A noise reduction method and device, electronic equipment and storage medium

By dynamically selecting denoising methods, combining signal-to-noise ratio and noise type, and using spectral subtraction and lightweight denoising models, the problems of high complexity in traditional methods and data dependence in deep learning methods are solved, achieving efficient denoising and speech fidelity in complex noise scenarios.

CN122245336APending Publication Date: 2026-06-19中移信息技术有限公司 +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
中移信息技术有限公司
Filing Date
2026-03-03
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Traditional noise reduction methods are complex or rely too much on prior assumptions about noise, while deep learning methods rely on a large amount of labeled data and computational resources, and their real-time performance is limited by model complexity.

Method used

By determining the signal-to-noise ratio and noise type of the target audio, the optimal balanced noise reduction in mixed noise scenarios is achieved by dynamically selecting spectral subtraction, lightweight noise reduction models, and neural network noise reduction methods, and combining the noise type and signal-to-noise ratio classification results.

Benefits of technology

It achieves the best balance between noise suppression and voice fidelity in complex dynamic noise scenarios, breaking through the rigid noise reduction that relies on static preset scenarios or fixed thresholds.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245336A_ABST
    Figure CN122245336A_ABST
Patent Text Reader

Abstract

This invention discloses a noise reduction method, apparatus, electronic device, and storage medium. The technical solution of this application can flexibly select different processing methods based on the signal-to-noise ratio and noise type of different pre-processed signals. It achieves dynamic selection or hybrid selection of algorithms such as spectral subtraction, neural network noise reduction, and high-gain mode based on noise type and SNR classification results, realizing the optimal balance between noise suppression and speech fidelity in complex dynamic noise scenarios. This overcomes the rigid noise reduction methods that rely on static preset scenarios or fixed thresholds and single algorithms.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of noise reduction technology, and in particular to a noise reduction method, apparatus, electronic device, and storage medium. Background Technology

[0002] In the field of Automatic Speech Recognition (ASR), noise reduction methods are mainly divided into two categories: traditional signal processing and deep learning. Traditional signal processing methods are based on mathematical modeling and physical acoustic principles, separating noise from speech through time-frequency domain transformations (such as Fourier transforms), but they rely too heavily on prior assumptions about noise. Deep learning methods use a data-driven approach, jointly training neural networks and end-to-end ASR, but this method relies on a large amount of labeled data and computational resources, and its real-time performance is limited by model complexity. Summary of the Invention

[0003] This invention provides a noise reduction method, apparatus, electronic device, and storage medium to solve the problems of high complexity or over-reliance on prior noise assumptions in traditional noise reduction methods.

[0004] According to one aspect of the present invention, a noise reduction method is provided, the method comprising: The target audio is determined; the target audio consists of several frames of preprocessed signal. Determine the signal-to-noise ratio and noise type of the preprocessed signal; If the noise type of the preprocessed signal is steady-state noise and the signal-to-noise ratio is greater than the first preset signal-to-noise ratio, then the preprocessed signal is denoised based on spectral subtraction to obtain an enhanced time-domain signal. If the noise type of the preprocessed signal is non-steady-state noise and the signal-to-noise ratio is lower than the second preset signal-to-noise ratio, then the preprocessed signal is denoised based on the pre-trained lightweight denoising model to obtain an enhanced time-domain signal; the second preset signal-to-noise ratio is lower than the first preset signal-to-noise ratio; the lightweight denoising model includes a time-frequency domain network with at least one layer of convolutional encoder and at least one layer of decoder. If the noise type of the preprocessed signal is mixed noise, or the signal-to-noise ratio is greater than the second preset signal-to-noise ratio and less than the first preset signal-to-noise ratio, then the initial noise reduction is performed based on coarse-grained spectral subtraction, and then fine enhancement is performed based on the pre-trained lightweight noise reduction model to obtain the enhanced time-domain signal; the mixed noise is a mixture of steady-state noise and non-steady-state noise. If the signal-to-noise ratio of the preprocessed signal is greater than the third preset signal-to-noise ratio, the over-subtraction coefficient of the spectral subtraction method is increased, and the preprocessed signal is denoised based on the spectral subtraction method to obtain the enhanced time-domain signal.

[0005] According to another aspect of the present invention, a noise reduction device is provided, the device comprising: The target audio determination module is used to determine the target audio; the target audio consists of several frames of preprocessed signals. The signal-to-noise ratio determination module is used to determine the signal-to-noise ratio and noise type of the preprocessed signal; The first noise reduction module is used to perform noise reduction processing on the preprocessed signal based on spectral subtraction if the noise type of the preprocessed signal is steady-state noise and the signal-to-noise ratio is greater than the first preset signal-to-noise ratio, so as to obtain an enhanced time-domain signal. The second noise reduction module is used to perform noise reduction processing on the preprocessed signal based on a pre-trained lightweight noise reduction model if the noise type of the preprocessed signal is non-steady-state noise and the signal-to-noise ratio is lower than the second preset signal-to-noise ratio, thereby obtaining an enhanced time-domain signal; the second preset signal-to-noise ratio is lower than the first preset signal-to-noise ratio; the lightweight noise reduction model includes a time-frequency domain network of at least one layer of convolutional encoder and at least one layer of decoder. The third noise reduction module is used to perform initial noise reduction based on coarse-grained spectral subtraction if the noise type of the preprocessed signal is mixed noise, or the signal-to-noise ratio is greater than the second preset signal-to-noise ratio and less than the first preset signal-to-noise ratio, and then perform fine enhancement based on the pre-trained lightweight noise reduction model to obtain the enhanced time-domain signal; the mixed noise is a mixture of steady-state noise and non-steady-state noise; The fourth noise reduction module is used to increase the over-subtraction coefficient of the spectral subtraction method if the signal-to-noise ratio of the preprocessed signal is greater than the third preset signal-to-noise ratio, and to perform noise reduction processing on the preprocessed signal based on the spectral subtraction method to obtain an enhanced time-domain signal.

[0006] According to another aspect of the present invention, an electronic device is provided, the electronic device comprising: At least one processor; and A memory that is communicatively connected to at least one processor; wherein, The memory stores a computer program that can be executed by at least one processor, such that the at least one processor is able to perform the noise reduction method of any embodiment of the present invention.

[0007] According to another aspect of the present invention, a computer-readable storage medium is provided, which stores computer instructions for causing a processor to execute and implement the noise reduction method of any embodiment of the present invention.

[0008] The technical solution of this invention involves determining a target audio signal, which is composed of several frames of preprocessed signals; determining the signal-to-noise ratio (SNR) and noise type of the preprocessed signals; if the noise type of the preprocessed signals is steady-state noise and the SNR is greater than a first preset SNR, then the preprocessed signals are denoised using spectral subtraction to obtain an enhanced time-domain signal; if the noise type of the preprocessed signals is non-steady-state noise and the SNR is lower than a second preset SNR, then the preprocessed signals are denoised using a pre-trained lightweight denoising model to obtain an enhanced time-domain signal; the second preset SNR is less than the first preset SNR; the lightweight denoising model includes a time-frequency domain network comprising at least one layer of convolutional encoder and at least one layer of decoder. If the noise type of the preprocessed signal is mixed noise, or the signal-to-noise ratio (SNR) is greater than a second preset SNR but less than a first preset SNR, then initial denoising is performed based on coarse-grained spectral subtraction, followed by fine enhancement based on a pre-trained lightweight denoising model to obtain an enhanced time-domain signal; the mixed noise is a mixture of steady-state and non-steady-state noise; if the SNR of the preprocessed signal is greater than a third preset SNR, then the over-subtraction coefficient of the spectral subtraction is increased, and the preprocessed signal is denoised based on the spectral subtraction to obtain an enhanced time-domain signal. This achieves dynamic selection or hybrid spectral subtraction, neural network denoising, high-gain mode, and other algorithms based on noise type and SNR classification results, achieving optimal balance between noise suppression and speech fidelity in complex dynamic noise scenarios. It overcomes the rigid denoising that relies on static preset scenarios or fixed thresholds and single algorithms.

[0009] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0010] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0011] Figure 1 This is a flowchart of a noise reduction method provided in Embodiment 1 of the present invention; Figure 2 This is a flowchart of another noise reduction method provided in Embodiment 2 of the present invention; Figure 3 This is a schematic diagram of a noise reduction device according to Embodiment 3 of the present invention; Figure 4 This is a schematic diagram of the structure of an electronic device that implements the noise reduction method of the present invention. Detailed Implementation

[0012] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0013] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0014] Example 1 Figure 1 The flowchart below provides a noise reduction method according to Embodiment 1 of the present invention. This embodiment is applicable to the case of audio noise reduction. The method can be executed by a noise reduction device, which can be implemented in hardware and / or software. The noise reduction device can be configured in an electronic device with data processing capabilities. Figure 1 As shown, the method includes: S110. Determine the target audio; the target audio consists of several frames of preprocessed signals.

[0015] The target audio can be determined through real-time acquisition using devices such as microphones, directly obtained audio, or extracted from video. This application does not limit the specific acquisition method. Audio and video can be obtained through real-time video recording, database querying and receiving, etc. The target audio consists of several consecutive frames of preprocessed signals. The preprocessed signal can be the signal after preprocessing the original audio; preprocessing includes, but is not limited to, frequency filtering, echo cancellation, and normalization, etc., which this application does not limit. Preprocessing can be omitted, and the original audio signal can be directly used as the preprocessed signal, thereby saving the computation required for this step.

[0016] Optionally, the target audio is determined, including: The original audio is determined; the original audio consists of several time-domain sampled data. The original audio is filtered to obtain the initial audio; the frequency band of the initial audio is within the preset frequency band. The initial audio is normalized by gain to obtain the target audio.

[0017] Time-domain sampled data refers to a one-dimensional digital sequence obtained by discretely sampling a continuous analog sound signal at fixed time intervals through analog-to-digital conversion.

[0018] The target audio is obtained by continuously collecting raw sound signals containing speech and noise in the environment through devices such as microphone arrays to form frames of time-domain sampling data.

[0019] The original time-domain sampling sequence With sampling rate (e.g., 16kHz) Input bandpass filter to filter out unwanted frequency bands below the first hertz and above the second hertz.

[0020] Bandpass filtering can be achieved using IIR / FIR filters.

[0021] Bandpass filtering results for each frame of length N Calculate RMS energy; ; in, It represents RMS energy.

[0022] Then, through the normalization coefficient Scaling the frame signal to obtain ; in, Indicates the preset reference energy. To prevent division by zero, This is the result of normalization.

[0023] Furthermore, to further improve the accuracy of subsequent processing, echo cancellation can be performed on the normalized result by calling the AEC (Acoustic Echo Cancellation) algorithm to remove the echo components from the normalized result. Eliminate and output As the target audio.

[0024] Optionally, after determining the target audio, the following may also be included: Determine whether the preprocessed signal contains speech; If the signal does not contain speech, it will not be processed.

[0025] Preprocessing signals in each frame Based on this, it is necessary to first determine whether the frame contains valid speech in order to avoid performing unnecessary noise reduction calculations in segments without speech, thereby saving computing power and reducing error suppression.

[0026] Optionally, determine whether the preprocessed signal contains speech, including: Determine the short-time energy of the preprocessed signal; If the short-time energy of the preprocessed signal is greater than the preset short-time energy, then feature extraction is performed on the preprocessed signal to obtain audio features; The audio features are input into a pre-trained speech activity detection and discrimination model to obtain the probability that speech exists in the pre-processed signal; If the probability of speech in the preprocessed signal is greater than the preset probability, then it is determined that speech exists in the preprocessed signal.

[0027] To address this, the short-time energy of the current frame is calculated. : ; Will With the preset short-term energy Comparison: if Less than If the frame is empty, it is determined that the frame has no audio (VAD=0).

[0028] if Greater than or equal to It is then assumed that the speech may contain simple features such as MFCC, short-time spectral entropy, and spectral centroid of the current frame. These features are then input into a pre-trained speech activity detection and discrimination model (which can be a GMM or a small DNN) to obtain a more accurate speech presence probability. .

[0029] if Greater than or equal to the preset probability of existence If the value is 1, then the frame is determined to contain audio (VAD=1); otherwise, it is 0.

[0030] The VAD determination results of several adjacent frames are smoothed by a sliding window majority voting method (at least two out of three consecutive frames are determined to be speech before the current frame is finally determined to be speech) to reduce timing jitter.

[0031] Final output .when If so, immediately skip the noise reduction step; if Noise reduction processing is then performed.

[0032] S120. Determine the signal-to-noise ratio and noise type of the preprocessed signal.

[0033] Signal-to-noise ratio (SNR) is the ratio of effective signal power to noise power. Noise type can refer to the type of noise it belongs to, including but not limited to "steady-state noise": such as air conditioner noise, fan noise, etc.; "non-steady-state noise": such as traffic noise, construction noise; "multi-source mixed noise": which includes multiple types at the same time; "human voice background interference": the sound of multiple people talking in the far field.

[0034] Optionally, determine the noise type of the preprocessed signal, including: Perform a short-time Fourier transform on the preprocessed signal to obtain the spectrum of the preprocessed signal; The Mel frequency is determined based on the Mel frequency cepstral filter bank and the spectrum corresponding to the preprocessed signal. The discrete cosine transform of the Mel frequency is used to obtain the Mel frequency cepstral coefficients; The Mel frequency cepstral coefficients are input into a pre-trained noise classification model to obtain the noise type of the pre-processed signal.

[0035] Preprocessing signals for each frame First, perform a Short-Time Fourier Transform (STFT). Assuming the frame length is L (e.g., 320 sampling points, corresponding to 20 ms), calculate its spectrum: ; in, The spectrum corresponding to the preprocessed signal; It is a window function (usually a Hamming window).

[0036] Take the power spectrum The Mel energy is obtained through the Mel filter bank: ; in, It is Mel energy.

[0037] Then perform a discrete cosine transform (DCT) on {M[m]} to obtain the MFCC cepstral coefficients. .

[0038] MFCC vector Input a pre-trained noise classification model (such as a lightweight 2D convolutional neural network, decision tree, or support vector machine SVM) to obtain discrete noise type labels: {Steady-state noise, sudden noise, mixed noise, ...}.

[0039] For example, common environments can be categorized as: "Steady-state noise": such as the sound of air conditioners and fans; "Non-steady-state noise": such as traffic noise and construction noise; "Multi-source mixed noise": includes multiple types simultaneously; "Human voice background interference": Multi-person conversations in far-field situations.

[0040] The obtained noise type is denoted as And simultaneously record the frame-level timestamp t.

[0041] Optionally, the signal-to-noise ratio of the preprocessed signal is determined, including: The energy of the preprocessed signal is calculated to obtain the total energy of the preprocessed signal; If the preprocessed signal is steady-state noise, the noise energy of the preprocessed signal is determined based on the minimum energy of the silent frame or the minimum value method of the sliding window and the total energy of the preprocessed signal, and the signal-to-noise ratio of the preprocessed signal is determined based on the noise energy. If the preprocessed signal is non-steady-state noise, the noise power spectrum estimate of the preprocessed signal is determined based on the spectrum of the preprocessed signal and the noise power spectrum estimate of the previous preprocessed signal. Based on the noise power spectrum estimate of the preprocessed signal, the time-domain energy of the preprocessed signal is determined, and the signal-to-noise ratio of the preprocessed signal is determined based on the time-domain energy.

[0042] Calculate the short-time energy of the current frame. As total energy: ; Noise energy estimation: Select the appropriate estimation method based on the noise type. For steady-state noise, the minimum energy of the frames recently identified as silent or the minimum value of the sliding window is used as the estimated noise energy: ; To estimate the noise energy, W is the front The set of frame indices that are judged to have no speech within a frame.

[0043] For non-steady-state noise, a noise tracking method based on spectral subtraction can be used to analyze the power spectrum of the current frame. With the previous frame Perform exponential smoothing: ; Then calculate the corresponding time-domain energy: .

[0044] For mixed noise or human voice background, the signal is first divided into speech segment and non-speech segment. The above method is used to estimate the signal in the non-speech segment. If the human voice is weak in the frame, the noise spectrum can be continuously updated using the near-field silent segment. Overall, the spectrum tracking or minimum energy method is still used.

[0045] SNR calculation: Obtain frame-level estimated noise energy Then, calculate the signal-to-noise ratio: ; in, To prevent the exclusion of zero constants, this formula applies to both steady-state and non-steady-state noise estimation results.

[0046] The calculation result is denoted as Subsequent threshold determinations and parameter selections will be based on this value.

[0047] S130. If the noise type of the preprocessed signal is steady-state noise and the signal-to-noise ratio is greater than the first preset signal-to-noise ratio, then the preprocessed signal is denoised based on spectral subtraction to obtain an enhanced time-domain signal.

[0048] If NoiseType[t] belongs to "steady-state noise" (such as air conditioner noise, fan noise, running water noise, etc.) and the signal-to-noise ratio is greater than the first preset signal-to-noise ratio, then spectral subtraction is used for noise reduction. In this case, the noise estimation is more accurate, the speech energy is higher than the noise, and the spectral subtraction algorithm can further suppress conventional noise while ensuring speech fidelity.

[0049] S140. If the noise type of the preprocessed signal is non-steady-state noise and the signal-to-noise ratio is lower than the second preset signal-to-noise ratio, then the preprocessed signal is denoised based on the pre-trained lightweight denoising model to obtain an enhanced time-domain signal.

[0050] The second preset signal-to-noise ratio is less than the first preset signal-to-noise ratio; the lightweight denoising model includes a time-frequency domain network with at least one convolutional encoder and at least one decoder.

[0051] If NoiseType[t] belongs to "non-steady-state noise" (such as traffic noise, sudden industrial noise, etc.) and the signal-to-noise ratio is lower than the second preset signal-to-noise ratio, then pure spectral subtraction will often cause significant speech distortion. A lightweight neural network noise reduction model (a time-frequency domain network with one layer of convolutional encoder and one layer of decoder) is required. Model weights are pre-trained offline for each type of non-steady-state noise; now, only the corresponding model needs to be called online to output enhanced time-domain frames. : .

[0052] S150. If the noise type of the preprocessed signal is mixed noise, or the signal-to-noise ratio is greater than the second preset signal-to-noise ratio and less than the first preset signal-to-noise ratio, then the initial noise reduction is performed based on coarse-grained spectral subtraction, and then fine enhancement is performed based on the pre-trained lightweight noise reduction model to obtain the enhanced time-domain signal; the mixed noise is a mixture of steady-state noise and non-steady-state noise.

[0053] If the environmental noise is a mixture of multiple sources (containing both steady-state and non-steady-state noise), or if the signal-to-noise ratio is greater than the second preset signal-to-noise ratio but less than the first preset signal-to-noise ratio, a hybrid noise reduction strategy is adopted: first, coarse-grained spectral subtraction is performed, and then a neural network model is used for fine enhancement.

[0054] First, calculate the spectral subtraction: ; Among them, parameters , All based on estimates Obtained via preset or table lookup; Will The coarsely denoised temporal frame is obtained by IFFT. .

[0055] Then Input a lightweight neural network; ; Output the final enhanced frame after noise reduction .

[0056] S160. If the signal-to-noise ratio of the preprocessed signal is less than the third preset signal-to-noise ratio, increase the over-subtraction coefficient of the spectral subtraction method, and perform noise reduction processing on the preprocessed signal based on the spectral subtraction method to obtain the enhanced time-domain signal.

[0057] when A signal-to-noise ratio (SNR) less than the third preset threshold allows for the strongest noise suppression mode (larger over-attenuation coefficient, more aggressive neural network model), but it's crucial to preserve key vowel formants in the time-frequency domain to prevent speech intelligibility from collapsing. In this case, the parameters can be set as follows: , (Used for spectral reduction or adjusting neural network thresholds), with a focus on maintaining high gain in the 200Hz–3500Hz frequency band (the main energy concentration area of ​​vowels) in the network front-end or filter.

[0058] In the spectral subtraction scenario, the over-subtraction coefficient αt and SNR have the following mapping relationship: ; in, , These represent the minimum and maximum over-attenuation coefficients allowed by the spectral subtraction algorithm, which can be calibrated through offline experiments. Filter gain. The same approach of segmented mapping can also be used. Furthermore, for neural network models, the number of nodes in the front-end hidden layer or the activation function threshold can be set as follows: To control the strength of network enhancement.

[0059] After processing using the above adaptive strategy, the current frame outputs an enhanced time-domain signal, denoted as: ; Calculate the enhanced signal-to-noise ratio of this frame. With the original Perform the difference calculation to obtain This serves as a frame-level performance metric for subsequent joint optimization.

[0060] The technical solution of this application involves determining a target audio signal, which is composed of several frames of preprocessed signals; determining the signal-to-noise ratio (SNR) and noise type of the preprocessed signals; if the noise type of the preprocessed signals is steady-state noise and the SNR is greater than a first preset SNR, then the preprocessed signals are denoised using spectral subtraction to obtain an enhanced time-domain signal; if the noise type of the preprocessed signals is non-steady-state noise and the SNR is lower than a second preset SNR, then the preprocessed signals are denoised using a pre-trained lightweight denoising model to obtain an enhanced time-domain signal; the second preset SNR is less than the first preset SNR; the lightweight denoising model includes a time-frequency domain network comprising at least one layer of convolutional encoder and at least one layer of decoder. If the noise type of the preprocessed signal is mixed noise, or the signal-to-noise ratio (SNR) is greater than a second preset SNR but less than a first preset SNR, then initial denoising is performed based on coarse-grained spectral subtraction, followed by fine enhancement based on a pre-trained lightweight denoising model to obtain an enhanced time-domain signal; the mixed noise is a mixture of steady-state and non-steady-state noise; if the SNR of the preprocessed signal is greater than a third preset SNR, then the over-subtraction coefficient of the spectral subtraction is increased, and the preprocessed signal is denoised based on the spectral subtraction to obtain an enhanced time-domain signal. This achieves dynamic selection or hybrid spectral subtraction, neural network denoising, high-gain mode, and other algorithms based on noise type and SNR classification results, achieving optimal balance between noise suppression and speech fidelity in complex dynamic noise scenarios. It overcomes the rigid denoising that relies on static preset scenarios or fixed thresholds and single algorithms.

[0061] Example 2 Figure 2 This invention provides a flowchart of another noise reduction method. This embodiment further optimizes the process after obtaining the enhanced time-domain signal in the aforementioned embodiments, based on the above embodiments. This embodiment can be combined with various optional solutions in one or more of the above embodiments. Figure 2 As shown, the noise reduction method in this embodiment may include the following steps: S2010. Determine the target audio; the target audio consists of several frames of preprocessed signals.

[0062] S2020. Determine the signal-to-noise ratio and noise type of the preprocessed signal.

[0063] S2030. If the noise type of the preprocessed signal is steady-state noise and the signal-to-noise ratio is greater than the first preset signal-to-noise ratio, then the preprocessed signal is denoised based on spectral subtraction to obtain an enhanced time-domain signal.

[0064] S2040. If the noise type of the preprocessed signal is non-steady-state noise and the signal-to-noise ratio is lower than the second preset signal-to-noise ratio, then the preprocessed signal is denoised based on the pre-trained lightweight denoising model to obtain an enhanced time-domain signal.

[0065] The second preset signal-to-noise ratio is less than the first preset signal-to-noise ratio; the lightweight denoising model includes a time-frequency domain network with at least one convolutional encoder and at least one decoder.

[0066] S2050. If the noise type of the preprocessed signal is mixed noise, or the signal-to-noise ratio is greater than the second preset signal-to-noise ratio and less than the first preset signal-to-noise ratio, then the initial noise reduction is performed based on coarse-grained spectral subtraction, and then the fine enhancement is performed based on the pre-trained lightweight noise reduction model to obtain the enhanced time-domain signal; the mixed noise is a mixture of steady-state noise and non-steady-state noise.

[0067] S2060. If the signal-to-noise ratio of the preprocessed signal is greater than the third preset signal-to-noise ratio, the over-subtraction coefficient of the spectral subtraction method is increased, and the preprocessed signal is denoised based on the spectral subtraction method to obtain an enhanced time-domain signal.

[0068] S2070. Determine the acoustic characteristics of the enhanced time-domain signal.

[0069] S2080. Based on acoustic characteristics, determine the word error rate of the enhanced time-domain signal.

[0070] S2090. Based on the signal-to-noise ratio of the enhanced time-domain signal and the signal-to-noise ratio of the preprocessed signal, determine the frame-level gain.

[0071] S2100: Determine the sentence-average gain of the target audio based on the gain of each frame level.

[0072] S2110. Adjust the noise reduction process based on word error rate and average sentence gain.

[0073] After noise reduction is completed, the noise reduction process needs to be adjusted to ensure the accuracy of the noise reduction results.

[0074] In response, for each frame of the enhanced temporal signal This needs to be converted into acoustic features that ASR can recognize. A common approach is as follows: First, pre-emphasis is performed using first-order differential filtering to enhance high-frequency components: ; Secondly, Frames are divided according to a frame length L=400 (25ms@16kHz) and a frame shift S=160 (10ms@16kHz), and each frame is multiplied by a Hamming window w[n]. ; Perform an FFT on the windowed signal of each frame to obtain the spectrum. .

[0075] Calculate the power spectrum: ; Through a set of Mel filters Mapping yields Mel energy: ; Perform DCT on the Mel energy to obtain MFCC features: ; Simultaneously calculate the first-order difference. and second-order difference The three are then concatenated to form the final feature vector: ; Output acoustic features for each frame And will perform ASR decoding.

[0076] Using an Attention Encoder-Decoder-based neural network as the acoustic model, input frame-level features Obtain the phoneme or character-level probability distribution for each frame t: ; Output the acoustic model With neural network language models Combine them to obtain the joint probability or score: ; in, These are the language model weight coefficients.

[0077] Using the Beam Search algorithm (beam width B), a set of candidate paths is maintained in time sequence: ; The top B sentences with the highest scores are retained at each step. The process continues until a sentence-end marker is detected, at which point the final recognized sentence is output. ; For the final decoding path The confidence score (Conf) can be calculated by normalizing the average probability of the acoustic model or the decoder score. For example: .

[0078] If there is manually corrected text The word error rate (WER) is then calculated on a per-word basis: ; in, To replace the number of errors, To delete the number of errors, For the number of insertion errors, The reference text length. If no manual labels are available, the confidence threshold will be used. Determine whether the identification meets the standards.

[0079] The recognition sentence contains several frames Frame-level gain Calculate the average gain of the current sentence: ; It can also calculate speech quality metrics (PESQ, STOI, etc.) as a supplement.

[0080] Set threshold and ; if and If so, the current noise reduction parameters and acoustic model are adapted to the current environment and do not need to be updated.

[0081] if This indicates that the front-end noise reduction did not meet expectations and the noise reduction strategy or parameters need to be adjusted.

[0082] if ,but This indicates that the front-end noise reduction is sufficient, but the back-end recognition model does not generalize well enough in this environment or speaker, and incremental fine-tuning of the acoustic model or noise reduction network model is required.

[0083] When the system determines that the situation is a "front-end parameter adjustment", it will determine the noise type based on the current noise type (NoiseType) and... Look up a new set of spectral reduction coefficients in an offline-built experience table or calculate them using linear / interpolation methods. and filtering smoothing parameters : ; Then directly This can be applied to the next round of spectral subtraction or hybrid strategies to increase the noise suppression intensity of subsequent frames or change the enhancement threshold of the network model.

[0084] When the situation is determined to be "backend model optimization", it is necessary to utilize several recently acquired enhanced back frames. , and the corresponding manually annotated text As incremental training data, the acoustic model and / or lightweight noise reduction network are fine-tuned with a few iterations to improve recognition performance in the current environment or with new speakers.

[0085] The specific process is as follows: Record the last 10 sentences ,in, .

[0086] For fine-tuning the acoustic model (or simultaneously fine-tuning the denoising model), a joint loss is used: ; in, To identify cross-entropy loss or CTC loss; and The loss weights are used to balance the goals of noise reduction and reconstruction with recognition accuracy.

[0087] Perform a small number of gradient descent iterations on the incremental data to update the model parameters. (Acoustic model) and / or (Noise reduction network). The update formula is as follows: ; in, The learning rate (can be less than 1e-5). After the update is complete, the new weights are written back to local storage, and the version number is upgraded from v to v+1.

[0088] The current environmental characteristics (NoiseType, ) and optimal parameters The updated model version number v+1 and corresponding metrics such as WER and PESQ are written to the local experience library (e.g., JSON). This allows for direct reuse in similar environments, avoiding redundant iterations.

[0089] In addition, the latest identification results from this round can also be used. Confidence level (Conf), Current noise type (NoiseType), The latest model version number and other information are exposed to upper-layer applications in a structured manner through a local API. Upper-layer applications can then display captions, execute subsequent business logic, or use the identification records for further statistical analysis as needed.

[0090] The technical solution of this application involves determining the acoustic characteristics of the enhanced time-domain signal; determining the word error rate of the enhanced time-domain signal based on the acoustic characteristics; determining the frame-level gain based on the signal-to-noise ratio of the enhanced time-domain signal and the signal-to-noise ratio of the preprocessed signal; determining the sentence-average gain of the target audio based on each frame-level gain; and adjusting the noise reduction process based on the word error rate and sentence-average gain to continuously update the noise reduction process and ensure the accuracy of the noise reduction results.

[0091] Example 3 Figure 3 This invention provides a structural block diagram of a noise reduction device, applicable to audio noise reduction. The noise reduction device can be implemented in hardware and / or software and can be configured in an electronic device with data processing capabilities. Figure 3 As shown, the noise reduction device in this embodiment may include: a target audio determination module 310, a signal-to-noise ratio determination module 320, a first noise reduction module 330, a second noise reduction module 340, a third noise reduction module 350, and a fourth noise reduction module 360. Wherein: The target audio determination module 310 is used to determine the target audio; the target audio is composed of several frames of preprocessed signals. The signal-to-noise ratio determination module 320 is used to determine the signal-to-noise ratio and noise type of the preprocessed signal; The first noise reduction module 330 is used to perform noise reduction processing on the preprocessed signal based on spectral subtraction if the noise type of the preprocessed signal is steady-state noise and the signal-to-noise ratio is greater than the first preset signal-to-noise ratio, so as to obtain an enhanced time-domain signal. The second noise reduction module 340 is used to perform noise reduction processing on the preprocessed signal based on a pre-trained lightweight noise reduction model to obtain an enhanced time-domain signal if the noise type of the preprocessed signal is non-steady-state noise and the signal-to-noise ratio is lower than the second preset signal-to-noise ratio; the second preset signal-to-noise ratio is lower than the first preset signal-to-noise ratio; the lightweight noise reduction model includes a time-frequency domain network of at least one layer of convolutional encoder and at least one layer of decoder. The third noise reduction module 350 is used to perform initial noise reduction based on coarse-grained spectral subtraction if the noise type of the preprocessed signal is mixed noise, or the signal-to-noise ratio is greater than the second preset signal-to-noise ratio and less than the first preset signal-to-noise ratio, and then perform fine enhancement based on the pre-trained lightweight noise reduction model to obtain the enhanced time-domain signal; the mixed noise is a mixture of steady-state noise and non-steady-state noise. The fourth noise reduction module 360 ​​is used to increase the over-subtraction coefficient of the spectral subtraction method if the signal-to-noise ratio of the preprocessed signal is greater than the third preset signal-to-noise ratio, and to perform noise reduction processing on the preprocessed signal based on the spectral subtraction method to obtain an enhanced time-domain signal.

[0092] Based on the above embodiments, optionally, determining the target audio includes: The original audio is determined; the original audio consists of several time-domain sampled data. The original audio is filtered to obtain the initial audio; the frequency band of the initial audio is within the preset frequency band. The initial audio is normalized by gain to obtain the target audio.

[0093] Based on the above embodiments, optionally, determining the noise type of the preprocessed signal includes: Perform a short-time Fourier transform on the preprocessed signal to obtain the spectrum of the preprocessed signal; The Mel frequency is determined based on the Mel frequency cepstral filter bank and the spectrum corresponding to the preprocessed signal. The discrete cosine transform of the Mel frequency is used to obtain the Mel frequency cepstral coefficients; The Mel frequency cepstral coefficients are input into a pre-trained noise classification model to obtain the noise type of the pre-processed signal.

[0094] Based on the above embodiments, optionally, determining the signal-to-noise ratio of the preprocessed signal includes: The energy of the preprocessed signal is calculated to obtain the total energy of the preprocessed signal; If the preprocessed signal is steady-state noise, the noise energy of the preprocessed signal is determined based on the minimum energy of the silent frame or the minimum value method of the sliding window and the total energy of the preprocessed signal, and the signal-to-noise ratio of the preprocessed signal is determined based on the noise energy. If the preprocessed signal is non-steady-state noise, the noise power spectrum estimate of the preprocessed signal is determined based on the spectrum of the preprocessed signal and the noise power spectrum estimate of the previous preprocessed signal. Based on the noise power spectrum estimate of the preprocessed signal, the time-domain energy of the preprocessed signal is determined, and the signal-to-noise ratio of the preprocessed signal is determined based on the time-domain energy.

[0095] Based on the above embodiments, optionally, after determining the target audio, the method further includes: Determine whether the preprocessed signal contains speech; If the signal does not contain speech, it will not be processed.

[0096] Based on the above embodiments, optionally, determining whether the preprocessed signal contains speech includes: Determine the short-time energy of the preprocessed signal; If the short-time energy of the preprocessed signal is greater than the preset short-time energy, then feature extraction is performed on the preprocessed signal to obtain audio features; The audio features are input into a pre-trained speech activity detection and discrimination model to obtain the probability that speech exists in the pre-processed signal; If the probability of speech in the preprocessed signal is greater than the preset probability, then it is determined that speech exists in the preprocessed signal.

[0097] Based on the above embodiments, optionally, after obtaining the enhanced time-domain signal, the method further includes: Determine the acoustic characteristics of the enhanced time-domain signal; Based on acoustic characteristics, the word error rate of the enhanced time-domain signal is determined; The frame-level gain is determined based on the signal-to-noise ratio of the enhanced time-domain signal and the signal-to-noise ratio of the preprocessed signal. Based on the gain at each frame level, determine the sentence-average gain of the target audio. The noise reduction process is adjusted based on the word error rate and the average sentence gain.

[0098] The noise reduction device provided in the embodiments of the present invention can execute the noise reduction method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the method.

[0099] Example 4 Figure 4 A schematic diagram of an electronic device 10, which can be used to implement embodiments of the present invention, is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.

[0100] like Figure 4 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 can also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.

[0101] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0102] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as noise reduction methods.

[0103] In some embodiments, the noise reduction method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and / or installed on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the noise reduction method described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the noise reduction method by any other suitable means (e.g., by means of firmware).

[0104] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0105] Computer programs used to implement the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0106] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0107] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0108] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or middleware components (e.g., application servers), or frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.

[0109] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.

[0110] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.

[0111] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A noise reduction method, characterized in that, include: Determine the target audio; the target audio is composed of several frames of preprocessed signal. Determine the signal-to-noise ratio and noise type of the preprocessed signal; If the noise type of the preprocessed signal is steady-state noise and the signal-to-noise ratio is greater than the first preset signal-to-noise ratio, then the preprocessed signal is denoised based on spectral subtraction to obtain an enhanced time-domain signal. If the noise type of the preprocessed signal is non-steady-state noise and the signal-to-noise ratio is lower than the second preset signal-to-noise ratio, then the preprocessed signal is denoised based on a pre-trained lightweight denoising model to obtain an enhanced time-domain signal; the second preset signal-to-noise ratio is less than the first preset signal-to-noise ratio; the lightweight denoising model includes a time-frequency domain network with at least one layer of convolutional encoder and at least one layer of decoder. If the noise type of the preprocessed signal is mixed noise, or the signal-to-noise ratio is greater than the second preset signal-to-noise ratio and less than the first preset signal-to-noise ratio, then initial noise reduction is performed based on coarse-grained spectral subtraction, followed by fine enhancement based on a pre-trained lightweight noise reduction model to obtain an enhanced time-domain signal; the mixed noise is a mixture of steady-state noise and non-steady-state noise. If the signal-to-noise ratio of the preprocessed signal is greater than the third preset signal-to-noise ratio, the over-subtraction coefficient of the spectral subtraction method is increased, and the preprocessed signal is denoised based on the spectral subtraction method to obtain an enhanced time-domain signal.

2. The method according to claim 1, characterized in that, Determine the target audio, including: Determine the original audio; the original audio consists of several time-domain sampled data; The original audio is filtered to obtain initial audio; the frequency band of the initial audio is within a preset frequency band. The initial audio is subjected to gain normalization to obtain the target audio.

3. The method according to claim 1, characterized in that, Determine the noise type of the preprocessed signal, including: Perform a short-time Fourier transform on the preprocessed signal to obtain the spectrum of the preprocessed signal; The Mel frequency is determined based on the Mel frequency cepstral filter bank and the spectrum corresponding to the preprocessed signal. The discrete cosine transform of the Mel frequency is used to obtain the Mel frequency cepstral coefficients; The Mel frequency cepstral coefficients are input into a pre-trained noise classification model to obtain the noise type of the preprocessed signal.

4. The method according to claim 1, characterized in that, Determining the signal-to-noise ratio of the preprocessed signal includes: The energy of the preprocessed signal is calculated to obtain the total energy of the preprocessed signal; If the preprocessed signal is steady-state noise, the noise energy of the preprocessed signal is determined based on the minimum energy of the silent frame or the minimum value method of the sliding window and the total energy of the preprocessed signal, and the signal-to-noise ratio of the preprocessed signal is determined based on the noise energy. If the preprocessed signal is non-steady-state noise, then based on the spectrum of the preprocessed signal and the noise power spectrum estimate of the previous preprocessed signal, the noise power spectrum estimate of the preprocessed signal is determined, and based on the noise power spectrum estimate of the preprocessed signal, the time-domain energy of the preprocessed signal is determined, and based on the time-domain energy, the signal-to-noise ratio of the preprocessed signal is determined.

5. The method according to claim 1, characterized in that, After determining the target audio, the following steps are also included: Determine whether the preprocessed signal contains speech; If the signal does not contain speech, it will not be processed.

6. The method according to claim 5, characterized in that, Determining whether the preprocessed signal contains speech includes: Determine the short-time energy of the preprocessed signal; If the short-time energy of the preprocessed signal is greater than the preset short-time energy, then feature extraction is performed on the preprocessed signal to obtain audio features; The audio features are input into a pre-trained speech activity detection and discrimination model to obtain the probability that speech exists in the pre-processed signal; If the probability of speech in the preprocessed signal is greater than the preset probability, then it is determined that speech exists in the preprocessed signal.

7. The method according to claim 1, characterized in that, After obtaining the enhanced time-domain signal, the process also includes: Determine the acoustic characteristics of the enhanced time-domain signal; Based on the acoustic characteristics, the word error rate of the enhanced time-domain signal is determined; The frame-level gain is determined based on the signal-to-noise ratio of the enhanced time-domain signal and the signal-to-noise ratio of the preprocessed signal. Based on the gain at each frame level, determine the sentence-average gain of the target audio. The noise reduction process is adjusted based on the word error rate and the average sentence gain.

8. A noise reduction device, characterized in that, include: A target audio determination module is used to determine the target audio; the target audio is composed of several frames of preprocessed signals. The signal-to-noise ratio determination module is used to determine the signal-to-noise ratio and noise type of the preprocessed signal; The first noise reduction module is used to perform noise reduction processing on the preprocessed signal based on spectral subtraction if the noise type of the preprocessed signal is steady-state noise and the signal-to-noise ratio is greater than a first preset signal-to-noise ratio, thereby obtaining an enhanced time-domain signal. The second noise reduction module is used to perform noise reduction processing on the preprocessed signal based on a pre-trained lightweight noise reduction model to obtain an enhanced time-domain signal if the noise type of the preprocessed signal is non-steady-state noise and the signal-to-noise ratio is lower than the second preset signal-to-noise ratio; the second preset signal-to-noise ratio is lower than the first preset signal-to-noise ratio; the lightweight noise reduction model includes a time-frequency domain network with at least one layer of convolutional encoder and at least one layer of decoder. The third noise reduction module is used to perform initial noise reduction based on coarse-grained spectral subtraction if the noise type of the preprocessed signal is mixed noise, or the signal-to-noise ratio is greater than the second preset signal-to-noise ratio and less than the first preset signal-to-noise ratio, and then perform fine enhancement based on a pre-trained lightweight noise reduction model to obtain an enhanced time-domain signal; the mixed noise is a mixture of steady-state noise and non-steady-state noise. The fourth noise reduction module is used to increase the over-subtraction coefficient of the spectral subtraction method if the signal-to-noise ratio of the preprocessed signal is greater than the third preset signal-to-noise ratio, and to perform noise reduction processing on the preprocessed signal based on the spectral subtraction method to obtain an enhanced time-domain signal.

9. An electronic device, characterized in that, The electronic device includes: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the noise reduction method according to any one of claims 1-7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause a processor to execute the noise reduction method according to any one of claims 1-7.