An audio restoration method and processing terminal based on a hierarchical progressive diffusion model

By using a hierarchical progressive diffusion model, combined with spectrum sensing analysis and a multi-level spectrum restoration network, the problems of high computational overhead, poor adaptability, and insufficient phase reconstruction in existing audio restoration methods are solved, achieving high-quality audio restoration results.

CN122201316APending Publication Date: 2026-06-12GUANGZHOU BAOLUN ELECTRONICS CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGZHOU BAOLUN ELECTRONICS CO LTD
Filing Date
2026-03-16
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing diffusion probability models for audio restoration suffer from problems such as huge computational overhead, difficulty in balancing global structural consistency and local detail accuracy, lack of adaptive processing capabilities, and insufficient attention to phase reconstruction issues.

Method used

By employing a hierarchical progressive diffusion model, high-quality audio signal restoration is achieved through spectrum sensing analysis, multi-scale feature extraction and aggregation, multi-level spectrum restoration network, phase reconstruction and balance fusion.

🎯Benefits of technology

High-quality audio restoration with consistent phase and amplitude was achieved, improving the quality of the restored audio.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201316A_ABST
    Figure CN122201316A_ABST
Patent Text Reader

Abstract

The application discloses an audio repairing method based on a hierarchical progressive diffusion model, which comprises the following steps: inputting a damaged audio signal to be repaired into a spectrum perception analysis module to obtain a damage mask, a spectrum feature map and a damage type label of the damaged audio signal; inputting the damage mask and the spectrum feature map to be spliced to obtain a combined feature tensor after splicing; extracting features on multiple different resolution levels of the combined feature tensor to obtain multi-scale features, and generating a guide condition vector based on the multi-scale fusion features obtained by aggregation; inputting the damage mask, the guide condition vector and the damaged spectrum into a multi-layer hierarchical progressive diffusion repairing network for spectrum repairing to obtain a repaired spectrum; and converting the amplitude spectrum of the repaired spectrum obtained by the diffusion repairing network into a complete complex spectrum through phase reconstruction and balance fusion. The application realizes high-quality phase reconstruction consistent with the amplitude, and improves the quality of the repaired audio.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, specifically to an audio restoration method and processing terminal based on a hierarchical progressive diffusion model. Background Technology

[0002] The diffusion probability model is an emerging audio restoration technique. Its core idea is to define a forward diffusion process, gradually adding Gaussian noise to the data until it becomes pure noise, then learning an inverse denoising process to gradually recover the original data from the noise. In audio restoration, the diffusion probability model uses a conditional generation mechanism, based on damaged audio and a damage mask, to gradually generate the restoration content for the damaged areas during the denoising process.

[0003] Existing diffusion probability models for audio restoration still have some limitations: First, most methods employ a single-scale diffusion process, performing all denoising steps at the original resolution, resulting in huge computational overhead and difficulty in balancing global structural consistency and local detail accuracy. Second, the conditional injection mechanism is simply designed, typically only splicing the mask into the input, failing to fully utilize the gradient information of the damaged boundary and the multi-scale features of the surrounding context. Third, there is a lack of adaptive processing capabilities for different damage types; a uniform restoration strategy is still used for different types of damage, such as missing parts, noise contamination, and clipping. Fourth, there may be inconsistencies between the amplitude spectrum generated by the diffusion process and the original phase; existing methods do not pay enough attention to the phase reconstruction problem. Summary of the Invention

[0004] To address the shortcomings of existing technologies, the purpose of this invention is to provide an audio restoration method and processing terminal based on a hierarchical progressive diffusion model, which can solve the problems described in the background art.

[0005] The technical solution to achieve the objective of this invention is: an audio restoration method based on a hierarchical progressive diffusion model, comprising the following steps: Step 1: Input the damaged audio signal to be repaired into the spectrum sensing and analysis module. The spectrum sensing and analysis module analyzes the spectrum of the damaged audio signal and processes the analyzed spectrum results using an attention impairment detection network to obtain the impairment mask, spectrum feature map and impairment type label of the damaged audio signal. Step 2: The adaptive guidance signal generation module concatenates the input damage mask and spectral feature map to obtain the concatenated combined feature tensor. It extracts features from the combined feature tensor at multiple different resolution levels to obtain multi-scale features. It then aggregates the multi-scale features to obtain multi-scale fusion features. Based on the multi-scale fusion features, it generates a guidance condition vector G suitable for the diffusion probability model. Step 3: Input the damage mask, guiding condition vector G, and damaged spectrum S into a multi-layered hierarchical progressive diffusion repair network for spectrum repair to obtain the repaired spectrum. Each layer of the hierarchical progressive diffusion repair network repairs the spectrum structure of the corresponding resolution level. The damaged spectrum S is the spectrum of the damaged audio signal after damage masking. Step 4: The amplitude spectrum of the restored spectrum obtained by the diffusion restoration network is converted into a complete complex spectrum through phase reconstruction and balance fusion to reconstruct a high-quality time-domain audio signal and complete the audio restoration.

[0006] Furthermore, the specific implementation of step 1 includes the following steps: Step 11: Perform a short-time Fourier transform on the damaged audio signal to obtain a discrete time-frequency domain audio signal; Step 12: Perform complex spectral decomposition on the discrete time-frequency domain audio signal to obtain the amplitude spectrum and phase spectrum; Step 13: Calculate the spectral gradient of the amplitude spectrum, which includes the time gradient and the frequency gradient. By using the calculated spectral gradient, abrupt changes and discontinuities in the spectrum can be captured. Step 14: Concatenate the spectral gradient and amplitude spectrum, and feed the concatenated result into an attention-based impairment detection network based on the U-Net architecture to identify the impairment region, thereby obtaining the impairment mask, spectral feature map and impairment type label of the damaged audio signal.

[0007] Furthermore, the spectral gradient is calculated using the following formula:

[0008] In the formula, This represents the combined spectral gradient at frequency index k and time frame index t, where A represents the amplitude spectrum matrix. This represents the first-order partial derivative of the amplitude spectrum along the time direction. This represents the first-order partial derivative of the amplitude spectrum along the frequency direction. The mixed second-order partial derivatives of the amplitude spectrum are represented. , and These represent three learnable weight coefficients that control the contribution ratios of the temporal gradient, frequency gradient, and time-frequency hybrid substitution in the overall gradient calculation, respectively. (Square digit operation) Used to calculate the Euclidean norm. , and The three weight coefficients are automatically optimized during training through backpropagation.

[0009] Furthermore, the specific implementation of step 2 includes the following steps: Step 21: The damage mask and spectral feature map input into the adaptive guidance signal generation module are concatenated to form a combined feature vector; Step 22: Input the combined feature vector into the multi-scale feature pyramid network to extract features at four different resolution levels, thereby obtaining multi-scale features; Step 23: Input the multi-scale features into the feature aggregation network for aggregation, so as to unify the various features with different resolutions and obtain multi-scale fused features; Step 24: Input the multi-scale fusion features into the conditional coding projection layer. The conditional coding projection layer compresses the multi-scale fusion features into a one-dimensional vector through global average pooling, and then maps it to a 512-dimensional guiding condition vector through two fully connected network layers. This guiding condition vector is then normalized by L2 to obtain the final guiding condition vector G.

[0010] Furthermore, the first resolution level downsamples the extracted features to 1 / 8 of the combined feature vector size, the second resolution level downsamples the extracted features to 1 / 4 of the combined feature vector size, the third resolution level downsamples the extracted features to 1 / 2 of the combined feature vector size, and the fourth resolution level downsamples the extracted features to the original size of the combined feature vector. Each resolution level uses residual convolutional blocks for feature extraction, and the number of channels is gradually increased during the downsampling process.

[0011] Furthermore, feature aggregation is performed according to the following formula:

[0012] In the formula, This represents a multi-scale fusion feature with dimensions C×H×W, where C is the number of channels, and H and W are the height and width, respectively. 4 represents the feature of the i-th resolution level, and 4 represents the total number of resolution levels. This represents the convolutional kernel weight matrix corresponding to the i-th resolution level. This represents a two-dimensional convolution operation. This represents the bias vector of the convolutional layer corresponding to the i-th resolution level. This represents the GELU activation function. Let represent the spatial attention map at the i-th resolution level, which is obtained by compressing the features along the channel dimension and normalizing them using the sigmoid function. This represents the element-wise Hadamard product operation. This represents the upsampling operator for the i-th resolution level, which uses bilinear interpolation to upsample the features of the i-th resolution level to the same size as the original. This represents the fusion weight coefficient for the i-th resolution level, which is adaptively generated by the dynamic weight adjuster based on the damage type label.

[0013] Furthermore, the specific implementation of step 3 includes the following steps: Step 31: Input the damage mask, guiding condition vector G and damage spectrum S into the denoising network, downsample the damage spectrum S to 1 / 4 of the original size to form a low-resolution spectrum, input the low-resolution spectrum into the denoising network to obtain the denoised estimated spectrum; Step 32: Upsample and downsample the denoised estimated spectrum to obtain the upsampled spectrum and the downsampled spectrum. Fuse the sampled spectrum and the downsampled spectrum to obtain the fused spectrum. Perform several steps of residual diffusion on the fused spectrum to obtain the medium resolution spectrum. Step 33: Perform residual diffusion repair on the medium resolution spectrum at the original resolution to obtain the repaired spectrum.

[0014] Further, in step 31, the denoising network performs diffusion of step T1 according to the following diffusion formula to denoise and obtain the denoised estimated spectrum:

[0015] In the formula, This represents the denoised estimated spectrum obtained after diffusion at time step t-1. This represents the noisy spectrum at time step t in the first stage, which is also the corrupted spectrum S. This represents the noise dispatch coefficient at time step t during the diffusion process, and its value is between 0 and 1. The cumulative noise figure from time step 0 to time step t is defined as follows: , The parameter is A noise prediction neural network is proposed, employing an improved U-Net architecture. Its inputs are the noisy spectrum, time-step embeddings, guiding conditional vectors, and impairment masks; the output is a denoised estimated spectrum. For damage masking, For time steps The noise standard deviation Let the random noise vector be sampled from a standard normal distribution, and the fractional Scaling factor for the denoising step, score This is the scaling factor for noise prediction. Step 33 uses the following spectral consistency constraint diffusion formula for repair to obtain the repaired spectrum:

[0016] In the formula, Indicates at time step The output spectrum This represents the input spectrum at time step t. This indicates the noise prediction network used in this step. Indicates the medium resolution spectrum. This represents the spectral uniformity projection operator. Indicates a damaged spectrum. The complement of the damage mask is used to identify non-damaged regions; the last item... This ensures that the original spectral values ​​are directly preserved in the non-damaged regions, unaffected by the denoising process.

[0017] Furthermore, the specific implementation of step 4 includes the following steps: Step 41: Input the phase spectrum of the damaged audio signal, the damage mask, and the repair spectrum obtained in step 3 into the deep phase prediction network for processing to obtain the estimated phase of the damaged region of the damaged audio signal; Step 42: Enhance the consistency between the estimated phase and the amplitude of the damaged audio signal to obtain the phase matrix. The phase matrix has the same dimension as the amplitude spectrum. This step uses the following consistency iterative optimization formula to enhance the consistency between the estimated phase and amplitude:

[0018] In the formula, This represents the phase matrix obtained after the (n+1)th iteration. This represents the phase matrix of the nth iteration. Indicates the repair spectrum, Indicates by The constructed unit complex matrix, j is the imaginary unit, () denotes element-wise multiplication, iSTFT(•) denotes the inverse short-time Fourier transform, and STFT(•) denotes the short-time Fourier transform. This represents the argument operation for complex numbers. For damage masking, The original phase spectrum, To smooth out the regularization coefficients, This represents the gradient of the smoothing loss function with respect to the phase. Step 43: Feather the boundary between the repaired and non-damaged regions of the repaired spectrum to achieve smooth spectral boundary fusion, resulting in the processed repaired spectrum. Feathering is performed using the following Gaussian weighted fusion formula:

[0019]

[0020] In the formula, This represents the spectral value at frequency index k and time frame t after final fusion. and These represent the positions of the restored spectrum and the original spectrum at frequency index k and time frame t, respectively. Spectral values, For this position The fusion weight at the point, Indicates the location Euclidean distance to the boundary of the damaged area. For boundary smoothing bandwidth parameters, It is a Gaussian decay function. For boundary region indication function, when the position The value is 1 if the damage is located within the smooth band at the boundary of the damaged area, and 0 otherwise. For internal region indication functions, when the position The value is 1 when the damage is located inside the damaged area and far from the boundary; otherwise, it is 0. Step 44: Perform sensory enhancement on the repaired spectrum after smooth fusion of spectral boundaries to obtain the sensory-enhanced repaired spectrum; Step 45: Perform an inverse short-time Fourier transform on the restored spectrum after sensor enhancement to obtain the restored audio signal, thus completing the restoration.

[0021] A processing terminal, comprising: Memory, used to store program instructions; A processor for running the program instructions to perform steps of an audio restoration method based on a hierarchical progressive diffusion model.

[0022] The beneficial effects of this invention are as follows: This invention achieves high-quality audio restoration by realizing high-quality phase reconstruction with consistent phase and amplitude, thereby improving the quality of the restored audio. Attached Figure Description

[0023] Figure 1 This is a flowchart illustrating a preferred embodiment of the method of the present invention; Figure 2 This is a schematic diagram of the processing terminal. Detailed Implementation

[0024] The present invention will be further described below with reference to the accompanying drawings and specific embodiments: like Figure 1 As shown, an audio restoration method based on a hierarchical progressive diffusion model includes the following steps: Step 1: Input the damaged audio signal to be repaired into the spectrum sensing and analysis module. The spectrum sensing and analysis module analyzes the spectrum of the damaged audio signal and processes the analyzed spectrum results using an attention-based damage detection network to obtain the damage mask, spectrum feature map, and damage type label of the damaged audio signal.

[0025] For example, a specific implementation of step 1 includes the following steps: Step 11: Perform a short-time Fourier transform (STFT) on the damaged audio signal to obtain a discrete time-frequency domain audio signal.

[0026] Understandably, the input damaged audio signal is a time-domain audio signal. Assuming a sampling rate of fs, the signal length of this time-domain audio signal is L sampling points. The short-time Fourier transform can use a Hanning window as the analysis window function. The window length N of the Hanning window is 2048 sampling points, and the frame shift H is 512 sampling points, to ensure both resolution and temporal resolution. After the short-time Fourier transform, a complex spectrum matrix is ​​obtained. The dimension of this complex spectrum matrix is ​​K×T, where K=N / 2+1 represents the number of frequency points, and T is the number of time frames.

[0027] Step 12: Perform complex spectral decomposition on the discrete time-frequency domain audio signal to obtain the amplitude spectrum and phase spectrum.

[0028] Step 13: Calculate the spectral gradient of the amplitude spectrum, which includes the time gradient and the frequency gradient. By using the calculated spectral gradient, abrupt changes and discontinuities in the spectrum can be captured.

[0029] Understandably, this step results in abrupt changes and discontinuous regions in the spectral gradient.

[0030] For example, the spectral gradient is calculated using the following formula:

[0031] In the formula, Let represent the spectral gradient, which is the combined spectral gradient at frequency index k and time frame index t. This combined spectral gradient includes both time and frequency gradients, reflecting the degree of spectral change at that position. Let A represent the amplitude spectrum matrix. This represents the first-order partial derivative of the amplitude spectrum along the time direction. It can be approximated by the difference between adjacent time frames and is used to detect abrupt changes in the time dimension. This represents the first-order partial derivative of the amplitude spectrum along the frequency direction. It can be approximated by the difference between adjacent frequency points and is used to detect abrupt changes in the frequency dimension. The mixed second-order partial derivatives of the amplitude spectrum are used to capture the characteristics of joint time-frequency variations, which is particularly important for detecting oblique damage boundaries. , and These represent three learnable weight coefficients that control the contribution ratios of the temporal gradient, frequency gradient, and time-frequency hybrid substitution in the overall gradient calculation. These three weight coefficients are automatically optimized during training through backpropagation. (Square Root Operation) It is used to calculate the Euclidean norm, and uses the square root operation to synthesize the gradient components in the three directions of time, frequency, and time-frequency mixture into a scalar gradient value.

[0032] Step 14: Concatenate the spectral gradient and amplitude spectrum, and feed the concatenated result into an attention-based impairment detection network based on the U-Net architecture to identify the impairment region, thereby obtaining the impairment mask, spectral feature map and impairment type label of the damaged audio signal.

[0033] The attention-based damage detection network embeds self-attention modules at the skip connections of its encoder and decoder to model global dependencies, thereby effectively identifying scattered damage regions.

[0034] Understandably, the attention-based damage detection network outputs three branches: a damage mask branch that produces a binary damage mask M of the same size as the input spectrum, where a value of 1 in the binary damage mask M indicates a damage region that needs to be repaired; a spectral feature map branch that outputs a 128-channel deep spectral feature map that encodes the contextual information surrounding the damage region; and a damage type label branch that outputs labels to classify the damage type into one of the preset types, such as missing, noisy, clipped, or mixed types, to guide the selection of subsequent repair strategies.

[0035] Step 2: The adaptive guidance signal generation module concatenates the input damage mask and spectral feature map to obtain a concatenated combined feature tensor. Features are extracted from the combined feature tensor at multiple resolution levels to obtain multi-scale features. These multi-scale features are then aggregated to obtain multi-scale fused features. Based on these multi-scale fused features, a conditional code suitable for the diffusion probability model is generated.

[0036] It is understandable that conditional coding is also known as guiding conditional vector G.

[0037] For example, a specific implementation of step 2 includes the following steps: Step 21: The damage mask and spectral feature map input into the adaptive guidance signal generation module are concatenated to form a combined feature vector.

[0038] Step 22: Input the combined feature vector into a multi-scale feature pyramid network to extract features at four different resolution levels, thus obtaining multi-scale features. Specifically, the first resolution level downsamples the extracted features to 1 / 8 of the combined feature vector size (i.e., the original size) to capture global structural information. The second resolution level downsamples the extracted features to 1 / 4 of the combined feature vector size to extract medium-scale texture patterns. The third resolution level downsamples the extracted features to 1 / 2 of the combined feature vector size to retain more local details. The fourth resolution level downsamples the extracted features to the original size of the combined feature vector, i.e., maintaining the original size, to encode fine spectral variations. Each resolution level uses residual convolutional blocks for feature extraction, and the number of channels is gradually increased during downsampling to maintain information capacity.

[0039] Step 23: Input the multi-scale features into the feature aggregation network for aggregation, so as to unify the various features with different resolutions and obtain multi-scale fused features.

[0040] In this embodiment, feature aggregation is performed according to the following formula:

[0041] In the formula, This represents a multi-scale fusion feature with dimensions C×H×W, where C is the number of channels, and H and W are the height and width, respectively. This represents the feature of the i-th resolution level, and 4 represents the total number of resolution levels, which is 4 in this case. This represents the convolution kernel weight matrix corresponding to the i-th resolution level, which is used to perform channel transformation and spatial filtering on the features of that resolution level. This represents a two-dimensional convolution operation, in which the convolution kernel is slidably applied to each spatial location of the feature. This represents the bias vector of the convolutional layer corresponding to the i-th resolution level, providing a learnable offset for each output channel. This represents the GELU activation function. This represents the spatial attention map at the i-th resolution level, which is obtained by compressing features along the channel dimension and normalizing them using the sigmoid function, and is used to highlight important spatial locations. This indicates the element-wise Hadamard product operation, which applies attention weights to the activated features. This represents the upsampling operator for the i-th resolution level, which uses bilinear interpolation to upsample the features of the i-th resolution level to the same size as the original, so as to facilitate the addition and fusion of features at different scales. This represents the fusion weight coefficient for the i-th resolution level, which is adaptively generated by the dynamic weight adjuster based on the damage type label.

[0042] Understandably, the dynamic weight adjuster receives damage type labels as input and predicts fusion weight coefficients across four resolution levels using a small multilayer perceptron network. For missing damage, the dynamic weight adjuster tends to increase the weights at lower resolution scales to reconstruct the overall structure; for noisy damage, it increases the weights at higher resolution scales to accurately locate the noise; and for mixed damage, it generates a balanced weight distribution.

[0043] Step 24: Input the multi-scale fusion features into the conditional coding projection layer. The conditional coding projection layer compresses the multi-scale fusion features into a one-dimensional vector through global average pooling, and then maps it to a 512-dimensional guiding condition vector through two fully connected network layers. The guiding condition vector is then normalized by L2 to obtain the final guiding condition vector G, which is also the conditional code.

[0044] Step 3: Input the damage mask, guiding condition vector G, and damaged spectrum S into a multi-layered progressive diffusion repair network for spectrum repair to obtain the repaired spectrum. Each layer of the progressive diffusion repair network repairs the spectrum structure of the corresponding resolution level. The damaged spectrum S is the spectrum of the damaged audio signal after damage masking.

[0045] Understandably, through multi-layered spectral restoration, a progressive spectral restoration from coarse to fine is achieved to obtain the restored spectrum.

[0046] The hierarchical progressive diffusion repair network for spectrum repair combines a diffusion probability model with a multi-resolution cascaded architecture, enabling progressive spectrum repair from coarse to fine.

[0047] For example, a specific implementation of step 3 includes the following steps: Step 31: Input the damage mask, guiding condition vector G, and damaged spectrum S into the denoising network. Downsample the damaged spectrum S to 1 / 4 of its original size to form a low-resolution spectrum. Input the low-resolution spectrum into the denoising network, which performs diffusion for a step of T1 according to the following diffusion formula to denoise (remove noise) and obtain the denoised estimated spectrum:

[0048] In the formula, This represents the denoised estimated spectrum obtained after diffusion at time step t-1. The noisy spectrum at time step t in the first stage, also known as the corrupted spectrum S, is the input to the denoising network. This represents the noise dispatch coefficient of the diffusion process at time step t, with a value between 0 and 1, and it increases with time step t from... It decreases towards 0 and gradually increases, controlling the intensity of noise reduction at each step. The cumulative noise figure from time step 0 to time step t is defined as follows: This characterizes the overall degree to which the data is contaminated by noise at time step t. The parameter is The noise prediction neural network adopts an improved U-Net architecture. The input is a noisy spectrum, time step embedding, guiding condition vector and impairment mask, and the output is the predicted noise component, that is, the denoised estimated spectrum. This is the guiding condition vector, which is injected into the intermediate layer of U-Net through the cross-attention mechanism to guide the repair direction. This serves as a damage mask, indicating here as a spatial condition which areas require repair. For time step The noise standard deviation is used to control the degree of randomness in the denoising process. This is a random noise vector sampled from a standard normal distribution, used to introduce randomness during the denoising process. Multiplied by an impairment mask. The operation ensures that random noise is added only to damaged regions, while non-damaged regions maintain deterministic updates. (Score) This is a scaling factor for the denoising step, used to compensate for amplitude attenuation during the noise addition process. (Score) The scaling factor is used for noise prediction, which converts the predicted noise components into appropriate denoising updates.

[0049] Step 32: Upsample and downsample the denoised estimated spectrum to obtain the upsampled spectrum and the downsampled spectrum, respectively. Fuse the sampled and downsampled spectra to obtain the fused spectrum. Perform residual diffusion for several steps (e.g., 50 steps) on the fused spectrum to focus on repairing the harmonic texture and band correlation of the fused spectrum, thereby obtaining a medium-resolution spectrum. The number of diffusion steps in this step is less than the number of diffusion steps in Step 31. For example, if Step 31 has 100 diffusion steps, this step has 50 diffusion steps.

[0050] Understandably, compared to step 31, this step is performed at a medium resolution (half the original size). The purpose of employing the residual diffusion strategy is to model the residual only between the coarse repair result and the true spectrum, thereby reducing the learning difficulty and accelerating convergence. The diffusion deployment can be set to 50 steps. The denoising in this step not only receives the guiding condition vector G, but also additionally receives the output of the coarse repair stage as a condition, realizing cross-stage information transfer.

[0051] Specifically, the denoised estimated spectrum is upsampled to half the original size to obtain the upsampled spectrum, which is then fused with the downsampled spectrum obtained from the downsampling.

[0052] Step 33: Perform residual diffusion on the medium-resolution spectrum at the original resolution to focus on restoring high-frequency details and fine textures, resulting in a restored spectrum. The number of residual diffusion steps in this step is less than in Step 32; for example, this step reduces the number of steps to 25. This reduction in steps is possible because the first two steps have already established stable low-resolution and medium-resolution spectra, i.e., stable low-frequency and mid-frequency structures. This step only needs to repair the remaining high-frequency deficiencies.

[0053] In this step, the following spectrum consistency constraint diffusion formula is used for repair to obtain the repaired spectrum:

[0054] In the formula, Indicates at time step The output spectrum, which is the repaired spectrum of the output result of this step, This represents the input spectrum at time step t. This indicates the noise prediction network used in this step, and its parameters. The network parameters are independent of those in the previous two steps and are specifically optimized for high-frequency details. The output of the fine repair in step 32 is the medium-resolution spectrum, which serves as a conditional input to provide a priori information about the mid-frequency structure. This represents the spectral consistency projection operator, which projects the denoising result onto a manifold that satisfies the spectral physical constraints. Specifically, it performs boundary smoothing fusion of the repair result with the original undamaged region, while ensuring the non-negativity of the spectrum and energy conservation. This represents the corrupted spectrum of the original input. This represents the complement of the damage mask, used to identify non-damaged regions. (The last item...) This ensures that the original spectral values ​​are directly preserved in the non-damaged regions, unaffected by the denoising process.

[0055] Step 4: The amplitude spectrum of the restored spectrum obtained by the diffusion restoration network is converted into a complete complex spectrum through phase reconstruction and balance fusion to reconstruct a high-quality time-domain audio signal and complete the audio restoration.

[0056] Understandably, this step, through spectral consistency constraints and post-processing, transforms the amplitude spectrum output by the diffusion repair network into a complete complex spectrum. This transformation requires phase reconstruction.

[0057] Specifically, the implementation of step 4 includes the following steps: Step 41: Input the phase spectrum of the damaged audio signal, the damage mask, and the repair spectrum obtained in step 3 into the deep phase prediction network for processing to obtain the estimated phase of the damaged region of the damaged audio signal.

[0058] The first challenge is phase reconstruction, a key issue in spectral domain audio restoration. For damaged regions, the original phase spectrum can be directly reused; however, the original phase is often unreliable or completely missing, requiring estimation from the restored amplitude spectrum. To address this, a deep phase prediction network was designed.

[0059] The deep phase prediction network adopts an encoder-decoder architecture. The encoder extracts local structural features of the amplitude spectrum in the repair spectrum, and the decoder generates phase prediction step by step through deconvolution. The atan2 function is used in the output layer to ensure that the predicted phase falls within the range of [-π, π].

[0060] Step 42: Enhance the consistency between the estimated phase and the amplitude of the damaged audio signal to obtain the phase matrix, the dimension of which is the same as the dimension of the amplitude spectrum.

[0061] This step follows a consistency iterative optimization formula, which employs an improved Griffin-Lim algorithm for iterative optimization to enhance the consistency between the estimated phase and amplitude:

[0062] In the formula, This represents the phase matrix obtained after the (n+1)th iteration. This represents the phase matrix of the nth iteration, which serves as the input for the current iteration (the (n+1)th iteration). Indicates the repair spectrum, Indicates by The constructed unit complex matrix, where j is the imaginary unit, aims to convert the phase into a complex exponential form. () represents element-wise multiplication, used to multiply the amplitude and phase by a complex exponent to construct a complex spectrum. iSTFT(•) represents the inverse short-time Fourier transform, converting the complex spectrum back to the time domain signal. STFT(•) represents the short-time Fourier transform, converting the time domain signal back to the frequency domain. This represents the argument operation of a complex number, extracting the phase component of the transformed complex spectrum. This process of "complex spectrum → time domain → complex spectrum → phase extraction" is the core of the Griffin-Lim algorithm, which utilizes the consistency constraint between STFT and iSTFT to optimize the phase. This serves as a damage mask, ensuring that iterative optimization only applies to the damaged region. The original phase spectrum is the phase spectrum of the damaged audio signal. The original phase spectrum is used directly in the undamaged region. To smooth the regularization coefficients and control the strength of the smoothing constraint, a typical value is... This represents the gradient of the smoothing loss function with respect to the phase. The loss function is defined as the sum of the squares of the second derivatives of the phase at the boundary of the damaged region, and its gradient indicates the update direction that makes the phase transition smoother.

[0063] After N=50 iterations, the phase matrix can maintain a stable state consistent with the repaired spectrum.

[0064] Step 43: Feather the boundary between the repaired and non-damaged regions of the repaired spectrum to achieve smooth fusion of the spectrum boundaries and obtain the processed repaired spectrum.

[0065] Feathering is performed using the following Gaussian weighted fusion formula:

[0066]

[0067] In the formula, This represents the spectral value at frequency index k and time frame t after final fusion, which is also the spectral value after feathering. and These represent the positions of the restored spectrum and the original spectrum at frequency index k and time frame t, respectively. The spectral values. For this position The fusion weight at the point, and its range of values. . Indicates the location The Euclidean distance to the boundary of the damaged area. This is the boundary smoothing bandwidth parameter, which controls the width of the transition region; a typical value is 5 pixels. It is a Gaussian decay function that makes the weights transition smoothly near the boundary. For boundary region indication functions, when the position The value is 1 if the area is located within the smooth band at the boundary of the damaged region, and 0 otherwise. For internal region indication functions, when the position The value is 1 when the data is located inside the damaged area and far from the boundary; otherwise, it is 0. This formula ensures that the repair results are fully utilized inside the damaged area, and smoothly transitions to the original data (non-repair results) at the boundary.

[0068] Step 44: Perform sensory enhancement on the repaired spectrum after smooth fusion of spectral boundaries to obtain the sensory-enhanced repaired spectrum.

[0069] Understandably, the purpose of perceptual enhancement is to fine-tune the fused spectrum based on the masking characteristics of human hearing, thereby suppressing spectral artifacts that may cause auditory discomfort.

[0070] Step 45: Perform an inverse short-time Fourier transform on the restored spectrum after sensor enhancement to obtain the restored audio signal, thus completing the restoration.

[0071] The purpose of using the inverse short-time Fourier transform is to convert the complex spectrum back into an audio signal in the time domain.

[0072] This invention innovatively proposes a damage region detection method based on multi-directional spectral gradients. It constructs a comprehensive spectral gradient feature map by calculating the gradient components of the amplitude spectrum in the time, frequency, and mixed time-frequency directions. The innovative spectral gradient calculation formula integrates the first-order partial derivatives of time, frequency, and the mixed second-order partial derivatives using learnable weight coefficients, accurately capturing abrupt changes in damage region boundaries, particularly demonstrating excellent detection capabilities for oblique damage boundaries. This gradient feature, combined with the original amplitude spectrum, is fed into an attention-based damage detection network to achieve precise damage location and automatic damage type classification. Key features include the mathematical definition of multi-directional spectral gradients, an adaptive adjustment method for learnable weight coefficients, a fusion input strategy for gradient features and amplitude spectrum, and a damage detection network architecture based on U-Net and a self-attention mechanism.

[0073] This invention designs a multi-scale feature pyramid network for guiding signal generation, extracting contextual semantic information around the damaged region at four different resolution levels. An innovative adaptive spectral aggregation formula fuses features at different scales through an attention-weighted mechanism. Features at each scale are upsampled to a uniform resolution after convolutional transformation, GELU activation, and spatial attention modulation before being weighted and summed. A dynamic weight adjuster automatically adjusts the fusion weights of features at each scale based on the damage type label, enhancing the role of low-resolution global features for missing damage and enhancing the role of high-resolution local features for noisy damage, achieving adaptive processing capabilities for different damage types. The key features of this method encompass the four-level feature pyramid network structure design, the mathematical formula for adaptive spectral aggregation, the spatial attention modulation mechanism, and the dynamic weight generation network based on damage type.

[0074] This invention overcomes the limitations of traditional diffusion models' single-scale repair by proposing a three-stage cascaded diffusion repair network, progressing from coarse to fine. The coarse repair stage performs 100 diffusion steps at 1 / 4 resolution to reconstruct the global structure and energy distribution. The fine repair stage performs 50 diffusion steps at 1 / 2 resolution to restore harmonic texture and frequency band correlation. The precise repair stage performs 25 diffusion steps at the original resolution to supplement high-frequency details and refine transitions. Information flows between the three stages through residual connections and feature propagation, with each stage's diffusion process subject to conditional constraints from an adaptive guiding vector. This hierarchical design allows for efficient allocation of computational resources, with the low-resolution stage quickly determining the overall framework and the high-resolution stage refining local details. The architecture's safeguards include the three-stage cascaded network topology, the resolution configuration and diffusion step count for each stage, the residual connections and feature propagation methods between stages, and the multi-resolution conditional guided injection mechanism.

[0075] This invention addresses the specific needs of audio restoration tasks by designing a diffusion denoising formula that integrates guided conditions and damage masks. Building upon standard diffusion denoising updates, a guiding vector G is introduced and injected into the noise prediction network via a cross-attention mechanism to guide the semantic direction of the restored content. Simultaneously, the random noise term is multiplied by the damage mask M, ensuring that randomness only affects damaged regions, while non-damaged regions maintain deterministic updates. In the fine restoration stage, a spectral consistency projection operator is further introduced. After each denoising step, the result is projected onto a manifold that satisfies spectral physical constraints, and a mask separation operation ensures that non-damaged regions directly retain their original values. The protection points of this formula include the mathematical expression of conditionally guided denoising, the handling method for mask-constrained random noise, the definition of the spectral consistency projection operator, and the separation and fusion strategy for damaged and non-damaged regions.

[0076] This invention addresses the phase estimation challenge in spectral domain restoration by proposing a hybrid reconstruction scheme combining a deep phase prediction network (DBN) with an improved Griffin-Lim algorithm. The DBN takes the restoration amplitude spectrum and damage mask as input, directly predicting the phase values ​​of the damaged region through an encoder-decoder structure. The output layer uses an atan2 activation function to ensure the accuracy of the phase range. The network output serves as the initial estimate and is fed into an improved consistency iterative optimization process, where the consistency between phase and amplitude is enhanced through STFT-iSTFT round-trip transformation. The innovative iterative formula adds a smoothing regularized gradient term to the standard Griffin-Lim update, promoting a smooth phase transition at the damage boundary, and using masking operations to ensure that the phase in non-damaged regions retains its original value. The safeguards of this method encompass the architecture design of the DBN, the consistency iterative formula with smoothing constraints, the mask selective update mechanism, and the cascaded process of deep prediction and iterative optimization.

[0077] This invention designs a refined fusion method for the transition between the repaired and original regions. The innovative fusion weight calculation formula generates smooth transition fusion weights based on the Euclidean distance from each time-frequency location to the damage boundary using a Gaussian attenuation function. A boundary bandwidth parameter controls the width of the transition region, and a region indicator function distinguishes between the boundary and internal regions, applying different weighting strategies—the internal region fully utilizes the repaired result, the boundary region mixes the repaired result with the original content using Gaussian weights, and the non-damaged region retains its original values. This design effectively eliminates potential discontinuities and auditory artifacts between the repaired and original regions. The protection points of this strategy include the distance-based Gaussian attenuation weighting function, the mechanism of the boundary bandwidth parameter, the definition of the region indicator function, and the complete calculation process of weighted fusion.

[0078] This invention integrates four modules—spectral sensing analysis, adaptive guided generation, hierarchical diffusion repair, and consistency post-processing—into a unified end-to-end training framework. The interface design between modules ensures smooth gradient propagation, enabling the entire system to perform global joint optimization for the final repair quality target. The training loss function comprehensively considers multiple dimensions, including signal domain reconstruction error, spectral domain structural similarity, sensing domain quality assessment, and adversarial discrimination loss, achieving co-evolution of modules through multi-task learning. The framework's safeguards include the cascaded topology of the four modules, the design of data flow and gradient flow between modules, the construction of the multi-dimensional joint loss function, and the optimization strategy for end-to-end training.

[0079] like Figure 2 As shown, the present invention also provides a processing terminal 100, which includes: Memory 101 is used to store program instructions; Processor 102 is configured to run the program instructions to perform the steps of the audio restoration method based on the hierarchical progressive diffusion model.

[0080] The embodiments disclosed in this specification are merely illustrative of one aspect of the invention, and the scope of protection of the invention is not limited to these embodiments. Any other functionally equivalent embodiments fall within the scope of protection of the invention. Those skilled in the art can make various other corresponding changes and modifications based on the technical solutions and concepts described above, and all such changes and modifications should fall within the scope of protection of the claims of this invention.

Claims

1. An audio restoration method based on a hierarchical progressive diffusion model, characterized in that, Includes the following steps: Step 1: Input the damaged audio signal to be repaired into the spectrum sensing and analysis module. The spectrum sensing and analysis module analyzes the spectrum of the damaged audio signal and processes the analyzed spectrum results using an attention-based damage detection network to obtain the damage mask, spectrum feature map and damage type label of the damaged audio signal. Step 2: The adaptive guidance signal generation module concatenates the input damage mask and spectral feature map to obtain the concatenated combined feature tensor. It extracts features from the combined feature tensor at multiple different resolution levels to obtain multi-scale features. It then aggregates the multi-scale features to obtain multi-scale fusion features. Based on the multi-scale fusion features, it generates a guidance condition vector G suitable for the diffusion probability model. Step 3: Input the damage mask, guiding condition vector G, and damaged spectrum S into a multi-layered hierarchical progressive diffusion repair network for spectrum repair to obtain the repaired spectrum. Each layer of the hierarchical progressive diffusion repair network repairs the spectrum structure of the corresponding resolution level. The damaged spectrum S is the spectrum of the damaged audio signal after damage masking. Step 4: The amplitude spectrum of the restored spectrum obtained by the diffusion restoration network is converted into a complete complex spectrum through phase reconstruction and balance fusion to reconstruct a high-quality time-domain audio signal and complete the audio restoration.

2. The audio restoration method based on a hierarchical progressive diffusion model according to claim 1, characterized in that, The specific implementation of step 1 includes the following steps: Step 11: Perform a short-time Fourier transform on the damaged audio signal to obtain a discrete time-frequency domain audio signal; Step 12: Perform complex spectral decomposition on the discrete time-frequency domain audio signal to obtain the amplitude spectrum and phase spectrum; Step 13: Calculate the spectral gradient of the amplitude spectrum, which includes the time gradient and the frequency gradient. By using the calculated spectral gradient, abrupt changes and discontinuities in the spectrum can be captured. Step 14: Concatenate the spectral gradient and amplitude spectrum, and feed the concatenated result into an attention-based impairment detection network based on the U-Net architecture to identify the impairment region, thereby obtaining the impairment mask, spectral feature map and impairment type label of the damaged audio signal.

3. The audio restoration method based on a hierarchical progressive diffusion model according to claim 2, characterized in that, The spectral gradient is calculated using the following formula: In the formula, This represents the combined spectral gradient at frequency index k and time frame index t, where A represents the amplitude spectrum matrix. This represents the first-order partial derivative of the amplitude spectrum along the time direction. This represents the first-order partial derivative of the amplitude spectrum along the frequency direction. The mixed second-order partial derivatives of the amplitude spectrum are represented. , and These represent three learnable weight coefficients that control the contribution ratios of the temporal gradient, frequency gradient, and time-frequency hybrid substitution in the overall gradient calculation, respectively. (Square digit operation) Used to calculate the Euclidean norm. , and The three weight coefficients are automatically optimized during training through backpropagation.

4. The audio restoration method based on a hierarchical progressive diffusion model according to claim 1, characterized in that, The specific implementation of step 2 includes the following steps: Step 21: The damage mask and spectral feature map input into the adaptive guidance signal generation module are concatenated to form a combined feature vector; Step 22: Input the combined feature vector into the multi-scale feature pyramid network to extract features at four different resolution levels, thereby obtaining multi-scale features; Step 23: Input the multi-scale features into the feature aggregation network for aggregation, so as to unify the various features with different resolutions and obtain multi-scale fused features; Step 24: Input the multi-scale fusion features into the conditional coding projection layer. The conditional coding projection layer compresses the multi-scale fusion features into a one-dimensional vector through global average pooling, and then maps it to a 512-dimensional guiding condition vector through two fully connected network layers. This guiding condition vector is then normalized by L2 to obtain the final guiding condition vector G.

5. The audio restoration method based on a hierarchical progressive diffusion model according to claim 4, characterized in that, The first resolution level downsamples the extracted features to 1 / 8 of the combined feature vector size; the second resolution level downsamples the extracted features to 1 / 4 of the combined feature vector size; the third resolution level downsamples the extracted features to 1 / 2 of the combined feature vector size; and the fourth resolution level downsamples the extracted features to the original size of the combined feature vector. Each resolution level uses residual convolutional blocks for feature extraction, and the number of channels is gradually increased during the downsampling process.

6. The audio restoration method based on a hierarchical progressive diffusion model according to claim 5, characterized in that, Feature aggregation is performed using the following formula: In the formula, This represents a multi-scale fusion feature with dimensions C×H×W, where C is the number of channels, and H and W are the height and width, respectively. 4 represents the feature of the i-th resolution level, and 4 represents the total number of resolution levels. This represents the convolutional kernel weight matrix corresponding to the i-th resolution level. This represents a two-dimensional convolution operation. This represents the bias vector of the convolutional layer corresponding to the i-th resolution level. This represents the GELU activation function. Let represent the spatial attention map at the i-th resolution level, which is obtained by compressing the features along the channel dimension and normalizing them using the sigmoid function. This represents the element-wise Hadamard product operation. This represents the upsampling operator for the i-th resolution level, which uses bilinear interpolation to upsample the features of the i-th resolution level to the same size as the original. This represents the fusion weight coefficient for the i-th resolution level, which is adaptively generated by the dynamic weight adjuster based on the damage type label.

7. The audio restoration method based on a hierarchical progressive diffusion model according to claim 1, characterized in that, The specific implementation of step 3 includes the following steps: Step 31: Input the damage mask, guiding condition vector G and damage spectrum S into the denoising network, downsample the damage spectrum S to 1 / 4 of the original size to form a low-resolution spectrum, input the low-resolution spectrum into the denoising network to obtain the denoised estimated spectrum; Step 32: Upsample and downsample the denoised estimated spectrum to obtain the upsampled spectrum and the downsampled spectrum. Fuse the sampled spectrum and the downsampled spectrum to obtain the fused spectrum. Perform several steps of residual diffusion on the fused spectrum to obtain the medium resolution spectrum. Step 33: Perform residual diffusion repair on the medium resolution spectrum at the original resolution to obtain the repaired spectrum.

8. The audio restoration method based on a hierarchical progressive diffusion model according to claim 7, characterized in that, In step 31, the denoising network performs diffusion with a step size of T1 according to the following diffusion formula to denoise and obtain the denoised estimated spectrum: In the formula, This represents the denoised estimated spectrum obtained after diffusion at time step t-1. This represents the noisy spectrum at time step t in the first stage, which is also the corrupted spectrum S. This represents the noise dispatch coefficient at time step t during the diffusion process, and its value is between 0 and 1. The cumulative noise figure from time step 0 to time step t is defined as follows: , The parameter is A noise prediction neural network is proposed, employing an improved U-Net architecture. Its inputs are the noisy spectrum, time-step embeddings, guiding conditional vectors, and impairment masks; the output is a denoised estimated spectrum. For damage masking, For time step The noise standard deviation Let the random noise vector be sampled from a standard normal distribution, and the fractional Scaling factor for the denoising step, score This is the scaling factor for noise prediction. Step 33 uses the following spectral consistency constraint diffusion formula for repair to obtain the repaired spectrum: In the formula, Indicates at time step The output spectrum This represents the input spectrum at time step t. This indicates the noise prediction network used in this step. Indicates the medium resolution spectrum. This represents the spectral uniformity projection operator. Indicates a damaged spectrum. The complement of the damage mask is used to identify non-damaged regions; the last item... This ensures that the original spectral values ​​are directly preserved in the non-damaged regions, unaffected by the denoising process.

9. The audio restoration method based on a hierarchical progressive diffusion model according to claim 1, characterized in that, The specific implementation of step 4 includes the following steps: Step 41: Input the phase spectrum of the damaged audio signal, the damage mask, and the repair spectrum obtained in step 3 into the deep phase prediction network for processing to obtain the estimated phase of the damaged region of the damaged audio signal; Step 42: Enhance the consistency between the estimated phase and the amplitude of the damaged audio signal to obtain the phase matrix. The phase matrix has the same dimension as the amplitude spectrum. This step uses the following consistency iterative optimization formula to enhance the consistency between the estimated phase and amplitude: In the formula, This represents the phase matrix obtained after the (n+1)th iteration. This represents the phase matrix of the nth iteration. Indicates the repair spectrum, Indicates by The constructed unit complex matrix, j is the imaginary unit, () denotes element-wise multiplication, iSTFT(•) denotes the inverse short-time Fourier transform, and STFT(•) denotes the short-time Fourier transform. This represents the argument operation for complex numbers. For damage masking, The original phase spectrum, To smooth out the regularization coefficients, This represents the gradient of the smoothing loss function with respect to the phase. Step 43: Feather the boundary between the repaired and non-damaged regions of the repaired spectrum to achieve smooth spectral boundary fusion, resulting in the processed repaired spectrum. Feathering is performed using the following Gaussian weighted fusion formula: In the formula, This represents the spectral value at frequency index k and time frame t after final fusion. and These represent the positions of the restored spectrum and the original spectrum at frequency index k and time frame t, respectively. Spectral values, For this position The fusion weight at the point, Indicates the location Euclidean distance to the boundary of the damaged area. For boundary smoothing bandwidth parameters, It is a Gaussian decay function. For boundary region indication functions, when the position The value is 1 if the damage is located within the smooth band at the boundary of the damaged area, and 0 otherwise. For internal region indication functions, when the position The value is 1 when the damage is located inside the damaged area and far from the boundary; otherwise, it is 0. Step 44: Perform sensory enhancement on the repaired spectrum after smooth fusion of spectral boundaries to obtain the sensory-enhanced repaired spectrum; Step 45: Perform an inverse short-time Fourier transform on the restored spectrum after sensor enhancement to obtain the restored audio signal, thus completing the restoration.

10. A processing terminal, characterized in that, It includes: Memory, used to store program instructions; A processor for running the program instructions to perform the steps of the audio restoration method based on a hierarchical progressive diffusion model as described in any one of claims 1-9.