Audio processing method and apparatus

By using frequency domain pop detection and time domain restoration models, the accuracy problem of pop restoration in complex video content scenarios in existing technologies has been solved, and efficient restoration of arbitrary audio content and sampling rate has been achieved.

CN122201329APending Publication Date: 2026-06-12BEIJING ZITIAO NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING ZITIAO NETWORK TECH CO LTD
Filing Date
2024-12-12
Publication Date
2026-06-12

Smart Images

  • Figure CN122201329A_ABST
    Figure CN122201329A_ABST
Patent Text Reader

Abstract

Embodiments of the present disclosure provide an audio processing method and device, the method comprising: obtaining audio data of a video file, converting each frame data in the audio data into frequency domain data by a pop noise detection model, and detecting whether pop noise data exists in each frame data based on the frequency domain data; if it is detected that pop noise data exists in the frame data, performing pop noise repair processing on the frame data by a pop noise repair model to obtain repaired frame data, the pop noise repair model being used to filter out pop noise data and spectral mirror image in the frame data; and performing normalization processing on audio loudness of the repaired frame data based on a maximum amplitude of the repaired frame data in a time domain to obtain repaired audio data of the video file. The method can improve the accuracy of audio processing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of Internet technology, and in particular to an audio processing method and device. Background Technology

[0002] Audio clipping is a common type of audio quality defect. In actual audio recording systems, because there is a certain upper limit to the loudness of recordable audio, clipping occurs when the loudness of the sound to be recorded is too high. Clipping sounds like a hissing or booming noise, and its presence degrades the user's listening experience. Therefore, clipping in audio data needs to be processed to improve the user experience.

[0003] Current mainstream methods for pop sound restoration treat it as an inverse problem, inferring the true data of pop sound segments based on observed non-pop sound data, and then solving it using relevant theories of constrained optimization methods. Among these, sparsity-based pop sound restoration is the most common method. However, this type of method makes a sparsity assumption about the audio signal, which limits its application in many practical scenarios, especially for complex video content scenarios where the audio is often a mixture of various sound sources, making it difficult to satisfy the sparsity assumption and resulting in lower accuracy for these pop sound restoration methods. Summary of the Invention

[0004] This disclosure provides an audio processing method and apparatus that can improve the accuracy of popping sound repair.

[0005] In a first aspect, embodiments of this disclosure provide an audio processing method, including:

[0006] The audio data of the video file is obtained, and each frame of the audio data is converted into frequency domain data through a pop detection model. Based on the frequency domain data, it is detected whether pop data exists in each frame.

[0007] If popping sound data is detected in the frame data, the frame data is processed by a popping sound repair model to obtain repaired frame data. The popping sound repair model is used to filter out popping sound data and spectral mirroring in the frame data.

[0008] Based on the maximum amplitude of the repaired frame data in the time domain, the audio loudness of the repaired frame data is normalized to obtain the repaired audio data of the video file.

[0009] In a second aspect, embodiments of this disclosure provide an audio processing device, including:

[0010] The acquisition unit is used to acquire the audio data of the video file, convert each frame data in the audio data into frequency domain data through the pop sound detection model, and detect whether pop sound data exists in each frame data based on the frequency domain data.

[0011] An audio processing unit is configured to perform pop noise repair processing on the frame data using a pop noise repair model if pop noise data is detected in the frame data, thereby obtaining repaired frame data. The pop noise repair model is used to filter out pop noise data and spectral mirrors in the frame data.

[0012] The limiting unit is used to normalize the audio loudness of the repaired frame data based on the maximum amplitude value of the repaired frame data in the time domain, so as to obtain the repaired audio data of the video file.

[0013] Thirdly, embodiments of this disclosure provide an electronic device, including: a processor and a memory;

[0014] The memory stores computer-executed instructions;

[0015] The processor executes computer execution instructions stored in the memory, causing the at least one processor to perform the audio processing method as described in the first aspect and various possible designs of the first aspect.

[0016] Fourthly, embodiments of this disclosure provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the audio processing method described in the first aspect and various possible designs of the first aspect.

[0017] Fifthly, embodiments of this disclosure provide a computer program product, including a computer program that, when executed by a processor, implements the audio processing method described in the first aspect and various possible designs of the first aspect.

[0018] This embodiment provides an audio processing method and apparatus. The method includes: acquiring audio data from a video file; converting each frame of the audio data into frequency domain data using a pop detection model; detecting whether pop data exists in each frame based on the frequency domain data; if pop data is detected, performing pop repair processing on the frame data using a pop repair model to obtain repaired frame data; the pop repair model is used to filter out pop data and spectral mirroring in the frame data; and normalizing the audio loudness of the repaired frame data based on the maximum amplitude in the time domain to obtain repaired audio data of the video file. In this technical solution, pop detection is performed in the frequency domain, and pops can be clearly observed in the frequency domain, thus improving the accuracy of pop detection. Furthermore, pop repair is performed in the time domain, and only the truncated components of the waveform need to be compensated in the time domain, thus allowing repair of arbitrary audio content and arbitrary sampling rates, and is applicable to pop truncation in scenarios such as variable sampling and encoding / decoding distortion, thereby improving the accuracy of audio repair. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 This is a schematic diagram illustrating an application scenario of an audio processing method provided in an embodiment of the present disclosure;

[0021] Figure 2 Flowchart of the audio processing method provided in the embodiments of this disclosure Figure 1 ;

[0022] Figure 3 A schematic diagram of the audio processing method provided in the embodiments of this disclosure. Figure 1 ;

[0023] Figure 4 A schematic diagram of the audio processing method provided in the embodiments of this disclosure. Figure 2 ;

[0024] Figure 5 This is a schematic diagram of the structure of an audio processing device provided in an embodiment of the present disclosure;

[0025] Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation

[0026] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0028] Audio clipping is a common type of audio quality defect. In actual audio recording systems, because there is a certain upper limit to the loudness of the audio that can be recorded, a clipping sound will occur when the loudness of the sound to be recorded is too high. The clipping sound is perceived as a hissing or booming noise, and its presence degrades the user's listening experience.

[0029] Furthermore, the presence of pops in audio processing can limit the effectiveness of processing algorithms and even produce unreasonable results. For example, in echo cancellation algorithms, if a pop occurs in the acquired signal, the correlation calculation between the reference signal and the acquired signal will be inaccurate, easily leading to echo leakage. Therefore, it is necessary to process pops in audio data to improve the user experience.

[0030] Current mainstream methods for pop sound restoration treat it as an inverse problem, inferring the true data of pop sound segments based on observed non-pop sound data and then solving it using constrained optimization theory. Among these, sparsity-based pop sound restoration is the most common method. However, this approach assumes sparsity in the audio signal, limiting its application in many real-world scenarios, especially complex video content where audio is often a mixture of various sound sources, making it difficult to satisfy the sparsity assumption. Furthermore, these methods rely on sampling-level pop sound detection. When encountering soft pops, encoding / decoding distortion, or secondary editing by the video creator (variable sampling, overlaying other components, etc.), the accuracy of pop sound detection significantly decreases, leading to limited or even failed restoration results. Therefore, the accuracy of these pop sound restoration methods is relatively low.

[0031] Therefore, there is an urgent need for an effective technical solution to improve the accuracy of methods for repairing popping sounds.

[0032] In video creation scenarios, increasing the loudness of audio content during editing can easily lead to popping and clipping. Some audio editing software uses a limiter to limit audio exceeding a threshold, producing soft pops. Unlike hard clipping, which can be directly detected based on amplitude values ​​in the time domain waveform, soft popping cannot be directly detected, making popping repair more challenging. This invention focuses on the popping clipping problem in video creation scenarios, aiming to design a universal popping repair method that can repair arbitrary audio content and sampling rates, and is applicable to popping clipping in scenarios with variable sampling and codec distortion.

[0033] To address the technical problems in existing technologies, the inventors' technical concept is as follows: Since pop sounds have obvious characteristics observed in the frequency domain, such as local spectral energy leakage, this invention considers pop sound detection in the frequency domain. However, during pop sound repair, it is necessary to compensate for some locations (locations of spectral energy leakage) and suppress others (locations where spectral energy leakage occurs). In this case, the model has to perform both addition and subtraction, resulting in a large learning burden and limited effectiveness. In the time domain, only the truncated components of the waveform need to be compensated, i.e., only addition is required. Comparatively, the model can more easily learn the data mapping pattern in the pop sound repair task. Therefore, this invention considers pop sound repair in the time domain.

[0034] Based on the above considerations, this invention proposes a general pop clipping repair method applicable to practical video creation scenarios. The method includes: a relatively simple pop detection model for pop detection in the frequency domain, and a robust general pop repair model for pop repair in the time domain.

[0035] Accordingly, the specific steps may include: First, acquiring the audio data of the video file; converting each frame of the audio data into frequency domain data using a pop detection model; and detecting the presence of pop data in each frame based on the frequency domain data. Then, if pop data is detected in the frame data, performing pop repair processing on the frame data using a pop repair model to obtain repaired frame data. The pop repair model is used to filter out pop data and spectral mirroring in the frame data. Finally, based on the maximum amplitude of the repaired frame data in the time domain, normalizing the audio loudness of the repaired frame data to obtain the repaired audio data of the video file.

[0036] In this technical solution, pop detection is performed in the frequency domain, where pops can be clearly observed, thus improving the accuracy of pop detection. Furthermore, pop repair is performed in the time domain, where only the truncated components of the waveform need to be compensated. This allows for the repair of any audio content and any sampling rate, and is applicable to pop clipping in scenarios such as variable sampling and codec distortion, thereby improving the accuracy of audio repair.

[0037] The application scenarios of the embodiments of this disclosure are explained below:

[0038] The audio processing method provided in this disclosure can be applied to various video file popping sound repair scenarios. Figure 1 This is a schematic diagram illustrating an application scenario of an audio processing method provided in an embodiment of this disclosure. For example... Figure 1 As shown, when a user sends a video file playback request to the server 102 via the terminal 101, the server 102 can use the audio processing method provided in this embodiment to repair the popping sound in the video file. The server 102 then returns the video file with the repaired popping sound to the display terminal 101 for playback.

[0039] The following describes the specific implementation process of the audio processing method and device involved in the embodiments of this disclosure. Some examples are merely illustrative and not intended to limit the scope. The executing entity of the audio processing method involved in the embodiments of this disclosure is an electronic device, which may be a terminal, server, etc.

[0040] Figure 2 Flowchart of the audio processing method provided in the embodiments of this disclosure Figure 1 ,like Figure 2 As shown, the audio processing method may include:

[0041] S201. Obtain the audio data of the video file, convert each frame of the audio data into frequency domain data using a pop detection model, and detect whether pop data exists in each frame of the audio data based on the frequency domain data.

[0042] In this embodiment of the disclosure, the video file may be a video draft including audio data. Optionally, the video file may be a video draft in the process of video creation and editing.

[0043] To better detect the presence of popping sounds while minimizing computational complexity, this invention proposes to perform popping sound detection in the time-frequency domain and predict whether a popping sound occurs using frames as the basic unit.

[0044] Optionally, the pop detection model includes a feature extraction module and a pop prediction module; correspondingly, detecting whether pop data exists in each frame of data based on frequency domain data includes: converting the frame data into frequency domain data through Fourier transform for each frame of data; modeling on the local spectrum through the feature extraction module to extract a first frequency domain feature for determining whether a pop is present; compressing the first frequency domain feature through the pop prediction module to obtain a second frequency domain feature of a preset dimension, and detecting whether pop data exists in the frame data through the second frequency domain feature.

[0045] For example, for each frame of data, a short-time Fourier transform with a frame length of 2048 and a frame shift of 512 can be used to transform the input frame data time-domain waveform signal (i.e., frame data) into frequency-domain data. The feature extraction module can use a Hanning window as the window function to model the transformed spectral data on a local spectrum and extract the first frequency-domain feature used to determine whether there is a pop. The feature extraction module may include multiple two-dimensional convolutional modules.

[0046] For example, as shown in Table 1 below, the feature extraction module may include five two-dimensional convolution modules: two-dimensional convolution 1, two-dimensional convolution 2, two-dimensional convolution 3, two-dimensional convolution 4, and two-dimensional convolution 5.

[0047] Table 1. Popping sound detection model

[0048]

[0049] Optionally, as shown in Table 1 above, the pop sound prediction module includes a linear layer, a one-dimensional convolutional layer, and a sigmoid non-linear activation function. Correspondingly, the pop sound prediction module compresses the first frequency domain features to obtain a second frequency domain feature of a preset dimension. The second frequency domain feature is then used to detect whether pop sound data exists in the frame data. This includes: compressing the first frequency domain features using the linear layer and one-dimensional convolutional layer in the pop sound prediction module to obtain a second frequency domain feature of a preset dimension; using the sigmoid non-linear activation function and the second frequency domain feature in the pop sound prediction module to detect pop sound in the frame data and predict the pop sound probability corresponding to the frame data; if the pop sound probability is greater than a preset probability, it is determined that pop sound data exists in the frame data; if the pop sound probability is not greater than the preset probability, it is determined that pop sound data does not exist in the frame data.

[0050] It should be noted that during the training phase, this invention employs a regression task mode to train the detection model, using the root mean square error as the loss function. During the inference phase, given an input signal y(t)∈R... 1×S The specific steps for detecting popping sounds are as follows:

[0051] Step 1: Transform y(t) to the frequency domain using the short-time Fourier transform to obtain the complex spectrum Y(f,t)∈C. 1024×T.

[0052] Step 2: Stack the real and imaginary parts of the amplitude spectrum and the complex spectrum to form the input feature Feat∈R 3×1024×T .

[0053] Step 3: Use the feature extraction module to model the local spectral features and extract the features that can be used to identify pops. local ∈R 64×2×T .

[0054] Step 4: Use the pop detection module to further detect the pop probability of each frame.

[0055] Step 5: Calculate the maximum percentage of popping sounds within a sliding window with a length of N and a movement of M, and return the result.

[0056] S202. If popping data is detected in the frame data, the frame data is processed by the popping repair model to obtain the repaired frame data. The popping repair model is used to filter out popping data and spectral mirroring in the frame data.

[0057] Existing audio generation models typically use a decoder to progressively generate target speech based on extracted audio representations. This decoder mainly consists of several upsampling layers. Due to the presence of these upsampling layers, these audio generation models are prone to spectral mirroring issues, usually requiring adversarial training with a suitable discriminator. The design of the discriminator significantly impacts the audio generation quality. To achieve more robust pop effects and simplify the training process, this invention proposes a temporal pop repair model that effectively avoids the spectral mirroring problem.

[0058] Optionally, the repair model includes an encoding module, a feature processing module, and an anti-mirror decoding module. Accordingly, the frame data is processed by the pop sound repair model to obtain repaired frame data, including: encoding the frame data using the encoding module in the pop sound repair model to obtain frame data with a preset feature dimension; extracting target feature data without pop sounds from the frame data with the preset feature dimension using the feature processing module in the pop sound repair model; and decoding the target feature data using the anti-mirror decoding module in the pop sound repair model to filter out spectral mirroring in the target feature data, thus obtaining the repaired frame data.

[0059] For example, the structure of the pop sound repair model is as follows: Figure 3As shown, the pop sound restoration model consists of three modules: an encoding module, a feature processing module, and an anti-mirror decoding module. The encoder first compresses the input time-domain signal in the time dimension and expands the feature dimension; it consists of a series of threshold one-dimensional convolutional modules. The feature processing module extracts pop sound-related features based on the encoder output; it consists of several layers of residual compression BLSTM (Bidirectional Long Short-Term Memory) modules. Finally, the anti-mirror decoding module progressively upsamples the extracted features to recover the pop-free speech signal while avoiding spectral mirroring; it is parallel to the encoder and consists of a series of anti-mirror threshold one-dimensional deconvolutions.

[0060] In some embodiments, the frame data is encoded by the encoding module in the pop sound repair model to obtain frame data with a preset feature dimension, including: downsampling the frame data by the first convolutional layer in the encoding module of the pop sound repair model to obtain frame data with a time dimension; and expanding the feature dimension of the frame data with a second convolutional layer in the encoding module to obtain frame data with a preset feature dimension.

[0061] For example, such as Figure 3 As shown, the encoding module consists of seven one-dimensional threshold convolutional layers. Each one-dimensional threshold convolutional layer includes a one-dimensional convolutional layer (Conv1d), a grouping normalization layer (GroupNorm), a GELU activation function layer, and a GLU non-linear activation function layer. The first convolutional layer is used for downsampling, compressing the temporal dimension and expanding the feature dimension; the second convolutional layer is used to double the feature dimension to facilitate threshold control using GLU. The parameters of the encoding module are shown in Table 2 below.

[0062] Table 2 Parameters of the encoding module

[0063] Layer ID Input Channel Output Channel convolution kernel Step length GroupNorm 1 1 32 8 1 1 2 32 64 8 4 1 3 64 128 8 4 1 4 128 256 8 4 2 5 256 512 4 2 2 6 512 1024 4 2 2 7 1024 1024 4 2 4

[0064] Optionally, the feature processing module in the pop sound repair model extracts target feature data without pop sounds from frame data of a preset feature dimension, including: extracting target feature data without pop sounds from frame data of a preset feature dimension using the bidirectional long short-term memory network (BLSTM) model in the feature processing module of the pop sound repair model.

[0065] For example, such as Figure 3As shown, the feature processing module consists of three residual compression DConv modules. Each module uses residual connections and includes a one-dimensional convolution Conv1d, a group normalization layer GroupNorm, a GELU activation function, a bidirectional long short-term memory (BLSTM) network, a pointwise one-dimensional convolution Conv1d1x1, a group normalization layer GroupNorm, and a GLU non-linear activation function. The first convolution is used to compress the input feature dimension to reduce the computational complexity of the subsequent BLSTM; the second convolution is used to amplify the feature dimension, working with GLU to filter out useful features and restore the feature dimension. Each module has the same structure, with a dimensionality compression coefficient of 4.

[0066] Optionally, the anti-mirror decoding module includes multiple decoders and filters; correspondingly, the target feature data is decoded by the anti-mirror decoding module in the pop sound restoration model to filter out the spectral mirror in the target feature data and obtain the restored frame data, including: upsampling the target feature data by multiple decoders in the pop sound restoration model to obtain the restored initial frame data; and filtering out the spectral mirror in the initial frame data by the filter in the pop sound restoration model to obtain the restored frame data.

[0067] For example, such as Figure 3 As shown, the anti-mirror decoding module consists of a 7-layer structure symmetrical to the encoding module. The main difference is that the downsampling convolution transformation is replaced by the upsampling transposed convolution. Furthermore, directly generating audio through upsampling easily produces spectral mirroring (initialized model parameters are random), and the mirroring depth depends on the upsampling factor; the larger the factor, the deeper the mirroring depth, and the slower the model convergence. To avoid this problem, this invention introduces a low-pass filter in the last layer of each decoding module to explicitly filter out potential spectral mirroring. Simultaneously, to ensure that the final signal does not lose information, this invention keeps the output channel consistent with the input in the last decoding module. After passing through the low-pass filter, a one-dimensional convolution is combined to recover the restored signal without spectral mirroring.

[0068] It should be noted that during the training phase, in order to obtain better listening quality while maintaining the consistency of the temporal waveform, this invention employs multi-resolution STFTLoss and temporal MAELoss to optimize the model. Furthermore, since the proportion of non-pop segments in audio is often much smaller than that of pop segments, non-pop segments have a smaller weight when calculating the loss function, resulting in a weaker perception of pop repair tasks by the model and limited repair effectiveness.

[0069] Therefore, this invention calculates the loss functions for plosive segments and non-plosive segments separately, and then sums them with weights to obtain the final loss function, which is defined as follows:

[0070]

[0071] Where, x, X, These represent the target speech signal, the speech signal output by the model, and their corresponding complex spectral features, respectively. Short-time Fourier transforms with frame lengths of 640, 960, 1024, 1536, and 2048 were used respectively, with a frame shift of 1 / 4 of the frame length. The window function was a Hanning window, and α and w were weight control factors for the loss function term, which were set to 100 and 0.5 respectively.

[0072] It should be noted that, unlike speech denoising, pop sound restoration is essentially a generative task. Training directly using a regression model is unlikely to achieve the desired results. Some existing methods often combine adversarial generative networks (AGNs) for training, utilizing the adversarial training between the discriminator and the generative model to achieve better results. However, these methods are cumbersome to train and their effectiveness depends heavily on the discriminator design. This invention first conducts an in-depth analysis of the common temporal model structure UNet, discovering its tendency to produce spectral mirroring. Simply training using a regression model is insufficient to eliminate this mirroring, but the mirroring problem is gradually alleviated with the introduction of a discriminator as the model trains. Based on this observation, this invention considers how to overcome mirroring from the perspective of model structure, allowing model training to be independent of the discriminator, simplifying model training while achieving better restoration results. To this end, this invention proposes an anti-mirror temporal network for pop sound restoration. The proposed network model effectively avoids spectral mirroring, and robust pop sound restoration results can be obtained simply using a regression model.

[0073] When constructing the training data, data augmentation was performed on secondary processing scenarios such as soft clipping, hard clipping, variable sampling, and encoding / decoding distortion. Sampling rates were randomly set, and the data included random superpositions of various sound sources such as music, vocals, and noise. The model uses time-domain waveforms as input and applies the same processing mode to audio at different sampling rates. During the inference phase, given an input signal y(t)∈R... 1×S The specific steps to fix the popping sound are as follows:

[0074] Step 1: Use an encoder to process y(t)∈R 1×S Compression is performed in the time dimension, and the feature dimension is amplified to obtain frame data F with a preset feature dimension. encoded ∈R C×S’ .

[0075] Step 2: Use the feature extraction module to extract from F encoded Extract the audio features F without popping sounds. filtered ∈R C×S’ .

[0076] Step 3: After passing through the anti-mirror decoder based on Ffiltered The signal after the popping sound was restored.

[0077] S203. Based on the maximum amplitude of the repaired frame data in the time domain, the audio loudness of the repaired frame data is normalized to obtain the audio data of the video file after repair.

[0078] In this embodiment, since the amplitude of the repaired audio signal is greater than that of the original signal in the time domain waveform, direct storage would result in further truncation due to the limitation of the numerical range. Therefore, this invention considers normalizing the loudness of the repaired audio to -16 LUFS, and then using a limiter to normalize the maximum amplitude. It is worth mentioning that, unlike the soft pops caused by excessively increasing the loudness and directly using a limiter during video creation, this invention first reduces the overall loudness, resulting in fewer sampling points exceeding the threshold. Thus, the soft pops introduced after the limiter can be almost ignored.

[0079] Optionally, this step may include: determining the maximum amplitude value of the repaired frame data in the time domain; if the maximum amplitude value is greater than a preset amplitude threshold, normalizing the audio loudness of the repaired frame data to a preset audio loudness to obtain normalized frame data; reducing the amplitude of the target frame data to a preset amplitude threshold using a limiter to obtain the audio data of the video file after repair, wherein the target frame data is the frame data in the normalized frame data whose amplitude is greater than the preset amplitude threshold.

[0080] For example, given an audio signal y(t)∈R 1×S Here are the specific steps for a general method to fix popping sounds:

[0081] Step 1: Predict the frame-level clipping ratio using the clipping detection model.

[0082] Step 2: Determine whether the clipping_ratio at the frame level exceeds a certain threshold. If it does, proceed to the next step to repair the clipping; otherwise, end and return. The threshold used in this invention is 0.5.

[0083] Step 3: Directly input the time-domain signal into the pop sound restoration model based on the anti-mirror time-domain network to obtain the restored audio.

[0084] Step 4: Calculate the maximum amplitude v in the time domain of the repaired audio. max .

[0085] Step 5: If the maximum amplitude v max If the value exceeds 1.0, the audio loudness will be limited to -16 LUFS, and the maximum amplitude will be limited using a limiter; otherwise, the repaired audio will be returned directly.

[0086] In summary, as Figure 4 As shown, the audio processing method in this application includes three main modules: a pop detection module, a pop repair module, and a post-processing module. Given an audio signal, the pop detection model is first used to determine whether the audio signal needs pop repair. For signals requiring repair, the pop repair model is used to repair the audio signal, and then the loudness is normalized by the post-processing module to avoid further truncation.

[0087] This disclosure provides an audio processing method: First, audio data from a video file is acquired. Each frame of the audio data is converted to frequency domain data using a pop detection model, and the presence of pop data in each frame is detected based on the frequency domain data. Then, if pop data is detected in the frame data, a pop repair model is used to repair the pop data, resulting in repaired frame data. The pop repair model filters out pop data and spectral mirroring from the frame data. Finally, based on the maximum amplitude of the repaired frame data in the time domain, the audio loudness of the repaired frame data is normalized to obtain the repaired audio data of the video file. In this technical solution, pop detection is performed in the frequency domain, where pops are clearly observable, thus improving the accuracy of pop detection. Furthermore, pop repair is performed in the time domain, requiring only compensation for the truncated waveform components. This allows for repair of arbitrary audio content and sampling rates, and is applicable to pop truncation in scenarios such as variable sampling and encoding / decoding distortion, thereby improving the accuracy of audio repair.

[0088] Figure 5 This is a schematic diagram of the structure of the audio processing device provided in the embodiments of this disclosure, such as... Figure 5 As shown, the audio processing device includes:

[0089] The pop sound detection unit 501 is used to acquire the audio data of the video file, convert each frame data in the audio data into frequency domain data through the pop sound detection model, and detect whether pop sound data exists in each frame data based on the frequency domain data.

[0090] The pop sound repair unit 502 is used to perform pop sound repair processing on the frame data through a pop sound repair model if pop sound data is detected in the frame data, so as to obtain repaired frame data. The pop sound repair model is used to filter out pop sound data in the frame data.

[0091] The audio processing unit 503 is used to normalize the audio loudness of the repaired frame data based on the maximum amplitude value in the time domain of the repaired frame data, so as to obtain the repaired audio data of the video file.

[0092] According to one or more embodiments of this disclosure, the pop sound detection model includes a feature extraction module and a pop sound prediction module; correspondingly, the step of detecting whether pop sound data exists in each frame of data based on the frequency domain data includes: for each frame of audio data, converting the time-domain waveform signal corresponding to the frame data into frequency domain data through Fourier transform; modeling the local spectrum of the frequency domain data by the feature extraction module to extract a first frequency domain feature for determining whether a pop sound exists; compressing the first frequency domain feature by the pop sound prediction module to obtain a second frequency domain feature of a preset dimension, and detecting whether pop sound data exists in the frame data by the second frequency domain feature.

[0093] According to one or more embodiments of this disclosure, the pop sound prediction module includes a linear layer, a one-dimensional convolutional layer, and a sigmoid nonlinear activation function. Correspondingly, the step of compressing the first frequency domain feature using the pop sound prediction module to obtain a second frequency domain feature of a preset dimension, and detecting whether pop sound data exists in the frame data using the second frequency domain feature, includes: compressing the first frequency domain feature using the linear layer and one-dimensional convolutional layer in the pop sound prediction module to obtain a second frequency domain feature of a preset dimension; performing pop sound detection on the frame data using the sigmoid nonlinear activation function in the pop sound prediction module and the second frequency domain feature, and predicting the pop sound probability corresponding to the frame data; if the pop sound probability is greater than a preset probability, then it is determined that pop sound data exists in the frame data; if the pop sound probability is not greater than the preset probability, then it is determined that pop sound data does not exist in the frame data.

[0094] According to one or more embodiments of this disclosure, the pop sound repair model includes an encoding module, a feature processing module, and an anti-mirror decoding module. Accordingly, the step of performing pop sound repair processing on the frame data using the pop sound repair model to obtain repaired frame data includes: encoding the frame data using the encoding module in the pop sound repair model to obtain frame data with a preset feature dimension; extracting target feature data without pop sounds from the frame data with the preset feature dimension using the feature processing module in the pop sound repair model; and decoding the target feature data using the anti-mirror decoding module in the pop sound repair model to filter out spectral mirroring in the target feature data, thereby obtaining the repaired frame data.

[0095] According to one or more embodiments of this disclosure, the step of encoding the frame data through the encoding module in the pop sound restoration model to obtain frame data with a preset feature dimension includes: downsampling the frame data through a first convolutional layer in the encoding module of the pop sound restoration model to obtain frame data with a time dimension; and expanding the feature dimension of the frame data with a second convolutional layer in the encoding module to obtain frame data with the preset feature dimension.

[0096] According to one or more embodiments of this disclosure, the step of extracting target feature data without popping sounds from frame data of the preset feature dimension through the feature processing module in the popping sound repair model includes: extracting target feature data without popping sounds from frame data of the preset feature dimension through a bidirectional long short-term memory network (BLSTM) model in the feature processing module of the popping sound repair model.

[0097] According to one or more embodiments of this disclosure, the anti-mirror decoding module includes multiple decoders and filters; correspondingly, the step of decoding the target feature data through the anti-mirror decoding module in the pop sound restoration model to filter out spectral mirrors in the target feature data and obtain restored frame data includes: upsampling the target feature data through multiple decoders in the pop sound restoration model to obtain restored initial frame data; and filtering out spectral mirrors in the initial frame data through the filters in the pop sound restoration model to obtain restored frame data.

[0098] According to one or more embodiments of this disclosure, the step of normalizing the audio loudness of the repaired frame data based on the maximum amplitude value in the time domain to obtain the repaired audio data of the video file includes: determining the maximum amplitude value in the time domain of the repaired frame data; if the maximum amplitude value is greater than a preset amplitude threshold, normalizing the audio loudness of the repaired frame data to a preset audio loudness to obtain normalized frame data; and reducing the amplitude of the target frame data to the preset amplitude threshold using a limiter to obtain the repaired audio data of the video file, wherein the target frame data is the frame data in the normalized frame data whose amplitude is greater than the preset amplitude threshold.

[0099] refer to Figure 6The diagram illustrates a structural schematic of an electronic device 600 suitable for implementing embodiments of the present disclosure. The electronic device 600 can be a terminal device or a server. The terminal device can include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, personal digital assistants (PDAs), tablet computers, portable media players (PMPs), and in-vehicle terminals (e.g., in-vehicle navigation terminals), as well as fixed terminals such as digital TVs and desktop computers. Figure 6 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.

[0100] like Figure 6 As shown, electronic device 600 may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 601, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 602 or a program loaded from storage device 608 into random access memory (RAM) 603. RAM 603 also stores various programs and data required for the operation of electronic device 600. The processing unit 601, ROM 602, and RAM 603 are interconnected via bus 604. Input / output (I / O) interface 605 is also connected to bus 604.

[0101] Typically, the following devices can be connected to I / O interface 605: input devices 606 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 607 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 608 including, for example, magnetic tapes, hard disks, etc.; and communication devices 609. Communication device 609 allows electronic device 600 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 6 An electronic device 600 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.

[0102] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 609, or installed from a storage device 608, or installed from a ROM 602. When the computer program is executed by the processing device 601, it performs the functions defined in the methods of embodiments of this disclosure.

[0103] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0104] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.

[0105] The aforementioned computer-readable medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.

[0106] Computer program code for performing the operations of this disclosure can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0107] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0108] The units described in the embodiments of this disclosure can be implemented in software or in hardware. The name of a unit does not necessarily limit the unit itself; for example, the first acquisition unit can also be described as "a unit that acquires at least two Internet Protocol addresses".

[0109] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.

[0110] In a first aspect, according to one or more embodiments of this disclosure, an audio processing method is provided, comprising:

[0111] The audio data of the video file is obtained, and each frame of the audio data is converted into frequency domain data through a pop detection model. Based on the frequency domain data, it is detected whether pop data exists in each frame.

[0112] If popping sound data is detected in the frame data, the frame data is processed by a popping sound repair model to obtain repaired frame data. The popping sound repair model is used to filter out popping sound data and spectral mirroring in the frame data.

[0113] Based on the maximum amplitude of the repaired frame data in the time domain, the audio loudness of the repaired frame data is normalized to obtain the repaired audio data of the video file.

[0114] According to one or more embodiments of this disclosure, the pop sound detection model includes a feature extraction module and a pop sound prediction module; correspondingly, the step of detecting whether pop sound data exists in each frame of data based on the frequency domain data includes: for each frame of audio data, converting the time-domain waveform signal corresponding to the frame data into frequency domain data through Fourier transform; modeling the local spectrum of the frequency domain data by the feature extraction module to extract a first frequency domain feature for determining whether a pop sound exists; compressing the first frequency domain feature by the pop sound prediction module to obtain a second frequency domain feature of a preset dimension, and detecting whether pop sound data exists in the frame data by the second frequency domain feature.

[0115] According to one or more embodiments of this disclosure, the pop sound prediction module includes a linear layer, a one-dimensional convolutional layer, and a sigmoid nonlinear activation function. Correspondingly, the step of compressing the first frequency domain feature using the pop sound prediction module to obtain a second frequency domain feature of a preset dimension, and detecting whether pop sound data exists in the frame data using the second frequency domain feature, includes: compressing the first frequency domain feature using the linear layer and one-dimensional convolutional layer in the pop sound prediction module to obtain a second frequency domain feature of a preset dimension; performing pop sound detection on the frame data using the sigmoid nonlinear activation function in the pop sound prediction module and the second frequency domain feature, and predicting the pop sound probability corresponding to the frame data; if the pop sound probability is greater than a preset probability, then it is determined that pop sound data exists in the frame data; if the pop sound probability is not greater than the preset probability, then it is determined that pop sound data does not exist in the frame data.

[0116] According to one or more embodiments of this disclosure, the pop sound repair model includes an encoding module, a feature processing module, and an anti-mirror decoding module. Accordingly, the step of performing pop sound repair processing on the frame data using the pop sound repair model to obtain repaired frame data includes: encoding the frame data using the encoding module in the pop sound repair model to obtain frame data with a preset feature dimension; extracting target feature data without pop sounds from the frame data with the preset feature dimension using the feature processing module in the pop sound repair model; and decoding the target feature data using the anti-mirror decoding module in the pop sound repair model to filter out spectral mirroring in the target feature data, thereby obtaining the repaired frame data.

[0117] According to one or more embodiments of this disclosure, the step of encoding the frame data through the encoding module in the pop sound restoration model to obtain frame data with a preset feature dimension includes: downsampling the frame data through a first convolutional layer in the encoding module of the pop sound restoration model to obtain frame data with a time dimension; and expanding the feature dimension of the frame data with a second convolutional layer in the encoding module to obtain frame data with the preset feature dimension.

[0118] According to one or more embodiments of this disclosure, the step of extracting target feature data without popping sounds from frame data of the preset feature dimension through the feature processing module in the popping sound repair model includes: extracting target feature data without popping sounds from frame data of the preset feature dimension through a bidirectional long short-term memory network (BLSTM) model in the feature processing module of the popping sound repair model.

[0119] According to one or more embodiments of this disclosure, the anti-mirror decoding module includes multiple decoders and filters; correspondingly, the step of decoding the target feature data through the anti-mirror decoding module in the pop sound restoration model to filter out spectral mirrors in the target feature data and obtain restored frame data includes: upsampling the target feature data through multiple decoders in the pop sound restoration model to obtain restored initial frame data; and filtering out spectral mirrors in the initial frame data through the filters in the pop sound restoration model to obtain restored frame data.

[0120] According to one or more embodiments of this disclosure, the step of normalizing the audio loudness of the repaired frame data based on the maximum amplitude value in the time domain to obtain the repaired audio data of the video file includes: determining the maximum amplitude value in the time domain of the repaired frame data; if the maximum amplitude value is greater than a preset amplitude threshold, normalizing the audio loudness of the repaired frame data to a preset audio loudness to obtain normalized frame data; and reducing the amplitude of the target frame data to the preset amplitude threshold using a limiter to obtain the repaired audio data of the video file, wherein the target frame data is the frame data in the normalized frame data whose amplitude is greater than the preset amplitude threshold.

[0121] Secondly, according to one or more embodiments of this disclosure, an audio processing device is provided, comprising:

[0122] The pop sound detection unit is used to acquire the audio data of the video file, convert each frame data in the audio data into frequency domain data through the pop sound detection model, and detect whether pop sound data exists in each frame data based on the frequency domain data.

[0123] The pop sound repair unit is used to perform pop sound repair processing on the frame data through a pop sound repair model if pop sound data is detected in the frame data, so as to obtain repaired frame data. The pop sound repair model is used to filter out pop sound data in the frame data.

[0124] An audio processing unit is used to normalize the audio loudness of the repaired frame data based on the maximum amplitude value in the time domain, so as to obtain the repaired audio data of the video file.

[0125] According to one or more embodiments of this disclosure, the pop sound detection model includes a feature extraction module and a pop sound prediction module; correspondingly, the step of detecting whether pop sound data exists in each frame of data based on the frequency domain data includes: for each frame of audio data, converting the time-domain waveform signal corresponding to the frame data into frequency domain data through Fourier transform; modeling the local spectrum of the frequency domain data by the feature extraction module to extract a first frequency domain feature for determining whether a pop sound exists; compressing the first frequency domain feature by the pop sound prediction module to obtain a second frequency domain feature of a preset dimension, and detecting whether pop sound data exists in the frame data by the second frequency domain feature.

[0126] According to one or more embodiments of this disclosure, the pop sound prediction module includes a linear layer, a one-dimensional convolutional layer, and a sigmoid nonlinear activation function. Correspondingly, the step of compressing the first frequency domain feature using the pop sound prediction module to obtain a second frequency domain feature of a preset dimension, and detecting whether pop sound data exists in the frame data using the second frequency domain feature, includes: compressing the first frequency domain feature using the linear layer and one-dimensional convolutional layer in the pop sound prediction module to obtain a second frequency domain feature of a preset dimension; performing pop sound detection on the frame data using the sigmoid nonlinear activation function in the pop sound prediction module and the second frequency domain feature, and predicting the pop sound probability corresponding to the frame data; if the pop sound probability is greater than a preset probability, then it is determined that pop sound data exists in the frame data; if the pop sound probability is not greater than the preset probability, then it is determined that pop sound data does not exist in the frame data.

[0127] According to one or more embodiments of this disclosure, the pop sound repair model includes an encoding module, a feature processing module, and an anti-mirror decoding module. Accordingly, the step of performing pop sound repair processing on the frame data using the pop sound repair model to obtain repaired frame data includes: encoding the frame data using the encoding module in the pop sound repair model to obtain frame data with a preset feature dimension; extracting target feature data without pop sounds from the frame data with the preset feature dimension using the feature processing module in the pop sound repair model; and decoding the target feature data using the anti-mirror decoding module in the pop sound repair model to filter out spectral mirroring in the target feature data, thereby obtaining the repaired frame data.

[0128] According to one or more embodiments of this disclosure, the step of encoding the frame data through the encoding module in the pop sound restoration model to obtain frame data with a preset feature dimension includes: downsampling the frame data through a first convolutional layer in the encoding module of the pop sound restoration model to obtain frame data with a time dimension; and expanding the feature dimension of the frame data with a second convolutional layer in the encoding module to obtain frame data with the preset feature dimension.

[0129] According to one or more embodiments of this disclosure, the step of extracting target feature data without popping sounds from frame data of the preset feature dimension through the feature processing module in the popping sound repair model includes: extracting target feature data without popping sounds from frame data of the preset feature dimension through a bidirectional long short-term memory network (BLSTM) model in the feature processing module of the popping sound repair model.

[0130] According to one or more embodiments of this disclosure, the anti-mirror decoding module includes multiple decoders and filters; correspondingly, the step of decoding the target feature data through the anti-mirror decoding module in the pop sound restoration model to filter out spectral mirrors in the target feature data and obtain restored frame data includes: upsampling the target feature data through multiple decoders in the pop sound restoration model to obtain restored initial frame data; and filtering out spectral mirrors in the initial frame data through the filters in the pop sound restoration model to obtain restored frame data.

[0131] According to one or more embodiments of this disclosure, the step of normalizing the audio loudness of the repaired frame data based on the maximum amplitude value in the time domain to obtain the repaired audio data of the video file includes: determining the maximum amplitude value in the time domain of the repaired frame data; if the maximum amplitude value is greater than a preset amplitude threshold, normalizing the audio loudness of the repaired frame data to a preset audio loudness to obtain normalized frame data; and reducing the amplitude of the target frame data to the preset amplitude threshold using a limiter to obtain the repaired audio data of the video file, wherein the target frame data is the frame data in the normalized frame data whose amplitude is greater than the preset amplitude threshold.

[0132] Thirdly, according to one or more embodiments of the present disclosure, an electronic device is provided, comprising: at least one processor and a memory;

[0133] The memory stores computer-executed instructions;

[0134] The at least one processor executes computer execution instructions stored in the memory, causing the at least one processor to perform the audio processing method as described in the first aspect and various possible designs of the first aspect.

[0135] Fourthly, according to one or more embodiments of the present disclosure, a computer-readable storage medium is provided, wherein computer-executable instructions are stored therein, which, when executed by a processor, implement the audio processing method described in the first aspect and various possible designs of the first aspect.

[0136] Fifthly, according to one or more embodiments of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the audio processing method as described in the first aspect and various possible designs of the first aspect.

[0137] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.

[0138] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.

[0139] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.

Claims

1. An audio processing method, characterized in that, include: The audio data of the video file is obtained, and each frame of the audio data is converted into frequency domain data through a pop detection model. Based on the frequency domain data, it is detected whether pop data exists in each frame. If popping sound data is detected in the frame data, the frame data is processed by a popping sound repair model to obtain repaired frame data. The popping sound repair model is used to filter out popping sound data and spectral mirroring in the frame data. Based on the maximum amplitude of the repaired frame data in the time domain, the audio loudness of the repaired frame data is normalized to obtain the repaired audio data of the video file.

2. The audio processing method according to claim 1, characterized in that, The pop sound detection model includes a feature extraction module and a pop sound prediction module; correspondingly, the step of detecting whether pop sound data exists in each frame of data based on the frequency domain data includes: For each frame of the audio data, the time-domain waveform signal corresponding to the frame data is converted into frequency-domain data by Fourier transform; The feature extraction module models the local spectrum of the frequency domain data and extracts the first frequency domain feature used to determine whether it is a popping sound. The pop prediction module compresses the first frequency domain feature to obtain a second frequency domain feature of a preset dimension, and uses the second frequency domain feature to detect whether pop data exists in the frame data.

3. The audio processing method according to claim 2, characterized in that, The pop sound prediction module includes a linear layer, a one-dimensional convolutional layer, and a sigmoid nonlinear activation function; correspondingly, the pop sound prediction module compresses the first frequency domain features to obtain a second frequency domain feature of a preset dimension, and detects whether pop sound data exists in the frame data using the second frequency domain feature, including: The first frequency domain features are compressed by the linear layer and one-dimensional convolutional layer in the pop sound prediction module to obtain the second frequency domain features of a preset dimension. The pop prediction module uses the Sigmoid nonlinear activation function and the second frequency domain feature to detect pops in the frame data and predict the pop probability corresponding to the frame data. If the probability of a popping sound is greater than a preset probability, then it is determined that popping sound data exists in the frame data; if the probability of a popping sound is not greater than the preset probability, then it is determined that popping sound data does not exist in the frame data.

4. The audio processing method according to claim 1, characterized in that, The pop sound repair model includes an encoding module, a feature processing module, and an anti-mirror decoding module; correspondingly, the pop sound repair processing of the frame data using the pop sound repair model to obtain the repaired frame data includes: The frame data is encoded by the encoding module in the pop sound repair model to obtain frame data with preset feature dimensions. The feature processing module in the popping sound repair model extracts target feature data without popping sound from the frame data of the preset feature dimension. The anti-mirror decoding module in the pop sound repair model decodes the target feature data, filters out the spectral mirror in the target feature data, and obtains the repaired frame data.

5. The audio processing method according to claim 4, characterized in that, The step of encoding the frame data through the encoding module in the pop sound repair model to obtain frame data with preset feature dimensions includes: The frame data is downsampled by the first convolutional layer in the encoding module of the pop sound repair model to obtain frame data in the time dimension. The frame data of the time dimension is expanded by the second-dimensional convolutional layer in the encoding module to obtain the frame data of the preset feature dimension.

6. The audio processing method according to claim 4, characterized in that, The step of extracting target feature data without popping sounds from frame data of the preset feature dimension through the feature processing module in the popping sound repair model includes: The bidirectional long short-term memory network (BLSTM) model in the feature processing module of the pop sound repair model is used to extract target feature data without pop sounds from the frame data of the preset feature dimensions.

7. The audio processing method according to claim 4, characterized in that, The anti-mirror decoding module includes multiple decoders and filters; Accordingly, the step of decoding the target feature data through the anti-mirror decoding module in the pop sound repair model to filter out spectral mirrors in the target feature data and obtain the repaired frame data includes: The target feature data is upsampled by multiple decoders in the pop sound repair model to obtain the repaired initial frame data. The filter in the pop sound repair model is used to filter out the spectral image in the initial frame data to obtain the repaired frame data.

8. The audio processing method according to claim 1, characterized in that, The audio loudness of the repaired frame data is normalized based on the maximum amplitude value in the time domain to obtain the repaired audio data of the video file, including: Determine the maximum amplitude of the repaired frame data in the time domain; If the maximum amplitude value is greater than the preset amplitude threshold, the audio loudness of the repaired frame data is normalized to the preset audio loudness to obtain normalized frame data. The amplitude of the target frame data is reduced to the preset amplitude threshold by a limiter to obtain the audio data after the video file is repaired. The target frame data is the frame data in the normalized frame data whose amplitude is greater than the preset amplitude threshold.

9. An audio processing device, characterized in that, The device includes: The pop sound detection unit is used to acquire the audio data of the video file, convert each frame data in the audio data into frequency domain data through the pop sound detection model, and detect whether pop sound data exists in each frame data based on the frequency domain data. The pop sound repair unit is used to perform pop sound repair processing on the frame data through a pop sound repair model if pop sound data is detected in the frame data, so as to obtain repaired frame data. The pop sound repair model is used to filter out pop sound data in the frame data. An audio processing unit is used to normalize the audio loudness of the repaired frame data based on the maximum amplitude value in the time domain, so as to obtain the repaired audio data of the video file.

10. An electronic device, characterized in that, include: Processor and memory; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory, causing the processor to perform the audio processing method as described in any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, implement the audio processing method as described in any one of claims 1 to 8.

12. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the audio processing method as described in any one of claims 1 to 8.