Method and apparatus for processing audio data, and storage medium
By classifying and training the audio data to be screened, high-quality and low-quality audio sets are formed, and an audio classification model is trained. This solves the problem of inconsistent audio data quality on Internet platforms and improves the audio screening accuracy of the large speech synthesis model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU SHIYUAN ELECTRONICS CO LTD
- Filing Date
- 2024-12-12
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, the quality of audio data collected from internet platforms varies, resulting in poor performance when training large speech synthesis models. Furthermore, audio training data using data simulation cannot cover audio features in real-world environments, affecting the accuracy of high-quality audio selection.
Training data is obtained from the audio data to be screened and classified to form high-quality and low-quality audio sets. The audio classification model is then trained using the high-quality and low-quality audio data to form a well-trained audio classification model, which is used to screen high-quality audio data.
The performance of the audio classification model has been improved, enabling it to better distinguish between high-quality and low-quality audio, and increasing the accuracy of selecting high-quality audio from the audio data to be screened.
Smart Images

Figure CN122201265A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, such as a method and apparatus for processing audio data and a storage medium. Background Technology
[0002] Currently, to save training costs, large amounts of audio data are typically collected from internet platforms during the training of large-scale speech synthesis models. However, since this audio data includes low-quality audio, training the model with low-quality audio negatively impacts its performance, resulting in poor-quality synthesized audio. Therefore, to ensure the performance of the large-scale speech synthesis model, it is necessary to select high-quality audio from the collected internet data for training.
[0003] To filter out high-quality audio, the relevant technology first constructs audio training data using data simulation and then trains a neural network model using this data to obtain a trained neural network model. Next, audio data collected from internet platforms is input into the pre-trained neural network model, which then filters out high-quality audio from the collected data.
[0004] However, in related technologies, the audio training data used to train neural network models is constructed using data simulation. Therefore, compared to audio data collected from internet platforms, the audio training data lacks sufficient audio features. Using audio training data with insufficient audio features to train neural network models will affect the performance of the neural network models, thereby affecting the accuracy of selecting high-quality audio. Summary of the Invention
[0005] To provide a basic understanding of some aspects of the disclosed embodiments, a brief summary is given below. This summary is not intended as a general commentary, nor is it intended to identify key / important components or describe the scope of protection of these embodiments, but rather as a prelude to the detailed description that follows.
[0006] This application provides an audio data processing method, apparatus, and storage medium, which can improve the accuracy of filtering high-quality audio using neural network models.
[0007] In a first aspect, embodiments of this application provide an audio data processing method applied to an electronic device, comprising:
[0008] Obtain training data from the audio data to be filtered;
[0009] The training data is classified to obtain a first audio set and a second audio set; the audio quality in the first audio set is higher than that in the second audio set.
[0010] Select the first target audio from the first audio set, and select the second target audio from the second audio set;
[0011] The audio classification model to be trained is trained using the first target audio and the second target audio to obtain a trained audio classification model.
[0012] The trained audio classification model is used to classify the selected audio data to obtain the target audio data.
[0013] Optionally, selecting a first target audio from a first audio set includes: evaluating the quality of the audio in the first audio set using a preset evaluation model to obtain a quality score for the audio in the first audio set; selecting audio from the first audio set whose quality score is higher than a first threshold to obtain a third audio set; evaluating the quality of the audio in the third audio set using a preset evaluation algorithm to obtain a quality score for the audio in the third audio set; and selecting audio from the third audio set whose quality score is higher than a second threshold to obtain the first target audio.
[0014] Optionally, the quality of the audio in the third audio set is evaluated using a preset evaluation algorithm to obtain a quality score for the audio in the third audio set, including: performing noise reduction processing on the audio in the third audio set using a preset noise reduction model to obtain a reference audio corresponding to the audio in the third audio set; calculating the difference between the audio in the third audio set and the reference audio corresponding to the audio; and generating a quality score for the audio in the third audio set based on the difference.
[0015] Optionally, selecting a second target audio from the second audio set includes: evaluating the quality of the audio in the second audio set using a preset evaluation model to obtain a quality score for the audio in the second audio set; selecting audio from the second audio set whose quality score is lower than a third threshold to obtain a fourth audio set; evaluating the quality of the audio in the fourth audio set using a preset evaluation algorithm to obtain a quality score for the audio in the fourth audio set; and selecting audio from the fourth audio set whose quality score is lower than a fourth threshold to obtain the second target audio.
[0016] Optionally, the quality of the audio in the fourth audio set is evaluated using a preset evaluation algorithm to obtain a quality score for the audio in the fourth audio set, including: performing noise reduction processing on the audio in the fourth audio set using a preset noise reduction model to obtain a reference audio corresponding to the audio in the fourth audio set; calculating the difference between the audio in the fourth audio set and the reference audio corresponding to the audio; and generating a quality score for the audio in the fourth audio set based on the difference.
[0017] Optionally, selecting the first target audio from the first audio set includes: evaluating the quality of the audio in the first audio set using a preset evaluation algorithm to obtain a quality score for the audio in the first audio set; selecting audio from the first audio set whose quality score is higher than a fifth threshold to obtain a fifth audio set; evaluating the quality of the audio in the fifth audio set using a preset evaluation model to obtain a quality score for the audio in the fifth audio set; and selecting audio from the fifth audio set whose quality score is higher than a sixth threshold to obtain the first target audio.
[0018] Optionally, selecting a second target audio from the second audio set includes: evaluating the quality of the audio in the second audio set using a preset evaluation algorithm to obtain a quality score for the audio in the second audio set; selecting audio from the first audio set whose quality score is lower than a seventh threshold to obtain a sixth audio set; evaluating the quality of the audio in the sixth audio set using a preset evaluation model to obtain a quality score for the audio in the sixth audio set; and selecting audio from the sixth audio set whose quality score is lower than an eighth threshold to obtain the second target audio.
[0019] Optionally, the audio classification model to be trained is trained using the first target audio and the second target audio to obtain a trained audio classification model. This includes: inputting the first target audio and the second target audio into the audio classification model to be trained, and training the audio classification model; calculating the learning loss of the audio classification model to be trained during the training process; adjusting the parameters of the audio classification model to be trained based on the learning loss; verifying whether the audio classification model to be trained with adjusted parameters meets preset conditions using labeled test data; wherein the test data is obtained from the audio data to be screened; if the audio classification model to be trained with adjusted parameters meets the preset conditions, the training of the audio classification model to be trained is completed, and a trained audio classification model is obtained.
[0020] Secondly, embodiments of this application provide an audio data processing apparatus, including a processor and a memory storing program instructions, wherein the processor is configured to execute the audio data processing method as described in the first aspect when running the program instructions.
[0021] Thirdly, embodiments of this application provide a storage medium storing program instructions, wherein the program instructions, when running, execute the audio data processing method as described in the first aspect.
[0022] This application provides a method and apparatus for processing audio data, as well as a storage medium, which can achieve the following technical effects:
[0023] In this embodiment, before training the audio classification model to be trained, the electronic device can obtain training data from the audio data to be screened and classify the training data. After classification, the training data forms two audio sets: a first audio set and a second audio set. The audio in the first audio set is of higher quality, while the audio in the second audio set is of lower quality. High-quality first target audio is further selected from the high-quality first audio set, and low-quality second target audio is selected from the low-quality second audio set. The high-quality first target audio and the low-quality second target audio are used to train the audio classification model to be trained, resulting in a trained audio classification model. After classifying the selected audio data using the trained audio classification model, target audio data for training a large speech synthesis model can be obtained.
[0024] In this embodiment, since the first and second target audios used to train the audio classification model are obtained by filtering the training data layer by layer, and the training data is obtained from the audio data to be filtered, the first and second target audios are actually obtained from the audio data to be filtered. Thus, the first and second target audios can contain the audio features of the audio data to be filtered. By using the first and second target audios to train the audio classification model, the audio classification model can learn the information of the audio data to be filtered. That is, the audio classification model can learn rich audio information during training, thereby improving the performance of the trained audio classification model and thus improving the accuracy of filtering high-quality target audio data from the audio data to be filtered.
[0025] Furthermore, since the first target audio selected from the training data is high-quality audio, it better reflects the characteristics of high-quality audio in the audio data to be filtered. And since the second target audio selected from the training data is low-quality audio, it better reflects the characteristics of low-quality audio in the audio data to be filtered. Thus, training the audio classification model using the first target audio reflecting high-quality audio characteristics and the second target audio reflecting low-quality audio characteristics allows the model to better learn information about high-quality and low-quality audio during training. This enables the trained audio classification model to better distinguish between high-quality and low-quality audio, further improving the accuracy of selecting high-quality target audio data from the audio data to be filtered.
[0026] The above general description and the description below are exemplary and illustrative only and are not intended to limit this application. Attached Figure Description
[0027] One or more embodiments are illustrated by way of example with reference to the accompanying drawings. These illustrations and drawings do not constitute a limitation on the embodiments. Elements having the same reference numerals in the drawings are considered similar elements. The drawings do not constitute a limitation of scale, and wherein:
[0028] Figure 1 This is a schematic diagram of an audio data processing method provided in an embodiment of this application;
[0029] Figure 2 This is a flowchart of a method for filtering high-quality audio provided in an embodiment of this application;
[0030] Figure 3 This is a flowchart of a method for filtering low-quality audio provided in an embodiment of this application;
[0031] Figure 4 This is a flowchart of another method for filtering high-quality audio provided in an embodiment of this application;
[0032] Figure 5 This is a flowchart of another method for filtering low-quality audio provided in an embodiment of this application;
[0033] Figure 6 This is a schematic diagram of a method for training an audio classification model according to an embodiment of this application;
[0034] Figure 7 This is a schematic diagram of an audio classification model provided in an embodiment of this application;
[0035] Figure 8 This is a schematic diagram of an audio data processing device provided in an embodiment of this application. Detailed Implementation
[0036] The terms "first," "second," etc., used in the specification, claims, and drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this application described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion.
[0037] Unless otherwise stated, the term "multiple" means two or more.
[0038] In this embodiment, the character " / " indicates that the objects before and after it are in an "or" relationship. For example, A / B means: A or B.
[0039] The term "and / or" describes an association between objects, indicating that three relationships can exist. For example, A and / or B means: A or B, or A and B.
[0040] The term "correspondence" can refer to an association or binding relationship. The correspondence between A and B means that there is an association or binding relationship between A and B.
[0041] To provide a more detailed understanding of the features and technical content of the embodiments of this application, the implementation of the embodiments of this application will be described in detail below with reference to the accompanying drawings. The accompanying drawings are for illustrative purposes only and are not intended to limit the embodiments of this application. In the following technical description, for ease of explanation, several details are used to provide a full understanding of the disclosed embodiments. However, one or more embodiments may still be implemented without these details. In other cases, well-known structures and devices may be simplified in their depiction to simplify the drawings.
[0042] Large-scale speech synthesis models are artificial intelligence models that utilize deep learning technology to generate natural and fluent speech. Compared to traditional speech synthesis models, large-scale speech synthesis models typically require a large amount of audio training data to learn how to generate high-quality speech. During the training process of large-scale speech synthesis models, it is costly and time-consuming for professional users to record large amounts of audio training data in a professional recording studio. Therefore, to save training costs, large-scale speech synthesis models are usually trained using large amounts of audio data collected from internet platforms.
[0043] However, due to the inconsistent quality of audio data collected from internet platforms, low-quality audio, such as audio containing noise, background music, and reverberation, is often included. Training a large-scale speech synthesis model with low-quality audio will negatively impact its performance, resulting in poor-quality synthesized audio. Therefore, low-quality audio data collected from internet platforms is unsuitable for training large-scale speech synthesis models. To ensure the performance of the large-scale speech synthesis model, it is necessary to select high-quality audio from the collected data and use it for training.
[0044] To filter out high-quality audio from audio data collected from internet platforms, the relevant technologies mainly adopt the following two approaches.
[0045] The first approach is an audio classification scheme, which filters high-quality audio by categorizing audio data. This approach first requires constructing training data by obtaining clean audio data, noisy data, and music data from clean audio datasets, noisy datasets, and music datasets, respectively. The clean audio data can be divided into two parts: one part serves as the audio data for the clean audio category, and the other part is used to simulate and generate noisy audio data, audio data with background music, and audio data with reverberation. The clean audio data and noisy data are superimposed using different signal-to-noise ratios to obtain noisy audio data. Similarly, the clean audio data and music data are superimposed using different signal-to-noise ratios to obtain audio data with background music. Room Impulse Response (RIR) is simulated using different parameter configurations, and the clean audio data is convolved with the RIR to obtain reverberated audio data. Thus, four categories of audio data are obtained: clean audio data, noisy audio data, audio data with background music, and audio data with reverberation. Secondly, a classification model is trained using clean audio data, as well as simulated audio data with noise, background music, and reverberation. Then, audio data collected from internet platforms is input into the trained classification model, which outputs the predicted probabilities of the audio data corresponding to the four categories mentioned above. The category with the highest predicted probability is selected as the predicted category of the audio data, thus filtering out audio data belonging to the clean audio data category as high-quality audio.
[0046] The second approach is an audio quality prediction scheme. It utilizes the MOS (Mean Opinion Score) prediction model to allow users to subjectively evaluate audio quality, i.e., rate the audio quality from 1 to 5. A higher score indicates higher audio quality. In this second approach, training data needs to be constructed first. The process of constructing training data is similar to that in the first approach and will not be repeated here. The constructed training data includes four categories of audio data: clean audio data, noisy audio data, audio data with background music, and audio data with reverb. Next, the MOS prediction model is trained using the clean audio data, as well as simulated noisy, noisy, and reverb-rich audio data. Then, audio data collected from internet platforms is input into the trained MOS prediction model, causing the model to output a score for each audio data point and selecting audio data with a score higher than 4 as high-quality audio.
[0047] Both approaches in the related technologies share the commonality of employing data simulation to construct audio training data and then using this data to train a neural network model, resulting in a pre-trained model. Audio data collected from internet platforms is then input into the pre-trained neural network model, which then filters out high-quality audio from the collected data.
[0048] However, in related technologies, the audio training data for training neural network models is constructed using data simulation, resulting in differences compared to audio data collected from internet platforms. Specifically, the audio training data constructed using data simulation is based on clean audio data, noise data, and music data obtained separately from clean audio datasets, noise datasets, and music datasets. When the noise dataset has a limited range of noise types, and the music dataset has a limited range of background music types, the types of audio training data become limited, failing to cover the audio types of audio data from real-world environments (i.e., audio data collected from internet platforms). Furthermore, there is a feature discrepancy between simulated audio training data and real-world audio data; the simulated audio training data cannot reflect the audio characteristics of audio data collected from internet platforms, resulting in insufficient audio features. Therefore, training neural network models with simulated training data makes it difficult for the neural network model to learn rich audio information, easily affecting the performance of the neural network model and thus impacting the accuracy of selecting high-quality audio from audio data collected from internet platforms.
[0049] Therefore, embodiments of this application provide an audio data processing method, apparatus, and storage medium. In these embodiments, training data can be directly obtained from audio data collected from the internet and then used to train an audio classification model. This allows the audio classification model to learn rich audio information, thereby improving the performance of the trained model. Using a well-trained, high-performance audio classification model to filter high-quality audio from the collected audio data can improve accuracy.
[0050] In this embodiment, the entity executing the audio data processing method can be an electronic device, such as a mobile phone, tablet computer, laptop, desktop computer, smart interactive whiteboard, or other device with audio filtering capabilities. The electronic device integrates multiple algorithm models, and when filtering high-quality audio based on collected audio data, it can achieve this through these integrated algorithm models.
[0051] The following section explains how electronic devices use various algorithm models to process audio data.
[0052] Combination Figure 1 As shown in the figure, this application provides a method for processing audio data, which can be applied to the above-mentioned electronic device. The method includes the following steps:
[0053] S11, Obtain training data from the audio data to be filtered.
[0054] S12, classify the training data to obtain a first audio set and a second audio set. The audio quality in the first audio set is higher than that in the second audio set.
[0055] S13, select the first target audio from the first audio set.
[0056] S14, select the second target audio from the second audio set.
[0057] S15, use the first target audio and the second target audio to train the audio classification model to be trained, and obtain the trained audio classification model.
[0058] S16: Use the trained audio classification model to classify the selected audio data to obtain the target audio data.
[0059] Using the audio data processing method provided in this application, before training the audio classification model to be trained, the electronic device can obtain training data from the audio data to be screened and classify the training data. After classification, the training data forms two audio sets: a first audio set and a second audio set. The audio in the first audio set has higher quality, while the audio in the second audio set has lower quality. High-quality first target audio is further selected from the high-quality first audio set, and low-quality second target audio is selected from the low-quality second audio set. The high-quality first target audio and the low-quality second target audio are used to train the audio classification model to be trained, resulting in a trained audio classification model. After classifying the screened audio data using the trained audio classification model, target audio data for training a large speech synthesis model can be obtained.
[0060] In this embodiment, since the first and second target audios used to train the audio classification model are obtained by filtering the training data layer by layer, and the training data is obtained from the audio data to be filtered, the first and second target audios are actually obtained from the audio data to be filtered. Thus, the first and second target audios can contain the audio features of the audio data to be filtered. By using the first and second target audios to train the audio classification model, the audio classification model can learn the information of the audio data to be filtered. That is, the audio classification model can learn rich audio information during training, thereby improving the performance of the trained audio classification model and thus improving the accuracy of filtering high-quality target audio data from the audio data to be filtered.
[0061] Furthermore, since the first target audio selected from the training data is high-quality audio, it better reflects the characteristics of high-quality audio in the audio data to be screened. And since the second target audio selected from the training data is low-quality audio, it better reflects the characteristics of low-quality audio in the audio data to be screened. Thus, training the audio classification model using the first target audio reflecting high-quality audio characteristics and the second target audio reflecting low-quality audio characteristics allows the model to better learn information about high-quality and low-quality audio during training. This enables the trained audio classification model to better distinguish between high-quality and low-quality audio, further improving the accuracy of selecting high-quality target audio data from the audio data to be screened.
[0062] Optionally, in step S11 above, the audio collected from the internet platform is segmented into audio segments to form audio data to be filtered. The duration of the audio segments in the audio data to be filtered can be in the range of 5s to 20s. In electronic devices, using long audio data to train a large speech synthesis model can easily lead to insufficient storage space of the electronic device's graphics card, resulting in training interruption. On the other hand, for audio segments that are too short, the speech content contained in the audio segment is limited, which can easily lead to incomplete semantics of the sentences in the speech content. If semantically incomplete audio segments participate in the training process of the large speech synthesis model, the model will find it difficult to capture complete and accurate features, thereby affecting the model's learning effect and thus affecting the model's performance. Therefore, in this embodiment, it is necessary to filter audio segments within a preset duration range from multiple audio segments to ensure that the duration of the audio segments is neither too long nor too short.
[0063] As an optional embodiment, a first preset number of audio segments are randomly selected from the audio segments included in the audio data to be screened, and used as training data. The total duration of the first preset number of audio segments in the training data can be 2000 hours.
[0064] Optionally, in step S12 above, the training data is classified by a preset classification model. The preset classification model can be an open-source binary classification model. The preset classification model can be used to perform preliminary screening of the training data. The purpose of preliminary screening is to divide the training data into two categories, where the first audio set including higher quality audio is one category, and the first audio set including lower quality audio is another category.
[0065] Optionally, in step S13 above, multiple filtering is performed based on the first audio set to filter out high-quality first target audio from the first audio set, which includes high-quality audio.
[0066] Furthermore, in combination Figure 2 As shown, the steps for the electronic device to select the first target audio from the first audio set are as follows:
[0067] S21, use a preset evaluation model to evaluate the quality of the audio in the first audio set and obtain a quality score for the audio in the first audio set.
[0068] The preset evaluation model can be a MOS prediction model. The structure of the MOS prediction model can be constructed by stacking Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Bi-directional Long Short-term Memory (BLSTM), Residual Blocks, and Attention modules. For example, it can be constructed by stacking 4 Residual Blocks, 2 CNNs, and 4 Attention modules.
[0069] S22, select audio files with quality scores higher than the first threshold from the first audio set to obtain the third audio set.
[0070] S23, evaluate the quality of the audio in the third audio set using a preset evaluation algorithm, and obtain the quality score of the audio in the third audio set.
[0071] The preset evaluation algorithm can be the Perceptual Evaluation of Speech Quality (PESQ) algorithm. The PESQ algorithm compares the original audio (i.e., the clean reference audio) and the processed audio (i.e., the audio in the third audio set) to obtain the difference between the two, and calculates the quality score (i.e., PESQ score) of the audio in the third audio set based on the difference.
[0072] S24, select audio files with quality scores higher than the second threshold from the third audio set to obtain the first target audio.
[0073] Steps S21 to S24 will be explained below.
[0074] In step S21, after the audio from the first audio set is input into the MOS prediction model, the MOS prediction model can evaluate the quality of the audio in the first audio set and output the quality score (i.e., MOS score) of the audio in the first audio set.
[0075] In step S22, the audio quality score (i.e., MOS score) ranges from 1 to 5. For example, the first threshold can be set to 4.5. After the MOS prediction model outputs the MOS scores of the audio in the first audio set, audio in the first audio set with a quality score higher than 4.5 can be filtered out to obtain the third audio set. Therefore, the MOS scores of the audio in the third audio set are all higher than 4.5.
[0076] In this embodiment, when evaluating the quality of the audio in the first audio set using the MOS prediction model, the MOS prediction model may have prediction errors. If the MOS prediction model has prediction errors, the quality score output by the MOS prediction model after evaluating the quality of the audio in the first audio set will be incorrect. If the quality score of the audio in the first audio set is incorrect, after filtering the audio in the first audio set using a first threshold, the resulting third audio set may contain audio of lower quality. Therefore, steps S23 and S24 are executed to further evaluate the quality of the audio in the third audio set using a preset evaluation algorithm, so as to filter out high-quality first target audio based on the quality of the audio in the third audio set.
[0077] In step S23, the quality of the audio in the third audio set is evaluated using a preset evaluation algorithm to obtain a quality score for the audio in the third audio set. This includes: performing noise reduction processing on the audio in the third audio set using a preset noise reduction model to obtain a reference audio corresponding to the audio in the third audio set; calculating the difference between the audio in the third audio set and the corresponding reference audio; and generating a quality score for the audio in the third audio set based on the difference.
[0078] In this implementation, since there is no reference audio corresponding to the audio in the third audio set in the audio data to be filtered, it is necessary to perform noise reduction processing on the audio in the third audio set. After noise reduction processing, the corresponding reference audio is obtained, and the difference between the audio in the third audio set and the corresponding reference audio is calculated. This difference can characterize the differences between the audio in the third audio set and the corresponding reference audio in terms of loudness, delay, and frequency distortion. The smaller the difference, the closer the quality of the audio in the third audio set and the corresponding reference audio is, and the higher the audio quality score (i.e., PESQ score) is.
[0079] In step S24, the PESQ score ranges from -0.5 to 4.5. For example, the second threshold can be set to 4, filtering out audio from the third audio set whose quality score is higher than 4, thus obtaining high-quality first target audio. Combining this with the example in step S22 above, since the third audio set consists entirely of audio with a MOS score higher than 4.5, in this embodiment, the final first target audio obtained is high-quality audio with both a MOS score higher than 4.5 and a PESQ score higher than 4.
[0080] By adopting this implementation method, when filtering high-quality first target audio from the first audio set, the quality of the audio can be evaluated multiple times using a preset evaluation model (i.e., MOS prediction model) and a preset evaluation algorithm (i.e., PESQ algorithm), and a filtering is performed based on the quality of the audio evaluated each time, thereby realizing multi-level filtering of audio to ensure the accuracy of the high-quality audio obtained by filtering.
[0081] Optionally, in step S14 above, multiple filtering is performed based on the second audio set to filter out the low-quality second target audio from the second audio set, which includes low-quality audio.
[0082] Furthermore, in combination Figure 3 As shown, the steps for the electronic device to select the second target audio from the second audio set are as follows:
[0083] S31, use a preset evaluation model to evaluate the quality of the audio in the second audio set and obtain the quality score of the audio in the second audio set.
[0084] The preset evaluation model can be a MOS prediction model. The structure of the MOS prediction model can be constructed by stacking CNN, RNN, BLSTM, Residual Block and Attention. For example, it can be constructed by stacking 4 layers of Residual Block, 2 layers of CNN and 4 layers of Attention.
[0085] S32, select audio from the second audio set whose quality score is lower than the third threshold to obtain the fourth audio set.
[0086] S33: Evaluate the quality of the audio in the fourth audio set using a preset evaluation algorithm to obtain the quality score of the audio in the fourth audio set.
[0087] The preset evaluation algorithm can be the PESQ algorithm. The PESQ algorithm compares the original audio (i.e., the clean reference audio) and the processed audio (i.e., the audio in the fourth audio set) to obtain the difference between the two, and calculates the quality score (i.e., the PESQ score) of the audio in the fourth audio set based on the difference.
[0088] S34: Select audio files with quality scores below the fourth threshold from the fourth audio set to obtain the second target audio.
[0089] Steps S31 to S34 will be explained below.
[0090] In step S31, after the audio from the second audio set is input into the MOS prediction model, the MOS prediction model can evaluate the quality of the audio in the second audio set and output the quality score (i.e., MOS score) of the audio in the second audio set.
[0091] In step S32, the audio quality score (i.e., MOS score) ranges from 1 to 5. For example, the third threshold can be set to 3. After the MOS prediction model outputs the MOS scores of the audio in the second audio set, audio with a quality score lower than 3 in the second audio set can be filtered out to obtain the fourth audio set. Therefore, the MOS scores of the audio in the fourth audio set are all lower than 3.
[0092] In this embodiment, when using the MOS prediction model to evaluate the quality of the audio in the second audio set, the MOS prediction model may have prediction errors. If the MOS prediction model has prediction errors, the quality score output by the MOS prediction model after evaluating the quality of the audio in the second audio set will be incorrect. If the quality score of the audio in the second audio set is incorrect, after filtering the audio in the second audio set using a third threshold, the resulting fourth audio set may contain audio of higher quality. Therefore, steps S33 and S34 are executed to further evaluate the quality of the audio in the fourth audio set using a preset evaluation algorithm, so as to filter out the low-quality second target audio based on the quality of the audio in the fourth audio set.
[0093] In step S33, the quality of the audio in the fourth audio set is evaluated using a preset evaluation algorithm to obtain a quality score for the audio in the fourth audio set. This includes: performing noise reduction processing on the audio in the fourth audio set using a preset noise reduction model to obtain a reference audio corresponding to the audio in the fourth audio set; calculating the difference between the audio in the fourth audio set and the corresponding reference audio; and generating a quality score for the audio in the fourth audio set based on the difference.
[0094] In this implementation, since there is no reference audio corresponding to the audio in the fourth audio set in the audio data to be filtered, it is necessary to perform noise reduction processing on the audio in the fourth audio set. After noise reduction processing, the corresponding reference audio is obtained, and the difference between the audio in the fourth audio set and the corresponding reference audio is calculated. This difference can characterize the differences in loudness, delay, and frequency distortion between the audio in the fourth audio set and the corresponding reference audio. The greater the difference, the greater the quality difference between the audio in the fourth audio set and the corresponding reference audio, and the lower the audio quality score (i.e., PESQ score).
[0095] In step S34, the PESQ score ranges from -0.5 to 4.5. For example, the fourth threshold can be set to 2 points to filter out audio files in the fourth audio set with a quality score below 2, resulting in low-quality second target audio. Referring to the example in step S32 above, since the fourth audio set consists entirely of audio files with a MOS score below 3, in this embodiment, the final filtered second target audio is low-quality audio with a MOS score below 3 and a PESQ score below 2.
[0096] By adopting this implementation method, when filtering low-quality second target audio from the second audio set, the quality of the audio can be evaluated multiple times using a preset evaluation model (i.e., MOS prediction model) and a preset evaluation algorithm (i.e., PESQ algorithm), and a filtering is performed based on the quality of the audio evaluated each time, thereby realizing multi-level filtering of audio to ensure the accuracy of the low-quality audio obtained by filtering.
[0097] In the foregoing embodiments of this application, when filtering the first target audio from the first audio set, in steps S21 to S24, the audio in the first audio set is first evaluated based on the MOS prediction model, and a filtering is performed on the audio in the first audio set to obtain the third audio set. Then, the audio in the third audio set is evaluated using the PESQ algorithm, and a filtering is performed on the audio in the third audio set, which is equivalent to performing two filterings on the audio in the first audio set to obtain high-quality first target audio.
[0098] As an optional implementation, in this embodiment of the application, when filtering the first target audio from the first audio set, the audio in the first audio set can first be evaluated and filtered once using the PESQ algorithm, and then the audio obtained from the first filtering can be further evaluated and filtered a second time based on the MOS prediction model to obtain high-quality first target audio. Specifically, as follows:
[0099] Combination Figure 4 As shown, the steps for the electronic device to select the first target audio from the first audio set are as follows:
[0100] S41, evaluate the quality of the audio in the first audio set using a preset evaluation algorithm, and obtain the quality score of the audio in the first audio set.
[0101] S42, select audio files with quality scores higher than the fifth threshold from the first audio set to obtain the fifth audio set.
[0102] S43. Use a preset evaluation model to evaluate the quality of the audio in the fifth audio set and obtain a quality score for the audio in the fifth audio set.
[0103] S44: Select audio files with quality scores higher than the sixth threshold from the fifth audio set to obtain the first target audio.
[0104] The preset evaluation algorithm and preset evaluation model are similar to the preset evaluation algorithm (PESQ algorithm) and preset evaluation model (MOS prediction model) used in steps S21 to S24 of the aforementioned embodiments, respectively. For details, please refer to the aforementioned embodiments, and they will not be repeated here.
[0105] In addition, when evaluating the quality of the audio in the first audio set using the PESQ algorithm, it is also necessary to obtain the reference audio corresponding to the audio in the first audio set. The specific method for obtaining the reference audio can also be referred to the aforementioned embodiments, and will not be repeated here.
[0106] In steps S41 and S42, the PESQ score ranges from -0.5 to 4.5. For example, the fifth threshold can be set to 4. After evaluating the quality of the audio in the first audio set using the PESQ algorithm and obtaining the quality score (i.e., PESQ score) of the audio in the first audio set, audio with a quality score higher than 4 can be selected to obtain the fifth audio set. Therefore, the PESQ scores of the audio in the fifth audio set are all higher than 4.
[0107] In this embodiment, the PESQ algorithm may introduce errors when evaluating the quality of audio in the first audio set, resulting in incorrect quality scores. If the quality scores in the first audio set are incorrect, filtering the audio in the first audio set using a third threshold may result in lower-quality audio in the fifth audio set. Therefore, steps S43 and S44 are executed to further evaluate the quality of the audio in the fifth audio set using a preset evaluation model, in order to filter out high-quality first target audio based on the quality of the audio in the fifth audio set.
[0108] In steps S43 and S44, the audio quality score (i.e., MOS score) ranges from 1 to 5. For example, the sixth threshold can be set to 4.5. After the MOS prediction model outputs the MOS scores of the audio in the first audio set, audio in the first audio set with a quality score higher than 4.5 can be filtered out to obtain high-quality first target audio. Combining the examples in steps S41 and S42 above, since the fifth audio set consists entirely of audio with a PESQ score higher than 4, in this embodiment, the final filtered first target audio is high-quality audio with both a PESQ score higher than 4 and a MOS score higher than 4.5.
[0109] In this embodiment, when filtering high-quality first target audio from the first audio set, the quality of the audio can be evaluated multiple times using a preset evaluation algorithm (i.e., PESQ algorithm) and a preset evaluation model (i.e. MOS prediction model), and a filtering is performed based on the quality of the audio evaluated each time, thereby achieving multi-level filtering of audio to ensure the accuracy of the high-quality audio obtained through filtering.
[0110] In the foregoing embodiments of this application, when filtering the second target audio from the second audio set, in steps S31 to S34, the audio in the second audio set is first evaluated based on the MOS prediction model, and a filtering is performed on the audio in the second audio set to obtain the fourth audio set. Then, the audio in the fourth audio set is evaluated using the PESQ algorithm, and a filtering is performed on the audio in the fourth audio set, which is equivalent to performing two filterings on the audio in the second audio set to obtain the low-quality second target audio.
[0111] As an optional implementation, in this embodiment of the application, when filtering the second target audio from the second audio set, the audio in the second audio set can first be evaluated and filtered using the PESQ algorithm, and then the audio obtained from the first filtering can be further evaluated and filtered again based on the MOS prediction model to obtain the low-quality second target audio. Specifically, as follows:
[0112] Combination Figure 5As shown, the steps for the electronic device to select the second target audio from the second audio set are as follows:
[0113] S51, evaluate the quality of the audio in the second audio set using a preset evaluation algorithm, and obtain the quality score of the audio in the second audio set.
[0114] S52, select audio files with quality scores below the seventh threshold from the first audio set to obtain the sixth audio set.
[0115] S53. Use a preset evaluation model to evaluate the quality of the audio in the sixth audio set and obtain a quality score for the audio in the sixth audio set.
[0116] S54: Select audio files with quality scores below the eighth threshold from the sixth audio set to obtain the second target audio.
[0117] The preset evaluation algorithm and preset evaluation model are similar to the preset evaluation algorithm (PESQ algorithm) and preset evaluation model (MOS prediction model) used in steps S31 to S34 of the aforementioned embodiments, respectively. For details, please refer to the aforementioned embodiments, and they will not be repeated here.
[0118] In addition, when evaluating the quality of the audio in the first audio set using the PESQ algorithm, it is also necessary to obtain the reference audio corresponding to the audio in the first audio set. The specific method for obtaining the reference audio can also be referred to the aforementioned embodiments, and will not be repeated here.
[0119] In steps S41 and S42, the PESQ score ranges from -0.5 to 4.5. For example, the seventh threshold can be set to 2. After evaluating the quality of the audio in the first audio set using the PESQ algorithm and obtaining the quality score (i.e., PESQ score) of the audio in the first audio set, audio with a quality score below 2 can be filtered out to obtain the sixth audio set. Therefore, the PESQ scores of the audio in the sixth audio set are all below 2.
[0120] In this embodiment, when evaluating the quality of the audio in the first audio set using the PESQ algorithm, errors may occur, resulting in incorrect quality scores for the audio in the first audio set. If the quality scores of the audio in the first audio set are incorrect, after filtering the audio in the first audio set using a seventh threshold, the resulting sixth audio set may contain audio of higher quality. Therefore, steps S53 and S54 are executed to further evaluate the quality of the audio in the sixth audio set using a preset evaluation model, so as to filter out the low-quality second target audio based on the quality of the audio in the sixth audio set.
[0121] In steps S53 and S54, the audio quality score (i.e., MOS score) ranges from 1 to 5. For example, the eighth threshold can be set to 3. After the MOS prediction model outputs the MOS scores of the audio in the first audio set, audio with a quality score below 3 in the first audio set can be filtered out to obtain low-quality second target audio. Combining the examples in steps S51 and S52 above, since the sixth audio set consists entirely of audio with a PESQ score below 2, in this embodiment, the final filtered first target audio is low-quality audio with a PESQ score below 2 and a MOS score below 3.
[0122] In this embodiment, when filtering high-quality first target audio from the first audio set, the quality of the audio can be evaluated multiple times using a preset evaluation algorithm (i.e., PESQ algorithm) and a preset evaluation model (i.e. MOS prediction model), and a filtering is performed based on the quality of the audio evaluated each time, thereby achieving multi-level filtering of audio to ensure the accuracy of the high-quality audio obtained through filtering.
[0123] Optionally, in step S15 above, combined with Figure 6 As shown, the electronic device uses the first target audio and the second target audio to train the audio classification model to obtain the trained audio classification model. The steps are as follows:
[0124] S61, input the first target audio and the second target audio into the audio classification model to be trained, and train the audio classification model.
[0125] S62, during the training of the audio classification model to be trained, calculate the learning loss of the audio classification model to be trained.
[0126] S63, adjust the parameters of the audio classification model to be trained based on the learning loss.
[0127] S64 uses labeled test data to verify whether the audio classification model to be trained with adjusted parameters meets the preset conditions. The test data is obtained from the audio data to be screened.
[0128] S65, under the condition that the audio classification model to be trained with adjusted parameters meets the preset conditions, the audio classification model to be trained is completed, and the trained audio classification model is obtained.
[0129] In this embodiment, the test data is obtained from the audio data to be filtered. For example, a second preset number of audio segments are randomly selected from the audio segments included in the audio data to be filtered, and these are used as test data. The second preset number of audio segments in the test data can be 10,000 audio segments with a total duration of 10 hours. Since the amount of test data is relatively small, and to ensure the accuracy of the test data labels, the test data can be labeled manually.
[0130] In this embodiment, the audio classification model's role is to filter high-quality target audio data from the available audio classification models, facilitating the training of a large speech synthesis model using this high-quality target audio data. Therefore, a binary classification model can be used. Compared to related technologies that use neural network models to classify audio into clean audio data, noisy audio data, audio data with background music, and audio data with reverberation, in this embodiment, the binary classification model only needs to focus on high-quality and low-quality audio, reducing the number of categories. This reduces the complexity of filtering and labeling training data, thereby lowering the learning difficulty of the audio classification model during training and improving the model's training performance.
[0131] Specifically, a binary classification model can categorize audio data to be filtered into high-quality target audio data and low-quality audio data. For example, the binary classification model can label audio data with binary labels (0 or 1), using 1 to identify high-quality target audio data and 0 to identify low-quality audio data. After inputting the audio data to be filtered into the binary classification model, it can calculate the probability that the audio data belongs to high-quality audio and the probability that it belongs to low-quality audio, with the sum of the probabilities of belonging to low-quality audio and high-quality audio being 1. When the probability of audio data belonging to high-quality audio exceeds 50%, it means that the probability of audio data belonging to low-quality audio is less than 50%, i.e., the probability of belonging to high-quality audio is greater than the probability of belonging to low-quality audio. In this case, the audio data can be labeled with the high-quality label 1, thus achieving the classification of the audio data.
[0132] In this embodiment of the application, the network architecture of the audio classification model can be constructed using CNN, RNN, BLSTM, Residual Block, and Attention. Combined with... Figure 7 As shown, the audio classification model consists of 6 layers of residual blocks and 8 layers of attention modules.
[0133] In steps S61 to S63 above, the first target audio and the second target audio are input into the audio classification model to be trained. During the training process of the audio classification model, the learning loss of the audio classification model needs to be calculated. The learning loss of the audio classification model is used to characterize the difference between the predicted result output by the audio classification model after processing the audio data and the true label of the audio data during the training process. This difference can reflect the performance of the audio classification model; that is, the smaller the difference, the better the performance of the audio classification model.
[0134] In this implementation, calculating the learning loss of the audio classification model to be trained serves to adjust the parameters of the model based on the difference between the predicted results and the true labels during training. This guides the training process, reduces the discrepancy between the predicted results and the true labels, improves the performance of the audio classification model, and enables the trained model to better recognize audio data. The learning loss of the audio classification model can be calculated using either the cross-entropy loss function or the logistic regression loss function.
[0135] In steps S64 and S65 above, during the training of the audio classification model to be trained, the test data is used to verify whether the audio classification model to be trained meets preset conditions, in order to determine whether the audio classification model to be trained has been successfully trained. The prediction conditions can be that the accuracy of the audio classification model to be trained reaches an accuracy threshold, and the recall of the audio classification model to be trained reaches a recall threshold. Accuracy is the proportion of correctly predicted high-quality audio tracks by the audio classification model to be trained. Recall is the proportion of correctly predicted high-quality audio tracks by the audio classification model to all high-quality audio tracks in the test data. For example, the test data includes 10,000 audio data entries, of which 5,000 are high-quality audio tracks. After classifying the 10,000 audio data entries, the audio classification model to be trained predicts that 8,000 are high-quality audio tracks. After verification, only 4,000 of the predicted high-quality audio tracks were found to be high-quality, while the remaining 4,000 were low-quality. Therefore, the accuracy of the audio classification model to be trained is 4,000 / 8,000 = 50%, and the recall is 4,000 / 5,000 = 80%.
[0136] As an optional implementation, when validating the audio classification model using test data, a preset condition can be set that the accuracy and recall of the audio classification model to be trained both reach 80%. In this way, if the accuracy and recall of the audio classification model to be trained reach 80% after processing the test data during the training process, it indicates that the audio classification model to be trained has been successfully trained, and a well-trained audio classification model is obtained.
[0137] By adopting this implementation method, during the training process of the audio classification model to be trained, after verifying the audio classification model to be trained through test data, it can be determined whether the audio classification model to be trained meets the preset conditions based on the verification results, so as to train an audio classification model that meets the preset conditions, thereby ensuring the performance of the trained audio classification model.
[0138] Optionally, in step S16 above, after the audio classification model to be trained is trained, it can be used to classify audio, thereby selecting high-quality audio. Combined with... Figure 7 As shown, the audio data to be filtered is input into a trained audio classification model. The trained audio classification model can extract audio features (audio Mel-spectral features) from the audio data, and after performing convolution operations on the audio features, it maps the features to the probability of classification categories through a fully connected layer. Based on the probability of the classification category, it determines whether the audio data belongs to high-quality audio, thereby filtering out high-quality target audio data from the audio data to be filtered.
[0139] Combination Figure 8 As shown, this application embodiment provides an audio data processing 800, including a processor 801 and a memory 802. Optionally, the device 800 may further include a communication interface 803 and a bus 804. The processor 801, communication interface 803, and memory 802 can communicate with each other via the bus 804. The communication interface 803 can be used for information transmission. The processor 801 can call logical instructions in the memory 802 to execute the audio data processing method described in the above embodiment.
[0140] Furthermore, the logic instructions in the aforementioned memory 802 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.
[0141] The memory 802, as a computer-readable storage medium, can be used to store software programs and computer-executable programs, such as program instructions / modules corresponding to the methods in the embodiments of this application. The processor 801 executes functional applications and data processing by running the program instructions / modules stored in the memory 802, that is, it implements the audio data processing method in the above embodiments.
[0142] The memory 802 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 101 may include high-speed random access memory and may also include non-volatile memory.
[0143] This application provides a storage medium storing computer-executable instructions, which are configured to execute the audio data processing method described in the above embodiments.
[0144] The aforementioned storage medium can be a transient computer-readable storage medium or a non-transitory computer-readable storage medium.
[0145] The technical solutions of this application embodiment can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes one or more instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the method described in this application embodiment. The aforementioned storage medium can be a non-transitory storage medium, including: USB flash drive, portable hard drive, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, and other media capable of storing program code; it can also be a transient storage medium.
[0146] The foregoing description and accompanying drawings fully illustrate embodiments of this disclosure to enable those skilled in the art to practice them. Other embodiments may include structural, logical, electrical, procedural, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the order of operation may vary. Parts and features of some embodiments may be included in or replace parts and features of other embodiments. Moreover, the terminology used in this application is for describing embodiments only and is not intended to limit the claims. As used in the description of embodiments and claims, the singular forms “a,” “an,” and “the” are intended to equally include the plural forms unless the context clearly indicates otherwise. Similarly, the term “and / or” as used in this application means including one or more of the associated listed items and all possible combinations thereof. Additionally, when used in this application, the term "comprise" and its variations "comprises" and / or "comprising" refer to the presence of stated features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof. Without further limitations, an element defined by the phrase "comprises a..." does not exclude the presence of other identical elements in the process, method, or apparatus that includes said element. In this document, each embodiment may focus on the differences from other embodiments, and similar or identical parts between embodiments can be referred to mutually. For methods, products, etc., disclosed in the embodiments, if they correspond to the method section disclosed in the embodiments, the relevant parts can be referred to the description of the method section.
[0147] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0148] The methods and products (including but not limited to devices and equipment) disclosed in the embodiments herein can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For instance, the division of units may be merely a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to implement this embodiment according to actual needs. In addition, the functional units in the embodiments of this application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
[0149] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than that shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. In the descriptions corresponding to the flowcharts and block diagrams in the accompanying drawings, the operations or steps corresponding to different blocks may also occur in a different order than disclosed in the description; sometimes there is no specific order between different operations or steps. For example, two consecutive operations or steps may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. Each block in a block diagram and / or flowchart, and combinations of blocks in a block diagram and / or flowchart, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
Claims
1. A method for processing audio data, characterized in that, Applied to electronic devices, including: Obtain training data from the audio data to be filtered; The training data is classified to obtain a first audio set and a second audio set; wherein the quality of the audio in the first audio set is higher than the quality of the audio in the second audio set. Select a first target audio from the first audio set, and select a second target audio from the second audio set; The first target audio and the second target audio are used to train the audio classification model to be trained, and a trained audio classification model is obtained. The trained audio classification model is used to classify the audio data to be screened to obtain the target audio data.
2. The method according to claim 1, characterized in that, The first target audio is selected from the first audio set, including: The quality of the audio in the first audio set is evaluated using a preset evaluation model to obtain a quality score for the audio in the first audio set. Audio sets with quality scores higher than a first threshold are selected from the first audio set to obtain the third audio set; The quality of the audio in the third audio set is evaluated using a preset evaluation algorithm to obtain a quality score for the audio in the third audio set. The first target audio is obtained by filtering out audio with a quality score higher than the second threshold from the third audio set.
3. The method according to claim 2, characterized in that, The quality of the audio in the third audio set is evaluated using a preset evaluation algorithm to obtain a quality score for the audio in the third audio set, including: The audio in the third audio set is denoised using a preset denoising model to obtain the reference audio corresponding to the audio in the third audio set; Calculate the difference between the audio in the third audio set and the corresponding reference audio; Based on the difference, a quality score is generated for the audio in the third audio set.
4. The method according to claim 1, characterized in that, The second target audio is selected from the second audio set, including: The quality of the audio in the second audio set is evaluated using a pre-defined evaluation model to obtain a quality score for the audio in the second audio set. Audio sets with quality scores below the third threshold are selected from the second audio set to obtain the fourth audio set; The quality of the audio in the fourth audio set is evaluated using a preset evaluation algorithm to obtain a quality score for the audio in the fourth audio set. The second target audio is obtained by filtering out audios with quality scores below the fourth threshold from the fourth audio set.
5. The method according to claim 4, characterized in that, The quality of the audio in the fourth audio set is evaluated using a preset evaluation algorithm to obtain a quality score for the audio in the fourth audio set, including: The audio in the fourth audio set is denoised using a preset denoising model to obtain the reference audio corresponding to the audio in the fourth audio set; Calculate the difference between the audio in the fourth audio set and the corresponding reference audio; Based on the difference, a quality score is generated for the audio in the fourth audio set.
6. The method according to claim 1, characterized in that, The first target audio is selected from the first audio set, including: The quality of the audio in the first audio set is evaluated using a preset evaluation algorithm to obtain a quality score for the audio in the first audio set. Audio files with quality scores higher than the fifth threshold are selected from the first audio set to obtain the fifth audio set; The quality of the audio in the fifth audio set is evaluated using a preset evaluation model to obtain a quality score for the audio in the fifth audio set; The first target audio is obtained by filtering out audios with quality scores higher than the sixth threshold from the fifth audio set.
7. The method according to claim 1, characterized in that, The second target audio is selected from the second audio set, including: The quality of the audio in the second audio set is evaluated using a preset evaluation algorithm to obtain a quality score for the audio in the second audio set. Audio sets with quality scores below the seventh threshold are selected from the first audio set to obtain the sixth audio set; The quality of the audio in the sixth audio set is evaluated using a preset evaluation model to obtain a quality score for the audio in the sixth audio set; The second target audio is obtained by filtering out audios with quality scores below the eighth threshold from the sixth audio set.
8. The method according to claim 1, characterized in that, The audio classification model to be trained is obtained by using the first target audio and the second target audio, including: The first target audio and the second target audio are input into the audio classification model to be trained, and the audio classification model to be trained is trained. During the training of the audio classification model to be trained, the learning loss of the audio classification model to be trained is calculated; The parameters of the audio classification model to be trained are adjusted based on the learning loss. Using labeled test data, verify whether the audio classification model to be trained with adjusted parameters meets the preset conditions; wherein, the test data is obtained from the audio data to be screened; When the audio classification model to be trained with adjusted parameters meets the preset conditions, the training of the audio classification model is completed, and the trained audio classification model is obtained.
9. An audio data processing apparatus, comprising a processor and a memory storing program instructions, characterized in that, The processor is configured to perform the audio data processing method as described in any one of claims 1 to 8 when executing the program instructions.
10. A storage medium storing program instructions, characterized in that, When the program instructions are executed, they perform the audio data processing method as described in any one of claims 1 to 8.