Method and apparatus for processing audio data, and storage medium
By filtering out upsampled audio segments and adjusting parameters, audio data is segmented and labeled, solving the problem of low-quality audio affecting the performance of large speech synthesis models, and achieving efficient feature learning and performance improvement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU SHIYUAN ELECTRONICS CO LTD
- Filing Date
- 2024-12-13
- Publication Date
- 2026-06-16
AI Technical Summary
During the training of large speech synthesis models, the use of low-quality audio data leads to performance degradation. Existing techniques adjust the sampling rate of audio data by upsampling, but the high-frequency range lacks effective spectral features, which affects model performance.
By filtering out upsampled audio segments, adjusting audio data parameters, segmenting into segments within a preset duration range, selecting high-quality segments and labeling them, data consistency is ensured, the influence of upsampling is avoided, and multiple algorithm models are used for cleaning and labeling.
It improves the performance of large speech synthesis models, ensures that the models learn effective feature information, and enhances the quality of speech synthesis.
Smart Images

Figure CN122224136A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, such as a method and apparatus for processing audio data, and a storage medium. Background Technology
[0002] Currently, to save training costs, large amounts of audio data are typically collected from internet platforms during the training of large-scale speech synthesis models. However, since this collected audio data includes low-quality audio, training the large-scale speech synthesis model with low-quality audio will negatively impact its performance, resulting in poor-quality synthesized audio. Therefore, to ensure the performance of the large-scale speech synthesis model, it is necessary to clean the audio data collected from internet platforms to select high-quality audio for training.
[0003] In related technologies, neural network models are primarily used to clean the collected audio data. To ensure the performance of the neural network model during audio data cleaning, the consistency of the audio data input to the model must be guaranteed. This means that before the audio data is input into the neural network model, the parameters of the audio data (e.g., sampling rate) need to be adjusted to ensure consistency and standardization. In these technologies, audio data standardization typically involves upsampling low-sampling-rate audio data to unify the sampling rates.
[0004] However, in related technologies, for high-sampling-rate audio data obtained through upsampling, effective spectral features are still concentrated in the low-frequency range, while effective spectral features are absent in the high-frequency range. Therefore, when upsampled audio data is used to train large-scale speech synthesis models, these models struggle to learn effective feature information, thus impacting their performance. Summary of the Invention
[0005] To provide a basic understanding of some aspects of the disclosed embodiments, a brief summary is given below. This summary is not intended as a general commentary, nor is it intended to identify key / important components or describe the scope of protection of these embodiments, but rather as a prelude to the detailed description that follows.
[0006] This application provides an audio data processing method, apparatus, and storage medium. By cleaning and labeling the audio data, it is possible to minimize the presence of upsampled audio data in the labeled audio data.
[0007] In a first aspect, embodiments of this application provide an audio data processing method applied to an electronic device, comprising:
[0008] Adjust the parameters of the audio data to be cleaned.
[0009] The audio data to be cleaned, with adjusted parameters, is divided into multiple audio segments.
[0010] The first audio segment set is obtained by selecting audio segments from multiple audio segments that fall within a preset duration. Audio segments that meet the first preset condition are considered high-quality audio segments.
[0011] The audio segments that meet the first preset conditions in the first audio segment set are selected to obtain the second audio segment set.
[0012] The audio segments that meet the second preset condition in the second audio segment set are filtered out to obtain the third audio segment set; the audio segments that meet the second preset condition are audio segments that include the voices of multiple speakers;
[0013] The upsampled audio segments in the third audio segment set are filtered out to obtain the target audio segment set.
[0014] Optionally, the adjusted audio data to be cleaned is segmented into multiple audio segments, including: obtaining silent segments from the adjusted audio data to be cleaned; determining whether a silent segment is a segment to be segmented based on its duration; and segmenting the adjusted audio data to be cleaned into multiple audio segments when the duration of the silent segment exceeds a duration threshold.
[0015] Optionally, obtaining silent segments from the adjusted audio data to be cleaned includes: obtaining audio segments of a set duration from the adjusted audio data to be cleaned; determining the audio segment as a silent audio segment when its energy is below an energy threshold; and merging time-continuous silent audio segments from the adjusted audio data to be cleaned to obtain at least one silent segment.
[0016] Optionally, the audio segments in the first audio segment set that meet the first preset condition are selected to obtain the second audio segment set. This includes: classifying the audio segments in the first audio segment set using a pre-trained audio classification model to obtain a first-class audio segment set and a second-class audio segment set. The set of audio segments with higher quality from the first-class and second-class audio segment sets is then selected as the second audio segment set.
[0017] Optionally, filtering out audio segments in the second audio segment set that meet the second preset condition to obtain a third audio segment set includes: inputting audio segments from the second audio segment set into a speaker log model, causing the speaker log model to output the start and end times of the speaker's speech; and filtering out audio segments in the second audio segment set that include multiple speaker speech segments based on the start and end times of the speaker's speech to obtain the third audio segment set.
[0018] Optionally, upsampled audio segments are filtered out from the third audio segment set to obtain the target audio segment set. This includes: for each audio segment in the third audio segment set, obtaining the first average spectral energy of the audio segment within a first frequency interval, and obtaining the second average spectral energy of the audio segment within a second frequency interval. The frequencies in the first frequency interval are lower than the frequencies in the second frequency interval. The ratio of the second average spectral energy to the first average spectral energy is calculated. From the third audio segment set, audio segments with a ratio less than a threshold are filtered out to obtain the target audio segment set.
[0019] Optionally, before filtering out upsampled audio segments in the third audio segment set, the method further includes filtering out audio segments in the third audio segment set whose sampling rate is lower than the sampling rate threshold.
[0020] Optionally, the target audio segments in the target audio segment set are labeled to obtain target audio data, including: using a natural speech recognition model to identify the target audio segments in the target audio segment set to generate text content corresponding to the target audio segments; calculating the perplexity of the text content corresponding to the target audio segments; and removing target audio segments from the target audio segment set when the perplexity of the text content corresponding to the target audio segments is higher than a perplexity threshold to obtain target audio data.
[0021] Secondly, embodiments of this application provide an audio data processing apparatus, including a processor and a memory storing program instructions, wherein the processor is configured to execute the audio data processing method as described in the first aspect when running the program instructions.
[0022] Thirdly, embodiments of this application provide a storage medium storing program instructions, which, when executed, perform the audio data processing method as described in the first aspect.
[0023] The audio data processing method, apparatus, and storage medium provided in this application embodiment can achieve the following technical effects:
[0024] Electronic devices can make the parameters of the audio data to be cleaned consistent by adjusting them. The electronic devices then divide the consistent audio data into multiple audio segments and select those within a preset duration to obtain a first audio segment set. The electronic devices then filter out high-quality audio segments that meet a first preset condition from the first audio segment set to obtain a second audio segment set. The electronic devices further filter out audio segments in the second audio segment set that meet the second preset condition, i.e., audio segments containing multiple speakers, to obtain a third audio segment set. Finally, the electronic devices filter out upsampled audio segments in the third audio segment set to obtain a target audio segment set, thus achieving the cleaning of the audio data to be cleaned. The cleaned audio data yields the target audio segment set. The target audio data obtained by labeling the target audio segments in the target audio segment set can be used to train a large speech synthesis model. In this embodiment, because upsampled audio segments in the third audio segment set are filtered out during the cleaning process, the cleaned target audio segment set is kept away from upsampled audio data, resulting in target audio segments containing more feature information. In this way, by using the target audio segments labeled in the target audio segment set to train the large speech synthesis model, the large speech synthesis model can learn effective feature information, thereby ensuring the performance of the large speech synthesis model.
[0025] The above general description and the description below are exemplary and illustrative only and are not intended to limit this application. Attached Figure Description
[0026] One or more embodiments are illustrated by way of example with reference to the accompanying drawings. These illustrations and drawings do not constitute a limitation on the embodiments. Elements having the same reference numerals in the drawings are considered similar elements. The drawings do not constitute a limitation of scale, and wherein:
[0027] Figure 1 This is a flowchart of a method for cleaning audio data;
[0028] Figure 2 This is a flowchart of another method for cleaning audio data;
[0029] Figure 3 This application provides a flowchart of an audio data processing method according to an embodiment;
[0030] Figure 4 This application provides a flowchart of a method for training an audio classification model.
[0031] Figure 5 This is a schematic diagram of an audio data processing device provided in an embodiment of this application. Detailed Implementation
[0032] The terms "first," "second," etc., used in the specification, claims, and drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this application described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion.
[0033] Unless otherwise stated, the term "multiple" means two or more.
[0034] In this embodiment, the character " / " indicates that the objects before and after it are in an "or" relationship. For example, A / B means: A or B.
[0035] The term "and / or" describes an association between objects, indicating that three relationships can exist. For example, A and / or B means: A or B, or A and B.
[0036] The term "correspondence" can refer to an association or binding relationship. The correspondence between A and B means that there is an association or binding relationship between A and B.
[0037] To provide a more detailed understanding of the features and technical content of the embodiments of this application, the implementation of the embodiments of this application will be described in detail below with reference to the accompanying drawings. The accompanying drawings are for illustrative purposes only and are not intended to limit the embodiments of this application. In the following technical description, for ease of explanation, several details are used to provide a full understanding of the disclosed embodiments. However, one or more embodiments may still be implemented without these details. In other cases, well-known structures and devices may be simplified in their depiction to simplify the drawings.
[0038] Large-scale speech synthesis models are artificial intelligence models that utilize deep learning technology to generate natural and fluent speech. Compared to traditional speech synthesis models, large-scale speech synthesis models typically require a large amount of audio training data to learn how to generate high-quality speech. During the training process of large-scale speech synthesis models, it is costly and time-consuming for professional users to record large amounts of audio training data in a professional recording studio. Therefore, to save training costs, large-scale speech synthesis models are usually trained using large amounts of audio data collected from internet platforms.
[0039] However, due to the inconsistent quality of audio data collected from internet platforms, low-quality audio, such as audio containing noise, background music, and reverberation, is often included. Training a large-scale speech synthesis model with low-quality audio will negatively impact its performance, resulting in poor-quality synthesized audio. Therefore, low-quality audio data collected from internet platforms is unsuitable for training large-scale speech synthesis models. To ensure the performance of the large-scale speech synthesis model, the collected audio data needs to be cleaned to select high-quality audio for training.
[0040] To clean audio data collected from internet platforms, in order to filter out high-quality audio from the collected audio data and label the high-quality audio, the relevant technologies mainly adopt the following two schemes.
[0041] Combination Figure 1 As shown, the first approach includes the following steps:
[0042] S11, the audio data to be cleaned is segmented to obtain multiple audio segments.
[0043] S12: Select an audio segment of a set length from multiple audio segments to obtain the audio segment to be selected.
[0044] S13 uses a classification model to divide the audio segments to be screened into five preset categories.
[0045] The five categories correspond to high-quality audio, audio with noise, audio with background music, audio with reverberation, and audio including dialogue from multiple speakers. The training data for the classification model was constructed using data simulation.
[0046] S14: Select audio clips from the high-quality audio categories out of the five categories as audio training data.
[0047] S15, label the audio training data to obtain labeled audio training data. The audio training data includes high-quality audio segments with text labels.
[0048] The first approach primarily relies on a classification model to categorize audio data, thereby enabling the selection of high-quality audio. However, during the training of this model, the use of data simulation to construct training data results in a limited range of audio types, failing to encompass the audio types found in real-world audio data (i.e., audio data collected from internet platforms). Furthermore, the simulated training data exhibits feature discrepancies compared to real-world audio data; it fails to capture the audio characteristics of data collected from internet platforms, resulting in insufficient audio features. Therefore, using simulated training data to train the classification model makes it difficult for the model to learn rich audio information, potentially impacting its classification performance. To address this, a second approach is provided, combining... Figure 2 As shown, the second approach includes the following steps:
[0049] S21, Adjust the parameters of the audio data to be cleaned to make the parameters (such as audio format, number of channels and sampling rate) consistent.
[0050] Specifically, when adjusting the sampling rate of the audio data to be cleaned, the low sampling rate audio data in the audio data to be cleaned is upsampled.
[0051] S22, using a sound source separation model to process the audio data to be cleaned with adjusted parameters, and extracting human voice audio from the audio data to be cleaned with adjusted parameters.
[0052] S23. Using a speaker log model to process human voice audio, the start and end times of different speakers' speech in the human voice audio are predicted to obtain the prediction results.
[0053] For example, suppose the length of a human voice audio is 60 seconds, and the audio includes the voices of 4 speakers. After processing the human voice audio using a speaker log model, it can be predicted that in the 60-second human voice audio, 0-10s corresponds to the voice of speaker A, 10-15s corresponds to the voice of speaker B, 15-25s corresponds to the voice of speaker C, and 25-60s corresponds to the voice of speaker D.
[0054] S24. Based on the prediction results, the human voice audio is segmented to obtain multiple audio segments.
[0055] Continuing with the previous example, after segmenting the human voice audio according to different speakers, we get 4 audio segments, and each audio segment corresponds to a speaker. The 4 audio segments correspond to speaker A, speaker B, speaker C and speaker D respectively.
[0056] S25: Select an audio segment of a set length from multiple audio segments to obtain the audio segment to be selected.
[0057] S26, Label the audio segments to be selected to obtain labeled audio training data. The audio training data includes audio segments with text labels.
[0058] S27. Using an audio quality assessment model, the quality of the labeled audio training data is evaluated to obtain the evaluation results. These evaluation results characterize the quality of audio segments within the audio training data.
[0059] S28. Based on the evaluation results, select high-quality audio segments with text labels from the labeled audio training data.
[0060] The second approach primarily utilizes a neural network model to clean the collected audio data. To ensure the performance of the neural network model during audio data cleaning, the consistency of the audio data input to the model must be guaranteed. This means that before the audio data is input into the neural network model, the parameters of the audio data (e.g., audio format, number of channels, and sampling rate) need to be adjusted to ensure consistency and standardization. In standardizing audio data, low-sampling-rate audio data is typically upsampled to unify the sampling rates of the audio data.
[0061] However, in the second approach, for high-sampling-rate audio data obtained through upsampling, the effective spectral features remain in the low-frequency range, while no effective spectral features exist in the high-frequency range. For example, if the original audio data has a sampling rate of 16kHz, upsampling it yields audio data with a sampling rate of 24kHz. Since the highest frequency of the original audio data is 16 / 2 = 8kHz, the effective spectral features in the original audio data are below the highest frequency of 8kHz. Although the original audio with a sampling rate of 16kHz is upsampled to audio data with a sampling rate of 24kHz, and the highest frequency of the upsampled audio data is 24 / 2 = 12kHz, the effective spectral features of the upsampled audio data are still concentrated below 8kHz. This means that there are no effective spectral features in the frequency range of 8–12kHz in the upsampled audio data. Therefore, when upsampled audio data is used in the training process of a large speech synthesis model, the large speech synthesis model struggles to learn effective features, thus affecting the performance of the large speech synthesis model.
[0062] Therefore, embodiments of this application provide an audio data processing method, apparatus, and storage medium. During the cleaning process of the audio data to be cleaned, by filtering out upsampled audio segments, the presence of upsampled audio data in the cleaned audio data can be minimized. Using the cleaned and labeled audio data to train a large-scale speech synthesis model allows the model to learn effective feature information, thereby ensuring the performance of the large-scale speech synthesis model.
[0063] In this embodiment, the entity executing the audio data processing method can be an electronic device, such as a mobile phone, tablet computer, laptop, desktop computer, smart interactive flat panel, or other device with audio filtering capabilities. The electronic device integrates multiple algorithm models, which can be used to clean and label the collected audio data.
[0064] Combination Figure 3 As shown in the figure, this application provides a method for processing audio data, which can be applied to the above-mentioned electronic device. The method includes the following steps:
[0065] S31, Adjust the parameters of the audio data to be cleaned.
[0066] The audio data to be cleaned is audio data obtained from internet platforms that requires cleaning. After cleaning, high-quality audio data is obtained, which can be used to train large-scale speech synthesis models.
[0067] The parameters of the audio data to be cleaned include at least the audio format and the number of channels. For example, when adjusting the audio format, the audio format can be uniformly adjusted to WAV (Waveform Audio File Format) or MP3 (Moving Picture Experts Group Audio Layer III). This application embodiment does not specifically limit the adjusted audio format. When adjusting the number of channels, the number of channels can be uniformly adjusted to 1, i.e., uniformly adjusted to mono audio data. Of course, the number of channels can also be adjusted to other values according to user needs. This application embodiment does not specifically limit the adjustment of the number of channels.
[0068] S32 divides the audio data to be cleaned, whose parameters have been adjusted, into multiple audio segments.
[0069] In electronic devices, training large speech synthesis models with long-duration audio data can easily lead to insufficient storage space on the graphics card, causing training interruptions. Therefore, in this embodiment, when cleaning audio data, it is necessary to divide the long-duration audio data to be cleaned into multiple shorter audio segments.
[0070] S33: Select audio segments within a preset duration range from multiple audio segments to obtain the first audio segment set.
[0071] After the audio to be cleaned is segmented into multiple audio segments, the durations of these segments will vary. For audio segments that are too long, if they are used in the training process of a large speech synthesis model, it can still lead to insufficient storage space and computational resources on the electronic device's graphics card, resulting in training interruptions. For audio segments that are too short, the limited speech content they contain can easily lead to incomplete semantics. If semantically incomplete audio segments are used in the training process of a large speech synthesis model, the model will struggle to capture complete and accurate features, thus affecting the model's learning effect and performance. Therefore, in this embodiment, it is necessary to select audio segments within a preset duration range from the multiple audio segments to ensure that the duration of the audio segments is neither too long nor too short.
[0072] The preset duration range can be 4 to 20 seconds. Referring to the duration range of 4 to 20 seconds, the duration of the audio segments in the first set of audio segments obtained by filtering are all within the range of 4 to 20 seconds.
[0073] S34, select audio segments from the first audio segment set that meet the first preset condition to obtain the second audio segment set.
[0074] The first preset condition for an audio segment is that it contains no noise, background music, or reverberation, meaning it is a high-quality audio segment. Since training a large speech synthesis model with high-quality audio data yields better training results, it is necessary to select high-quality audio segments from the first audio segment set.
[0075] S35, filter out audio segments in the second audio segment set that meet the second preset condition to obtain the third audio segment set.
[0076] The audio segment that meets the second preset condition refers to an audio segment that includes the speech of multiple speakers. Audio segments that include the speech of multiple speakers can easily interfere with the training of the model, affecting the model's learning effect and thus its performance. Therefore, in this embodiment of the application, when cleaning the audio data, it is necessary to filter out audio segments that include the speech of multiple speakers, so that the audio segments in the third audio segment set only include the speech of one speaker.
[0077] S36, filter out the upsampled audio segments in the third audio segment set to obtain the target audio segment set.
[0078] S37, label the target audio segments in the target audio segment set to obtain the target audio data.
[0079] Using the audio data processing method provided in this application, the electronic device can make the parameters of the audio data to be cleaned consistent by adjusting the parameters of the audio data to be cleaned. The electronic device needs to divide the audio data to be cleaned with consistent parameters into multiple audio segments, and select audio segments within a preset duration range from the multiple audio segments to obtain a first audio segment set. The electronic device removes high-quality audio segments that meet a first preset condition from the first audio segment set to obtain a second audio segment set. The electronic device also removes audio segments that meet the second preset condition from the second audio segment set, that is, deletes audio segments that include the speech of multiple speakers, to obtain a third audio segment set, and removes upsampled audio segments from the third audio segment set to obtain a target audio segment set, thereby realizing the cleaning of the audio data to be cleaned. After cleaning, the target audio segment set is obtained. The target audio data obtained by annotating the target audio segments in the target audio segment set can be used to train a large speech synthesis model.
[0080] In this embodiment, since upsampled audio segments in the third audio segment set are removed during the cleaning process, the cleaned target audio segment set can be kept away from upsampled audio data as much as possible, resulting in target audio segments containing more feature information. Thus, training a large-scale speech synthesis model using the labeled target audio segments in the target audio segment set allows the model to learn effective feature information, thereby ensuring the performance of the large-scale speech synthesis model.
[0081] Optionally, in step S32 above, the audio data to be cleaned with adjusted parameters is divided into multiple audio segments, including: obtaining the silent segments in the audio data to be cleaned with adjusted parameters; determining whether the silent segment is a silent segment to be segmented based on the duration of the silent segment; and segmenting the silent segment from a preset position to divide the audio data to be cleaned with adjusted parameters into multiple audio segments.
[0082] Among them, the audio data to be cleaned with adjusted parameters contains silent segments and speech segments. Silent segments refer to audio segments in which there is no speaker's voice, while speech segments refer to audio segments in which there is speaker's voice.
[0083] In this implementation, when segmenting the audio data to be cleaned, segmentation is performed starting from the silent segments of the audio data. However, errors may occur during segmentation; segmenting from a short silent segment might deviate from the silent segment and result in segmentation from an adjacent speech segment, thus affecting the integrity of the adjacent speech segment. Therefore, when segmenting the audio data to be cleaned, it is necessary to select a silent segment with a longer duration for segmentation.
[0084] By using a duration threshold, it's possible to determine which silent segments in the audio data to be cleaned are long enough to be segmented (i.e., segments to be segmented). Specifically, if the duration of a silent segment exceeds the duration threshold, it indicates that the segment is long enough to be segmented. For example, the duration threshold can be set to 600ms; if a silent segment's duration is greater than 600ms, it's considered long enough to be segmented.
[0085] After identifying longer silence segments in the audio data to be cleaned, appropriate positions can be selected within these segments for segmentation. Specifically, the middle position of a longer silence segment is relatively far from the adjacent speech segments; therefore, segmentation can be performed from the middle position of the longer silence segment.
[0086] For example, suppose the audio data to be segmented is 10 seconds long, consisting of a 5000ms speech segment a, a 400ms silence segment b, a 600ms speech segment c, a 700ms silence segment d, and a 3300ms speech segment e, with a duration threshold of 600ms. It can be seen that the audio data to be segmented includes two silence segments, namely silence segment b and silence segment d. Since the duration of silence segment b is less than the duration threshold, and the duration of silence segment d is greater than the duration threshold, the audio data to be segmented is segmented starting from silence segment d. Specifically, the audio data to be segmented can be segmented from the middle position of silence segment d (i.e., at 350ms of silence segment d), resulting in two audio segments: one with a duration of 6350ms and the other with a duration of 3650ms.
[0087] It is understandable that the audio data to be cleaned may contain multiple long silent segments. After these long silent segments are split, multiple audio clips can be formed.
[0088] Training a large speech synthesis model with incomplete audio segments results in the model acquiring incomplete features, affecting its learning effectiveness and performance. Therefore, in this implementation, when segmenting the audio data to be cleaned, the segmentation is performed from the longer silent segments to minimize the number of incomplete audio segments in the segmented data. This approach aims to prevent incomplete audio segments from participating in the training process of the large speech synthesis model, thereby ensuring the model's learning effectiveness and performance.
[0089] Furthermore, the process of obtaining silent segments from the adjusted audio data to be cleaned includes: obtaining audio segments of a set duration from the adjusted audio data each time; determining that an audio segment is silent when its energy is below an energy threshold; and merging temporally consecutive silent audio segments from the adjusted audio data to be cleaned to obtain at least one silent segment.
[0090] Audio segment energy refers to the power or intensity carried by an audio signal within a certain time period, reflecting the physical performance of the audio signal at a specific moment or time interval. The energy of a speech audio segment is higher than that of a silent audio segment. Based on this energy difference between speech and silent audio segments, it is possible to determine whether an audio segment is a silent audio segment.
[0091] Specifically, each time, an audio segment of a set duration (e.g., 100ms) can be obtained from the audio data to be cleaned, and it can be determined whether the audio segment is a silent segment. If the energy of the audio segment is less than the energy threshold, it indicates that the energy of the audio segment is low, and the audio segment is determined to be a silent audio segment. For the audio data to be cleaned, multiple audio segments will be obtained, and after all audio segments have undergone the process of determining whether they are silent audio segments, multiple silent audio segments will be obtained. Furthermore, the silent audio segments with consecutive times can be merged into one silent segment, thereby obtaining at least one audio segment.
[0092] Continuing with the previous example, suppose the duration of the audio data to be segmented is 10 seconds, comprising a 5000ms speech segment a, a 400ms silence segment b, a 600ms speech segment c, a 700ms silence segment d, and a 3300ms speech segment e, with each segment lasting 100ms. From this data, 100 audio segments of 100ms each can be obtained. After determining whether these segments are silence segments, 11 silence segments and 89 speech segments will be obtained. Among the 11 silence segments, 4 consecutive silence segments can be merged into a 400ms silence segment b, and 7 consecutive silence segments can be merged into a 700ms silence segment d.
[0093] In this embodiment, when acquiring silent segments from the audio data to be cleaned, shorter audio segments can be selected and their duration determined to be silent. Since the energy of shorter audio segments is relatively stable, their silence status can be accurately determined based on this energy. Thus, after accurately identifying the silent segments in the audio data to be cleaned, the silent segment formed by merging consecutive silent audio segments can be accurately obtained.
[0094] It's worth noting that when extracting silence segments from the audio data to be cleaned, a Voice Activity Detection (VAD) model can be used. The VAD model leverages the energy difference between speech and silence segments to predict the start and end times of silence and speech segments in the audio data to be cleaned. Based on the start and end times of the silence segments, the silence segments can be further extracted from the audio data to be cleaned.
[0095] Optionally, in step S34 above, selecting audio segments from the first audio segment set that meet the first preset condition to obtain the second audio segment set includes: using a pre-trained audio classification model to classify the audio segments in the first audio segment set to obtain a first-class audio segment set and a second-class audio segment set. Then, selecting the set of audio segments with higher quality from the first-class and second-class audio segment sets as the second audio segment set.
[0096] In this embodiment, the role of the audio classification model is to select audio segments that meet preset conditions from the first audio segment set, that is, to select high-quality audio segments from the first audio segment set so as to train a large speech synthesis model using high-quality audio segments.
[0097] Specifically, the audio classification model can employ a binary classification model. This model categorizes the audio data to be filtered into high-quality and low-quality audio data. For example, the binary classification model can label audio data with binary labels (0 or 1), using 1 to identify high-quality audio data and 0 to identify low-quality audio data. After inputting the audio data to be filtered into the binary classification model, it can calculate the probability that the audio data belongs to low-quality audio and the probability that it belongs to high-quality audio, with the sum of these probabilities being 1. When the probability that audio data belongs to high-quality audio exceeds 50%, it means that the probability of audio data belonging to low-quality audio is less than 50%, i.e., the probability of belonging to high-quality audio is greater than the probability of belonging to low-quality audio. In this case, the audio data can be labeled with the high-quality label 1, thus achieving the classification of the audio data.
[0098] Using the aforementioned binary classification model, the audio segments in the first audio segment set are classified to obtain a first-class audio segment set and a second-class audio segment set. The first-class audio segment set contains high-quality audio segments, while the second-class audio segment set contains low-quality audio segments. The first-class audio segment set can then be used as the second audio segment set.
[0099] In this embodiment, the binary classification model only needs to focus on high-quality audio and low-quality audio. Compared with the first scheme in the related technology, which uses a classification model to classify audio into five categories, this embodiment reduces the number of categories, thereby reducing the complexity of training data screening and labeling, thus reducing the learning difficulty in the training process of the audio classification model, improving the training effect of the model, and enabling the model to screen out high-quality audio data.
[0100] Furthermore, in combination Figure 4 As shown, training an audio classification model includes the following steps:
[0101] S41, Obtain audio training data and audio test data from the first audio segment set.
[0102] In this process, a first preset number of audio segments are randomly selected from a first set of audio segments as audio training data. The total duration of the first preset number of audio segments in the audio training data can be 4000 hours. A second preset number of audio segments are randomly selected from the first set of audio segments as audio test data. The second preset number of audio segments can be 10,000 audio segments with a total duration of 10 hours. Since the amount of audio test data is relatively small, and to ensure the accuracy of the audio test data labels, manual annotation can be used.
[0103] S42, use an audio quality prediction model to evaluate the quality of audio segments in the audio training data, and obtain the quality score of the audio segments in the audio training data.
[0104] The audio quality prediction model is used to evaluate the quality of audio segments in the audio training data and generate a quality score for each segment. The quality score ranges from 0 to 5, with a higher score indicating better quality. Audio quality prediction models can include DNSMOS (Deep Noise Suppression Mean Opinion Score) and NISQA (Non-Intrusive Speech Quality Assessment) models, among others.
[0105] S43: Select audio segments with quality scores higher than the first quality score threshold from the audio training data and use them as the first audio sample data.
[0106] The first quality score threshold can be 4. Audio segments with a quality score higher than 4 in the audio training data can be considered as high-quality audio segments and used as the first audio sample data.
[0107] S44: Select audio segments with quality scores lower than the second quality score threshold from the audio training data and use them as the second audio sample data.
[0108] The second quality score threshold can be 2. Audio segments with a quality score lower than 2 in the audio training data can be considered as low-quality audio segments and used as the second audio sample data.
[0109] S45, input the first audio sample data and the second audio sample into the audio classification model to train the audio classification model.
[0110] In this model, the first audio sample data contains high-quality audio segments, thus better reflecting the characteristics of high-quality audio. Similarly, the second audio sample data contains low-quality audio segments, thus better reflecting the characteristics of low-quality audio. Therefore, training the audio classification model using the first audio sample data (reflecting high-quality audio characteristics) and the second audio sample data (reflecting low-quality audio characteristics) allows the model to better learn information about high-quality and low-quality audio during training. This improves the model's ability to distinguish between high-quality and low-quality audio, enhancing the accuracy of selecting high-quality audio segments from the first audio segment set.
[0111] S46. During the training of the audio classification model, when the audio classification model meets the training completion conditions by verifying through labeled audio test data, the audio classification model training is completed, and a trained audio classification model is obtained.
[0112] In training the audio classification model, test data is used to verify whether the model meets the training completion criteria to determine if the training is complete. These criteria can be that the audio classification model's accuracy reaches an accuracy threshold and its recall reaches a recall threshold. Accuracy is the proportion of correctly predicted high-quality audio tracks. Recall is the proportion of correctly predicted high-quality audio tracks out of all high-quality audio tracks in the test data. When validating the audio classification model using test data, a preset condition can be set where both accuracy and recall reach 80%. Thus, if, during training, the audio classification model achieves 80% accuracy and recall after processing the test data, the training is considered complete, and a well-trained audio classification model is obtained.
[0113] In this embodiment, after sufficient training, the audio classification model can accurately filter out high-quality audio data. Compared to the first approach in related technologies, which uses simulated training data to train the classification model, in this embodiment, since the first and second audio sample data used to train the audio classification model are derived from a first set of audio segments (i.e., real audio data), they can contain audio features of the real audio data. Thus, training the audio classification model using the first and second audio sample data allows the model to learn information from the real audio data; that is, the audio classification model can learn rich audio information during training, thereby improving its performance and ultimately increasing the accuracy of filtering high-quality audio data.
[0114] Furthermore, in the second approach of the related technology, the human voice audio extracted using the sound source separation model still contains some noise. However, in the embodiments of this application, the audio classification model, after sufficient training, can accurately filter out high-quality audio data, ensuring that the filtered audio data is as free as possible from noise, background music, and reverberation, thereby extracting higher-quality audio data.
[0115] Optionally, in step S35 above, filtering out audio segments in the second audio segment set that meet the second preset condition to obtain the third audio segment set includes: inputting audio segments from the second audio segment set into the speaker log model, causing the speaker log model to output the start and end times of the speaker's speech; and filtering out audio segments in the second audio segment set that include multiple speaker speech segments based on the start and end times of the speaker's speech to obtain the third audio segment set.
[0116] Among them, the speaker log model is a model for processing multi-person dialogue audio, used to identify and record the time periods of activity of different speakers in an audio segment, that is, to solve the problem of "who speaks and when".
[0117] It should be noted that the first audio segment set includes not only audio segments containing noise, background music, and reverberation, but also audio segments containing multiple speakers. However, the audio classification model can only filter out high-quality audio segments from the first audio segment set—that is, audio segments without noise, background music, and reverberation—and cannot filter out audio segments containing multiple speakers. This means that the second audio segment set may contain audio segments containing multiple speakers, and since audio segments containing multiple speakers are not suitable for training a large speech synthesis model, they need to be filtered out. Specifically, a speaker log model can be used to identify audio segments in the second audio segment set. The speaker log model outputs the start and end times of each speaker, and then, based on the start and end times of each speaker, it can be determined whether the audio segments in the second audio set correspond to the speech of multiple speakers, thus filtering out the speech segments that correspond to multiple speakers.
[0118] In this embodiment, by using the speaker log model to further filter the audio segments in the second audio segment set, the third audio segment set can be made to contain as few audio segments as possible that correspond to the speech of multiple speakers, thereby obtaining audio data that is more suitable for the speech synthesis model and improving the training effect of the speech synthesis model.
[0119] Optionally, in step S36 above, filtering out upsampled audio segments from the third audio segment set to obtain the target audio segment set includes: for each audio segment in the third audio segment set, obtaining the first average spectral energy of the audio segment within a first frequency range, and obtaining the second average spectral energy of the audio segment within a second frequency range. The ratio of the second average spectral energy to the first average spectral energy is calculated. From the third audio segment set, audio segments with a ratio less than a threshold are filtered out to obtain the target audio segment set. The frequencies in the first frequency range are lower than the frequencies in the second frequency range.
[0120] In this embodiment, since there are no effective spectral features in the high-frequency range of the upsampled audio segment, and the effective spectral features are all concentrated in the low-frequency range, the upsampled audio segment has the characteristic that the spectral energy in the high-frequency range is close to 0 and the spectral energy in the low-frequency range is relatively large. Based on this characteristic, it can be determined whether the audio segment is an upsampled audio segment.
[0121] For example, suppose that in the third set of audio clips, the highest frequency of a certain audio clip is 8kHz. Let the first frequency range be the low-frequency range of 0–4kHz, and the second frequency range be the high-frequency range of 4–8kHz, with a ratio threshold of 0.05. If the calculated ratio of the average spectral energy of the high-frequency range to the average spectral energy of the low-frequency range is less than 0.05 and approaches 0, it indicates that the spectral energy of the high-frequency range is close to 0, and that the high-frequency range of this audio clip lacks effective spectral characteristics; that is, the audio clip is an upsampled audio clip.
[0122] In this embodiment, by judging the magnitude of the spectral energy in the high-frequency range of the audio segment, it is possible to indirectly determine whether there are effective spectral features in the high-frequency range of the audio segment, thereby accurately determining whether the audio segment is an upsampled audio segment, so as to accurately filter out upsampled audio segments in the audio data.
[0123] It is understandable that during audio segment transmission, if the audio segment is compressed, it will exhibit mid-to-high frequency distortion, meaning the spectral energy in the mid-to-high frequency range will be lower than that in the low-frequency range. Therefore, by implementing the above-described method for filtering upsampled audio segments, mid-to-high frequency distorted audio segments can also be identified and filtered out, thereby minimizing the presence of such distorted segments in the audio data.
[0124] Optionally, before step S36 above, that is, before filtering out upsampled audio segments in the third audio segment set, audio segments in the third audio segment set with a sampling rate lower than the sampling rate threshold can also be filtered out.
[0125] In this implementation, considering that the third audio segment set may include low-sampling-rate audio segments, which contain less feature information and are not suitable for training large speech synthesis models, it is necessary to filter out low-sampling-rate audio segments from the third audio segment set. Specifically, if the sampling rate of an audio segment in the third audio segment set is lower than a sampling rate threshold (e.g., 24kHz), it indicates that the sampling rate of that audio segment is low, and such low-sampling-rate audio segments need to be filtered out.
[0126] It should be noted that the step of filtering out low-sampling-rate audio segments from the third audio segment is not limited to being performed before step S36, but can also be performed after step S36. The specific execution order does not affect the implementation of the embodiments of this application. Therefore, the embodiments of this application do not impose specific limitations.
[0127] In this implementation, by filtering out low-sampling-rate audio segments, the audio data can be kept as free of such segments as possible, ensuring that the audio segments in the data are as uniform as possible with high sampling rates. Using high-sampling-rate audio segments to train a large speech synthesis model can improve the model's training performance.
[0128] Optionally, step S37 above, namely, labeling the target audio segments in the target audio segment set, includes: using a natural speech recognition model to identify the target audio segments in the target audio segment set, so as to generate text content corresponding to the target audio segments. The perplexity of the text content corresponding to the target audio segments is calculated. When the perplexity of the text content corresponding to the target audio segments is higher than a perplexity threshold, the target audio segments are removed from the target audio segment set.
[0129] The perplexity of text content measures the accuracy of its grammatical structure and common grammatical expressions. A lower perplexity indicates more accurate grammatical structure and expressions.
[0130] In this implementation, after cleaning the audio data to be screened to obtain a set of target audio segments containing high-quality target audio fragments, the high-quality target audio fragments in the target audio segment set can be further labeled to facilitate the subsequent training of a large speech synthesis model using the labeled target audio fragments. Specifically, a natural speech recognition model can be used to recognize the text content of the target audio fragments, thus obtaining target audio fragments with text content.
[0131] Considering that while the target audio segment may be of high quality, the resulting text content might be of poor quality. For example, if the target audio segment is spoken in a dialect or the speaker's pronunciation is unclear, the resulting text content may contain grammatically incorrect sentences. Therefore, in this embodiment, the perplexity of the text content corresponding to the target audio segment in the target audio segment set can be calculated. A low perplexity indicates that the sentences in the text content conform to grammatical rules. When the perplexity of the text content corresponding to the target audio segment is higher than a perplexity threshold (e.g., 180), it indicates that the sentences in the text content are not grammatically correct, and the target audio segment should be removed from the target audio segment set.
[0132] In this implementation, by filtering out audio segments from the target audio segment set whose labeled text content is not standardized, the quality of the audio segments in the target audio segment set can be further guaranteed. When the quality of all audio segments in the target audio segment set is high, training a large speech synthesis model using these audio segments can improve the model's training performance.
[0133] It should also be noted that the second approach in the related technology involves first annotating the audio segments and then evaluating and selecting high-quality audio segments. However, in this embodiment, the approach is to first select high-quality audio segments and then annotate them. In comparison, because the audio segments are pre-selected before annotation, only high-quality audio segments need to be annotated, resulting in a smaller amount of data for the annotated audio segments. This saves time on audio segment annotation and improves efficiency.
[0134] Combination Figure 5 As shown in the illustration, this application provides an audio data processing apparatus 500, including a processor 501 and a memory 502. Optionally, the apparatus 500 may further include a communication interface 503 and a bus 504. The processor 501, communication interface 503, and memory 502 can communicate with each other via the bus 504. The communication interface 503 can be used for information transmission. The processor 501 can call logical instructions in the memory 502 to execute the audio data processing method described in the above embodiment.
[0135] Furthermore, the logic instructions in the aforementioned memory 502 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.
[0136] The memory 502, as a computer-readable storage medium, can be used to store software programs and computer-executable programs, such as program instructions / modules corresponding to the methods in the embodiments of this application. The processor 501 executes functional applications and data processing by running the program instructions / modules stored in the memory 502, that is, it implements the audio data processing method in the above embodiments.
[0137] The memory 502 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 101 may include high-speed random access memory and may also include non-volatile memory.
[0138] This application provides a storage medium storing computer-executable instructions configured to execute the audio data processing method described in the above embodiments.
[0139] The aforementioned storage medium can be a transient computer-readable storage medium or a non-transitory computer-readable storage medium.
[0140] The technical solutions of this application embodiment can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes one or more instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the method described in this application embodiment. The aforementioned storage medium can be a non-transitory storage medium, including: USB flash drive, portable hard drive, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, and other media capable of storing program code; it can also be a transient storage medium.
[0141] The foregoing description and accompanying drawings fully illustrate embodiments of this disclosure to enable those skilled in the art to practice them. Other embodiments may include structural, logical, electrical, procedural, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the order of operation may vary. Parts and features of some embodiments may be included in or replace parts and features of other embodiments. Moreover, the terminology used in this application is for describing embodiments only and is not intended to limit the claims. As used in the description of embodiments and claims, the singular forms “a,” “an,” and “the” are intended to equally include the plural forms unless the context clearly indicates otherwise. Similarly, the term “and / or” as used in this application means including one or more of the associated listed items and all possible combinations thereof. Additionally, when used in this application, the term "comprise" and its variations "comprises" and / or "comprising" refer to the presence of stated features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof. Without further limitations, an element defined by the phrase "comprises a..." does not exclude the presence of other identical elements in the process, method, or apparatus that includes said element. In this document, each embodiment may focus on the differences from other embodiments, and similar or identical parts between embodiments can be referred to mutually. For methods, products, etc., disclosed in the embodiments, if they correspond to the method section disclosed in the embodiments, the relevant parts can be referred to the description of the method section.
[0142] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0143] The methods and products (including but not limited to devices and equipment) disclosed in the embodiments herein can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For instance, the division of units may be merely a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to implement this embodiment according to actual needs. In addition, the functional units in the embodiments of this application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
[0144] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than that shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. In the descriptions corresponding to the flowcharts and block diagrams in the accompanying drawings, the operations or steps corresponding to different blocks may also occur in a different order than disclosed in the description; sometimes there is no specific order between different operations or steps. For example, two consecutive operations or steps may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. Each block in a block diagram and / or flowchart, and combinations of blocks in a block diagram and / or flowchart, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
Claims
1. A method for processing audio data, characterized in that, Applied to electronic devices, including: Adjust the parameters of the audio data to be cleaned; The adjusted audio data to be cleaned is divided into multiple audio segments; From the multiple audio segments, audio segments within a preset duration range are selected to obtain a first audio segment set; Audio segments that meet the first preset condition are selected from the first audio segment set to obtain the second audio segment set; audio segments that meet the first preset condition are high-quality audio segments. The audio segments that meet the second preset condition in the second audio segment set are filtered out to obtain the third audio segment set; the audio segments that meet the second preset condition are audio segments that include the voices of multiple speakers; The upsampled audio segments in the third audio segment set are filtered out to obtain the target audio segment set; The target audio segments in the target audio segment set are labeled to obtain the target audio data.
2. The method according to claim 1, characterized in that, The adjusted audio data to be cleaned is divided into multiple audio segments, including: Obtain the silent segments from the audio data to be cleaned, which has been adjusted in terms of parameters; Based on the duration of the silent segment, determine whether the silent segment is a silent segment to be segmented; When the silent segment is the silent segment to be segmented, it is segmented from the preset position of the silent segment to be segmented, so as to segment the audio data to be cleaned with adjusted parameters into multiple audio segments.
3. The method according to claim 2, characterized in that, Obtaining the silent segments from the adjusted audio data to be cleaned includes: From the audio data to be cleaned with the adjusted parameters, obtain an audio segment of a set duration; When the energy of the audio segment is lower than the energy threshold, the audio segment is determined to be a silent audio segment; The time-continuous silent audio segments in the audio data to be cleaned with the adjusted parameters are merged to obtain at least one silent segment.
4. The method according to claim 1, characterized in that, The audio segments that meet the first preset condition in the first audio segment set are filtered out to obtain the second audio segment set, including: Using a pre-trained audio classification model, the audio segments in the first audio segment set are classified to obtain a first-class audio segment set and a second-class audio segment set; The set of audio segments with high quality is selected from both the first set of audio segments and the second set of audio segments, and this set is used as the second set of audio segments.
5. The method according to claim 1, characterized in that, The audio segments that meet the second preset condition in the second audio segment set are filtered out to obtain the third audio segment set, which includes: The audio segments in the second audio segment set are input into the speaker log model, so that the speaker log model outputs the start and end times of the speaker's speech. Based on the start and end times of the speaker's speech, audio segments that include multiple speaker speech segments are filtered out from the second audio segment set to obtain the third audio segment set.
6. The method according to claim 1, characterized in that, After filtering out the upsampled audio segments from the third audio segment set, a target audio segment set is obtained, including: For an audio segment in the third audio segment set, the first average spectral energy of the audio segment in a first frequency range is obtained, and the second average spectral energy of the audio segment in a second frequency range is obtained; wherein, the frequency of the first frequency range is lower than the frequency of the second frequency range. Calculate the ratio of the second average spectral energy to the first average spectral energy; From the third set of audio segments, audio segments with ratios less than a ratio threshold are filtered out to obtain the target set of audio segments.
7. The method according to claim 1 or 6, characterized in that, Before filtering out upsampled audio segments from the third audio segment set, the process also includes: Audio segments with a sampling rate lower than the sampling rate threshold are filtered out from the third set of audio segments.
8. The method according to claim 1, characterized in that, Labeling the target audio segments in the target audio segment set yields target audio data, including: Using a natural speech recognition model, target audio segments in the target audio segment set are identified to generate text content corresponding to the target audio segments; Calculate the perplexity of the text content corresponding to the target audio segment; When the perplexity of the text content corresponding to the target audio segment is higher than the perplexity threshold, the target audio segment is filtered out from the set of target audio segments to obtain the target audio data.
9. An audio data processing apparatus, comprising a processor and a memory storing program instructions, characterized in that, The processor is configured to perform the audio data processing method as described in any one of claims 1 to 8 when executing the program instructions.
10. A storage medium storing program instructions, characterized in that, When the program instructions are executed, they perform the audio data processing method as described in any one of claims 1 to 8.