An audio data classification method, apparatus, device and medium
By combining a base classifier and a novel classifier, utilizing a convolutional neural network and a global temporal pooling layer, and incorporating dual data augmentation strategies and an attention mechanism, the accuracy issues of audio data classification models in situations with insufficient data and dynamic scenarios are addressed, achieving high-precision audio data classification in various scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA TELECOM NETWORK SECURITY TECH CO LTD
- Filing Date
- 2023-01-18
- Publication Date
- 2026-06-26
AI Technical Summary
Existing audio data classification models are prone to overfitting during training and require a large amount of data, resulting in low audio classification accuracy in dynamically changing or unknown scenarios.
We employ a method that combines a base classifier and a novel classifier. By acquiring an audio dataset, we determine the weight matrices of the base class and the novel class. We then utilize a convolutional neural network and a global temporal pooling layer, along with dual data augmentation strategies and an attention mechanism, to dynamically expand the classifier to adapt to different scenarios.
It improves the accuracy of audio data classification, and can effectively identify data in both fixed category vocabularies and dynamically changing or previously unknown scenarios, thus enhancing recognition capabilities.
Smart Images

Figure CN116129888B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of deep learning technology, and in particular to an audio data classification method, apparatus, device, and medium. Background Technology
[0002] In the data era, the focus of enterprise transformation and digitalization has shifted from "data" to "data assets," with audio data, an indispensable part of enterprise digital processes, gradually becoming a key area of attention. Data classification and grading, as a crucial step in data asset management, is of significant guiding importance for differentiated security protection and refined security control. With the development of computer auditory technology, deep learning has played a vital role in audio data classification and grading.
[0003] However, deep neural networks are prone to overfitting during training and require a large amount of data. Collecting large-scale effective audio data for model training is not practical, resulting in high difficulty and low accuracy in audio recognition.
[0004] Therefore, audio data classification models in related technologies usually rely on fixed category vocabularies to achieve high-precision classification. However, this has limited recognition capabilities in dynamically changing or previously unknown scenarios, resulting in lower audio classification accuracy in these scenarios. Summary of the Invention
[0005] This application provides an audio data classification method, apparatus, device, and medium to address the problem of low audio classification accuracy in the prior art.
[0006] In a first aspect, embodiments of this application provide an audio data classification method, the method comprising:
[0007] Obtain the audio dataset;
[0008] The audio dataset is input into a base classifier, and the base class weight matrix of the audio dataset is determined based on the base classifier.
[0009] The audio dataset is input into a new class classifier, and a new class weight matrix for the audio dataset is determined based on the new class classifier.
[0010] The classification result of the audio dataset is determined based on the base class weight matrix and the new class weight matrix.
[0011] Secondly, embodiments of this application provide an audio data classification device, the device comprising:
[0012] The acquisition module is used to acquire audio datasets;
[0013] The classification module is used to input the audio dataset into a base classifier, determine the base class weight matrix of the audio dataset based on the base classifier, and input the audio dataset into a new class classifier, determine the new class weight matrix of the audio dataset based on the new class classifier.
[0014] The determination module is used to determine the classification result of the audio dataset based on the base class weight matrix and the new class weight matrix.
[0015] Thirdly, embodiments of this application provide an electronic device, which includes at least a processor and a memory, wherein the processor is configured to execute a computer program stored in the memory to implement the steps of the audio data classification method as described in any of the preceding claims.
[0016] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the audio data classification method as described in any of the preceding claims.
[0017] In this embodiment, an audio dataset is acquired; the audio dataset is input into a base classifier, and based on the base classifier, a base class weight matrix for the audio dataset is determined; the audio dataset is input into a new class classifier, and based on the new class classifier, a new class weight matrix for the audio dataset is determined; and the classification result of the audio dataset is determined based on the base class weight matrix and the new class weight matrix. In this method, the base classifier can determine the base class weight matrix of the audio dataset, and the new class classifier can determine the new class weight matrix of the audio dataset. Then, the classification result of the audio dataset is determined based on the base class weight matrix and the new class weight matrix. This method is applicable to scenarios with a fixed category vocabulary as well as scenarios with dynamic changes or unknown prior knowledge, thus improving recognition capabilities and audio classification accuracy in different scenarios. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 A schematic diagram of an audio data classification process is provided for some embodiments of this application;
[0020] Figure 2 A schematic diagram of a hierarchical classification model group provided for some embodiments of this application;
[0021] Figure 3A schematic diagram of an audio data classification process is provided for some embodiments of this application;
[0022] Figure 4 A schematic diagram of an audio data classification process is provided for some embodiments of this application;
[0023] Figure 5 A schematic diagram of the structure of an audio data classification device provided for some embodiments of this application;
[0024] Figure 6 This is a schematic diagram of the structure of an electronic device provided for some embodiments of this application. Detailed Implementation
[0025] To make the objectives and implementation methods of this application clearer, the exemplary implementation methods of this application will be clearly and completely described below with reference to the accompanying drawings of the exemplary embodiments of this application. Obviously, the exemplary embodiments described are only some embodiments of this application, and not all embodiments.
[0026] It should be noted that the brief descriptions of terms in this application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of this application. Unless otherwise stated, these terms should be understood in their ordinary and common meaning.
[0027] The terms "first," "second," "third," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar or related objects or entities, and do not necessarily imply a specific order or sequence, unless otherwise specified. It should be understood that such terms are interchangeable where appropriate.
[0028] The terms “comprising” and “having”, and any variations thereof, are intended to cover but not exclude inclusion, for example, a product or device that includes a range of components is not necessarily limited to all of the components that are clearly listed, but may include other components that are not clearly listed or that are inherent to such product or device.
[0029] The term "module" refers to any known or subsequently developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and / or software code that is capable of performing the functions associated with that element.
[0030] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
[0031] For ease of explanation, the above description has been provided in conjunction with specific embodiments. However, the above exemplary discussion is not intended to be exhaustive or to limit the embodiments to the specific forms disclosed above. Various modifications and variations can be obtained based on the above teachings. The selection and description of the above embodiments are for the purpose of better explaining the principles and practical applications, thereby enabling those skilled in the art to better utilize the described embodiments and various different variations of embodiments suitable for specific use considerations.
[0032] Example 1:
[0033] Figure 1 A schematic diagram of an audio data classification process provided for some embodiments of this application, the process including:
[0034] S101: Obtain the audio dataset.
[0035] The audio data classification method provided in this application is applied to electronic devices, including but not limited to audio acquisition devices (such as microphones), user devices (such as mobile phones, tablets, wearable devices, etc.), or servers.
[0036] In the training scenario of a classifier (including a base classifier and / or a new class classifier) for audio data classification, the audio dataset may include one or more training sets, validation sets, or test sets, and the corresponding classifier is the classifier being trained. In the recognition scenario of audio data classification, the audio dataset may include the audio data to be recognized / classified, and the corresponding classifier is a pre-trained or fully trained classifier.
[0037] An audio dataset can include the original audio dataset and / or a collection obtained after processing the original audio dataset. Processing the original audio dataset can include preprocessing and / or data augmentation. For example, when preprocessing the original audio dataset, it can be segmented, and the segments can be adjusted to equal-length segments through methods such as pruning and padding. Optionally, the equal-length segments can be shuffled, and then training and testing sets can be created. The duration of the equal-length segments is not displayed here, but is not limited to 10 seconds. As another example, when performing data augmentation on the original audio dataset, audio data augmentation or spectrogram data augmentation can be performed, or a combination of audio data augmentation and spectrogram data augmentation can be performed.
[0038] Optionally, the original audio dataset may include publicly available datasets or self-collected datasets.
[0039] Electronic devices may include a base classifier and a novel classifier, which may optionally be integrated into an audio classification model (set). This audio classification model (set) is suitable for classifying and grading multi-label audio data with few samples.
[0040] S102: Input the audio dataset into the base classifier, and determine the base class weight matrix of the audio dataset based on the base classifier.
[0041] For example, the base classifier can be a convolutional neural network (CNN) model, such as, but not limited to, a 14-layer CNN containing 6 convolutional modules, each consisting of 2 convolutional layers with 3×3 kernels. Batch normalization is applied between each convolutional layer, and a rectified linear unit (ReLU) non-linear activation function is used to accelerate and stabilize training. A 2×2 average pooling method can be applied to downsampling each convolutional block in the base classifier. Optionally, global temporal pooling can also be applied to the base classifier to summarize features, improving training performance for weakly labeled audio data. The implementation of global temporal pooling can be found in subsequent embodiments.
[0042] The base classifier can be trained on a public audio dataset. For example, the base classifier can be pre-trained on a public audio dataset by dividing it into training, validation and test sets. Then, by fine-tuning the parameters in the feature extraction module, the feature extraction module can be transferred to other public datasets and / or self-collected datasets to obtain the base class to which the audio data in the audio dataset belongs and the base class weight matrix.
[0043] S103: Input the audio dataset into the new class classifier, and determine the new class weight matrix of the audio dataset based on the new class classifier.
[0044] The new classifier can be obtained through dynamic few-shot learning, that is, the new classifier can be extended by the basic classifier mentioned above through the few-shot classification weight generator module.
[0045] The new classifier can identify new categories other than the base class, and obtain the new class to which the audio data in the audio dataset belongs and the new class weight matrix.
[0046] S104: Determine the classification result of the audio dataset based on the base class weight matrix and the new class weight matrix.
[0047] In one implementation, the electronic device can directly use the base class weight matrix and the new class weight matrix as the classification result.
[0048] In another implementation, the electronic device determines the category to which the audio data in the audio dataset belongs based on the base class weight matrix and the new class weight matrix, and uses the category to which the audio data belongs as the classification result.
[0049] In this embodiment, the base classifier can determine the base class weight matrix of the audio dataset, and the new class classifier can determine the new class weight matrix of the audio dataset. Then, the classification result of the audio dataset is determined based on the base class weight matrix and the new class weight matrix. This method is applicable to scenarios with a fixed category vocabulary as well as scenarios with dynamic changes or unknown priors. Therefore, it can improve the recognition ability in different scenarios and improve the audio classification accuracy.
[0050] Example 2:
[0051] Based on the above embodiments, in this embodiment of the application, obtaining the audio dataset includes:
[0052] The original audio dataset is subjected to a first data augmentation, which includes one or more of the following data augmentation processes: audio rotation, audio pitch correction, audio pitch shifting, or noise addition.
[0053] The original audio dataset after the first data augmentation is converted into a Mel spectrogram;
[0054] Calculate the average value in the Mel spectrogram;
[0055] The average value is used to replace the selected row and / or column data in the Mel spectrogram to obtain the Mel spectrogram after second data enhancement;
[0056] The audio dataset is determined based on the original audio dataset and the second data-enhanced Mel spectrogram.
[0057] Because deep neural networks are prone to overfitting during training and require a large amount of data, the final recognition accuracy is uncontrollable when faced with situations where there are many types of labels but a small total amount of data. Furthermore, collecting large-scale effective audio data for model training is not practical, and audio data is usually affected by complex factors such as background noise. Therefore, in this embodiment, a dual data augmentation strategy can be applied to the original audio samples to expand the dataset, increase data diversity, and improve the generalization of the feature extraction model. This is beneficial for solving the problem of having many types of labels but a small total amount of data in the scenario of classifying and grading audio data with few samples and many labels.
[0058] The dual data enhancement processing in this embodiment includes audio data enhancement processing and spectrogram data enhancement processing.
[0059] In audio data augmentation, the original dataset can be expanded using one or more of the following methods: audio rotation, audio pitch correction, audio shifting, and noise addition, to complete the first data augmentation of the audio data in the original dataset.
[0060] For example, after audio data enhancement processing, spectrogram data enhancement processing can be performed. During spectrogram data enhancement processing, settings such as frame length, frame shift, Mel plot band number, and sampling frequency can be set to convert the original audio dataset after the first data enhancement into one or more Mel spectrograms. The electronic device can randomly select a portion of rows and columns from each Mel spectrogram to calculate the average value of each Mel spectrogram. Replacing the randomly selected row and column data with the average value yields a new Mel spectrogram. It is understood that random selection of row and / or column data is merely an example; in other examples, row and / or column data can be selected according to set rules. The set rules are not limited in this embodiment.
[0061] When determining the audio dataset based on the original audio dataset and the new Mel spectrogram, the new Mel spectrogram can be added to the original audio dataset to obtain the audio dataset, thus completing the second data augmentation for the spectrogram data.
[0062] In this embodiment of the application, performing dual data augmentation on the original audio samples can expand the dataset, increase data diversity, and further improve the accuracy of audio classification.
[0063] Example 3:
[0064] Based on the above embodiments, in this embodiment of the application, a global temporal pooling layer is connected after the last convolutional layer in the basic classifier.
[0065] Global temporal pooling layers can summarize features, improve training performance for weak audio data, and increase the accuracy of weakly labeled audio classification.
[0066] For example, the base classifier applies a 2×2 average pooling to downsample each convolutional block and applies a global temporal pooling to summarize features after the last convolutional layer.
[0067] Because audio data has the characteristic of multiple sounds overlapping in time, audio classification suffers from weak labeling. In this embodiment, global temporal pooling is applied after the last convolutional layer of the basic classifier to summarize audio features, which can further improve the accuracy of audio classification.
[0068] Example 4:
[0069] Based on the above embodiments, in this embodiment of the application, the method further includes:
[0070] In the base class, determine the pseudo-new class used to train the new class classifier, and determine multiple labeled data of the pseudo-new class and the weight of each labeled data belonging to the pseudo-new class;
[0071] Input the new class classifier with multiple labeled data, the weight of each labeled data belonging to the pseudo-new class, and the base class weight vector output by the base classifier;
[0072] Based on the new class classifier, the average feature vector of the pseudo-new class is calculated according to multiple labeled data and the weight of each labeled data belonging to the pseudo-new class;
[0073] Based on the new class classifier, the average feature vector and the base class weight vector are weighted to obtain the weight vector of the pseudo-new class.
[0074] The base class weight matrix is updated based on the weight vector of the pseudo-new class and the base class weight vector; the parameters of the new class classifier are then updated based on the updated base class weight matrix.
[0075] The new classifier can extend the basic classifier to achieve the recognition of new audio categories.
[0076] In the embodiments of this application, the electronic device can train a new class classifier based on a pre-trained base classifier and a training set containing base classes. In each iteration, one or more pseudo-new classes can be taken from the base classes to simulate the new category in the inference stage. Then, K training samples (i.e. labeled data) are sampled for each pseudo-new class, and a new weight vector for the pseudo-new class is generated by the new class classifier.
[0077] When generating a new weight vector for a pseudo-new class using a new class classifier, the labeled data of the pseudo-new class, the weights corresponding to each labeled data, and the base class weight vector can be used as inputs to the new class classifier. The new class classifier can calculate the average feature vector of the pseudo-new class data using multiple labeled data and the weights corresponding to each labeled data. Then, based on the weighted sum of this average feature vector and the base class weight vector, the weight vector of the pseudo-new class is determined.
[0078] The base class weight vector is a linear combination of the weight vectors of each base class. Optionally, the weight vector of each base class can be calculated by an attention module consisting of a cosine similarity function in the base classifier followed by a softmax (normalized exponential function) on the base class.
[0079] In this implementation, the new class classifier can be learned through a dynamic few samples, for example, it can be learned through K labeled data of the pseudo-new class. In this embodiment, the value of K is not limited, for example, but not limited to K≤5.
[0080] During the training of the new class classifier, a new base class weight matrix is formed based on the weight vector of the pseudo-new class and the weight vector of the base class. Then, based on the new base class weight matrix, the parameters of the new class classifier can be updated to minimize the classification loss of this batch.
[0081] In one implementation, when training a new classifier, the optimization process uses an adaptive moment estimate (Adam) optimizer with a high learning rate.
[0082] In the embodiments of this application, a new classifier can be learned based on dynamic few-shot learning, which can continuously expand the trained base classifier so that new categories can be identified based on only a small amount of labeled data during the inference stage, overcoming the application limitations of a fixed class vocabulary in dynamically changing or prior unknown scenarios.
[0083] In this method, the base class weight vector can also be updated based on the updated base class weight matrix. During the training of the few-shot new class classifier, the base class weight vector can be updated based on the new base class weight matrix formed by the pseudo-new class weight vector and the base class weight vector to minimize the classification loss of this batch.
[0084] A few-shot weight generator is built based on the attention mechanism. It makes full use of the prior knowledge of the base class classification weights and can obtain the corresponding classification weights based on only a small amount of new class labeled data. By combining the new class weights with the original base class weights, the prior matrix of classification weights is dynamically expanded, thereby realizing the joint prediction of the base class and the new class in a unified framework. It can also further improve the accuracy of audio classification.
[0085] Example 5:
[0086] Based on the above embodiments, in the embodiments of this application, the loss function in the new class classifier includes the binary cross-entropy loss function.
[0087] Typically, the loss function in neural networks is the classification cross-entropy loss function. However, in this embodiment, the binary cross-entropy loss function is used instead of the classification cross-entropy loss function to train the neural network, which can realize the transfer from multi-class tasks to multi-label tasks.
[0088] Because audio data has the characteristic of multiple sounds overlapping in time, audio classification suffers from the problem of multiple labels. In this embodiment, the binary cross-entropy loss function is used to train a new classifier, which can adapt to multi-label audio classification and further improve the accuracy of audio classification.
[0089] Example 6:
[0090] Based on the above embodiments, in this embodiment of the application, the method further includes:
[0091] Based on the test results of the base classifier and / or the new class classifier, identify the confused classes that have a classification accuracy below a set threshold and are mixed with other classes, as well as the correct classes that have a classification accuracy above a set threshold.
[0092] If the ratio of the number of confused classes to the number of correct classes exceeds a set ratio, the number of nodes in the base classifier and / or the new class classifier is modified according to the number of confused classes to obtain the classifier to be trained; the classifier is then trained again using a sub-audio dataset containing the confused classes from the audio dataset.
[0093] In classification tasks, the difficulty of recognizing different labels often varies. Therefore, in this embodiment, a hierarchical classification model group training method can be used to address the problem of uneven classification accuracy of the classifier across different categories. The classifier includes a base classifier and / or a new class classifier, which will be described in this embodiment.
[0094] The classifier can be tested using a validation set and / or a test set to obtain the test results of the current classifier. Based on the test results, confusing classes (those with classification accuracy below a set threshold and mixed with other classes) and correct classes (those with classification accuracy above the set threshold) can be identified. Confusing classes, due to their low prediction accuracy and tendency to be mixed with other classes, can also be considered as easily confused and prone to errors. Optionally, confusing classes can be stored in the confusion matrix of the test results. In this embodiment, the value of the set threshold is not limited.
[0095] The decision to train the lower-level model can be determined by the ratio k of the number of confused classes to the number of correct classes. Specifically, if the ratio k exceeds a set threshold, training the lower-level model is initiated. This threshold can be considered as the generation threshold p parameter for the lower-level model. Optionally, the parameters for the lower-level model can also include a learning rate variation parameter q. This parameter q allows adjustment of the learning rate, thereby adjusting the convergence speed of the lower-level model. Further details can be found in subsequent embodiments.
[0096] In one implementation, the lower-level model is trained when the ratio of the number of confused classes to the number of correct classes exceeds a set ratio (e.g., k≥p). In another implementation, the lower-level model is trained when the ratio of the number of confused classes to the number of correct classes exceeds a set ratio, and the upper-level model (i.e., the current classifier) is not a binary classification model.
[0097] When training the lower-level model, you can copy the pre-trained classifier from the upper layer, modify the number of softmax nodes in the classifier, and then retrain the lower-level model (the classifier with the modified number of nodes) using a sub-audio dataset that only retains the confusion classes. This allows for focused correction of the classification performance of the confusion classes after removing other data interference. For example, the modified number of nodes is the same as the number of confusion classes.
[0098] In one implementation, when the ratio of the number of confused classes to the number of correct classes is less than a set ratio (e.g., k < p), the lower-level model is not trained, and the generation of the lower-level model is terminated. In another implementation, when the ratio of the number of confused classes to the number of correct classes exceeds a set ratio, and the upper-level model is a binary classification model, the lower-level data is not trained, and the generation of the lower-level model is terminated.
[0099] by Figure 2Taking an example, the electronic device trains a few-shot audio classification model (base classifier and / or new class classifier) using dataset S. Then, it tests the trained few-shot audio classification model, obtaining a confusion matrix 0 that includes confusion classes. Based on the confusion classes included in confusion matrix 0, datasets S1 and S2 are determined in dataset S, where datasets S1 and S2 can correspond to the same or different confusion classes. Based on the confusion classes, the number of nodes in the few-shot audio classification model is modified to obtain the lower-level few-shot audio classification model 1 and few-shot audio classification model 2. The electronic device trains few-shot audio classification model 1 using dataset S1 and few-shot audio classification model 2 using dataset S2. If the trained few-shot audio classification model 1 is tested, obtaining a confusion matrix 1 that includes confusion classes, dataset S3 is determined in dataset S based on the confusion classes included in confusion matrix 1. The number of nodes in few-shot audio classification model 1 is then modified to obtain the lower-level few-shot audio classification model 3, which is then trained using dataset S3. If the test results of the trained few-shot audio classification model 2 determine that there is no need to train the next layer model, then the generation of the next layer model will be terminated.
[0100] When the classifier's classification accuracy is uneven across categories, constructing a hierarchical classification model group to focus on correcting and training easily confused and misclassified categories can further optimize the existing few-sample audio classification model, thereby improving the overall accuracy of audio classification.
[0101] Example 7:
[0102] Based on the above embodiments, in this application embodiment, the method further includes:
[0103] After a set number of training rounds, the learning rate of the classifier is adjusted by changing the learning rate parameter.
[0104] In addition to the aforementioned set ratio (such as the threshold p generated by the lower-level model), the parameters for constructing the lower-level model can also include learning rate variation parameters (such as the learning rate variation parameter q of the lower-level model). The learning rate of the classifier can be adjusted through the learning rate variation parameter, thereby adjusting the convergence speed of the classifier.
[0105] Generating the lower-level model (i.e., the classifier to be trained after modifying the number of nodes) increases training time, which in turn reduces the convergence speed of the classifier. Therefore, by changing the learning rate parameter and lowering the learning rate of the classifier, the convergence speed of the classifier can be improved.
[0106] In this embodiment, the learning rate of the classifier can be reduced once after a set number of training epochs. One epoch is defined as when every sample in the training set participates in one training iteration of the classifier. There is no restriction on the value of the set number of epochs; for example, the set number of epochs can be m epochs, where m is a positive integer.
[0107] For example, the learning rate variation parameter is q, where q < 1 and q is a positive number. When using the learning rate variation parameter to reduce the learning rate of the classifier, the learning rate of the classifier can be reduced to q times its original value.
[0108] In this embodiment of the application, a learning rate variation parameter q can be introduced to address the training time consumption caused by the large number of lower-level models generated. The learning rate of the lower-level model is reduced to q times its original value after every m epochs, thereby improving the convergence speed of the multi-level model. In other words, during the training of the hierarchical model group, by setting the learning rate variation parameter of the lower-level model, the learning rate reduction speed can be appropriately accelerated while ensuring accuracy, so as to alleviate the training time consumption problem caused by the large number of factor models generated.
[0109] Example 8:
[0110] Based on the above embodiments, Figure 3 A flowchart illustrating audio data classification is provided, including the following steps:
[0111] S301: The original audio samples are trimmed or padded into audio segments of equal length, and then subjected to dual data augmentation processing of audio data augmentation and spectrogram data augmentation to obtain the expanded audio dataset.
[0112] When performing dual data enhancement processing of audio data enhancement and spectrogram data enhancement, the first data enhancement is completed by means of audio pitch adjustment, pitch shifting, and noise addition; the Mel spectrogram of the audio data after the first data enhancement is obtained, and a new Mel spectrogram is generated by random mean replacement to complete the second data enhancement.
[0113] S302: Train a base classifier based on a CNN model (including a feature extraction module and a base class weight matrix module in the base classifier), use the base classifier to extract audio signal features, and obtain the probability of each base class by applying a set of classification weight vectors to the features.
[0114] Specifically, the feature extraction module can extract features from the audio signal, and the base class weight matrix module can apply a set of corresponding classification weight vectors to the extracted features. This set of classification weight vectors includes the probability of each audio signal belonging to its corresponding base class, thus obtaining the probability of each base class.
[0115] S303: Train an attention-based few-shot weight generator (i.e., a new class classifier), combine the generated new class weights with the original weights of other base classes to construct a new classification weight matrix, and update the parameters of the weight generator and the base class weight vector.
[0116] This step is the same as the training process of the new classifier in the above embodiments, and will not be repeated here.
[0117] S304: Train a hierarchical classification model group. Based on the confusing classes that the upper-level model failed to classify accurately, obtain a subset of the original dataset and perform transfer learning on the upper-level model to obtain a lower-level model for the confusing classes, and finally obtain a series of model groups for high-precision classification.
[0118] This step is the same as the process of continuing to train the classifier based on the confusion class in the above embodiments, and will not be repeated here.
[0119] S305: Input the audio signal into the trained few-sample audio classification model (group) and output the classification result corresponding to the audio signal.
[0120] In this step, audio data is input into the trained few-shot multi-label audio classification model (group) to obtain the classification results corresponding to the audio signals. Specifically: the audio data is input into the few-shot multi-label audio classification model (group), and converted into equal-length audio segments by cropping or padding. The first data augmentation (audio data augmentation) is completed by methods such as audio tuning, pitch shifting, and noise addition. The Mel spectrogram of the data augmented is obtained. New Mel spectrogram data is generated by replacing the mean with a random mean, completing the second data augmentation (spectral data augmentation). Based on the base classifier (in the few-shot multi-label audio classification model), audio features are extracted and the base class weight matrix is obtained. Based on the new class classifier (in the few-shot multi-label audio classification model), corresponding classification weights are generated for the new class audio data and the classification results are output.
[0121] S304 is an optional step. For example, S304 is executed when there are confused classes, when the ratio of the number of confused classes to the number of correct classes exceeds a set ratio, or when the ratio of the number of confused classes to the number of correct classes exceeds a set ratio and the current classifier is not a binary classification model. One possible implementation is described below. Figure 4The audio data is preprocessed, then subjected to dual data augmentation. A base classifier and an attention-based few-shot weight generator are trained using the augmented audio data. The base classifier and weight generator are tested to determine if any confusing classes exist. If so, a hierarchical classification model group is constructed to further test the base classifier and / or weight generator. Finally, the trained base classifier and / or weight generator are used to classify and identify the audio data, outputting the classification result. If no confusing classes exist, the trained base classifier and / or weight generator can be directly used to classify and identify the audio data, outputting the classification result.
[0122] In this embodiment, an audio data classification and grading method for scenarios with few samples and multiple labels is constructed based on a secure data platform. The dataset is expanded through a dual data augmentation strategy to complete the training of the basic classifier when the sample size is insufficient. The basic classifier is dynamically expanded based on dynamic few-shot learning technology and attention mechanism, so that new categories can be identified based on only a small amount of labeled data during the inference stage. The model definition and loss function are fine-tuned to adapt to audio data classification in scenarios with multiple labels and weak labels. When the classification accuracy of the classifier is uneven across categories, a hierarchical classification model group is constructed to focus on correcting and training easily confused and error-prone categories, thereby further optimizing the existing few-shot audio classification model.
[0123] The embodiments of this application are applicable to, but not limited to, the following scenarios:
[0124] Scenario 1: Sensitive Audio Judgment on Audio Review Platforms. In the digital age, audio has become a crucial means of information transmission, with social media platforms generating hundreds of millions of audio content daily. Some harmful audio containing sensitive information and unsuitable for dissemination has also emerged. Compared to normal audio, this sensitive content is usually small in quantity, but if it cannot be accurately identified and removed, it will have a negative impact on national security, social stability and harmony, and especially on the growth of teenagers. Therefore, a few-sample audio classification model can be used to accurately classify sensitive audio content with relatively small datasets, improving the efficiency and accuracy of audio review.
[0125] Scenario 2: Access Control for Audio Conference Minutes. The widespread adoption of online meetings has led to a surge in enterprise-archived meeting minutes. These minutes often contain a small amount of important business audio content. A low-sample, multi-label audio classification model is needed to identify this important audio content. Different levels of security access permissions should be set to achieve differentiated management of these business audio conference minutes. Regular staff have only Level 1 access to the audio conference minutes, allowing them to access only the anonymized versions of the important content. Administrators and senior management have Level 2 access to the complete audio conference minutes.
[0126] Example 9:
[0127] Based on the same technical concept and the above embodiments, this application provides an audio data classification device. Figure 5 A schematic diagram of an audio data classification device is provided for some embodiments of this application, such as... Figure 5 As shown, the device includes:
[0128] Module 501 is used to acquire audio datasets;
[0129] The classification module 502 is used to input the audio dataset into a base classifier, determine the base class weight matrix of the audio dataset based on the base classifier, and input the audio dataset into a new class classifier, determine the new class weight matrix of the audio dataset based on the new class classifier.
[0130] The determination module 503 is used to determine the classification result of the audio dataset based on the base class weight matrix and the new class weight matrix.
[0131] In one possible implementation, the acquisition module 501 is specifically used to perform a first data enhancement on the original audio dataset. The first data enhancement includes one or more of the following data enhancement processes: audio rotation, audio pitch correction, audio pitch shifting, or noise addition. The original audio dataset after the first data enhancement is converted into a Mel spectrogram. The average value in the Mel spectrogram is calculated. The average value is used to replace the selected row data and / or column data in the Mel spectrogram to obtain a second data-enhanced Mel spectrogram. The audio dataset is determined based on the original audio dataset and the second data-enhanced Mel spectrogram.
[0132] In one possible implementation, a global temporal pooling layer is connected after the last convolutional layer in the base classifier.
[0133] In one possible implementation, the device further includes:
[0134] The training module is used to determine the pseudo-new classes for training the new class classifier from the base classes, and to determine multiple labeled data for the pseudo-new classes and the weights of each labeled data belonging to the pseudo-new class; input the multiple labeled data, the weights of each labeled data belonging to the pseudo-new class, and the base class weight vector output by the base classifier into the new class classifier; based on the new class classifier, calculate the average feature vector of the pseudo-new class according to the multiple labeled data and the weights of each labeled data belonging to the pseudo-new class; based on the new class classifier, perform weighted processing on the average feature vector and the base class weight vector to obtain the weight vector of the pseudo-new class; update the base class weight matrix according to the weight vector of the pseudo-new class and the base class weight vector; update the parameters of the new class classifier according to the updated base class weight matrix.
[0135] In one possible implementation, the training module is further configured to update the base class weight vector based on the updated base class weight matrix.
[0136] In one possible implementation, the loss function in the new class classifier includes the binary cross-entropy loss function.
[0137] In one possible implementation, the device further includes:
[0138] The correction module is used to determine, based on the test results of the base classifier and / or the new classifier, confused classes whose classification accuracy is lower than a set threshold and mixed with other classes, and correct classes whose classification accuracy is higher than a set threshold; if the ratio of the number of confused classes to the number of correct classes exceeds a set ratio, the number of nodes in the base classifier and / or the new classifier is modified according to the number of confused classes to obtain the classifier to be trained; the classifier is further trained using a sub-audio dataset containing confused classes from the audio dataset.
[0139] In one possible implementation, the correction module is further configured to lower the learning rate of the classifier by using a learning rate change parameter after a set number of training rounds.
[0140] Example 10:
[0141] Based on the same technical concept, this application also provides an electronic device. Figure 6 This application provides a schematic diagram of an electronic device structure, such as... Figure 6 As shown, it includes: processor 601, communication interface 602, memory 603 and communication bus 604, wherein processor 601, communication interface 602 and memory 603 communicate with each other through communication bus 604.
[0142] The memory 603 stores a computer program, which, when executed by the processor 601, causes the processor 601 to perform the following steps:
[0143] Obtain the audio dataset;
[0144] The audio dataset is input into a base classifier, and the base class weight matrix of the audio dataset is determined based on the base classifier.
[0145] The audio dataset is input into a new class classifier, and the new class weight matrix of the audio dataset is determined based on the new class classifier.
[0146] The classification results of the audio dataset are determined based on the base class weight matrix and the new class weight matrix.
[0147] In one possible implementation, the processor 601 is specifically configured to perform a first data enhancement on the original audio dataset, the first data enhancement including one or more of the following data enhancement processes: audio rotation, audio pitch correction, audio pitch shifting, or noise addition; convert the original audio dataset after the first data enhancement into a Mel spectrogram; calculate the average value in the Mel spectrogram; replace selected row data and / or column data in the Mel spectrogram with the average value to obtain a second data-enhanced Mel spectrogram; and determine the audio dataset based on the original audio dataset and the second data-enhanced Mel spectrogram.
[0148] In one possible implementation, a global temporal pooling layer is connected after the last convolutional layer in the base classifier.
[0149] In one possible implementation, the processor 601 is further configured to: determine pseudo-new classes for training the new class classifier in the base class; determine multiple labeled data of the pseudo-new class and the weight of each labeled data belonging to the pseudo-new class; input the multiple labeled data, the weight of each labeled data belonging to the pseudo-new class, and the base class weight vector output by the base classifier into the new class classifier; calculate the average feature vector of the pseudo-new class based on the new class classifier, according to the multiple labeled data and the weight of each labeled data belonging to the pseudo-new class; perform weighted processing on the average feature vector and the base class weight vector based on the new class classifier to obtain the weight vector of the pseudo-new class; update the base class weight matrix according to the weight vector of the pseudo-new class and the base class weight vector; and update the parameters of the new class classifier according to the updated base class weight matrix.
[0150] In one possible implementation, the processor 601 is further configured to update the base class weight vector based on the updated base class weight matrix.
[0151] In one possible implementation, the loss function in the new class classifier includes the binary cross-entropy loss function.
[0152] In one possible implementation, the processor 601 is further configured to determine, based on the test results of the base classifier and / or the new class classifier, confused classes with classification accuracy below a set threshold and mixed with other classes, and correct classes with classification accuracy above a set threshold; if the ratio of the number of confused classes to the number of correct classes exceeds a set ratio, modify the number of nodes in the base classifier and / or the new class classifier according to the number of confused classes to obtain a classifier to be trained; and continue training the classifier using a sub-audio dataset containing confused classes from the audio dataset.
[0153] In one possible implementation, the processor 601 is further configured to, after a set number of training rounds, use a learning rate variation parameter to lower the learning rate of the classifier.
[0154] The communication bus mentioned in the above electronic devices can be a PCI (Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.
[0155] The communication interface 602 is used for communication between the above-mentioned electronic device and other devices.
[0156] The memory may include RAM (Random Access Memory) or NVM (Non-Volatile Memory), such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.
[0157] The processors mentioned above can be general-purpose processors, including central processing units, network processors (NPs), etc.; they can also be DSPs (Digital Signal Processors), application-specific integrated circuits, field-programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
[0158] Example 11:
[0159] Based on the same technical concept, embodiments of this application provide a computer-readable storage medium storing a computer program executable by an electronic device. When the program is run on the electronic device, it causes the electronic device to implement any of the above embodiments.
[0160] The aforementioned computer-readable storage medium can be any available medium or data storage device that can be accessed by the processor in an electronic device, including but not limited to magnetic storage such as floppy disks, hard disks, magnetic tapes, MO (magneto-optical disks), optical storage such as CDs, DVDs, BDs, HVDs, etc., and semiconductor storage such as ROMs, EPROMs, EEPROMs, NAND flash (non-volatile memory), SSDs (solid-state drives), etc.
[0161] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0162] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0163] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0164] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0165] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.
Claims
1. An audio data classification method, characterized in that, The method includes: Obtain the audio dataset; The audio dataset is input into a base classifier, and the base class weight matrix of the audio dataset is determined based on the base classifier. The audio dataset is input into a new class classifier, and a new class weight matrix for the audio dataset is determined based on the new class classifier; wherein the new class classifier is obtained by dynamic few-shot learning and by extending the base classifier; wherein the dynamic few-shot refers to multiple labeled data in the pseudo-new class determined in the base class for training the new class classifier; The classification result of the audio dataset is determined based on the base class weight matrix and the new class weight matrix.
2. The method as described in claim 1, characterized in that, The acquisition of the audio dataset includes: The original audio dataset is subjected to a first data augmentation, which includes one or more of the following data augmentation processes: audio rotation, audio pitch correction, audio pitch shifting, or noise addition. The original audio dataset after the first data augmentation is converted into a Mel spectrogram; Calculate the average value in the Mel spectrogram; The average value is used to replace the selected row and / or column data in the Mel spectrogram to obtain the Mel spectrogram after second data enhancement; The audio dataset is determined based on the original audio dataset and the second data-enhanced Mel spectrogram.
3. The method as described in claim 1 or 2, characterized in that, The last convolutional layer in the basic classifier is followed by a global temporal pooling layer.
4. The method as described in claim 1 or 2, characterized in that, The method further includes: In the base class, a pseudo-new class is determined for training the new class classifier, and multiple labeled data of the pseudo-new class and the weight of each labeled data belonging to the pseudo-new class are determined; The multiple labeled data, the weight of each labeled data belonging to the pseudo-new class, and the base class weight vector output by the base classifier are input into the new class classifier; Based on the new class classifier, the average feature vector of the pseudo-new class is calculated according to the multiple labeled data and the weight of each labeled data belonging to the pseudo-new class; Based on the new class classifier, the average feature vector and the base class weight vector are weighted to obtain the weight vector of the pseudo-new class. The base class weight matrix is updated based on the weight vector of the pseudo-new class and the weight vector of the base class; the parameters of the new class classifier are updated based on the updated base class weight matrix.
5. The method as described in claim 4, characterized in that, The loss function in the new classifier includes the binary cross-entropy loss function.
6. The method as described in claim 1 or 2, characterized in that, The method further includes: Based on the test results of the base classifier and / or the new class classifier, identify confused classes with classification accuracy below a set threshold and mixed with other classes, as well as correct classes with classification accuracy above the set threshold. If the ratio of the number of confused classes to the number of correct classes exceeds a set ratio, the number of nodes in the base classifier and / or the new class classifier is modified according to the number of confused classes to obtain a classifier to be trained; the classifier is then trained again using a sub-audio dataset containing the confused classes from the audio dataset.
7. The method as described in claim 6, characterized in that, The method further includes: After a set number of training rounds, the learning rate of the classifier is adjusted downwards by using a learning rate variation parameter.
8. An audio data classification device, characterized in that, The device includes: The acquisition module is used to acquire audio datasets; A classification module is used to input the audio dataset into a base classifier, and based on the base classifier, determine the base class weight matrix of the audio dataset; input the audio dataset into a new class classifier, and based on the new class classifier, determine the new class weight matrix of the audio dataset; wherein, the new class classifier is obtained by dynamic few-shot learning and extending the base classifier; wherein, the dynamic few-shot learning consists of multiple labeled data in a pseudo-new class determined in the base class for training the new class classifier; The determination module is used to determine the classification result of the audio dataset based on the base class weight matrix and the new class weight matrix.
9. An electronic device, characterized in that, The electronic device includes at least a processor and a memory, the processor being used to implement the steps of the audio data classification method as described in any one of claims 1-7 when executing a computer program stored in the memory.
10. A computer storage medium, characterized in that, It stores a computer program executable by an electronic device, which, when run on the electronic device, causes the electronic device to perform the steps of the audio data classification method according to any one of claims 1-7.