Audio recognition method and training method of audio recognition model

By performing mask gradient updates and text pseudo-label training on the audio recognition model, the problem of difficulty in grasping homology in training unlabeled audio data is solved, and the recognition accuracy of the audio recognition model is improved.

CN116229949BActive Publication Date: 2026-06-26IFLYTEK CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
IFLYTEK CO LTD
Filing Date
2022-12-29
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing audio recognition models often suffer from poor recognition performance during training on unlabeled audio data due to the difficulty in grasping the homology.

Method used

By performing mask gradient updates on the model parameters related to the audio recognition task in the audio recognition model, the initial audio recognition model is trained using first audio data containing text pseudo-labels and second audio data containing text labels, thereby determining the model parameters related to the audio recognition task and performing mask gradient updates.

Benefits of technology

This improves the recognition accuracy of the audio recognition model for audio data, avoids the problem of difficulty in grasping the homology between unlabeled data and the audio recognition model, and enhances the training effect.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116229949B_ABST
    Figure CN116229949B_ABST
Patent Text Reader

Abstract

The application provides an audio recognition method and a training method of an audio recognition model. The audio recognition method comprises: obtaining audio data to be recognized; performing audio recognition processing on the audio data to be recognized by using a pre-trained audio recognition model to obtain text data corresponding to the audio data; wherein the audio recognition model is obtained based on mask gradient updating of model parameters related to an audio recognition task in a first audio recognition model; the first audio recognition model is obtained by performing audio recognition training on an initial audio recognition model by using first audio data containing text pseudo-labels and second audio data containing text labels, and the text pseudo-labels are determined by performing audio recognition on the first audio data by the initial audio recognition model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of speech recognition, specifically to an audio recognition method and a training method for an audio recognition model. Background Technology

[0002] With the continuous development of artificial intelligence, audio recognition technology has been involved in all aspects of people's lives.

[0003] In existing technologies, audio recognition models used in audio recognition processes are usually obtained through semi-supervised training based on a large amount of unlabeled audio data. However, during the training process using a large amount of unlabeled audio data, it is difficult to grasp the homology between unlabeled data and audio recognition technology, and there are data points with domain offsets, which leads to poor performance in completing audio recognition tasks based on audio recognition models.

[0004] Therefore, how to complete the audio data recognition task with high quality has become a technical problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0005] This application provides an audio recognition method and an audio model training method to improve the quality of audio recognition tasks.

[0006] According to a first aspect of the embodiments of this application, an audio recognition method is provided, comprising:

[0007] Obtain the audio data to be recognized;

[0008] Using a pre-trained audio recognition model, the audio data to be recognized is processed to obtain text data corresponding to the audio data;

[0009] The audio recognition model is obtained by masking gradient updates of the model parameters related to the audio recognition task in the first audio recognition model. The first audio recognition model is obtained by training the initial audio recognition model with first audio data containing text pseudo-labels and second audio data containing text labels. The text pseudo-labels are determined by the initial audio recognition model performing audio recognition on the first audio data.

[0010] According to a second aspect of the embodiments of this application, a method for training an audio recognition model is provided, comprising:

[0011] The initial audio recognition model is trained using first audio data containing text pseudo-labels and second audio data containing text labels to obtain the trained first audio recognition model.

[0012] Determine the model parameters related to the audio recognition task in the first audio recognition model, and perform mask gradient update on the model parameters to obtain the audio recognition model with updated parameters.

[0013] In one optional embodiment of this application, before the step of training an initial audio recognition model using first audio data containing text pseudo-labels and second audio data containing text labels to obtain a trained first audio recognition model, the method further includes:

[0014] The initial audio recognition model is used to perform audio recognition processing on the first audio dataset to obtain the text pseudo-labels of each audio data in the first audio dataset.

[0015] Based on the text pseudo-label confidence of each audio data and the training stage of the initial audio recognition model, a preset number of audio data corresponding to the training stage are determined from the first audio dataset.

[0016] The preset number of audio data is used as the first audio data.

[0017] In one optional embodiment of this application, determining a preset number of audio data points corresponding to the training stage from the first audio dataset based on the text pseudo-label confidence of each audio data point and the training stage of the initial audio recognition model includes:

[0018] Based on the confidence level of each text pseudo-label, sort the audio data in the first audio dataset.

[0019] Based on the training phase of the initial audio recognition model, a preset number of audio data corresponding to the training phase are determined from the first audio dataset in descending order of confidence.

[0020] In one optional embodiment of this application, after obtaining the trained first audio recognition model, the method further includes:

[0021] Delete the first audio data in the first audio dataset, and add the preset number of new audio data to the first audio dataset.

[0022] In one optional embodiment of this application, the confidence level of the text pseudo-tag is obtained in the following manner:

[0023] The audio data is input into the initial audio recognition model, and a perturbation is added to the audio recognition process of the initial audio recognition model to obtain the perturbation text pseudo-label of the audio data.

[0024] Determine the first confidence level of the text pseudo-label corresponding to the audio data, and the second confidence level of the perturbation text pseudo-label;

[0025] The confidence level of the text pseudo-label is obtained based on the edit distance between the text pseudo-label and the perturbation text pseudo-label, and the average confidence level of the first confidence level and the second confidence level.

[0026] In one optional embodiment of this application, determining the model parameters related to the audio recognition task in the first audio recognition model and performing mask gradient updates on the model parameters to obtain the parameter-updated audio recognition model includes:

[0027] Obtain the correlation between each model parameter in the first audio recognition model and the audio recognition task;

[0028] Based on the preset mask threshold and the correlation between each model parameter and the audio task, determine the model parameters in the first audio recognition model that are related to the audio recognition task;

[0029] The model parameters related to the audio task are updated using a mask gradient to obtain the audio recognition model with updated parameters.

[0030] In one optional embodiment of this application, it further includes:

[0031] The updated audio recognition model is used as the initial audio recognition model. The process of training the initial audio recognition model using first audio data containing text pseudo-labels and second audio data containing text labels is then repeated to obtain the trained first audio recognition model.

[0032] According to a third aspect of the embodiments of this application, an audio recognition device is provided, comprising:

[0033] The first unit is used to acquire the audio data to be recognized;

[0034] The second unit is used to perform audio recognition processing on the audio data to be recognized using a pre-trained audio recognition model to obtain text data corresponding to the audio data.

[0035] The audio recognition model is obtained by updating the mask gradient of the model parameters related to the audio recognition task in the first audio recognition model; the first audio recognition model is obtained by training the initial audio recognition model using first audio data containing text pseudo-labels and second audio data containing text labels.

[0036] According to a fourth aspect of the embodiments of this application, a training apparatus for an audio recognition model is provided, comprising:

[0037] The third unit is used to train the initial audio recognition model using the first audio data containing text pseudo-labels and the second audio data containing text labels, so as to obtain the trained first audio recognition model.

[0038] The fourth unit is used to determine the model parameters related to the audio recognition task in the first audio recognition model, and to perform mask gradient update on the model parameters to obtain the audio recognition model after parameter update.

[0039] According to a fifth aspect of the embodiments of this application, an electronic device is provided, comprising:

[0040] processor;

[0041] Memory used to store the processor's executable instructions;

[0042] The processor is configured to execute the above method by running instructions in the memory.

[0043] According to a fifth aspect of the embodiments of this application, a computer-readable storage medium is provided, the storage medium storing a computer program that, when executed by a processor, performs the above-described method.

[0044] Compared with the prior art, this application has the following advantages:

[0045] This application provides an audio recognition method and an audio recognition model training method. The audio recognition method includes: acquiring audio data to be recognized; using a pre-trained audio recognition model to perform audio recognition processing on the audio data to be recognized to obtain text data corresponding to the audio data; wherein the audio recognition model is obtained based on mask gradient updates of model parameters related to the audio recognition task in a first audio recognition model; the first audio recognition model is obtained by training an initial audio recognition model using first audio data containing text pseudo-labels and second audio data containing text labels, wherein the text pseudo-labels are determined by the initial audio recognition model performing audio recognition on the first audio data.

[0046] This method utilizes mask gradient updates to update the model parameters related to the audio recognition task in the audio recognition model, avoiding the problem of difficulty in grasping the homology between unlabeled data and the audio recognition model during the training process using unlabeled sample data, and improving the audio recognition accuracy of the trained audio recognition model on the audio data to be recognized. Attached Figure Description

[0047] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0048] Figure 1 This is a schematic diagram illustrating an application scenario of the audio recognition method provided in the embodiments of this application;

[0049] Figure 2 A flowchart of an audio recognition method provided in another embodiment of this application;

[0050] Figure 3 A flowchart illustrating a training method for an audio recognition model provided in another embodiment of this application;

[0051] Figure 4 A schematic diagram of an audio recognition device provided in another embodiment of this application.

[0052] Figure 5 A schematic diagram of the training device structure for an audio recognition model provided in another embodiment of this application;

[0053] Figure 6 This is a schematic diagram of an electronic device structure provided for another embodiment of this application. Detailed Implementation

[0054] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0055] With the continuous development of artificial intelligence, audio recognition technology has been involved in all aspects of people's lives.

[0056] In existing technologies, audio recognition models used in audio recognition processes are usually obtained through semi-supervised training based on a large amount of unlabeled audio data. However, during the training process using a large amount of unlabeled audio data, it is difficult to grasp the homology between unlabeled data and audio recognition technology, and there are data points with domain offsets, which leads to poor performance in completing audio recognition tasks based on audio recognition models.

[0057] Therefore, how to complete the audio data recognition task with high quality has become a technical problem that urgently needs to be solved by those skilled in the art.

[0058] To address the aforementioned technical problems, this application provides an audio recognition method and an audio recognition model training method, which will be described in detail in the following embodiments.

[0059] Exemplary Implementation Environment

[0060] To facilitate understanding of the audio recognition method and audio recognition model training method provided in this application, the specific application scenarios of the audio recognition method will first be introduced.

[0061] Please refer to Figure 1 , Figure 1 This is a schematic diagram illustrating an application scenario of the audio recognition method provided in this application embodiment. In this scenario embodiment, the audio recognition method is applied to speaker audio recognition in a conference setting.

[0062] Figure 1 It includes: audio to be identified 101, audio recognition model 102, and audio-to-text transcription 103.

[0063] The audio to be identified, 101, can be the audio of any speaker at the meeting. The audio recognition model is specifically a neural network or machine learning model used to identify the audio to be identified.

[0064] In practical applications, the audio to be recognized 101 is input into the audio recognition model 102, so that the audio recognition model 102 performs audio recognition processing on the audio to be recognized 101 to obtain the audio transcription text 103 corresponding to the audio to be recognized 101.

[0065] Specifically, the audio recognition model 102 is trained through the following steps S101 to S105:

[0066] Step S101: Based on the confidence of the pseudo-labels of each first audio data in the audio cache pool, sort the confidence of each first audio data, and obtain k first audio data containing text pseudo-labels according to the sorting result from largest to smallest confidence, where k is the training stage of the audio recognition model 102.

[0067] Step S102: Obtain second audio data containing text tags.

[0068] Step S103: Use the first audio data containing text pseudo-labels and the second audio data containing text labels as training samples to train the audio recognition model 102.

[0069] Step S104: Determine the model parameters related to the audio recognition task in the audio recognition model obtained after training, perform mask gradient update on the model parameters, and synchronize the updated parameters to the audio recognition model 102.

[0070] Step S105: Return to step S101 to perform the next training phase on the audio recognition model 102.

[0071] It should be noted that the above description of the implementation scenarios of this application is only for the purpose of facilitating a better understanding of the audio recognition method provided in this application, and is not intended to limit the application scenarios of the audio recognition method. The audio recognition method can also be applied to other scenarios, such as for recognizing call speech, recognizing recording information, etc. This application does not impose any limitations on this.

[0072] Exemplary methods

[0073] This application also provides an audio recognition method, the core of which is to use mask gradient updates on the model parameters related to the audio recognition task in the audio recognition model, thereby avoiding the problem of difficulty in grasping the homology between unlabeled data and the audio recognition model during the training process of the audio recognition model with unlabeled sample data.

[0074] In one optional embodiment of this application, the entity implementing the voice recognition method may be any combination of two or more of various types of user terminals such as laptops, tablets, desktop computers, set-top boxes, and mobile devices (e.g., mobile phones, portable music players, personal digital assistants, dedicated messaging devices, game consoles), or a server.

[0075] Please refer to Figure 2 , Figure 2 This is a flowchart of an audio recognition method provided in another embodiment of this application.

[0076] like Figure 2 As shown, the audio recognition method includes the following steps S201 and S202:

[0077] Step S201: Obtain the audio data to be recognized.

[0078] The audio data to be identified can be understood as voice information that needs to be transcribed into text data. In one optional embodiment of this application, the audio data to be identified can be acquired by an audio acquisition device such as a recorder or voice recorder; for example, in some conference scenarios, the audio data to be identified can be the voice information of the conference speaker obtained through an audio acquisition device.

[0079] In another alternative embodiment of this application, the audio data to be identified may also be voice or audio data collected based on the Internet, including various fields such as science and technology, medical care, and education.

[0080] Step S202: Using a pre-trained audio recognition model, perform audio recognition processing on the audio data to be recognized to obtain text data corresponding to the audio data;

[0081] The audio recognition model is obtained by masking gradient updates of the model parameters related to the audio recognition task in the first audio recognition model. The first audio recognition model is obtained by training the initial audio recognition model with first audio data containing text pseudo-labels and second audio data containing text labels. The text pseudo-labels are determined by the initial audio recognition model performing audio recognition on the first audio data.

[0082] The audio recognition model can be understood as a convolutional neural network. In specific applications, this application uses machine learning (ML) to obtain the audio recognition model. Machine learning (a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc.) is specifically used to study how to acquire new knowledge or skills through training samples, reorganize existing knowledge structures, and continuously improve its performance. Machine learning typically includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, and inductive learning, and is a branch of artificial intelligence (AI) technology.

[0083] Furthermore, to facilitate understanding of the audio recognition method provided in this application, the training method of the audio recognition model will first be introduced.

[0084] Please refer to Figure 3 , Figure 3 A flowchart illustrating the training method for an audio recognition model provided in another embodiment of this application.

[0085] like Figure 3 As shown, the training method for the audio recognition model includes the following steps S301 and S302:

[0086] Step S301: Use the first audio data containing text pseudo-labels and the second audio data containing text labels to train the initial audio recognition model and obtain the trained first audio recognition model.

[0087] The initial audio recognition model can be understood as a model with preliminary audio recognition capabilities. In one optional embodiment of this application, the initial audio recognition model can be trained using sample speech with text labels as training samples.

[0088] The first audio data can contain any one or more audio data. The text pseudo-tag of the first audio data can be understood as a special form of text identifier of the first audio data, but the text pseudo-tag may not correspond to the real text of the first audio data.

[0089] Furthermore, the first audio data specifically refers to the audio data in the first audio dataset. The pseudo-labels of the first audio data can be obtained by using the initial audio recognition model to perform audio recognition processing on the first audio dataset, thereby obtaining the text pseudo-labels of each audio data in the first audio dataset.

[0090] The text tags of the second audio data can be understood as the actual text identifier of the second audio data.

[0091] In the embodiments of this application, the first audio data containing text pseudo-tags can be obtained from a first audio dataset, specifically through the following steps S1 to S3:

[0092] Step S1: Use the initial audio recognition model to perform audio recognition processing on the first audio dataset to obtain text pseudo-labels for each audio data in the first audio dataset.

[0093] Specifically, the process of obtaining the text pseudo-labels for each audio data in the first audio dataset can be represented by the following formula (1):

[0094]

[0095] Where q represents the audio data in the first audio dataset input to the initial audio recognition model. The pseudo-label represents the audio data; u represents the first audio dataset.

[0096] Step S2: Based on the text pseudo-label confidence of each audio data and the training stage of the initial audio recognition model, determine a preset number of audio data corresponding to the training stage from the first audio dataset.

[0097] Step S3: Use the preset number of audio data as the first audio data.

[0098] The confidence level of the text pseudo-labels for each audio data is obtained by using the robust confidence score (CRS) of the text pseudo-label.

[0099] Furthermore, in this embodiment of the application, the training of the initial audio recognition model is carried out in multiple different stages, wherein each stage requires the use of a different amount of first audio data from the first audio dataset.

[0100] Specifically, step S2 above includes:

[0101] First, the audio data is input into the initial audio recognition model, and a perturbation is added to the audio recognition process of the initial audio recognition model to obtain the perturbation text pseudo-labels of the audio data.

[0102] Next, determine the first confidence level of the text pseudo-label corresponding to the audio data, and the second confidence level of the perturbation text pseudo-label;

[0103] Finally, the confidence level of the text pseudo-label is obtained based on the edit distance between the text pseudo-label and the perturbation text pseudo-label, and the average confidence level of the first confidence level and the second confidence level.

[0104] Specifically, the first confidence level of the text pseudo-label corresponding to the audio data can be understood as the posterior probability of each sub-word token of the text pseudo-label. In practical applications, the first confidence level of the text pseudo-label can be expressed by the following formula (2):

[0105] CS=min(softmax(logit)) (2);

[0106] Wherein, logit is the posterior probability of each word in the text pseudo-label; CS is the first confidence level of the text pseudo-label.

[0107] Furthermore, regarding the perturbation text pseudo-label; assume that any audio data in the first audio dataset is u n Furthermore, the pseudo-labels obtained by processing the audio data using the initial audio recognition model are: The process described above, which involves inputting audio data into an initial audio recognition model and adding perturbations to the audio recognition process of that model to obtain perturbed text pseudo-labels for the audio data, can be understood as applying weak enhancement as a way to modify the audio data. n The disturbance, and in the initial audio recognition model, in recognizing audio data u n During the process, the perturbation is transmitted, thereby obtaining the perturbation text pseudo-label.

[0108] After obtaining the perturbation text pseudo-label, the second confidence level of the perturbation text pseudo-label is obtained in a similar manner to that provided by the above formula (2).

[0109] Furthermore, after obtaining the first confidence level of the text pseudo-label and the second confidence level of the perturbation text pseudo-label, the mean information of the first confidence level and the second confidence level is determined; at the same time, the edit distance between the text pseudo-label and the perturbation text pseudo-label is calculated, and the edit distance is used as a penalty term to obtain the confidence level of the text pseudo-label.

[0110] Specifically, the confidence level of the text pseudo-label can be represented by the following formula (3):

[0111]

[0112] in, Indicates text pseudo-tags The first confidence level; Indicates perturbation text pseudo-labels The second confidence level; Indicates text pseudo-tags With perturbation text pseudo-labels The edit distance between them, λ represents the text pseudo-tag. With perturbation text pseudo-labels The balancing weights between them, where l represents the pseudo-label of the text. The length of the text.

[0113] After obtaining the text pseudo-label confidence of the audio data, in accordance with the method provided in step S2 above, and in conjunction with the training phase of the initial audio recognition model, a preset number of audio data corresponding to the training phase is determined from the first audio dataset.

[0114] Specifically, the audio data in the first audio dataset is sorted according to the confidence level of each text pseudo-label; and a preset number of audio data corresponding to the training stage are determined from the first audio dataset in descending order of confidence level, based on the training stage of the initial audio recognition model.

[0115] To facilitate understanding of the above process, we will first introduce the training of the initial audio recognition model and the training samples used in different stages:

[0116] In this embodiment of the application, the training of the initial audio recognition model is implemented based on samples of audio data with progressively enhanced text pseudo-labels.

[0117] Specifically, assuming that the training of the initial audio recognition model includes K stages, then for the kth training stage of the initial audio recognition model, the training samples used to train the initial audio recognition model include: k first audio data containing text pseudo-labels in the first audio dataset, and second audio data containing text labels.

[0118] That is, as the training phase of the initial audio recognition model increases, the training samples contain more audio data with text pseudo-labels. Specifically, the number of audio data with text pseudo-labels in the training samples can be controlled by hyperparameters.

[0119] In one optional embodiment of this application, in order to improve the training efficiency of the initial audio recognition model and the diversity of training data during the training of the initial audio recognition model, after the initial audio recognition model has completed one stage of training, the method further includes:

[0120] Delete the first audio data in the first audio dataset, and add the preset number of new audio data to the first audio dataset.

[0121] For example, suppose that in the nth training stage of the initial audio recognition model, n first audio data containing text pseudo-labels are selected from the first audio dataset. Then, in the (n+1)th training stage of the initial audio recognition model, the aforementioned n first audio data containing text pseudo-labels are deleted from the first audio dataset, and n new audio data are added to the first audio dataset. When training the initial audio recognition model in the (n+1)th stage, first audio data containing text pseudo-labels are reselected from the first audio dataset with the added new audio data.

[0122] Furthermore, as the training phase of the initial audio recognition model increases, the amount of first audio data containing text pseudo-labels used to train the initial audio recognition model also increases. However, the increase in the amount of the first audio data will cause the data overfitting in the early training phase of the initial audio recognition model to the data underfitting in the later training phase, thereby causing the training samples of the initial audio recognition model to produce a sample domain shift.

[0123] To eliminate the bias during the initial audio recognition model training process, the following step S302 is further performed.

[0124] Step S302: Determine the model parameters related to the audio recognition task in the first audio recognition model, and perform mask gradient update on the model parameters to obtain the audio recognition model after parameter update.

[0125] Specifically, step S302 above includes steps S4 to S6:

[0126] Step S4: Obtain the correlation between each model parameter in the first audio recognition model and the audio recognition task.

[0127] Step S5: Based on the preset mask threshold and the correlation between each model parameter and the audio task, determine the model parameters in the first audio recognition model that are related to the audio recognition task.

[0128] Step S6: Perform mask gradient update on the model parameters related to the audio task to obtain the audio recognition model with updated parameters.

[0129] In this embodiment of the application, the correlation between each model parameter in the first audio recognition model and the audio recognition task can be represented by the Fisher Information Matrix (FIM) of each parameter in the first audio recognition model. Specifically, the Fisher Information Matrix of each model parameter can be obtained by the following formula (4):

[0130]

[0131] Where F(w) represents the Fisher information matrix of each model parameter of the first audio recognition model; x and y represent the audio data input to the first audio recognition model and the text data output by the first audio recognition model, respectively.

[0132] After obtaining the aforementioned Fisher information matrix, the model parameters related to the audio recognition task are estimated using the diagonal elements of the matrix. Specifically, the Fisher information of each parameter in the first audio recognition model is determined. In this embodiment, the Fisher information of the i-th parameter of the first audio recognition model can be obtained by the following formula (5):

[0133]

[0134] Where i represents the i-th parameter of the first audio recognition model, and D represents the number of the i-th parameter.

[0135] Furthermore, the more important the i-th parameter is to the audio recognition task, the higher the value of the Fisher information for that parameter.

[0136] Furthermore, in order to filter out the model parameters related to the audio recognition task from the various parameters of the first audio recognition model, this embodiment of the application filters each parameter in each first audio recognition model through a preset mask threshold.

[0137] Specifically, assuming the first audio recognition model is obtained after the t-th training stage of the initial audio recognition model, a 0-1 mask M is preset. t The model parameters related to the audio recognition task are obtained by filtering using the following formula (6).

[0138]

[0139] Where σ is the mask threshold, in the i-th model parameter When the Fisher information is greater than the preset mask threshold, it indicates that the weight of that parameter needs to be updated; for the i-th model parameter When the Fisher information is less than or equal to the preset mask threshold, it means that the weight of this parameter does not need to be updated.

[0140] Specifically, the process of updating the parameters can be achieved through the following formulas (7)-(15):

[0141]

[0142] Among them, w t Let g represent the parameters to be updated in the initial audio recognition model at the t-th training stage. t1 This represents the gradient value of the parameters to be updated in the initial model during the t-th training phase.

[0143] C t =getNetParam(M θ (8);

[0144] Among them, M θ C represents the initial audio recognition model. t Indicates parameter w t The network it belongs to.

[0145] M t =GenerateMask(C t (9);

[0146] Among them, M t This represents the mask matrix added to the initial audio recognition model.

[0147] g t+1 =g t ⊙M t (10);

[0148] Among them, g t2 This indicates the parameter w after updating based on the mask matrix. t The gradient value.

[0149] m t =β1·m t-1 +(1-β1)·g t+1 (11);

[0150]

[0151] Where, mt V represents a first-order moment vector. t Let β1 represent the second-order moment vector, and let β1 represent the exponential decay rate of the parameter to be updated, where β1∈(0,1).

[0152]

[0153]

[0154] in, This represents the first-order moment vector after bias correction. This represents the second-order moment vector after bias correction.

[0155]

[0156] Where η represents the learning rate of the initial audio recognition model, w t+1 The parameter w after weight update t .

[0157] In summary, the audio recognition method provided in this application uses mask gradient updates on the model parameters related to the audio recognition task in the audio recognition model. This avoids the problem of difficulty in grasping the homology between unlabeled data and the audio recognition model during the training process using unlabeled sample data, and improves the audio recognition accuracy of the trained audio recognition model on the audio data to be recognized.

[0158] Exemplary device

[0159] Accordingly, this application also provides an audio recognition device, please refer to... Figure 4 , Figure 4 This is a schematic diagram of the structure of an audio recognition device provided in another embodiment of this application.

[0160] like Figure 4 As shown, the audio recognition device includes:

[0161] Unit 401: Acquire the audio data to be recognized;

[0162] The second unit 402 uses a pre-trained audio recognition model to perform audio recognition processing on the audio data to be recognized, and obtains text data corresponding to the audio data.

[0163] The audio recognition model is obtained by masking gradient updates of the model parameters related to the audio recognition task in the first audio recognition model. The first audio recognition model is obtained by training the initial audio recognition model with first audio data containing text pseudo-labels and second audio data containing text labels. The text pseudo-labels are determined by the initial audio recognition model performing audio recognition on the first audio data.

[0164] The audio recognition device provided in this embodiment belongs to the same concept as the audio recognition method provided in the above embodiments of this application. It can execute the audio recognition method provided in any of the above embodiments of this application and has the corresponding functional modules and beneficial effects for executing the audio recognition method. Technical details not described in detail in this embodiment can be found in the specific processing content of the audio recognition method provided in the above embodiments of this application, and will not be repeated here.

[0165] This application also provides a training device for an audio recognition model. Please refer to [link / reference]. Figure 5 , Figure 5 This is a schematic diagram of the training device structure for an audio recognition model provided in another embodiment of this application.

[0166] like Figure 5 As shown, the training device for the audio recognition model includes:

[0167] Unit 501 uses the first audio data containing text pseudo-labels and the second audio data containing text labels to train the initial audio recognition model and obtain the trained first audio recognition model.

[0168] Unit 502 determines the model parameters related to the audio recognition task in the first audio recognition model, and performs mask gradient update on the model parameters to obtain the audio recognition model after parameter update.

[0169] In one optional embodiment of this application, before the step of training an initial audio recognition model using first audio data containing text pseudo-labels and second audio data containing text labels to obtain a trained first audio recognition model, the method further includes:

[0170] The initial audio recognition model is used to perform audio recognition processing on the first audio dataset to obtain the text pseudo-labels of each audio data in the first audio dataset.

[0171] Based on the text pseudo-label confidence of each audio data and the training stage of the initial audio recognition model, a preset number of audio data corresponding to the training stage are determined from the first audio dataset.

[0172] The preset number of audio data is used as the first audio data.

[0173] In one optional embodiment of this application, determining a preset number of audio data points corresponding to the training stage from the first audio dataset based on the text pseudo-label confidence of each audio data point and the training stage of the initial audio recognition model includes:

[0174] Based on the confidence level of each text pseudo-label, sort the audio data in the first audio dataset.

[0175] Based on the training phase of the initial audio recognition model, a preset number of audio data corresponding to the training phase are determined from the first audio dataset in descending order of confidence.

[0176] In one optional embodiment of this application, after obtaining the trained first audio recognition model, the method further includes:

[0177] Delete the first audio data in the first audio dataset, and add the preset number of new audio data to the first audio dataset.

[0178] In one optional embodiment of this application, the confidence level of the text pseudo-tag is obtained in the following manner:

[0179] The audio data is input into the initial audio recognition model, and a perturbation is added to the audio recognition process of the initial audio recognition model to obtain the perturbation text pseudo-label of the audio data.

[0180] Determine the first confidence level of the text pseudo-label corresponding to the audio data, and the second confidence level of the perturbation text pseudo-label;

[0181] The confidence level of the text pseudo-label is obtained based on the edit distance between the text pseudo-label and the perturbation text pseudo-label, and the average confidence level of the first confidence level and the second confidence level.

[0182] In one optional embodiment of this application, determining the model parameters related to the audio recognition task in the first audio recognition model and performing mask gradient updates on the model parameters to obtain the parameter-updated audio recognition model includes:

[0183] Obtain the correlation between each model parameter in the first audio recognition model and the audio recognition task;

[0184] Based on the preset mask threshold and the correlation between each model parameter and the audio task, determine the model parameters in the first audio recognition model that are related to the audio recognition task;

[0185] The model parameters related to the audio task are updated using a mask gradient to obtain the audio recognition model with updated parameters.

[0186] In an optional embodiment of this application, the training device for the audio recognition model is further used for:

[0187] The updated audio recognition model is used as the initial audio recognition model. The process of training the initial audio recognition model using first audio data containing text pseudo-labels and second audio data containing text labels is then repeated to obtain the trained first audio recognition model.

[0188] The audio recognition model training device provided in this embodiment belongs to the same application concept as the audio recognition model training method provided in the above embodiments of this application. It can execute the audio recognition model training method provided in any of the above embodiments of this application and has the corresponding functional modules and beneficial effects for training the audio recognition model. Technical details not described in detail in this embodiment can be found in the specific processing content of the audio recognition model training method provided in the above embodiments of this application, and will not be repeated here.

[0189] Exemplary electronic devices

[0190] Another embodiment of this application also proposes an electronic device, please refer to Figure 6 , Figure 6 This is a schematic diagram of an electronic device structure provided in another embodiment of this application. The electronic device includes:

[0191] Memory 200 and processor 210;

[0192] The memory 200 is connected to the processor 210 and is used to store programs;

[0193] The processor 210 is used to implement the audio recognition method or audio recognition model training method disclosed in the above embodiments by running the program stored in the memory 200.

[0194] Specifically, the aforementioned electronic device may also include: a bus, a communication interface 220, an input device 230, and an output device 240.

[0195] The processor 210, memory 200, communication interface 220, input device 230, and output device 240 are interconnected via a bus. Among them:

[0196] A bus can include a pathway for transmitting information between various components of a computer system.

[0197] The processor 210 can be a general-purpose processor, such as a general-purpose central processing unit (CPU), a microprocessor, etc., or an application-specific integrated circuit (ASIC), or one or more integrated circuits used to control the execution of the program of the present invention. It can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0198] Processor 210 may include a main processor, as well as a baseband chip, modem, etc.

[0199] The memory 200 stores a program that executes the technical solution of this invention, and may also store an operating system and other key business functions. Specifically, the program may include program code, which includes computer operation instructions. More specifically, the memory 200 may include read-only memory (ROM), other types of static storage devices capable of storing static information and instructions, random access memory (RAM), other types of dynamic storage devices capable of storing information and instructions, disk storage, flash memory, etc.

[0200] Input device 230 may include a device for receiving user input data and information, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor.

[0201] Output device 240 may include devices that allow information to be output to a user, such as a display screen, printer, speaker, etc.

[0202] The communication interface 220 may include a device that uses any transceiver to communicate with other devices or communication networks, such as Ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), etc.

[0203] The processor 210 executes the program stored in the memory 200 and calls other devices, which can be used to implement the various steps of the audio recognition method or audio recognition model training method provided in the above embodiments of this application.

[0204] Exemplary computer program products and storage media

[0205] In addition to the methods and devices described above, embodiments of this application may also be computer program products, which include computer program instructions that, when executed by a processor, cause the processor to perform the steps in the audio recognition method or audio recognition model training method according to various embodiments of this application as described in the "Exemplary Methods" section of this specification.

[0206] The computer program product can be written in any combination of one or more programming languages ​​to perform the operations of the embodiments of this application. The programming languages ​​include object-oriented programming languages ​​such as Java and C++, as well as conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on the user's computing device, partially on the user's computing device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.

[0207] Furthermore, embodiments of this application may also be storage media storing a computer program, which is executed by a processor in the steps of the audio recognition method or audio recognition model training method according to various embodiments of this application described in the "Exemplary Methods" section above. Specifically, the following steps can be implemented:

[0208] Step S201: Obtain the audio data to be recognized;

[0209] Step S202: Using a pre-trained audio recognition model, perform audio recognition processing on the audio data to be recognized to obtain text data corresponding to the audio data;

[0210] The audio recognition model is obtained by masking gradient updates of the model parameters related to the audio recognition task in the first audio recognition model. The first audio recognition model is obtained by training the initial audio recognition model with first audio data containing text pseudo-labels and second audio data containing text labels. The text pseudo-labels are determined by the initial audio recognition model performing audio recognition on the first audio data.

[0211] or,

[0212] Step S301: Use the first audio data containing text pseudo-labels and the second audio data containing text labels to train the initial audio recognition model and obtain the trained first audio recognition model.

[0213] Step S302: Determine the model parameters related to the audio recognition task in the first audio recognition model, and perform mask gradient update on the model parameters to obtain the audio recognition model after parameter update.

[0214] For the foregoing method embodiments, in order to simplify the description, they are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, because according to this application, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0215] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For apparatus embodiments, since they are basically similar to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0216] The steps in the methods of the various embodiments of this application can be adjusted, merged, or deleted in order according to actual needs, and the technical features described in each embodiment can be replaced or combined.

[0217] The modules and sub-modules in the various embodiments of the present application's devices and terminals can be merged, divided, and deleted according to actual needs.

[0218] It should be understood that the disclosed terminals, devices, and methods can be implemented in other ways, given the several embodiments provided in this application. For example, the terminal embodiments described above are merely illustrative. For instance, the division of modules or sub-modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or modules, and may be electrical, mechanical, or other forms.

[0219] The modules or submodules described as separate components may or may not be physically separate. The components that constitute a module or submodule may or may not be physical modules or submodules; that is, they may be located in one place or distributed across multiple network modules or submodules. Some or all of the modules or submodules can be selected to achieve the purpose of this embodiment's solution, depending on actual needs.

[0220] Furthermore, the functional modules or sub-modules in the various embodiments of this application can be integrated into one processing module, or each module or sub-module can exist physically separately, or two or more modules or sub-modules can be integrated into one module. The integrated modules or sub-modules described above can be implemented in hardware or in the form of software functional modules or sub-modules.

[0221] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0222] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software unit executed by a processor, or a combination of both. The software unit can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0223] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0224] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An audio recognition method, characterized in that, include: Obtain the audio data to be recognized; Using a pre-trained audio recognition model, the audio data to be recognized is processed to obtain text data corresponding to the audio data; The audio recognition model is obtained by updating the mask gradient of the model parameters in the first audio recognition model whose Fisher information is greater than a preset mask threshold. The Fisher information is obtained according to the Fisher information matrix of the model parameters. The first audio recognition model is obtained by training the initial audio recognition model with first audio data containing text pseudo-labels and second audio data containing text labels. The text pseudo-labels are determined by the initial audio recognition model performing audio recognition on the first audio data.

2. A training method for an audio recognition model, characterized in that, include: The initial audio recognition model is trained using first audio data containing text pseudo-labels and second audio data containing text labels to obtain the trained first audio recognition model. The model parameters related to the audio recognition task in the first audio recognition model are determined, and the model parameters whose Fisher information is greater than a preset mask threshold are updated by mask gradient to obtain the audio recognition model with updated parameters; the Fisher information is obtained based on the Fisher information matrix of the model parameters.

3. The method according to claim 2, characterized in that, Before the step of training the initial audio recognition model using first audio data containing text pseudo-labels and second audio data containing text labels to obtain the trained first audio recognition model, the following steps are also included: The initial audio recognition model is used to perform audio recognition processing on the first audio dataset to obtain the text pseudo-labels of each audio data in the first audio dataset. Based on the text pseudo-label confidence of each audio data and the training stage of the initial audio recognition model, a preset number of audio data corresponding to the training stage are determined from the first audio dataset. The preset number of audio data is used as the first audio data.

4. The method according to claim 3, characterized in that, The step of determining a preset number of audio data points corresponding to the training stage from the first audio dataset based on the text pseudo-label confidence of each audio data point and the training stage of the initial audio recognition model includes: Based on the confidence level of each text pseudo-label, sort the audio data in the first audio dataset. Based on the training phase of the initial audio recognition model, a preset number of audio data corresponding to the training phase are determined from the first audio dataset in descending order of confidence.

5. The method according to claim 3, characterized in that, After obtaining the first audio recognition model after training, the following is also included: Delete the first audio data in the first audio dataset, and add the preset number of new audio data to the first audio dataset.

6. The method according to claim 3, characterized in that, The confidence level of the text pseudo-labels is obtained in the following way: The audio data is input into the initial audio recognition model, and a perturbation is added to the audio recognition process of the initial audio recognition model to obtain the perturbation text pseudo-label of the audio data. Determine the first confidence level of the text pseudo-label corresponding to the audio data, and the second confidence level of the perturbation text pseudo-label; The confidence level of the text pseudo-label is obtained based on the edit distance between the text pseudo-label and the perturbation text pseudo-label, and the average confidence level of the first confidence level and the second confidence level.

7. The method according to claim 2, characterized in that, The step of determining the model parameters related to the audio recognition task in the first audio recognition model and performing mask gradient updates on the model parameters to obtain the parameter-updated audio recognition model includes: Obtain the correlation between each model parameter in the first audio recognition model and the audio recognition task; Based on the preset mask threshold and the correlation between each model parameter and the audio recognition task, the model parameters in the first audio recognition model that are related to the audio recognition task are determined. The model parameters related to the audio recognition task are updated by mask gradient to obtain the audio recognition model with updated parameters.

8. The method according to claim 2, further comprising: The updated audio recognition model is used as the initial audio recognition model. The process of training the initial audio recognition model using first audio data containing text pseudo-labels and second audio data containing text labels is then repeated to obtain the trained first audio recognition model.

9. An audio recognition device, characterized in that, include: The first unit is used to acquire the audio data to be recognized; The second unit is used to perform audio recognition processing on the audio data to be recognized using a pre-trained audio recognition model to obtain text data corresponding to the audio data. The audio recognition model is obtained by updating the mask gradient of the model parameters in the first audio recognition model whose Fisher information is greater than a preset mask threshold. The Fisher information is obtained from the Fisher information matrix of the model parameters. The first audio recognition model is obtained by training the initial audio recognition model using first audio data containing text pseudo-labels and second audio data containing text labels.

10. A training device for an audio recognition model, characterized in that, include: The third unit is used to train the initial audio recognition model using the first audio data containing text pseudo-labels and the second audio data containing text labels, so as to obtain the trained first audio recognition model. The fourth unit is used to determine the model parameters related to the audio recognition task in the first audio recognition model, and to perform mask gradient update on the model parameters whose Fisher information is greater than a preset mask threshold to obtain the audio recognition model with updated parameters; the Fisher information is obtained based on the Fisher information matrix of the model parameters.

11. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the method described in any one of claims 1-8 by running instructions in the memory.

12. A computer-readable storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, performs the method described in any one of claims 1-8.