Speech recognition model adjustment method and device, and electronic device
By processing and recognizing audio data in the speech database, misrecognized audio data is selected as negative samples to train the initial speech recognition model. An autoencoder is introduced to adjust the model structure, which solves the problem of high misrecognition rate in the existing command word recognition model and improves the recognition accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGZHOU SHIYUAN ELECTRONICS CO LTD
- Filing Date
- 2022-03-03
- Publication Date
- 2026-06-16
AI Technical Summary
Existing command word recognition models have a high false recognition rate, especially in complex environments where they are easily affected by microphone performance and external noise, leading to unreliable recognition results.
By acquiring audio data from a speech database, a first initial audio data excluding command words and a second initial audio data including command words are separated. After processing, command word recognition is performed. The misidentified audio data is selected as negative sample data. The initial speech recognition model is trained using the negative sample and the second initial audio data. An autoencoder is introduced to prevent errors from negative sample data. The generation part and the output layer are adjusted to maintain the consistency of the model structure.
It effectively reduced the false recognition rate of the command word recognition model and improved the recognition accuracy of the model in complex environments.
Smart Images

Figure CN116741152B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of speech recognition, and more specifically, to a method, apparatus, computer-readable storage medium, processor, and electronic device for adjusting a speech recognition model. Background Technology
[0002] Command word recognition is a subfield of speech recognition. It is typically performed offline, with minimal computational requirements, and is generally used for controlling terminal devices (including wake-up). In more complex situations, command word recognition is highly susceptible to the influence of objective phonemes such as microphone performance and ambient noise, resulting in unreliable recognition results. The most significant issue is false recognition, which is particularly common in multi-command word recognition models.
[0003] Therefore, the high false recognition rate of existing command word recognition models urgently needs to be addressed.
[0004] The information disclosed above in the background section is only intended to enhance the understanding of the background art of the art described herein. Therefore, the background art may contain certain information that does not constitute prior art known to those skilled in the art in this country. Summary of the Invention
[0005] The main objective of this application is to provide a method, apparatus, computer-readable storage medium, processor, and electronic device for adjusting a speech recognition model, in order to solve the problem of high misrecognition rate in existing command word recognition models.
[0006] According to one aspect of the present invention, a method for adjusting a speech recognition model is provided, comprising: acquiring audio data from a speech database to obtain first initial audio data and second initial audio data, wherein the first initial audio data does not include command words and the second initial audio data includes the command words; processing the first initial audio data to obtain audio processing data, wherein the audio processing data is different from the first initial audio data; using the initial speech recognition model to perform command word recognition on the audio processing data to obtain a target recognition result, wherein the target recognition result is used to characterize that the corresponding audio processing data includes the command words, and the audio processing data corresponding to the target recognition result constitutes negative sample data; and training at least the initial speech recognition model based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model.
[0007] Optionally, the initial speech recognition model includes an initial generation part and an initial output layer. The initial speech recognition model is trained at least partially using the second initial audio data and the negative sample data to obtain a target speech recognition model, including: obtaining an initial reconstruction part based on the initial generation part, wherein the initial reconstruction part and the initial generation part constitute an initial autoencoder; training the initial reconstruction part using at least partially of the second initial audio data to obtain a first target reconstruction part; and training the initial generation part, the first target reconstruction part, and the initial output layer based on at least partially of the second initial audio data and the negative sample data to obtain the target speech recognition model.
[0008] Optionally, training the initial generation part, the first target reconstruction part, and the initial output layer based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model includes: training the initial generation part, the first target reconstruction part, and the initial output layer based on at least a portion of the second initial audio data and the negative sample data to obtain a target generation part, a target output layer, and a second target reconstruction part; deleting the second target reconstruction part to obtain the target speech recognition model, wherein the target speech recognition model includes the target generation part and the target output layer.
[0009] Optionally, training the initial speech recognition model based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model includes: recognizing the audio processing data in the negative sample data as text to obtain text data, wherein the audio processing data and the text data constitute training data; and training the initial speech recognition model using at least a portion of the second initial audio data and the negative sample training data to obtain the target speech recognition model.
[0010] Optionally, the first initial audio data is processed to obtain audio processed data, including: adding scene features to the first initial audio data to obtain the audio processed data, wherein the scene features include at least one of the following: ambient noise, rate perturbation, and reverberation.
[0011] Optionally, adding scene features to the first initial audio data to obtain the audio processing data includes: extracting a portion of the first initial audio data to obtain a first initial sub-audio; and adding the scene features to the first initial sub-audio to obtain the audio processing data.
[0012] Optionally, there are multiple first initial audio data. The first initial audio data is processed to obtain audio processed data, including: extracting portions of at least two of the first initial audio data to obtain multiple second initial sub-audio data; and superimposing the multiple second initial sub-audio data to obtain the audio processed data.
[0013] Optionally, after acquiring audio data from the speech database and obtaining first initial audio data and second initial audio data, and before using the initial speech recognition model to perform command word recognition on the audio processing data, the method further includes: constructing the initial speech recognition model based on the second initial audio data.
[0014] According to another aspect of the present invention, a speech recognition model adjustment device is also provided, comprising: an acquisition unit, configured to acquire audio data from a speech database to obtain first initial audio data and second initial audio data, wherein the first initial audio data does not include command words and the second initial audio data includes the command words; a processing unit, configured to process the first initial audio data to obtain processed audio data, wherein the processed audio data is different from the first initial audio data; a recognition unit, configured to use the initial speech recognition model to perform command word recognition on the processed audio data to obtain a target recognition result, wherein the target recognition result is used to characterize that the corresponding processed audio data includes the command words, and the processed audio data corresponding to the target recognition result constitutes negative sample data; and a training unit, configured to train at least the initial speech recognition model based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model.
[0015] According to another aspect of the present invention, a computer-readable storage medium is also provided, the computer-readable storage medium including a stored program, wherein the program executes any one of the methods described.
[0016] According to another aspect of the present invention, a processor is also provided, the processor being configured to run a program, wherein the program, when running, executes any one of the methods described.
[0017] According to another aspect of the present invention, an electronic device is also provided, comprising: one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs include methods for performing any one of the methods described.
[0018] In this embodiment of the invention, firstly, audio data from a speech database is acquired to obtain first initial audio data excluding command words and second initial audio data including command words. Then, the first initial audio data is processed to obtain processed audio data different from the first initial audio data. Next, an initial speech recognition model is used to perform command word recognition on the processed audio data to obtain a target recognition result representing the corresponding processed audio data including command words. The target recognition result and the processed audio data constitute negative sample data. Finally, based on at least a portion of the second initial audio data and the negative sample data, the initial speech recognition model is trained to obtain a target speech recognition model. In this method, the first initial audio data excluding command words is processed to obtain processed audio data. Then, command word recognition is performed on the processed audio data, and audio data that is misrecognized is selected to constitute negative sample data. Then, the initial speech recognition model is trained using the negative samples and the second initial audio data, increasing the amount of sample data prone to misrecognition during model training, thereby making the target speech recognition model more accurate and solving the problem of high misrecognition rate in existing command word recognition models. Attached Figure Description
[0019] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings:
[0020] Figure 1 A flowchart illustrating a method for adjusting a speech recognition model according to an embodiment of this application is shown.
[0021] Figure 2 A schematic diagram of the structure of an adjustment device for a speech recognition model according to an embodiment of this application is shown;
[0022] Figure 3 A structural diagram of an initial autoencoder according to an embodiment of this application is shown;
[0023] Figure 4 A structural diagram of an initial generation section, an initial output layer, and a first target reconstruction section according to an embodiment of this application is shown;
[0024] Figure 5 A flowchart illustrating a method for adjusting a speech recognition model according to an embodiment of this application is shown.
[0025] The above figures include the following reference numerals:
[0026] 200. Initial generation section; 201. Initial reconstruction section; 202. Initial output layer; 203. First target reconstruction section. Detailed Implementation
[0027] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.
[0028] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort should fall within the scope of protection of the present application.
[0029] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this application described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0030] It should be understood that when an element (such as a layer, film, region, or substrate) is described as being "on" another element, the element may be directly on the other element, or there may be an intermediate element present. Furthermore, in the specification and claims, when an element is described as being "connected" to another element, the element may be "directly connected" to the other element, or "connected" to the other element via a third element.
[0031] As mentioned in the background section, the command word recognition model in the prior art has a high false recognition rate. In order to solve the above problem, in a typical embodiment of this application, a method, apparatus, computer-readable storage medium, processor and electronic device for adjusting a speech recognition model are provided.
[0032] According to an embodiment of this application, a method for adjusting a speech recognition model is provided.
[0033] Figure 1 This is a flowchart of a method for adjusting a speech recognition model according to an embodiment of this application. Figure 1 As shown, the method includes the following steps:
[0034] Step S101: Obtain audio data from the speech database to obtain first initial audio data and second initial audio data. The first initial audio data does not include command words, and the second initial audio data includes the command words.
[0035] Step S102: Process the first initial audio data to obtain processed audio data, which is different from the first initial audio data.
[0036] Step S103: The initial speech recognition model is used to perform command word recognition on the above audio processing data to obtain target recognition results. The target recognition results are used to characterize that the corresponding audio processing data includes the command words. The audio processing data corresponding to the target recognition results constitute negative sample data.
[0037] Step S104: Based on at least a portion of the second initial audio data and the negative sample data, train at least the initial speech recognition model to obtain the target speech recognition model.
[0038] In the above method, firstly, audio data from a speech database is acquired to obtain first initial audio data excluding command words and second initial audio data including command words. Then, the first initial audio data is processed to obtain processed audio data different from the first initial audio data. Next, an initial speech recognition model is used to perform command word recognition on the processed audio data to obtain a target recognition result representing the corresponding processed audio data including command words. The processed audio data corresponding to the target recognition result constitutes negative sample data. Finally, based on at least a portion of the second initial audio data and the negative sample data, the initial speech recognition model is trained to obtain a target speech recognition model. In this method, the first initial audio data excluding command words is processed to obtain processed audio data. Then, command word recognition is performed on the processed audio data, and audio data that is misrecognized is selected to constitute negative sample data. Then, the initial speech recognition model is trained using the negative samples and the second initial audio data, increasing the amount of sample data prone to misrecognition during model training, thereby making the target speech recognition model more accurate and solving the problem of high misrecognition rates in existing command word recognition models.
[0039] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.
[0040] The aforementioned first initial speech data can be various types of speech data, such as noise data, song data, and second initial speech data excluding command words; the aforementioned second initial speech data includes command word data and general data; specifically, the command word data can be speech data of command words such as "previous episode" and corresponding text labels; the general data can be data composed of any sentence, including speech data and corresponding text labels, such as "how is the weather today?", etc.
[0041] In one specific embodiment of this application, the learning rate of the initial speech recognition model is 10. -3 This allows the trained model to be more accurate. Of course, in practical applications, the learning rate mentioned above can also be other values, which can be set by those skilled in the art according to the actual situation.
[0042] In one embodiment of this application, the initial speech recognition model includes an initial generation part and an initial output layer. The initial speech recognition model is trained at least partially based on the second initial audio data and the negative sample data to obtain a target speech recognition model. This includes: obtaining an initial reconstruction part 201 based on the initial generation part 200. The initial reconstruction part 201 and the initial generation part 200 constitute an initial autoencoder. The structure diagram of the initial autoencoder is shown below. Figure 3 As shown; the initial reconstruction part is trained using at least a portion of the second initial audio data to obtain the first target reconstruction part; as shown Figure 4 As shown, the initial generation part 200, the first target reconstruction part 203, and the initial output layer 202 are trained based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model. To avoid recognition errors in text recognition due to significant interference in the audio of the negative sample data, an autoencoder is introduced during model training in this embodiment. The autoencoder can prevent large errors in the negative sample data from causing significant deviations in the target speech recognition model. To ensure that the structure of the original model remains unchanged, the feature encoding part of the autoencoder in this application is the same as the initial generation part of the initial speech recognition model. The decoding part of the autoencoder in this application is the aforementioned reconstruction part. According to the characteristics of the autoencoder, the initial reconstruction part is obtained by reverse engineering based on the initial generation part. First, the initial autoencoder needs to be trained using the second initial audio data to obtain the first target reconstruction part. Then, the initial generation part, the first target reconstruction part, and the initial output layer are trained using at least a portion of the second initial audio data and the negative sample data, thereby further reducing the misrecognition rate of the target speech recognition model.
[0043] In another embodiment of this application, training the initial generation part, the first target reconstruction part, and the initial output layer based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model includes: training the initial generation part, the first target reconstruction part, and the initial output layer based on at least a portion of the second initial audio data and the negative sample data to obtain a target generation part, a target output layer, and a second target reconstruction part; deleting the second target reconstruction part to obtain the target speech recognition model, wherein the target speech recognition model includes the target generation part and the target output layer. In this embodiment, because the initial speech recognition model includes an initial generation part and an initial output layer, the second target reconstruction part obtained after training needs to be deleted in order to ensure that the structure of the obtained target speech recognition model is consistent with that of the initial speech recognition model.
[0044] To further reduce the false recognition rate of the command word recognition model, in another embodiment of this application, the initial speech recognition model is trained at least based on at least a portion of the aforementioned second initial audio data and the aforementioned negative sample data to obtain a target speech recognition model. This includes: recognizing the audio processing data in the aforementioned negative sample data as text to obtain text data, wherein the audio processing data and the aforementioned text data constitute negative sample training data; and training the initial speech recognition model at least using at least a portion of the aforementioned second initial audio data and the aforementioned negative sample training data to obtain the target speech recognition model. Before training the initial speech recognition model, it is necessary to recognize the audio processing data as text to obtain text data, and then use the audio processing data and text data to train the initial speech recognition model.
[0045] In one specific embodiment of this application, a continuous speech recognition model can be used to recognize the above-mentioned audio processing data into text data. Of course, in practical applications, other methods can also be used to process audio data into text data.
[0046] In another specific embodiment of this application, not only can audio processing data be recognized as text data, but also initials and finals can be recognized from the audio processing data.
[0047] In another embodiment of this application, the first initial audio data is processed to obtain processed audio data, including: adding scene features to the first initial audio data to obtain the processed audio data. The scene features include at least one of the following: ambient noise, rate perturbation, and reverberation. In this embodiment, adding scene features to the first initial audio data allows the processed audio data to more realistically reflect various situations in real life, increasing the number of training samples for the model and making the sample set closer to actual application scenarios, thereby further improving the accuracy of the speech recognition model.
[0048] The aforementioned environmental noise can be noise such as television, music, or running water, used to simulate the actual background noise of the environment; the aforementioned rate perturbation can be to lengthen or shorten the first initial audio data; the aforementioned reverberation can be the room impact response, used to simulate the reflection of sound in the room.
[0049] In order to construct diverse audio processing data, in another embodiment of this application, scene features are added to the first initial audio data to obtain the audio processing data, including: extracting a portion of the first initial audio data to obtain a first initial sub-audio; adding the scene features to the first initial sub-audio to obtain the audio processing data.
[0050] In another embodiment of this application, there are multiple first initial audio data points. Processing the first initial audio data to obtain processed audio data includes: extracting portions of at least two of the first initial audio data points to obtain multiple second initial sub-audio data points; and superimposing the multiple second initial sub-audio data points to obtain the processed audio data. In this embodiment, the extracted portions of multiple first initial audio data points can also be used to obtain multiple second initial sub-audio data points, which can then be superimposed to obtain synthesized audio. Scene features are then added to the synthesized audio.
[0051] The above-mentioned portions of at least two of the first initial audio data can be extracted, and can be audio data of the same duration or audio data of different durations; the signal-to-noise ratios of multiple second initial sub-audio data can be the same or different, as long as the generated audio processing data is different from the first initial audio data.
[0052] In another embodiment of the present application, after obtaining the audio data in the voice database to obtain the first initial audio data and the second initial audio data, before using the initial voice recognition model to recognize the command words in the above audio processing data, the above method further includes: constructing the initial voice recognition model according to the above second initial audio data. In this embodiment, the initial voice recognition model is constructed according to the above second initial audio data, and the sample scale of the above second initial audio data is generally much smaller than that of the first initial audio data. Therefore, the initial voice recognition model may have a relatively high misrecognition rate for the samples in the first initial audio data and needs to be adjusted through negative sample data.
[0053] In a specific embodiment of the present application, the above initial voice recognition model is constructed based on the initials and finals of the corresponding characters in the second initial audio data. Of course, in actual applications, there are other modeling methods. For example, the initial voice recognition model can be constructed based on the corresponding characters in the second initial audio data, or can be constructed based on the syllables of the corresponding characters in the second initial audio data (that is, the combination of the initials and finals in the order of Chinese pinyin pronunciation), or can be constructed based on the phonemes of the corresponding characters in the second initial audio data. Among them, the above phoneme is the smallest speech unit divided according to the natural attributes of speech. Analyzing according to the pronunciation actions in the syllable, one action constitutes one phoneme. For example, the Chinese syllable (a, ah) has only one phoneme, the Chinese syllable (ai, love) has two phonemes, and the Chinese syllable (dai, generation) has three phonemes, etc.
[0054] The embodiment of the present application further provides an adjustment device for a voice recognition model. It should be noted that the adjustment device for the voice recognition model in the embodiment of the present application can be used to execute the method for adjusting the voice recognition model provided in the embodiment of the present application. The following introduces the adjustment device for the voice recognition model provided in the embodiment of the present application.
[0055] Figure 2 It is a schematic diagram of the adjustment device for the voice recognition model according to the embodiment of the present application. As Figure 2 shown, the device includes:
[0056] An acquisition unit 10, configured to acquire audio data in a voice database to obtain first initial audio data and second initial audio data. The above first initial audio data does not include command words, and the above second initial audio data includes the above command words;
[0057] A processing unit 20, configured to process the above first initial audio data to obtain audio processing data, and the above audio processing data is different from the above first initial audio data;
[0058] The recognition unit 30 is used to perform command word recognition on the audio processing data using an initial speech recognition model to obtain a target recognition result. The target recognition result is used to characterize that the corresponding audio processing data includes the command word. The audio processing data corresponding to the target recognition result constitutes negative sample data.
[0059] Training unit 40 is used to train the initial speech recognition model based on at least a portion of the second initial audio data and the negative sample data to obtain the target speech recognition model.
[0060] The aforementioned apparatus includes an acquisition unit, a processing unit, a recognition unit, and a training unit. The acquisition unit acquires audio data from a speech database to obtain first initial audio data excluding command words and second initial audio data including command words. The processing unit processes the first initial audio data to obtain processed audio data different from the first initial audio data. The recognition unit uses an initial speech recognition model to perform command word recognition on the processed audio data, obtaining a target recognition result representing the corresponding processed audio data including command words. The processed audio data corresponding to the target recognition result constitutes negative sample data. The training unit trains at least the initial speech recognition model based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model. In this apparatus, the first initial audio data excluding command words is processed to obtain processed audio data. Then, command word recognition is performed on the processed audio data to select audio data that has been misrecognized, thus forming negative sample data. Subsequently, the initial speech recognition model is trained using the negative samples and the second initial audio data, increasing the amount of sample data prone to misrecognition during model training, thereby making the target speech recognition model more accurate and solving the problem of high misrecognition rates in existing command word recognition models.
[0061] The aforementioned first initial speech data can be various types of speech data, such as noise data, song data, and second initial speech data excluding command words; the aforementioned second initial speech data includes command word data and general data; specifically, the command word data can be speech data of command words such as "previous episode" and corresponding text labels; the general data can be data composed of any sentence, including speech data and corresponding text labels, such as "how is the weather today?", etc.
[0062] In one specific embodiment of this application, the learning rate of the initial speech recognition model is 10. -3 This allows the trained model to be more accurate. Of course, in practical applications, the learning rate mentioned above can also be other values, which can be set by those skilled in the art according to the actual situation.
[0063] In one embodiment of this application, the initial speech recognition model includes an initial generation part and an initial output layer, and the training unit includes a constructing module, a first training module, and a second training module. The constructing module is used to obtain an initial reconstruction part 201 based on the initial generation part 200. The initial reconstruction part 201 and the initial generation part 200 constitute an initial autoencoder. The structural diagram of the initial autoencoder is shown below. Figure 3 As shown; the first training module is used to train the initial reconstruction part using at least a portion of the second initial audio data to obtain the first target reconstruction part; as Figure 4 As shown, the first training module is used to train the initial generation part 200, the first target reconstruction part 203, and the initial output layer 202 based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model. To avoid recognition errors when converting negative sample data into text due to excessive interference, an autoencoder is introduced in this embodiment during model training. The autoencoder can prevent large errors in the negative sample data from causing significant deviations in the target speech recognition model. To ensure the original model structure remains unchanged, the feature encoding part of the autoencoder in this application is the same as the initial generation part of the initial speech recognition model. The decoding part of the autoencoder in this application is the aforementioned reconstruction part. Based on the characteristics of the autoencoder, the initial reconstruction part is derived by reverse engineering from the initial generation part. First, the initial autoencoder needs to be trained using the second initial audio data to obtain the first target reconstruction part. Then, at least a portion of the second initial audio data and the negative sample data are used to train the initial generation part, the first target reconstruction part, and the initial output layer, thereby further reducing the misrecognition rate of the target speech recognition model.
[0064] In another embodiment of this application, the second training module includes a training submodule and a deletion submodule. The training submodule is used to train the initial generation part, the first target reconstruction part, and the initial output layer based on at least a portion of the second initial audio data and the negative sample data, to obtain a target generation part, a target output layer, and a second target reconstruction part. The deletion submodule is used to delete the second target reconstruction part to obtain the target speech recognition model, which includes the target generation part and the target output layer. In this embodiment, because the initial speech recognition model includes an initial generation part and an initial output layer, the second target reconstruction part obtained after training needs to be deleted to ensure that the structure of the obtained target speech recognition model remains consistent with that of the initial speech recognition model.
[0065] To further reduce the false recognition rate of the command word recognition model, in another embodiment of this application, the training unit includes a recognition module and a third training module. The recognition module is used to recognize the audio processing data in the negative sample data as text, obtaining text data. The audio processing data and the text data constitute the negative sample training data. The third training module is used to train at least a portion of the second initial audio data and the negative sample training data to obtain the target speech recognition model. Before training the initial speech recognition model, the audio processing data needs to be recognized as text to obtain text data, and then the initial speech recognition model is trained using the audio processing data and the text data.
[0066] In one specific embodiment of this application, a continuous speech recognition model can be used to recognize the above-mentioned audio processing data into text data. Of course, in practical applications, other methods can also be used to process audio data into text data.
[0067] In another specific embodiment of this application, not only can audio processing data be recognized as text data, but also initials and finals can be recognized from the audio processing data.
[0068] In another embodiment of this application, the processing unit includes a first adding module, wherein the first adding module is used to add scene features to the first initial audio data to obtain the audio processed data. The scene features include at least one of the following: environmental noise, rate perturbation, and reverberation. In this embodiment, adding scene features to the first initial audio data makes the audio processed data more realistically reflect various situations in real life, increases the number of sample sets for model training, and makes the sample sets closer to actual application scenarios, thereby further improving the accuracy of the speech recognition model.
[0069] The aforementioned environmental noise can be noise such as television, music, or running water, used to simulate the actual background noise of the environment; the aforementioned rate perturbation can be to lengthen or shorten the first initial audio data; the aforementioned reverberation can be the room impact response, used to simulate the reflection of sound in the room.
[0070] In order to construct diverse audio processing data, in another embodiment of this application, the first adding module includes an extraction submodule and an adding submodule, wherein the extraction submodule is used to extract a portion of the first initial audio data to obtain a first initial sub-audio; the adding submodule is used to add the scene features to the first initial sub-audio to obtain the audio processing data.
[0071] In another embodiment of the present application, there are multiple pieces of the above-mentioned first initial audio data. The processing unit includes an interception module, a superimposition module, and a second addition module. Among them, the interception module is used to respectively intercept parts of at least two pieces of the above-mentioned first initial audio data to obtain multiple second initial sub-audio data; the superimposition sub-module is used to superimpose the multiple second initial sub-audio data to obtain synthesized audio data; the second addition module is used to add the above-mentioned scene features to the synthesized audio data to obtain the above-mentioned audio processing data. In this embodiment, it is also possible to intercept parts of multiple pieces of the first initial audio data to obtain multiple second initial sub-audio data, superimpose the multiple second initial sub-audio data to obtain synthesized audio, and then add scene features to the synthesized audio.
[0072] The above-mentioned parts of at least two pieces of the first initial audio data can be intercepted to obtain audio data of the same duration or audio data of different durations; the signal-to-noise ratios of the multiple second initial sub-audio data can be the same or different, as long as the generated audio processing data is different from the first initial audio data.
[0073] In yet another embodiment of the present application, the above-mentioned device further includes a construction unit. Among them, the construction unit is used to construct the above-mentioned initial speech recognition model according to the second initial audio data after obtaining the first initial audio data and the second initial audio data in the speech database and before using the initial speech recognition model to recognize command words for the above-mentioned audio processing data. In this embodiment, the above-mentioned initial speech recognition model is constructed according to the second initial audio data. The sample size of the second initial audio data is generally much smaller than that of the first initial audio data. Therefore, the initial speech recognition model may have a relatively high misrecognition rate for samples in the first initial audio data and needs to be adjusted through negative sample data.
[0074] In a specific embodiment of the present application, the above-mentioned initial speech recognition model is constructed based on the initials and finals of the corresponding characters in the second initial audio data. Of course, in actual applications, there are other modeling methods. For example, the initial speech recognition model can be constructed based on the corresponding characters in the second initial audio data, or can be constructed based on the syllables of the corresponding characters in the second initial audio data (that is, the combination of the initials and finals in the order of Chinese pinyin pronunciation), or can be constructed based on the phonemes of the corresponding characters in the second initial audio data. Among them, the above-mentioned phoneme is the smallest speech unit divided according to the natural attributes of speech. Analyzed according to the pronunciation actions in the syllable, one action constitutes one phoneme. For example, the Chinese syllable (a, ah) has only one phoneme, the Chinese syllable (ai, love) has two phonemes, and the Chinese syllable (dai, generation) has three phonemes, etc.
[0075] The adjustment device for the aforementioned speech recognition model includes a processor and a memory. The aforementioned acquisition unit, processing unit, recognition unit, and training unit are all stored in the memory as program units, and the processor executes the aforementioned program units stored in the memory to achieve the corresponding functions.
[0076] The processor contains a kernel, which retrieves the corresponding program units from memory. One or more kernels can be configured, and adjusting kernel parameters can address the high false recognition rate of existing command word recognition models.
[0077] The memory may include non-permanent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM, and the memory includes at least one memory chip.
[0078] This invention provides a computer-readable storage medium storing a program that, when executed by a processor, implements the above-described method for adjusting a speech recognition model.
[0079] This invention provides a processor for running a program, wherein the program executes the adjustment method of the speech recognition model during runtime.
[0080] This invention provides an electronic device including one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs include methods for performing any of the methods described above.
[0081] The aforementioned electronic device includes one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs include methods for performing any of the aforementioned methods. In this method, first initial audio data excluding command words is processed to obtain processed audio data; then, command word recognition is performed on the processed audio data, and audio data that is misrecognized is selected to constitute negative sample data. Subsequently, an initial speech recognition model is trained using the negative samples and second initial audio data, increasing the amount of sample data prone to misrecognition during model training, thereby making the target speech recognition model more accurate and solving the problem of high misrecognition rates in existing command word recognition models.
[0082] This invention provides a device including a processor, a memory, and a program stored in the memory and executable on the processor. When the processor executes the program, it performs at least the following steps:
[0083] Step S101: Obtain audio data from the speech database to obtain first initial audio data and second initial audio data. The first initial audio data does not include command words, and the second initial audio data includes the command words.
[0084] Step S102: Process the first initial audio data to obtain processed audio data, which is different from the first initial audio data.
[0085] Step S103: The initial speech recognition model is used to perform command word recognition on the above audio processing data to obtain target recognition results. The target recognition results are used to characterize that the corresponding audio processing data includes the command words. The audio processing data corresponding to the target recognition results constitute negative sample data.
[0086] Step S104: Based on at least a portion of the second initial audio data and the negative sample data, train at least the initial speech recognition model to obtain the target speech recognition model.
[0087] The devices mentioned in this article can be servers, PCs, tablets, mobile phones, etc.
[0088] This application also provides a computer program product, which, when executed on a data processing device, is suitable for executing an initialization program having at least the following method steps:
[0089] Step S101: Obtain audio data from the speech database to obtain first initial audio data and second initial audio data. The first initial audio data does not include command words, and the second initial audio data includes the command words.
[0090] Step S102: Process the first initial audio data to obtain processed audio data, which is different from the first initial audio data.
[0091] Step S103: The initial speech recognition model is used to perform command word recognition on the above audio processing data to obtain target recognition results. The target recognition results are used to characterize that the corresponding audio processing data includes the command words. The audio processing data corresponding to the target recognition results constitute negative sample data.
[0092] Step S104: Based on at least a portion of the second initial audio data and the negative sample data, train at least the initial speech recognition model to obtain the target speech recognition model.
[0093] To enable those skilled in the art to better understand the technical solutions of this disclosure, the technical solutions of this disclosure will be described in detail below with reference to specific embodiments and comparative examples.
[0094] Example
[0095] The flowchart illustrating the adjustment method for this speech recognition model is as follows: Figure 5 As shown. The adjustment method for this speech recognition model includes the following steps:
[0096] Step 1: Construct diverse audio samples;
[0097] First, audio data is acquired from the speech database to obtain first initial audio data and second initial audio data. The first initial audio data does not include command words, while the second initial audio data does include command words. Then, an initial speech recognition model is constructed based on the second initial audio data. Finally, portions of at least two of the first initial audio data are extracted to obtain multiple second initial sub-audio data. These multiple second initial audio data are then processed to obtain processed audio data. There are two specific schemes for processing the second initial audio data: First, noise interference and reverberation are added to one of the second initial sub-audio data to obtain synthesized audio data, and the rate of the synthesized audio data is adjusted to obtain the processed audio data. Second, two second initial audio data are selected, and the rates of these two second initial audio data are adjusted to obtain the processed audio data.
[0098] Step 2: Recognize the constructed audio samples to obtain samples that are identified as command words;
[0099] The above-mentioned initial speech recognition model is used to perform command word recognition on the above-mentioned audio processing data to obtain target recognition results. The target recognition results are used to characterize the corresponding audio processing data including the above-mentioned command words. The target recognition results and the above-mentioned audio processing data constitute negative sample data.
[0100] Step 3: Recognize negative samples as text;
[0101] A continuous speech recognition model is used to recognize the audio processing data in the negative sample data as text, and the audio processing data and the text data constitute the training data.
[0102] Step 4: Fine-tune and train the initial speech recognition model;
[0103] The initial reconstruction part is trained using at least a portion of the second initial audio data to obtain a first target reconstruction part. The initial generation part, the first target reconstruction part, and the initial output layer are trained using at least a portion of the second initial audio data and the training data to obtain a target generation part, a target output layer, and a second target reconstruction part. The second target reconstruction part is deleted to obtain the target speech recognition model, which includes the target generation part and the target output layer.
[0104] In the above embodiments of the present invention, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0105] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units described above can be a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between units or modules may be electrical or other forms.
[0106] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0107] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0108] If the aforementioned integrated units are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.
[0109] As can be seen from the above description, the embodiments of this application achieve the following technical effects:
[0110] 1) The method for adjusting the speech recognition model in this application firstly acquires audio data from a speech database to obtain first initial audio data excluding command words and second initial audio data including command words. Then, the first initial audio data is processed to obtain audio processed data different from the first initial audio data. Next, the initial speech recognition model is used to perform command word recognition on the audio processed data to obtain a target recognition result representing the corresponding audio processed data including command words. The audio processed data corresponding to the target recognition result constitutes negative sample data. Finally, based on at least a portion of the second initial audio data and the negative sample data, the initial speech recognition model is trained to obtain a target speech recognition model. In this method, the first initial audio data excluding command words is processed to obtain audio processed data. Then, command word recognition is performed on the audio processed data, and audio processed data that is misrecognized is selected to constitute negative sample data. Then, the initial speech recognition model is trained using the negative sample data and the second initial audio data, increasing the amount of sample data prone to misrecognition during model training, thereby making the target speech recognition model more accurate and solving the problem of high misrecognition rate in existing command word recognition models.
[0111] 2) The speech recognition model adjustment device of this application includes an acquisition unit, a processing unit, a recognition unit, and a training unit. The acquisition unit is used to acquire audio data from a speech database to obtain first initial audio data excluding command words and second initial audio data including command words. The processing unit is used to process the first initial audio data to obtain audio processing data different from the first initial audio data. The recognition unit is used to use the initial speech recognition model to perform command word recognition on the audio processing data to obtain a target recognition result that represents the corresponding audio processing data including command words. The audio processing data corresponding to the target recognition result constitutes negative sample data. The training unit is used to train the initial speech recognition model based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model. In this device, the first initial audio data excluding command words is processed to obtain audio processed data. Then, command word recognition is performed on the audio processed data, and audio processed data that are misrecognized are selected to form negative sample data. Subsequently, the initial speech recognition model is trained using the negative samples and the second initial audio data, which increases the amount of sample data that is prone to misrecognition during model training, thereby making the target speech recognition model more accurate and solving the problem of high misrecognition rate of command word recognition models in the prior art.
[0112] 3) The electronic device of this application includes one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs include methods for performing any of the above-described methods. In this method, first initial audio data excluding command words is processed to obtain processed audio data; then, command word recognition is performed on the processed audio data, and audio processed data that has been misrecognized is selected to constitute negative sample data. Subsequently, the initial speech recognition model is trained using the negative samples and second initial audio data, increasing the amount of sample data prone to misrecognition during model training, thereby making the target speech recognition model more accurate and solving the problem of high misrecognition rates in existing command word recognition models.
[0113] The above description is merely a preferred embodiment of this application and is not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, or improvements made within the spirit and principles of this application should be included within the scope of protection of this application.
Claims
1. A method for adjusting a speech recognition model, characterized in that, include: Audio data is obtained from a speech database to obtain first initial audio data and second initial audio data. The first initial audio data does not include command words, and the second initial audio data includes the command words. The second initial audio data contains command word data and general data, and the general data includes speech data and corresponding text tags. The first initial audio data is processed to obtain processed audio data, which is different from the first initial audio data. An initial speech recognition model is used to perform command word recognition on the audio processing data to obtain target recognition results. The target recognition results are used to characterize that the corresponding audio processing data includes the command word. The audio processing data corresponding to the target recognition results constitutes negative sample data. Based on at least a portion of the second initial audio data and the negative sample data, at least the initial speech recognition model is trained to obtain the target speech recognition model; The initial speech recognition model includes an initial generation part and an initial output layer. The initial speech recognition model is trained at least partially using the second initial audio data and the negative sample data to obtain a target speech recognition model. This includes: obtaining an initial reconstruction part based on the initial generation part, wherein the initial reconstruction part and the initial generation part constitute an initial autoencoder; training the initial reconstruction part using at least partially of the second initial audio data to obtain a first target reconstruction part; and training the initial generation part, the first target reconstruction part, and the initial output layer based on at least partially of the second initial audio data and the negative sample data to obtain the target speech recognition model.
2. The method according to claim 1, characterized in that, The initial generation part, the first target reconstruction part, and the initial output layer are trained based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model, including: The initial generation part, the first target reconstruction part, and the initial output layer are trained based on at least a portion of the second initial audio data and the negative sample data to obtain the target generation part, the target output layer, and the second target reconstruction part. The second target reconstruction part is deleted to obtain the target speech recognition model, which includes the target generation part and the target output layer.
3. The method according to claim 1, characterized in that, Based on at least a portion of the second initial audio data and the negative sample data, at least the initial speech recognition model is trained to obtain a target speech recognition model, including: The audio processing data in the negative sample data is identified as text to obtain text data. The audio processing data and the text data constitute the negative sample training data. The initial speech recognition model is trained using at least a portion of the second initial audio data and the negative sample training data to obtain the target speech recognition model.
4. The method according to claim 1, characterized in that, The first initial audio data is processed to obtain processed audio data, including: Add scene features to the first initial audio data to obtain the audio processing data. The scene features include at least one of the following: ambient noise, rate perturbation, and reverberation.
5. The method according to claim 4, characterized in that, Adding scene features to the first initial audio data to obtain the audio processing data includes: A portion of the first initial audio data is extracted to obtain the first initial sub-audio. The scene features are added to the first initial sub-audio to obtain the audio processing data.
6. The method according to claim 1, characterized in that, There are multiple initial audio data sets. The initial audio data is processed to obtain processed audio data, including: At least two portions of the first initial audio data are extracted to obtain multiple second initial sub-audio data; The audio processing data is obtained by superimposing multiple second initial sub-audio data.
7. The method according to any one of claims 1 to 6, characterized in that, After acquiring audio data from the speech database and obtaining first and second initial audio data, before performing command word recognition on the audio processing data using an initial speech recognition model, the method further includes: The initial speech recognition model is constructed based on the second initial audio data.
8. A device for adjusting a speech recognition model, characterized in that, include: The acquisition unit is used to acquire audio data from a speech database to obtain first initial audio data and second initial audio data. The first initial audio data does not include command words, and the second initial audio data includes the command words. The second initial audio data contains command word data and general data, and the general data includes speech data and corresponding text tags. A processing unit is configured to process the first initial audio data to obtain processed audio data, wherein the processed audio data is different from the first initial audio data. The recognition unit is used to perform command word recognition on the audio processing data using an initial speech recognition model to obtain a target recognition result. The target recognition result is used to characterize that the corresponding audio processing data includes the command word. The audio processing data corresponding to the target recognition result constitutes negative sample data. A training unit is configured to train at least the initial speech recognition model based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model. The initial speech recognition model includes an initial generation part and an initial output layer. The training unit includes a construction module, a first training module, and a second training module. The construction module is used to obtain an initial reconstruction part based on the initial generation part, and the initial reconstruction part and the initial generation part constitute an initial autoencoder. The first training module is used to train the initial reconstruction part using at least a portion of the second initial audio data to obtain a first target reconstruction part. The second training module is used to train the initial generation part, the first target reconstruction part, and the initial output layer based on at least a portion of the second initial audio data and the negative sample data to obtain a target speech recognition model.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes a stored program, wherein the program performs the method according to any one of claims 1 to 7.
10. A processor, characterized in that, The processor is used to run a program, wherein the program executes the method according to any one of claims 1 to 7 when it runs.
11. An electronic device, characterized in that, include: One or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising methods for performing any one of claims 1 to 7.