Speech recognition text correction method and apparatus

By using a combination of text similarity and edit distance models with a binary classification model in refrigeration equipment, the problem of typos in speech recognition was solved, achieving more accurate text correction and improving user experience.

CN117951247BActive Publication Date: 2026-06-19QINDAO HAIER REFRIGERATOR CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
QINDAO HAIER REFRIGERATOR CO LTD
Filing Date
2022-10-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for speech recognition in refrigeration equipment suffer from inaccurate word recognition, especially due to misspellings caused by homophones in Chinese, resulting in the final text being far removed from the user's true meaning.

Method used

We use a text similarity model and an edit distance model to filter similar text sets and distance sets, combine a binary classification model to determine the value of each bit of the text data, and use the text with the highest frequency in the candidate text set as the error correction text. We then use a Transformer+sigmoid model for error correction.

Benefits of technology

It improves the accuracy of speech recognition, ensures the accuracy of text correction, and enhances the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117951247B_ABST
    Figure CN117951247B_ABST
Patent Text Reader

Abstract

The application discloses a speech recognition text correction method and device, and the method comprises the following steps: converting speech data into text data; screening a similar text set with similar semantics with the text data in a corpus using a text similarity model; screening a distance set with an edit distance within a preset threshold range with the text data in the corpus; judging whether the value of each bit of the text data is a first value or a second value through a binary classification model; screening a candidate text set with the same text length as the text data and the same content at each specified position in the similar text set and the distance set; and taking the text with the highest frequency of occurrence in the candidate text set as the corrected text. The speech recognition text correction method and device can determine the corrected text without splitting the sentence, realize text correction, and thus more clearly recognize the real demand of the user.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of voice recognition for refrigeration equipment, and more particularly to a method and apparatus for text error correction in voice recognition for refrigeration equipment. Background Technology

[0002] With the advancement of technology, users have put forward new requirements for the intelligence of refrigeration equipment. For example, in the scenario where users use a refrigerator, the user speaks to the refrigerator, the refrigerator performs voice recognition, and then performs the corresponding operation according to the voice command.

[0003] Although speech recognition technology is relatively mature, it cannot yet achieve 100% accuracy. This is because Chinese has the characteristic of homophones with different characters. Even if these speech sounds are matched in a dictionary and some typos are corrected, some words will still be recognized inaccurately.

[0004] In particular, existing technologies for correcting errors in speech recognition text first perform word segmentation, then perform operations such as judging these words, extracting candidate words, sorting candidate words, and converting them into text. However, if an error occurs in the word segmentation step, subsequent steps will amplify the error step by step, resulting in the final text being far removed from the user's true meaning. Summary of the Invention

[0005] To address at least one of the aforementioned problems in the prior art, the present invention aims to provide a speech recognition text correction method and apparatus capable of accurately correcting text errors in the context of refrigeration equipment applications. 。

[0006] To achieve the above-mentioned objective, one embodiment of the present invention provides a speech recognition text error correction method, comprising the following steps:

[0007] Translate speech data into text data;

[0008] A text similarity model is used in the corpus to filter a set of similar texts that are semantically similar to the text data, wherein the set of similar texts may include multiple identical sentences;

[0009] In the corpus, a distance set is selected that is within a preset threshold range from the edit distance of the text data, wherein the distance set may include multiple identical text sentences;

[0010] A binary classification model is used to determine whether each value of the text data is the first value or the second value.

[0011] In the similar text set and the distance set, a candidate text set is selected that has the same text length as the text data and whose content is the same as the content of the text data, where the specified position is the position in the text data with the first value.

[0012] The text with the highest frequency of occurrence in the candidate text set is selected as the corrected text.

[0013] As a further improvement of the present invention, the binary classification model is a Transformer+sigmoid model.

[0014] As a further improvement of the present invention, the step of using a text similarity model in the corpus to filter a set of semantically similar texts to the text data includes:

[0015] The text data is used to generate a text semantic vector through the text similarity model;

[0016] Each text in the corpus is used to generate a similar semantic vector through the text similarity model;

[0017] Calculate the similarity between the text semantic vector and the similar semantic vector to determine whether the text data is similar to the corresponding text in the corpus.

[0018] As a further improvement to the present invention, the following steps are also included:

[0019] Filter the corpus to select a first set whose text length is equal to that of the text data;

[0020] The text similarity model is used in the first set to filter a set of similar texts that are semantically similar to the text data;

[0021] Filter the set of distances from the text data whose edit distance is within a preset threshold range from the first set.

[0022] As a further improvement to the present invention, the following steps are also included:

[0023] If every bit of the text data has the first value, the text data is corrected text.

[0024] As a further improvement of the present invention, the edit distance is equal to the minimum number of edit operations required to transform one string into the other.

[0025] As a further improvement of the present invention, the step of translating speech data into text data includes:

[0026] The voice data is denoised to obtain enhanced user voice data;

[0027] Extract the enhanced user voice data to obtain user voice data;

[0028] The user's voice data is recognized to obtain text data.

[0029] To achieve one of the above-mentioned objectives, an embodiment of the present invention provides a speech recognition text correction device, comprising:

[0030] The translation module is used to translate speech data into text data;

[0031] The similar text set filtering module is used to filter similar text sets that are semantically similar to the text data in the corpus using a text similarity model. The similar text set may include multiple identical sentences.

[0032] The distance set filtering module is used to filter distance sets in the corpus that are within a preset threshold range of the edit distance to the text data, wherein the distance set may include multiple identical text sentences;

[0033] The judgment module is used to determine whether each value of the text data is a first value or a second value using a binary classification model;

[0034] The candidate text set filtering module filters candidate text sets from the similar text set and the distance set that have the same length as the text data and whose content at each specified position is the same as the content of the text data, wherein the specified position is the position in the text data with a value of the first value;

[0035] The sorting module is used to select the text with the highest frequency of occurrence from the candidate text set as the corrected text.

[0036] To achieve one of the above-mentioned objectives, one embodiment of the present invention provides an electronic device, comprising:

[0037] Storage module, used to store computer programs;

[0038] The processing module, when executing the computer program, can implement the steps in the above-described speech recognition text correction method.

[0039] To achieve one of the above-mentioned objectives, one embodiment of the present invention provides a readable storage medium storing a computer program that, when executed by a processing module, can implement the steps in the above-mentioned speech recognition text correction method.

[0040] Compared with existing technologies, the present invention has the following beneficial effects: By using this speech recognition text correction method and device, a suitable set range can be narrowed down through the similar text set filtered by the text similarity model and the distance set filtered by the edit distance. Then, through the binary classification model, the content that is in the same position as other texts with the first value of text data is filtered out from the above set, and the possible content of other texts in the corpus with the second value of text data is indirectly filtered out. This is determined as a candidate text set. Then, the text with the highest frequency is selected as the corrected text. The corrected text has the highest probability of being the correct text. Text correction is achieved without word segmentation of sentences, thereby more clearly identifying the user's real needs and improving the user experience. Attached Figure Description

[0041] Figure 1 This is a flowchart of a speech recognition text error correction method according to an embodiment of the present invention;

[0042] Figure 2 This is a flowchart illustrating the specific steps of step S10 in an embodiment of the present invention;

[0043] Figure 3 This is a schematic diagram of a speech recognition text correction device according to an embodiment of the present invention. Detailed Implementation

[0044] The present invention will now be described in detail with reference to the specific embodiments shown in the accompanying drawings. However, these embodiments do not limit the present invention, and any structural, methodological, or functional modifications made by those skilled in the art based on these embodiments are included within the scope of protection of the present invention.

[0045] One embodiment of the present invention provides a speech recognition text correction method and apparatus that can accurately correct text errors in the context of refrigeration equipment use.

[0046] The refrigeration equipment in this embodiment can be a refrigerator, freezer, upright refrigerator, wine cabinet, etc. The following embodiment will be described using a refrigerator as an example. The refrigerator includes a cabinet, a storage space with an opening disposed in the cabinet, and a door covering the opening. Food can be placed in the storage space. A microphone can be disposed inside and / or outside the door for receiving user voice data.

[0047] After the microphone receives the voice data, the following speech recognition and text correction method can be run through the processing module on the refrigeration equipment. Alternatively, the voice data can be uploaded to a server and run through the server, or through the user's mobile phone or computer. After the process is complete, the recognition result is run on the refrigeration equipment. For example, if "What kind of stone is in the refrigerator?" is corrected to "What kind of food is in the refrigerator?", the correct answer to the question can be given, or the correct result can be displayed on the screen, without the user's speech being incomprehensible due to text errors.

[0048] The following is combined with Figure 1 This invention provides a speech recognition text correction method according to an embodiment of the present invention. Although this application provides method operation steps as shown in the following embodiments or flowcharts, based on conventional or non-creative labor, the execution order of steps in which there is no necessary causal relationship in logic is not limited to the execution order provided in the embodiments of this application. For example, the acquisition order of steps S20 and S30 below can be arbitrarily adjusted, or performed simultaneously, without distinguishing the chronological order.

[0049] Specifically, the speech recognition text error correction method of this embodiment includes the following steps:

[0050] Step S10: Translate the speech data into text data;

[0051] like Figure 2 As shown, step S10 further includes the following steps:

[0052] Step S11: Reduce noise in the voice data to obtain enhanced user voice data; Step S11 can reduce environmental noise and enhance the intensity of user voice.

[0053] Step S12: Extract the enhanced user voice data to obtain user voice data; Step S12 may only extract the content of the user voice and exclude other irrelevant content;

[0054] Step S13: Recognize the user's voice data to obtain text data.

[0055] After step S10, the voice data is converted into clean text data for subsequent processing.

[0056] For example, suppose the text data obtained after speech-to-text translation is "What kind of stone is in the refrigerator?"

[0057] Steps S20 and S30 below can be directly filtered from the corpus, or you can go through the following steps first before starting steps S20 and S30.

[0058] In the corpus, a first set of texts with the same length as the text data is selected.

[0059] Taking the example above, "What kind of stone is in the refrigerator" has 8 characters and 16 bytes. In the corpus, texts with 8 characters are first filtered out and merged into a new first set.

[0060] Then, steps S20 and S30 are both filtered within the first set, that is, steps S20 and S30 are filtered within a subset of the corpus, which greatly reduces the workload of filtering.

[0061] Step S20: Use a text similarity model in the corpus to filter a set of similar texts that are semantically similar to the text data, wherein the set of similar texts may include multiple identical sentences;

[0062] The corpus can include some repeated sentences, and even a small amount of corpus content can be incorrect. In the process of using the text similarity model for filtering, it is not that the same sentence can only be filtered once, but that if there are n identical texts in the corpus, it will be filtered out n times.

[0063] Step S20 specifically also includes:

[0064] Step S21: Generate a text semantic vector from the text data using the text similarity model;

[0065] Step S22: Generate a similar semantic vector for each text in the corpus using the text similarity model;

[0066] Step S23: Calculate the similarity between the text semantic vector and the similar semantic vector, and determine whether the text data is similar to the corresponding text in the corpus.

[0067] Statistical methods are used here to calculate the degree of similarity.

[0068] Taking "What kind of stone is in the refrigerator?" as an example, after filtering by the text similarity model, semantically similar content such as "What kind of food is in the refrigerator?", "What kind of stone is in the refrigerator?", and "What kind of food is in the refrigerator?" can be selected.

[0069] Step S30: Filter the distance set in the corpus that is within a preset threshold range of the edit distance of the text data, wherein the distance set may include multiple identical text sentences.

[0070] As described in step S20, the filtered distance set can contain completely identical text, as long as they all pass the distance set.

[0071] The edit distance is equal to the minimum number of edit operations required to transform one string into the other.

[0072] Taking the example above, the edit distance between "What kind of stone is in the refrigerator" and "What kind of food is in the refrigerator" is equal to 1, and the edit distance between "What kind of stone is in the refrigerator" and "What kind of food is in the refrigerator" is equal to 5. If the preset threshold is set to 4, that is, other content with an edit distance greater than 4 is not considered, and only content with an edit distance of 4 or less is included in the distance set.

[0073] Step S40: Using a binary classification model, determine whether each value of the text data is the first value or the second value.

[0074] After using the binary classification model, the result is either classified as the first value or the second value. The first value is a different value that can be distinguished from the second value. Here, the first value can represent the result as correct and the second value as incorrect. For example, the first value can be 1 and the second value can be 0.

[0075] Thus, the binary classification model has two possible results: 1 or 0. 1 indicates correctness and 0 indicates error. That is, the character at position 1 is correct and the character at position 0 is incorrect.

[0076] Suppose the original corpus is: What kind of stone is in a refrigerator? The result after using a binary classification model is: 1111101, meaning the prediction of the position 'stone' is likely incorrect. The binary classification model writes the most likely true value as 1 and the position that is likely to be incorrect as 0.

[0077] Furthermore, the binary classification model is a Transformer+sigmoid model.

[0078] In other cases, the SVM algorithm or other binary classification models can be used instead.

[0079] Step S50: Filter the candidate text set from the similar text set and the distance set, where the text length is the same as the text data and the content at each specified position is the same as the content of the text data, wherein the specified position is the position in the text data where the value is the first value.

[0080] In other words, assuming the original corpus is: What kind of stone is in a refrigerator, then each text in the candidate text set will also have 8 characters.

[0081] Furthermore, if the value of the stone material in the refrigerator after binary classification is 1111101, then each text in the candidate text set is also 11111X1, where X can be 0 or 1, but the other bits are all 1.

[0082] Step S60: Select the text with the highest frequency of occurrence from the candidate text set as the corrected text.

[0083] Since both the similar text set and the distance set can contain the same number of texts, when the candidate text set meets the requirements of step S60, the same text may appear multiple times, for example, more than 3 times. The value that appears most frequently is found among these texts.

[0084] After following the steps above, you will most likely find the correct text as "What ingredients are in the refrigerator?"

[0085] Furthermore, on the one hand, because users' speech to cooling equipment is generally not very long in the environment where the equipment is used, and the speech produced by many users is generally similar, it can usually be found in the corpus. On the other hand, after three rounds of screening, there are generally not many qualified texts that can appear in the candidate text set. Therefore, even if there is only one text in the candidate text set, that text is very likely to be correct at position X in the example above.

[0086] If there are two or more texts with the same frequency in the candidate text set, on the one hand, as mentioned above, the probability is extremely small; on the other hand, this can usually be resolved manually or by having the user repeat the text.

[0087] Furthermore, even if the values ​​of the speech data translated into text data in step S10 are completely correct, after being filtered by the edit distance and text similarity model, it is highly likely that the corresponding sentence will be found.

[0088] The method in this embodiment does not perform word segmentation and analysis on the speech first. Instead, it finds text that is similar to the speech and has correct content as much as possible. In this way, it can deduce the specific content of the incorrect words with a value of 0 in the binary classification.

[0089] Furthermore, the step of using a text similarity model in the corpus to filter a set of semantically similar texts to the text data includes:

[0090] In addition, it also includes the following steps:

[0091] If every bit of the text data has the first value, the text data is corrected text.

[0092] In other words, if the binary classification model determines that the character at each position is the correct character, then the text content is assumed to be correct, thus avoiding the need to filter the text through distance sets and similar text sets to change it to other content.

[0093] Compared with the prior art, this embodiment has the following beneficial effects:

[0094] By using this speech recognition text correction method and device, a suitable set range can be narrowed down through a set of similar texts filtered by a text similarity model and a set of distances filtered by edit distance. Then, a binary classification model is used to filter out content in the above set that has the same position as other texts with the first value in text data, and indirectly filter out other texts in the corpus that have the second value in text data, thus determining them as a candidate text set. Finally, the text with the highest frequency is selected as the corrected text, which has the highest probability of being the correct text. Text correction is achieved without word segmentation of sentences, thereby more clearly identifying the user's real needs and improving the user experience.

[0095] In one embodiment, a speech recognition text correction device is provided, such as Figure 3 As shown. The speech recognition text correction device includes the following modules, and the specific functions of each module are as follows:

[0096] The translation module is used to translate speech data into text data;

[0097] The similar text set filtering module is used to filter similar text sets that are semantically similar to the text data in the corpus using a text similarity model. The similar text set may include multiple identical sentences.

[0098] The distance set filtering module is used to filter distance sets in the corpus that are within a preset threshold range of the edit distance to the text data, wherein the distance set may include multiple identical text sentences;

[0099] The judgment module is used to determine whether each bit of the text data is 0 or 1 using a binary classification model;

[0100] The candidate text set filtering module filters candidate text sets from the similar text set and the distance set that have the same length as the text data and whose content has the same value of 1 at all positions.

[0101] The sorting module is used to select the text with the highest frequency of occurrence from the candidate text set as the corrected text.

[0102] It should be noted that for details not disclosed in the speech recognition text correction device of this embodiment, please refer to the details disclosed in the speech recognition text correction method of this embodiment.

[0103] Those skilled in the art will understand that the schematic diagram of the module is merely an example of a speech recognition text correction device and does not constitute a limitation on the terminal device of the speech recognition text correction device. It may include more or fewer components than shown in the diagram, or combine certain components, or different components. For example, the speech recognition text correction device may also include input / output devices, network access devices, buses, etc.

[0104] The speech recognition text correction device may further include computing devices such as computers, laptops, PDAs, and cloud servers, as well as, but not limited to, a processing module, a storage module, and a computer program stored in the storage module and capable of running on the processing module, such as the speech recognition text correction method program described above. When the processing module executes the computer program, it implements the steps in the various speech recognition text correction method embodiments described above, for example... Figure 1 The steps are shown.

[0105] The speech recognition text correction device may also include a signal transmission module and a communication bus. The signal transmission module is used to send data to other external processing modules or servers. Other external processing modules, such as mobile phones, can transmit data with the cooling device wirelessly, such as via Bluetooth, Wi-Fi, or ZigBee. The communication bus is used to establish a connection between the microphone, signal transmission module, processing module, and storage module. The communication bus may include a path for transmitting information between the microphone, signal transmission module, processing module, and storage module.

[0106] In addition, the present invention also proposes an electronic device, which includes a storage module and a processing module. When the processing module executes the computer program, it can implement the steps in the above-mentioned speech recognition text correction method, that is, implement the steps in any of the above-mentioned speech recognition text correction methods.

[0107] The electronic device may be part of a speech recognition text correction device, a local terminal device, or part of a cloud server.

[0108] The processing module can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. General-purpose processors can be microprocessors or any conventional processor. The processing module is the control center of the speech recognition and text correction device, connecting all parts of the device via various interfaces and lines.

[0109] The storage module can be used to store the computer programs and / or modules. The processing module implements various functions of the speech recognition and text correction device by running or executing the computer programs and / or modules stored in the storage module and by calling the data stored in the storage module. The storage module may mainly include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function, etc. In addition, the storage module may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, at least one disk storage device, flash memory device, or other volatile solid-state storage device.

[0110] For example, the computer program can be divided into one or more modules / units, which are stored in a storage module and executed by a processing module to complete the present invention. The one or more modules / units can be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program in the speech recognition and text correction device.

[0111] Furthermore, one embodiment of the present invention provides a readable storage medium storing a computer program that, when executed by a processing module, can implement the steps in the above-described speech recognition text correction method, that is, implement the steps in any of the technical solutions of the above-described speech recognition text correction method.

[0112] If the integrated module of the speech recognition text correction method is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the above embodiments of the present invention can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by the processing module, it can implement the steps of the above-described method embodiments.

[0113] The computer program includes computer program code, which can be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium can include any entity or device capable of carrying the computer program code, recording media, U disks, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium can be appropriately added to or subtracted according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electrical carrier signals and telecommunication signals.

[0114] It should be understood that although this specification describes embodiments, not every embodiment contains only one independent technical solution. This way of describing the specification is only for clarity. Those skilled in the art should regard the specification as a whole. The technical solutions in each embodiment can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.

[0115] The detailed descriptions listed above are merely specific descriptions of feasible embodiments of the present invention, and are not intended to limit the scope of protection of the present invention. All equivalent embodiments or modifications made without departing from the spirit of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method of speech recognition text correction, characterized by, Includes the following steps: Translate speech data into text data; A text similarity model is used in the corpus to filter a set of similar texts that are semantically similar to the text data, wherein the set of similar texts may include multiple identical sentences; In the corpus, a distance set is selected that is within a preset threshold range from the edit distance of the text data, wherein the distance set may include multiple identical text sentences; A binary classification model is used to determine whether each value of the text data is a first value or a second value, where the first value indicates a correct result and the second value indicates an incorrect result. Filter the candidate text set from the similar text set and the distance set, where the text length is the same as the text data and the content at each specified position is the same as the content of the text data, wherein the specified position is the position in the text data where the value is the first value; The text with the highest frequency of occurrence in the candidate text set is selected as the corrected text. If every bit of the text data has the first value, the text data is corrected text.

2. The speech recognition text correction method of claim 1, wherein, The binary classification model is a Transformer+sigmoid model.

3. The speech recognition text correction method of claim 1, wherein, The step of using a text similarity model in the corpus to filter a set of semantically similar texts to the text data includes: The text data is used to generate a text semantic vector through the text similarity model; Each text in the corpus is used to generate a similar semantic vector through the text similarity model; Calculate the similarity between the text semantic vector and the similar semantic vector to determine whether the text data is similar to the corresponding text in the corpus.

4. The method of claim 1, wherein, It also includes the following steps: Filter the corpus to select a first set whose text length is equal to that of the text data; The text similarity model is used in the first set to filter a set of similar texts that are semantically similar to the text data; Filter the set of distances from the text data whose edit distance is within a preset threshold range from the first set.

5. The speech recognition text error correction method according to claim 1, characterized in that, The edit distance is equal to the minimum number of edit operations required to transform one string into the other.

6. The method of voice recognition text correction of claim 1, wherein, The step of translating speech data into text data includes: The voice data is denoised to obtain enhanced user voice data; Extract the enhanced user voice data to obtain user voice data; The user's voice data is recognized to obtain text data.

7. A speech recognition text correction device, characterized in that, include: The translation module is used to translate speech data into text data; The similar text set filtering module is used to filter similar text sets that are semantically similar to the text data in the corpus using a text similarity model. The similar text set may include multiple identical text sentences. The distance set filtering module is used to filter distance sets in the corpus that are within a preset threshold range of the edit distance to the text data, wherein the distance set may include multiple identical text sentences; The judgment module is used to determine whether each value of the text data is a first value or a second value using a binary classification model, wherein the first value indicates that the result is correct and the second value indicates that the result is incorrect. If each value of the text data is the first value, the text data is corrected text. The candidate text set filtering module filters candidate text sets from the similar text set and the distance set that have the same length as the text data and whose content at each specified position is the same as the content of the text data, wherein the specified position is the position in the text data with a value of the first value; The sorting module is used to select the text with the highest frequency of occurrence from the candidate text set as the corrected text.

8. An electronic device, comprising: include: Storage module, used to store computer programs; The processing module, when executing the computer program, can implement the steps of the speech recognition text correction method according to any one of claims 1 to 6.

9. A readable storage medium storing a computer program, characterized in that, When executed by the processing module, the computer program can implement the steps of the speech recognition text correction method according to any one of claims 1 to 6.