A noise level-based speech correction confirmation interaction method
By determining the granularity of confirmation through noise classification and ambiguity levels, expanding the speech context step by step, generating confirmation interaction information, and adjusting the granularity based on user feedback, the problem of handling ambiguity in speech recognition results under complex noise environments is solved, improving the efficiency of error correction confirmation and user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGDONG LEGEND COMM CO LTD
- Filing Date
- 2026-05-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to handle ambiguities in speech recognition results under complex noise environments, leading to issues such as excessively long confirmation content, too much irrelevant information, a heavy burden on user comprehension, and an increase in interaction rounds.
The confirmation granularity is determined by noise classification and ambiguity level, the speech context is expanded level by level, confirmation interaction information is generated, and the confirmation granularity is adjusted according to user feedback until the ambiguity is eliminated.
It improves the targeting and efficiency of error correction and confirmation, reduces invalid feedback and duplicate confirmation, and enhances the robustness of interaction and user experience in complex scenarios.
Smart Images

Figure CN122245298A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of voice interaction and voice recognition technology, specifically a voice error correction and confirmation interaction method based on noise classification. Background Technology
[0002] With the development of speech recognition and intelligent voice interaction technologies, voice input has been widely applied in scenarios such as vehicle control, smart terminals, navigation retrieval, voice assistants, telephone services, and human-computer dialogue. Existing technologies typically first perform automatic speech recognition on the user's voice to obtain an initial recognition result, and then perform post-processing and error correction on the recognition result based on language models, hot word models, domain lexicons, or user feedback.
[0003] However, existing technologies mainly focus on improving the accuracy of the recognition results or understanding the semantics of user responses to queries. When ambiguity exists in the recognition results, the system lacks a more refined processing mechanism regarding the granularity, scope, and content of confirmation required for error correction and confirmation interactions with the user. CN118471201A proposes a hot word error correction scheme for speech recognition engines. It collects misidentified data, constructs a training dataset, and establishes an error correction model for hot word errors. This model detects and corrects errors in the text output by the speech recognition engine, improving its ability to correct hot word errors. The key focus is on error detection and correction of the speech recognition engine's output text through training and inference modules. Its core improvement direction is model-level hot word error correction and data-driven optimization, primarily addressing the issue of automatic error correction after recognition. While this type of scheme helps improve the error correction performance of specific word classes, especially hot words, in complex noisy environments, the user's original speech often contains more than just misidentified hot words; there may be unclear phrase boundaries, weakened contextual semantic connections, and multiple coexisting candidate correction results. In this situation, relying solely on text-level automatic correction often makes it difficult to determine which part of the content the system should ask the user for confirmation, and it is also difficult to dynamically control the scope of confirmation based on the current level of ambiguity. This can easily lead to problems such as excessively long confirmation content, too much irrelevant information, a heavy burden on the user's understanding, and an increase in the number of interaction rounds.
[0004] On the other hand, CN111696535A discloses an information verification scheme based on voice interaction. It acquires the user's voice response text to a voice inquiry and performs semantic understanding by combining character and phonetic information to identify the user's intentions such as confirmation, denial, answering, or asking questions. Although it involves parsing user feedback to voice inquiries and can identify confirmation, denial, answering, or asking questions, its technical focus is on semantic understanding of the inquiry feedback text to complete the predetermined information verification task. Such schemes typically assume that the content of the system's inquiry is already relatively clear. However, existing technologies do not provide targeted hierarchical processing mechanisms for locating the voice segment to be confirmed in the automatic speech recognition result, determining the level of ambiguity of the segment, dynamically expanding the confirmation unit, and determining whether the user's attitudinal feedback (yes, no, uncertain) uniquely points to one of multiple candidate correction results when the user only provides attitudinal feedback. Especially in high-noise scenarios, if the system still uses a fixed template or fixed granularity for confirmation, the following defects are likely to occur: First, the confirmation statement may contain contextual information unrelated to the current error correction, making it difficult for users to quickly grasp the differences; Second, when user feedback cannot uniquely correspond to a certain candidate result, the system lacks a mechanism to gradually expand the confirmation granularity, supplement the necessary context, and reconfirm, which may lead to repeated inquiries, unclear relationships, invalid user feedback, or error correction failure; Third, failure to consider the linkage between noise intensity and ambiguity level, and using the conventional confirmation method when noise is high, will further reduce confirmation efficiency and interactive experience. Summary of the Invention
[0005] The purpose of this invention is to provide a speech error correction confirmation interaction method based on noise classification, thereby solving some of the drawbacks and shortcomings pointed out in the background art.
[0006] The present invention addresses the aforementioned technical problems by employing the following technical solution: a speech error correction and confirmation interaction method based on noise classification, comprising: acquiring user speech input, performing speech recognition on the user speech input to obtain an initial recognition result; and determining the noise level based on the noise characteristics of the user speech input; Based on the initial recognition results, identify speech segments with recognition ambiguity that need to be confirmed; determine the ambiguity level of the speech segments to be confirmed; determine the confirmation granularity based on the noise level and the ambiguity level, and output confirmation interaction information; Receive user feedback information and correct the initial recognition result based on the user feedback information; if the corrected recognition result still has recognition ambiguity and the user feedback information cannot uniquely correspond to a corrected result, adjust the confirmation granularity to the next higher level of confirmation granularity, and re-output confirmation interaction information and receive user feedback information until the recognition ambiguity is eliminated or the preset termination condition is reached.
[0007] Furthermore, taking the speech segment to be confirmed as the center, the process is gradually expanded to both sides along the speech sequence of the initial recognition result until the expanded speech unit can correspond to a semantic content to be executed; the speech unit corresponding to the semantic content to be executed is determined as the target confirmation unit, and the ambiguity level of the speech segment to be confirmed is determined according to the composition range of the target confirmation unit.
[0008] Furthermore, when the noise level reaches a preset level, distinguishing speech components used to differentiate different correction results are extracted from the target confirmation unit, and confirmation interaction information is generated according to the rule of prioritizing distinguishing speech components and omitting non-distinguishing speech components; wherein, the confirmation interaction information only includes the target confirmation unit corresponding to the current confirmation granularity, and does not include other speech content in the initial recognition result that is unrelated to the current error correction.
[0009] Furthermore, for at least two candidate correction results corresponding to the target confirmation unit, the dissimilar speech components in each candidate correction result are compared, and the speech components that enable each candidate correction result to be distinguished from each other are determined as the distinguishing speech components.
[0010] Furthermore, the distinguishing speech components are retained, as are the adjacent related speech components used to define the directional relationship of the distinguishing speech components; the modifying speech components, repetitive speech components, and background context speech components that do not affect the distinction between different correction results in the target confirmation unit are deleted.
[0011] Furthermore, the voice broadcast is performed in the order of first outputting the distinguishing speech components and then outputting the adjacent related speech components; when the noise level is higher than the preset level, the confirmation interaction information is limited to a candidate selection confirmation statement, so that the user feedback information directly corresponds to one of the candidate correction results.
[0012] Furthermore, the user feedback information is used to determine the direction of error correction; only when the user feedback information cannot uniquely correspond to a correction result is the current confirmation granularity adjusted to the previous level confirmation granularity; wherein, the previous level confirmation granularity corresponds to the target confirmation unit containing more voice context, and the confirmation interaction information is regenerated based on the adjusted confirmation granularity so that the voice range covered by the new round of confirmation interaction information is expanded to the voice range covered by the previous round of confirmation interaction information.
[0013] Furthermore, the user feedback information is compared with multiple candidate correction results at the current confirmation granularity; when the user feedback information can only represent the confirmation attitude but cannot represent the distinguishing relationship between the candidate correction results, it is determined that the user feedback information cannot uniquely correspond to a correction result; wherein, the confirmation attitude includes positive feedback, negative feedback and uncertain feedback.
[0014] Furthermore, taking the target confirmation unit corresponding to the current confirmation granularity as the center, the speech context is preferentially expanded to the side with different speech components from the candidate correction result until the expanded target confirmation unit can establish the correspondence between the user feedback information and the target correction result; and a new round of confirmation interaction information is generated based on the expanded target confirmation unit.
[0015] Furthermore, when the user feedback information is determined to only represent a confirmation attitude and not a distinguishing relationship between candidate correction results, a secondary pointing analysis is performed on the confirmation attitude by adding the order information of each candidate correction result under the current confirmation granularity; if the user feedback information still cannot establish a unique correspondence with any candidate correction result, the user feedback information is determined to be invalid feedback, and the next round of confirmation interaction is triggered.
[0016] The beneficial effects of this invention are as follows: The noise level is determined based on the noise characteristics of the user's voice input. Then, combined with the ambiguous speech segments to be confirmed in the initial recognition results and their ambiguity levels, the confirmation granularity is jointly determined, and confirmation interaction information is output accordingly. Therefore, it is possible to adaptively select a more suitable confirmation range based on the strength of environmental noise and the complexity of ambiguity, avoiding over-confirmation in low-noise, low-ambiguity situations, and also avoiding an overly narrow confirmation range in high-noise, high-ambiguity situations that makes it difficult to eliminate ambiguity, thereby improving the targeting and efficiency of error correction confirmation.
[0017] After receiving user feedback, the error correction process doesn't simply end based on a single feedback. Instead, if the corrected recognition result still has ambiguity and the user feedback doesn't uniquely correspond to a single corrected result, the confirmation granularity is adjusted to the next higher level, and confirmation interaction information is re-outputted until the ambiguity is resolved or a preset termination condition is met. This progressively expanding confirmation mechanism introduces more necessary voice context when user feedback is unclear, strengthening the correspondence between feedback and candidate corrected results, reducing the probability of invalid feedback and duplicate confirmations, and thus improving the error correction success rate, interaction robustness, and user experience in complex scenarios. Attached Figure Description
[0018] Figure 1 This is a flowchart of the speech error correction confirmation interaction method based on noise classification according to the present invention.
[0019] Figure 2 The attached diagram illustrates the composition of the comprehensive noise score and the determination of the noise level in Embodiment 1 of the present invention.
[0020] Figure 3 The accompanying drawing shows the expansion and ambiguity dependency determination of the speech segment to be confirmed in Embodiment 1 of the present invention.
[0021] Figure 4 The accompanying drawings illustrate the candidate confirmation content screening and confirmation interaction organization in Embodiment 1 of the present invention.
[0022] Figure 5 This is a comparison chart of the data composition of the first round of feedback matching scores in Embodiment 2 of the present invention.
[0023] Figure 6 This is the unique corresponding difference and confirmation granularity adjustment judgment diagram in Embodiment 2 of the present invention.
[0024] Figure 7 This is a diagram showing the composition of the secondary pointing analysis score in Embodiment 2 of the present invention. Detailed Implementation
[0025] The specific embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0026] Combined with appendix Figure 1 This invention discloses a voice error correction and confirmation interaction method based on noise classification. First, it acquires the user's voice input and performs voice recognition processing on it to obtain an initial recognition result. The voice input can be real-time voice from a terminal device, human-computer interaction device, or voice control device, or it can be voice data collected and input into the system. After receiving the user's voice input, the system can call a voice recognition model to decode the voice content, outputting a text result, candidate text results, and confidence information related to the recognition process, thus forming the initial recognition result. The candidate text results can serve as the basis for subsequent candidate correction results.
[0027] After obtaining the initial recognition result, the system further determines the noise level based on the noise characteristics of the user's voice input. Noise characteristics may include background noise intensity, signal-to-noise ratio, proportion of non-speech interference components, speech intelligibility, and speech segment stability information. The system can extract features from the acquired speech signal and, combined with pre-set noise grading rules or a noise level determination model, output the corresponding noise level. Preferably, the noise level is divided into at least three levels: low noise, medium noise, and high noise. When using rule-based determination, feature values can be calculated for each noise feature separately, and the level can be graded according to the weighted summation result or segmented threshold. When using model-based determination, the noise features can be input into the noise level determination model, and the model outputs the noise level. The grading threshold can be determined based on training samples, historical interaction samples, or preset empirical values. Reaching the preset level indicates that the noise level is medium noise or higher, while exceeding the preset level indicates that the noise level is high noise.
[0028] The system can combine recognition confidence, differences between candidate text results, semantic parsing conflicts, and context matching to locate ambiguous speech segments from the initial recognition results and designate these segments as speech segments to be confirmed. Subsequently, the system determines the ambiguity level of the speech segment to be confirmed. The ambiguity level characterizes the degree of impact of the current ambiguity on semantic understanding and the range of context required to complete error correction. The ambiguity level can be divided based on the location of the speech segment to be confirmed, its correlation with adjacent speech content, the degree of difference between candidate interpretation results, and the range of semantic information required to eliminate ambiguity. Preferably, the ambiguity level can be divided into three levels: low, medium, and high. The low level corresponds to ambiguity that can be eliminated within the speech segment itself or its short nearest neighbor; the medium level corresponds to ambiguity that requires phrase-level context; and the high level corresponds to ambiguity that requires semantic unit-level or complete instruction-level context.
[0029] After determining the noise level and ambiguity level, the system determines the confirmation granularity based on these factors and outputs confirmation interaction information. The confirmation granularity characterizes the range of speech and information covered by the current confirmation interaction. Preferably, the confirmation granularity can be divided into segment level, phrase level, semantic unit level, and complete instruction level, according to the range of coverage, from smallest to largest. When the noise level is low, the confirmation granularity can be relatively focused on the speech segment to be confirmed; when the ambiguity level or noise level is high, the confirmation granularity can be extended to speech units containing more contextual information. Based on the determined confirmation granularity, the system generates confirmation interaction information corresponding to the current error correction task and outputs it to the user in the form of voice broadcast, text display, or a combination of voice and text, to guide the user to confirm or correct the currently ambiguous recognition content.
[0030] After outputting confirmation interaction information, the system receives user feedback and corrects the initial recognition result based on it. User feedback can be positive, negative, candidate selection, or rephrased. The system can perform semantic analysis on the user feedback and, combined with the candidate correction results at the current confirmation granularity, replace, revise, or retain the speech segment to be confirmed in the initial recognition result, thereby obtaining the corrected recognition result. The candidate correction results can consist of multiple candidate text results corresponding to the initial recognition result, candidate results reordered based on context, or candidate results formed after performing local replacement on the speech segment to be confirmed.
[0031] If the corrected recognition result still has ambiguity, and the current user feedback information cannot uniquely correspond to a corrected result, the system will adjust the confirmation granularity to the next higher level and re-output the confirmation interaction information and receive the user feedback information. The next higher level of confirmation granularity corresponds to a confirmation method with a wider coverage, which includes more speech context or a more complete semantic range.
[0032] In subsequent rounds, the system continues to generate new confirmation interaction information based on the adjusted confirmation granularity, and receives user feedback again to update the recognition results. This process can be repeated until the recognition ambiguity is eliminated or a preset termination condition is met. The preset termination condition may include reaching a set number of confirmation rounds, the user explicitly terminating the error correction process, multiple consecutive rounds of ineffective feedback, or the system determining that the current interaction cannot continue to complete effective error correction. When the preset termination condition is met, the system may output a termination prompt and, based on the existing interaction results, retain the current recognition result, revert to the initial recognition result, or request the user to re-enter voice input.
[0033] The system can start from the location of the speech segment to be confirmed, extract the speech segment itself as the initial analysis scope, and determine whether the analysis scope can form clear semantic content to be executed based on a semantic parsing model, instruction recognition model, or context matching rules. The semantic content to be executed can represent an intention expression that can be directly executed or mapped by the system. Preferably, when the current analysis scope can parse at least one of the actions or objects and can form a stable semantic with another through context completion, or can be mapped to a predefined instruction template, it can be determined that it can form clear semantic content to be executed. When the corresponding semantic content to be executed cannot be determined solely by the speech segment to be confirmed, the system gradually introduces adjacent speech content along the forward and backward directions of the speech segment in the speech sequence to expand the current analysis scope. After each expansion, the system re-performs semantic integrity and semantic orientation judgments on the expanded speech unit to confirm whether the speech unit has a relatively complete semantic expression capability and whether it can support the formation of a clear execution intention, control object, or operation content. Semantic integrity can be determined based on slot filling results, syntactic dependency relations, or intent recognition confidence; semantic orientation can be determined based on context matching scores or the consistency of candidate interpretation results.
[0034] When the system determines that the expanded speech unit can correspond to a semantic content to be executed, it designates that speech unit as the target confirmation unit. The target confirmation unit defines the speech range surrounding the current confirmation interaction, ensuring that when the system outputs confirmation interaction information subsequently, it covers the necessary content related to ambiguity resolution while avoiding introducing excessive speech information unrelated to the current error correction. The scope of the target confirmation unit corresponds to the confirmation granularity; the higher the confirmation granularity, the larger the context range covered by the target confirmation unit.
[0035] After identifying the target confirmation unit, the system determines the ambiguity level of the speech segment to be confirmed based on the scope of the target confirmation unit. The scope can be characterized as the context span covered by the target confirmation unit relative to the speech segment to be confirmed, or as the range of speech content required to achieve a clear semantic correspondence. If the speech segment to be confirmed can correspond to a semantic content to be executed along with adjacent speech content within a small scope, its ambiguity level can be determined to be low; if phrase-level context is required before the expanded speech unit can correspond to a semantic content to be executed, its ambiguity level can be determined to be medium; if semantic unit-level or complete instruction-level context is required before the expanded speech unit can correspond to a semantic content to be executed, its ambiguity level can be determined to be high. Thus, the system can distinguish the ambiguity level based on the degree of contextual dependence of the speech segment to be confirmed.
[0036] When the system determines that the noise level of the current user's voice input reaches the preset level, in order to reduce the understanding burden of confirmation interaction in noisy environments, the system no longer outputs all the voice content in the target confirmation unit completely. Instead, it extracts the distinguishable voice components that can be used to distinguish different correction results from the target confirmation unit and generates confirmation interaction information according to the rule of prioritizing distinguishable voice components and omitting non-distinguished voice components.
[0037] For at least two candidate correction results corresponding to the target confirmation unit, the system compares the speech content in each candidate correction result and identifies the differing speech components. These differing speech components can manifest as different words, phrases, or semantic meanings. After identifying the dissimilar speech components among the candidate correction results, the system further determines the speech content that can characterize the differences between the candidate correction results and designates this speech content as the distinguishing speech component. The distinguishing speech component directly reflects the distinguishing relationship between different candidate correction results, enabling the user to quickly identify the key differences that need to be confirmed when receiving confirmation interaction information. Preferably, the distinguishing speech component can be obtained by performing word-by-word comparison, phrase alignment, or semantic slot alignment on multiple candidate correction results.
[0038] After identifying the distinguishable speech components, the system further filters the remaining speech content in the target confirmation unit. Adjacent speech content that defines the object, action, or semantic relationship corresponding to the distinguishable speech components is retained and included in the confirmation interaction information along with the distinguishable speech components as adjacent related speech components. Speech content that does not affect the distinction between different candidate correction results is not included in the confirmation interaction information. The deleted content may include descriptive speech components, repetitive speech components, and background context speech components. Descriptive speech components can be descriptive content that does not affect the distinction between candidate results; repetitive speech components can be repeated expressions that are not directly related to the distinction results; and background context speech components can be speech content that only supplements the overall context but does not participate in the current error correction judgment.
[0039] During the output phase of confirming interactive information, the system broadcasts audio in the order of first outputting distinguishing speech components, followed by adjacent related speech components. By prioritizing the presentation of distinguishing speech components, users can perceive the core differences between different candidate correction results. This, combined with the subsequently output adjacent related speech components, allows them to understand the semantic location and directional relationship of these differences. This output order helps users establish a clear understanding of the differences between candidate results in a shorter time, thereby improving the accuracy and relevance of the feedback information.
[0040] Furthermore, when the noise level exceeds a preset level, it indicates a further increase in interference in the current environment, and the difficulty for users to understand longer confirmation messages also increases accordingly. In this case, the system limits the confirmation interaction information to candidate selection confirmation statements, enabling user feedback to directly correspond to one of the candidate correction results. The system can generate confirmation statements with clear selection relationships based on the current candidate correction results, presenting different candidate correction results to the user in a distinguishable manner, and guiding the user to directly provide feedback on the target result by selecting, confirming, or negating. Preferably, the candidate selection confirmation statements can present each candidate correction result in a sequential numbering, enumeration, or side-by-side display on the screen.
[0041] After outputting the confirmation interaction information for the current round and receiving user feedback, the system performs error correction direction determination on the user feedback information to determine whether the feedback information uniquely corresponds to a correction result at the current confirmation granularity. The purpose of error correction direction determination is to determine whether the user feedback has sufficient discriminative power to support the system in making explicit corrections to the current recognition ambiguity. Only when the user feedback information cannot uniquely correspond to a correction result will the system adjust the current confirmation granularity to the next higher level. The next higher level of confirmation granularity corresponds to the target confirmation unit that includes more speech context. Based on the adjusted confirmation granularity, the system regenerates the confirmation interaction information, expanding the speech range covered by the new round of confirmation interaction information beyond the speech range covered by the previous round of confirmation interaction information.
[0042] The system compares user feedback with multiple candidate correction results at the current confirmation granularity. This comparison can be based on feedback semantics, feedback keywords, the object the feedback refers to, and the matching relationship between the feedback and the candidate correction results. If the user feedback clearly points to one of the candidate correction results, the system determines the target correction result and corrects the current recognition result accordingly. Preferably, the system can calculate the matching score between the user feedback and each candidate correction result; when the difference between the candidate correction result corresponding to the highest matching score and the candidate correction result corresponding to the second highest matching score is greater than a preset difference threshold, the system determines that the user feedback uniquely corresponds to the candidate correction result corresponding to the highest matching score; otherwise, the system determines that the user feedback cannot uniquely correspond to a correction result.
[0043] If user feedback only indicates a confirming attitude but not a distinguishing relationship between candidate correction results, then the user feedback cannot uniquely correspond to a single correction result. Confirming attitudes include positive, negative, and uncertain feedback. Positive feedback indicates that the user agrees with the current confirmed content, negative feedback indicates that the user disagrees with the current confirmed content, and uncertain feedback indicates that the user cannot judge or has not formed a clear choice. When multiple candidate correction results exist, the aforementioned confirming attitudes alone are usually insufficient to establish a unique correspondence between the feedback and a particular candidate correction result. Therefore, the system considers this type of feedback as failing to complete the correction target for the current round.
[0044] After determining that the current user feedback information cannot uniquely correspond to a correction result, the system expands the speech context centered on the target confirmation unit corresponding to the current confirmation granularity. During the expansion process, the system prioritizes expanding the speech context towards the side with speech components that differ from the candidate correction result. One side can represent the adjacent context direction that is closer to the action, object, or modifier relationship corresponding to the differing speech component; when the differing speech component mainly affects the recognition of action words, the system prioritizes expanding the context adjacent to action words; when the differing speech component mainly affects the recognition of object words, the system prioritizes expanding the context adjacent to object words; when both sides of expansion can enhance the distinguishability, a two-sided expansion can be adopted. In other words, the system introduces adjacent speech content along the direction that enhances the distinguishability of candidate results to improve the recognizability between candidate correction results in the next round of confirmation interaction. As the speech context is gradually expanded, the system continuously judges whether the expanded target confirmation unit can establish a correspondence between the user feedback information and the target correction result. When the expanded target confirmation unit can carry enough semantic information to distinguish the candidate correction results, the system generates a new round of confirmation interaction information based on the expanded target confirmation unit and enters the next round of confirmation interaction.
[0045] The system can analyze whether user feedback implicitly points to a specific candidate correction result by combining the presentation order of each candidate correction result in the current confirmation interaction information. When the user feedback is affirmative or negative, the system can determine whether the feedback actually corresponds to a specific candidate correction result by combining the broadcast order, display order, or primary / secondary arrangement of the current candidate results. Preferably, when the confirmation interaction information is output by listing candidate correction results in sequence, affirmative or negative feedback can be preferentially interpreted as targeting the currently focused candidate correction result or the last broadcast candidate correction result. If, after secondary pointing analysis, the user feedback information still cannot establish a unique correspondence with any candidate correction result, the system determines the user feedback information as invalid feedback and triggers the next round of confirmation interaction. The next round of confirmation interaction can be conducted at the adjusted confirmation granularity, so as to improve the mapping clarity between user feedback and target correction result by introducing more voice context and clearer difference information.
[0046] Example 1:
[0047] A voice interaction system is deployed in a terminal device or voice control device. The system receives a control voice input from the user and calls the speech recognition module to decode the voice, obtaining an initial recognition result and multiple candidate correction results. Taking a device control voice as an example, the initial recognition result output by the speech recognition module is "Turn on the main living room light and increase it by two levels," while simultaneously outputting two candidate correction results: "Turn on the main living room light and increase it by two levels" and "Turn on the main living room light and brighten it by two levels." The system then performs noise assessment, ambiguity localization, target confirmation unit determination, and confirmation interaction information generation based on the initial recognition result. For the first round of confirmation interaction, the system determines the initial confirmation granularity based on the noise level and ambiguity level, and organizes the confirmation content for the current round at this initial confirmation granularity.
[0048] The system extracts noise features from the user's voice input. These noise features include at least the normalized background noise intensity, signal-to-noise ratio, proportion of non-speech interference, and speech segment instability. To unify multiple noise features within the same judgment framework, this embodiment constructs a comprehensive noise score N as the basis for noise level determination, expressed as:
[0049] Where N is the overall noise score. Here, SNR is the normalized value of background noise intensity. The percentage of non-voice interference. For speech segment instability, Let be the weight coefficient, and satisfy... In this formula, the SNR is treated with the reciprocal term because a higher signal-to-noise ratio indicates clearer speech, and the overall noise score should increase with increasing noise. Therefore, the reciprocal term is used. This allows the score to align with the overall score trend. The weighting coefficients can be pre-configured based on the statistical results of the voice interaction samples. For example, background noise, signal-to-noise ratio, interference ratio, and stability can be assigned weights of 0.30, 0.30, 0.20, and 0.20, respectively.
[0050] In a set of actual test samples, the normalized value of the background noise intensity collected by the system was... The signal-to-noise ratio (SNR) is 5, and the proportion of non-speech interference is... The speech segment instability is Substituting this set of data into the above formula, we get:
[0051] The system pre-classifies the overall noise score into three levels: N < 0.20 corresponds to low noise level, 0.20 ≤ N < 0.35 corresponds to medium noise level, and N ≥ 0.35 corresponds to high noise level. Therefore, this voice input corresponds to the medium noise level and reaches the preset level. When subsequent confirmation interaction information is generated, it enters the difference-priority output mode.
[0052] Combination Figure 2 As shown, the overall noise score of Sample 1 and Sample 2 can be decomposed into four weighted contributions: background noise, reciprocal of signal-to-noise ratio, interference ratio, and instability. Among them, the overall noise score of Sample 1 is N=0.336, which falls into the medium noise range, while the overall noise score of Sample 2 is N=0.390, which enters the high noise range. This supports the selection of different confirmation output methods in the future.
[0053] After obtaining the noise level, the system combines the word-level confidence of the initial recognition results, the difference distribution among candidate correction results, and the semantic parsing results to locate the speech segments with recognition ambiguity. For the above sample, the system detected that the word-level confidence of the positions corresponding to "adjust" and "brighten" was lower than that of other words in the same sentence, and that they corresponded to different control actions. Therefore, the position of "adjust" was identified as the speech segment to be confirmed. Although this segment itself can form a word-level expression, it cannot independently determine the complete semantic content to be executed; therefore, it needs to be expanded to both sides.
[0054] The system centers on the speech segment to be confirmed and progressively introduces adjacent speech content to both sides along the speech sequence of the initial recognition result, performing a semantic integrity judgment after each expansion. The semantic integrity judgment is based on whether the expanded speech unit contains at least both an action object and action content, or whether it can be mapped to a preset control template. Taking this sample as an example, the speech segment to be confirmed, "adjust up," only represents an action and is insufficient to uniquely correspond to an execution object. When expanding forward to obtain "main light and adjust up," there is still interference from connecting words, and a clear control semantic has not yet been formed. Continuing to expand forward and combining with subsequent quantitative descriptions to obtain "living room main light and adjust up two levels," the system can parse that the control object is "living room main light," the control action is "adjust up," and the control quantity is "two levels," thus determining that the expanded speech unit can correspond to a semantic content to be executed and identifying it as the target confirmation unit.
[0055] To quantify the dependence of the speech segment to be confirmed on its context, this embodiment uses the ambiguity dependency degree A as the criterion for determining the ambiguity level, and its expression is as follows:
[0056] Where A represents the degree of ambiguity dependency. The length of the speech unit required to form the explicit semantic content to be executed. is the length of the voice segment to be confirmed. The length can be counted by the number of characters, words or syllables. In this embodiment, the word count method is adopted. The reason for adopting this formula is that if the semantic determination can be completed only relying on a shorter context, the value of A is small, indicating a lower level of ambiguity. If more context needs to be introduced to determine the semantics, the value of A increases, indicating an increase in the level of ambiguity. In this sample, the number of words corresponding to the voice segment to be confirmed, "turn up", is , and the number of words corresponding to the target confirmation unit, "the main light in the living room and turn up two gears", is , then there is:
[0057] The system pre-sets as the low ambiguity level, as the medium ambiguity level, as the high ambiguity level. Therefore, this sample corresponds to the high ambiguity level.
[0058] As Figure 3 shown, after gradually introducing the context before and after, the voice segment to be confirmed, "turn up", is expanded from a segment containing only action semantics to a target confirmation unit containing the control object, action content and control amount. The corresponding ambiguity dependence rises from the starting segment level to A = 4, thus entering the high ambiguity level determination.
[0059] The system determines the initial confirmation granularity according to the noise level and the ambiguity level. Since the noise level in this sample is medium noise and the ambiguity level is high ambiguity, the system does not adopt the segment-level confirmation for a single word, but adopts the semantic unit-level confirmation for the target confirmation unit, so that the confirmation information can cover both the candidate action differences and contain the necessary object qualification information. For the candidate correction results under this initial confirmation granularity, the system obtains "the main light in the living room and turn up two gears" and "the main light in the living room and brighten two gears" respectively, and enters the stage of screening different voice components. Whether it is necessary to further adjust the confirmation granularity in subsequent interaction rounds can be determined according to whether the user feedback information can uniquely correspond to a correction result. This embodiment only illustrates the generation method of the first-round confirmation content.
[0060] The system performs a word-by-word comparison on multiple candidate correction results corresponding to the target confirmation unit, and extracts the voice components that can distinguish each candidate correction result from each other. For the above two candidate correction results, the comparison results show that "the main light in the living room", "and", and "two gears" are the same in the two candidate results. Only the semantic difference between "turn up" and "brighten" exists, and this difference directly determines the execution action. Therefore, the system determines "turn up" and "brighten" as the different voice components. Subsequently, the system retrieves the adjacent associated voice components around the different voice components, and determines that "the main light in the living room" is the action object qualification component and "two gears" is the action amount qualification component. The two together with the different voice components constitute the effective confirmation content.
[0061] During the screening process, the system retains distinguishable speech components and their adjacent related speech components, while deleting decorative speech components, repetitive speech components, and background context speech components that do not affect the distinction between different correction results. For the target confirmation unit in this sample, "AND" only serves a connecting function and does not affect the distinction between "adjust up" and "brighten up," so it is deleted. After screening, one candidate confirmation content is organized as "living room main light, adjust up, two levels," and another candidate confirmation content is organized as "living room main light, brighten up, two levels." If the initial statement also contains a pre-wake word, polite expressions, or additional explanations unrelated to the current error correction, such content will not be included in the current confirmation interaction information.
[0062] Combination Figure 4 As shown, during the candidate confirmation content screening process, the system retains "living room main light" as the action object limiting component, retains "turn up" or "turn brighter" as the distinguishing voice component, and retains "two levels" as the action quantity limiting component, while deleting "and" which only serves as a connector, in order to form effective confirmation content that is more suitable for confirmation interaction broadcast.
[0063] During the confirmation interaction information generation phase, the system organizes the broadcast content in the order of prioritizing distinguishable speech components and then outputting adjacent related speech components. In medium-noise scenarios, the system prioritizes outputting the distinguishing speech components "turn up" and "turn brighter," allowing the user to perceive the core difference between the candidate results. It then supplements this with "living room main light" and "two levels" to define the applicable objects and operational range of the differences. The resulting confirmation interaction information can be expressed as "Please confirm, do you want to turn up the living room main light to level two, or turn brighter, living room main light to level two?" This organization places the differences at the forefront of user perception, avoiding the weakening of key difference recognition by receiving longer, common content first in a noisy environment.
[0064] When the overall noise score continues to rise and exceeds a preset level, the system will further limit the confirmation interaction information to candidate selection-based confirmation statements to reduce the user's burden of parsing long statements. For example, in another set of actual collected samples, the normalized value of the background noise intensity was... The signal-to-noise ratio (SNR) is 4, and the proportion of non-speech interference is... The speech segment instability is Substituting this set of data into the aforementioned comprehensive noise scoring formula, we obtain:
[0065] because The system determines that the voice input corresponds to a high noise level. Even if a larger range of target confirmation units are subsequently introduced within the system for error correction, the output confirmation interaction information can still use a candidate selection or compressed expression for the expanded target confirmation units. For the current sample, the system no longer broadcasts the complete phrase, but compresses the two candidate correction results into a directly selectable confirmation statement, such as "Please confirm, the first option is to increase by two levels, and the second option is to increase by two levels." If the terminal has display capabilities, it can also simultaneously display the candidate option number and text content, allowing the user to complete the confirmation through a brief selection. Because the confirmation interaction information is constructed only around the target confirmation unit corresponding to the current confirmation granularity, it does not introduce other recognition content unrelated to the current error correction.
[0066] like Figure 4 Furthermore, under high noise levels, the confirmation interaction information is further compressed from the difference-priority broadcasting method in medium noise scenarios to a candidate selection confirmation statement, allowing users to directly complete short sentence confirmations between two candidate actions.
[0067] Example 2:
[0068] A voice interaction system receives user-input control voice and calls a speech recognition module to decode the voice segment, obtaining an initial recognition result and multiple candidate correction results. Taking a device control voice as an example, the initial recognition result might be "Turn off bedroom air conditioner silent mode." The system simultaneously generates two candidate correction results: "Turn off bedroom air conditioner silent mode" and "Turn off bedroom air conditioner sleep mode." Since "silent mode" and "sleep mode" are acoustically similar and correspond to different control functions, the system identifies the location of "silent mode" as the voice segment to be confirmed. In the first round of confirmation interaction, based on the current confirmation granularity, it forms the first-round target confirmation unit "air conditioner silent mode" and its corresponding candidate correction results. The first-round confirmation granularity can be pre-determined by noise level and ambiguity level. Whether to adjust the confirmation granularity to the next higher level in subsequent interaction rounds depends on whether the user feedback information can uniquely correspond to a correction result.
[0069] The system outputs a first-round confirmation interaction message to guide the user to provide feedback on the current recognition ambiguity. This first-round confirmation interaction message can be "Please confirm, is it silent mode or sleep mode?" In this round of interaction, the user's feedback is "Yes." After receiving this feedback, the system does not directly update the recognition result based on it. Instead, it first performs an error correction determination on the feedback to determine whether the feedback uniquely corresponds to a corrected result at the current confirmation granularity.
[0070] To quantify the indicative power of user feedback on each candidate correction result, this embodiment uses feedback matching scores. As the basis for judgment, its expression is:
[0071] in, The matching score for the i-th candidate correction result is given by the user feedback, where i is the index of the candidate correction result. To provide feedback on the matching degree between the keyword and the i-th candidate correction result, To provide feedback on the matching degree between the semantic relationship and the i-th candidate correction result, To provide feedback on the correspondence between the referenced object and the i-th candidate correction result, Let be the weight coefficient, and satisfy... The reason for using this formula is that whether user feedback can uniquely point to a candidate result depends not only on the presence of explicit keywords, but also on the semantic relationship between the feedback and the current confirmation statement, and whether it contains a locatable referent. When user feedback is only a confirmation attitude such as "yes," "no," or "uncertain," keyword information, semantic relationship information, and object information are all insufficient, so it is necessary to calculate all three types of factors comprehensively.
[0072] In this sample, the first candidate correction result was "silent mode," the second candidate correction result was "sleep mode," and the user feedback was "correct." The system is pre-set. Since "yes" only expresses an affirmative attitude and does not contain content that directly distinguishes the candidate results, the keyword matching degree of this feedback for both candidate results is low, and therefore, it is taken as follows: and Their semantic relationship matching scores are all positive responses to the currently confirmed content, and are respectively taken as follows: and Due to the lack of a clear object reference, the object correspondence between the two is respectively taken as... and Substituting the above data, we get:
[0073] This shows that the matching scores of the two candidate correction results are very close, indicating that although the feedback expressed a positive attitude, it did not form a clear distinction for a specific candidate correction result.
[0074] Combination Figure 5 As shown, the first and second candidate correction results have the same weighted contribution to the keyword and object items. The main difference lies in the weighted contribution of the semantic relation item, which is 0.2975 and 0.294 respectively, resulting in feedback matching scores of only 0.4025 and 0.399 respectively.
[0075] To further determine whether the feedback information can uniquely correspond to a correction result, the system calculates the unique corresponding difference. Its expression is:
[0076] in, For a unique corresponding difference, The highest matching score among multiple candidate correction results. This is the second-highest matching score. The system has a preset unique matching threshold. ,when When the user feedback uniquely corresponds to the target correction result, then... At that time, it is determined that user feedback cannot uniquely correspond to a single correction result. Threshold The engineering implication is that the system only performs corrections when the score difference between the best and second-best candidates reaches a level sufficient to stably distinguish them, thus avoiding misjudging fuzzy feedback as a definite choice. In this sample, If we assume Then we have:
[0077] because The system determines that the feedback information cannot uniquely correspond to a single correction result. Therefore, when user feedback only expresses a confirmatory attitude without containing discriminatory content, even if it is expressed as positive feedback, it cannot be used to directly lock in a candidate correction result. The same applies to negative and uncertain feedback; they only indicate denial or inability to determine the current confirmation content, and cannot establish a one-to-one correspondence with a specific candidate.
[0078] like Figure 6 As shown, the actual unique corresponding difference is at the first round of confirmation granularity. The value is only 0.0035, which is significantly lower than the unique corresponding judgment threshold. Therefore, the system does not directly perform the correction, but instead adjusts the confirmation granularity to the next higher level to expand the target confirmation unit.
[0079] The system only adjusts the current confirmation granularity to the next higher level when user feedback cannot uniquely correspond to a correction result. Compared to the smaller confirmation scope centered on "air conditioner silent mode" in the first round, the next higher level of confirmation granularity corresponds to a larger target confirmation unit. The system expands the speech context around the target confirmation unit from the previous round, prioritizing expansion towards speech components that differ from the candidate correction result. Since the speech components that differ between "silent mode" and "sleep mode" are both located at the mode object position, this difference requires combining action constraint words and device status words for clearer differentiation. Therefore, the system prioritizes introducing control semantics related to mode switching, resulting in the expanded target confirmation unit being "turn off bedroom air conditioner silent mode" and "turn off bedroom air conditioner sleep mode." Compared to simply comparing "silent mode" and "sleep mode," the expanded confirmation unit simultaneously includes the action word "turn off" and the control object "bedroom air conditioner," thereby enhancing the mappability between user feedback and candidate correction results. In high-noise scenarios, even if the system expands the target confirmation unit because the feedback cannot uniquely locate the correction result, when outputting confirmation interaction information, the expanded target confirmation unit can still be expressed in a candidate selection or compressed manner to reduce the user's understanding burden.
[0080] In the second round of interaction, the system generates new confirmation interaction information based on the expanded target confirmation unit, such as "Please confirm, is it to turn off the bedroom air conditioner in silent mode or the bedroom air conditioner in sleep mode?" The user responds with "first" in this round. Because this feedback contains sequence indication information, the system can directly establish a correspondence between it and the current order of candidate correction results, and determine that it uniquely points to the first candidate correction result, thereby correcting the identification result to "turn off the bedroom air conditioner in silent mode." In this case, the system does not need to upgrade the confirmation granularity again, thus avoiding unnecessary follow-up questions.
[0081] In another set of interaction samples, the second round of user feedback was still "yes." After determining that this feedback only represented a confirmation attitude, the system triggered secondary pointing analysis. Secondary pointing analysis refers to appending the order information of the current candidate correction results to the feedback judgment process when the original feedback content is insufficient to make a distinction, in order to determine whether the user actually responded to a candidate option at a certain position. Therefore, this embodiment constructs a secondary pointing score. Its expression is:
[0082] in, The quadratic pointing score is the result of combining the i-th candidate correction result in the sorted order, where i is the index of the candidate correction result. To provide feedback on the original matching score for the correction result of the i-th candidate, Let be the sequential correlation value of the i-th candidate correction result. and Let be the weight coefficient, and satisfy... Sequential associated values The value is determined based on the order in which candidate correction results are announced, displayed, or focused within the current confirmation statement. A larger value indicates a stronger sequential relationship between the candidate correction result and the current feedback. This formula is used because when a user simply says "yes" or "no," the language itself is insufficient to distinguish the candidate options. If the system confirms that the current announcement uses a sequential listing of candidate options, it can further determine whether the confirmation attitude corresponds to the currently focused candidate option or the last announced candidate option.
[0083] In this sample set, the system uses a sequential playback method, first playing the first candidate correction result, then the second candidate correction result, and setting the last candidate as the current voice focus. The user feedback remains "correct," therefore the original matching score is acceptable. System Settings At the same time, assign a sequence association value to the first candidate based on the broadcast order. The second candidate's sequential associated value Substituting this set of data, we get:
[0084] If the system sets the threshold for the difference between the two pointers to 0.10, then the difference between the two is:
[0085] Since the difference reached the threshold, the system determined that the user's feedback, after being supplemented with sequence information, could uniquely correspond to the second candidate correction result, and corrected the recognition result to "turn off bedroom air conditioner sleep mode". It can be seen that secondary pointing parsing is not applicable to all feedback, but is only triggered when the feedback content itself is insufficient to distinguish the candidate options, but the order of the candidate options can provide additional positioning information.
[0086] Combination Figure 7 As shown, in the secondary pointing analysis, the secondary pointing scores of the first candidate and the second candidate are 0.347 and 0.541 respectively, with a difference of 0.194, reaching the secondary pointing difference threshold of 0.10. Therefore, the system can uniquely map the feedback of this round to the second candidate correction result.
[0087] If, after secondary pointing analysis, a unique correspondence cannot be established with any candidate correction result, the system determines the feedback as invalid and triggers the next round of confirmation interaction. For example, when the user feedback is "uncertain," and the current feedback matching score and sequential association value are insufficient to form a clear bias, the system does not use this feedback to correct the identification result. Instead, it continues to expand the context centered on the previous round's target confirmation unit, or outputs a re-expressed prompt after reaching a preset round threshold. Preset termination conditions may include reaching a set number of confirmation rounds, multiple consecutive rounds of invalid feedback, or the system determining that the current interaction cannot form a stable mapping relationship.
Claims
1. A voice error correction and confirmation interaction method based on noise classification, characterized in that... include: Acquire user voice input and perform speech recognition on the user voice input to obtain an initial recognition result; determine the noise level based on the noise characteristics of the user voice input; Based on the initial recognition results, identify speech segments with recognition ambiguities that need to be confirmed; determine the ambiguity level of the speech segments to be confirmed; The confirmation granularity is determined based on the noise level and the ambiguity level, and confirmation interaction information is output. Receive user feedback information and correct the initial recognition result based on the user feedback information; if the corrected recognition result still has recognition ambiguity and the user feedback information cannot uniquely correspond to a corrected result, adjust the confirmation granularity to the next higher level of confirmation granularity, and re-output confirmation interaction information and receive user feedback information until the recognition ambiguity is eliminated or the preset termination condition is reached.
2. The speech error correction and confirmation interaction method based on noise classification according to claim 1, characterized in that, Centered on the speech segment to be confirmed, the process expands outwards from the speech sequence of the initial recognition result until the expanded speech unit corresponds to a semantic content to be executed. The speech unit corresponding to the semantic content to be executed is determined as the target confirmation unit, and the ambiguity level of the speech segment to be confirmed is determined according to the composition range of the target confirmation unit.
3. The speech error correction and confirmation interaction method based on noise classification according to claim 2, characterized in that, When the noise level reaches a preset level, distinguishing speech components used to differentiate different correction results are extracted from the target confirmation unit, and confirmation interaction information is generated according to the rule of prioritizing distinguishing speech components and omitting non-distinguishing speech components; wherein, the confirmation interaction information only includes the target confirmation unit corresponding to the current confirmation granularity, and does not include other speech content in the initial recognition result that is unrelated to the current error correction.
4. The speech error correction and confirmation interaction method based on noise classification according to claim 3, characterized in that, For at least two candidate correction results corresponding to the target confirmation unit, the dissimilar speech components in each candidate correction result are compared, and the speech components that can distinguish each candidate correction result from each other are determined as the distinguishing speech components.
5. The speech error correction and confirmation interaction method based on noise classification according to claim 3, characterized in that, The distinguishing speech components are retained, as are the adjacent related speech components used to define the directional relationship of the distinguishing speech components; the modifying speech components, repetitive speech components, and background context speech components that do not affect the distinction between different correction results in the target confirmation unit are deleted.
6. The speech error correction and confirmation interaction method based on noise classification according to claim 5, characterized in that, The voice broadcast is performed in the order of first outputting the distinguished speech components and then outputting the adjacent related speech components; when the noise level is higher than the preset level, the confirmation interaction information is limited to a candidate selection confirmation statement, so that the user feedback information directly corresponds to one of the candidate correction results.
7. The speech error correction and confirmation interaction method based on noise classification according to claim 1, characterized in that, The system performs error correction on user feedback information. Only when the user feedback information cannot uniquely correspond to a correction result is the current confirmation granularity adjusted to the next higher level confirmation granularity. The next higher level confirmation granularity corresponds to a target confirmation unit that includes more voice context. Confirmation interaction information is regenerated based on the adjusted confirmation granularity so that the voice range covered by the new round of confirmation interaction information is expanded beyond the voice range covered by the previous round of confirmation interaction information.
8. The speech error correction and confirmation interaction method based on noise classification according to claim 7, characterized in that, The user feedback information is compared with multiple candidate correction results at the current confirmation granularity; when the user feedback information can only represent the confirmation attitude but cannot represent the distinguishing relationship between the candidate correction results, it is determined that the user feedback information cannot uniquely correspond to a correction result; wherein, the confirmation attitude includes positive feedback, negative feedback and uncertain feedback.
9. A speech error correction and confirmation interaction method based on noise classification according to claim 8, characterized in that, Centered on the target confirmation unit corresponding to the current confirmation granularity, the speech context is expanded on the side with different speech components from the candidate correction result until the expanded target confirmation unit can establish the correspondence between the user feedback information and the target correction result; A new round of confirmation interaction information is generated based on the expanded target confirmation unit.
10. A speech error correction and confirmation interaction method based on noise classification according to claim 8, characterized in that, When the user feedback information is determined to only represent a confirmation attitude and not a distinguishing relationship between candidate correction results, a secondary pointing analysis is performed on the confirmation attitude by adding the order information of each candidate correction result under the current confirmation granularity; if the user feedback information still cannot establish a unique correspondence with any candidate correction result, the user feedback information is determined to be invalid feedback, and the next round of confirmation interaction is triggered.