A noise scene-oriented hearing aid speech recognition enhancement method

By performing bitonal and pure tone audiometry using the hearing aid's testing terminal, an auditory assessment profile is generated. Task type label matching and speech activity detection are then performed, which solves the problem of insufficient preservation of key pronunciation segments in noisy scenes by hearing aids and improves the effectiveness of speech recognition.

CN122245293APending Publication Date: 2026-06-19SUZHOU HIKADI HEARING TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SUZHOU HIKADI HEARING TECH CO LTD
Filing Date
2026-03-31
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing hearing aids struggle to accommodate differences in digit recognition, binaural frequency sensitivity, and binaural asymmetry among different subjects in noisy environments, resulting in insufficient preservation of key pronunciation segments and impacting speech recognition performance.

Method used

The hearing aid's matching testing terminal is used to perform diatonic testing and binaural pure tone audiometry to generate an auditory assessment profile. Based on the profile, task type labels are generated, and speech activity detection and key segment localization are performed to enhance the speech to be recognized, with a focus on protecting and enhancing key segments.

Benefits of technology

It improves the speech recognition effectiveness of hearing aids in noisy environments by ensuring the accurate preservation and recognition of key pronunciation segments through targeted enhancement processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245293A_ABST
    Figure CN122245293A_ABST
Patent Text Reader

Abstract

This invention discloses a speech recognition enhancement method for hearing aids in noisy environments, belonging to the field of speech enhancement technology. The method includes: conducting a digraph test on a subject under background noise using a hearing aid's accompanying testing terminal; generating test feedback information and performing screening to generate digraph screening results; performing type matching on the subject's auditory assessment profile to generate task type labels; detecting speech activity in the speech to be recognized by the hearing aid based on the task type labels to generate candidate speech segments; locating key segments in the candidate speech segments and generating key segment protection windows; enhancing the speech to be recognized using the key segment protection windows to generate enhanced speech; and performing speech recognition on the enhanced speech to generate target recognition results. This invention improves the effectiveness of hearing aid speech recognition in noisy environments by performing result judgment and generating target recognition results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of speech enhancement technology, and in particular to a method for enhancing speech recognition in hearing aids in noisy environments. Background Technology

[0002] With the development of digital signal processing, miniature microphone arrays, adaptive noise reduction, and speech recognition technologies, hearing aids have gradually evolved from simple sound amplification devices into intelligent hearing assistance devices that combine acoustic compensation, environmental perception, and human-computer interaction functions. Meanwhile, pure-tone audiometry has long been an important basic means of identifying and assessing the degree of hearing loss, and can reflect changes in hearing thresholds at different frequencies. Noise identification tests, which use digital data as stimulus materials, are widely used to reflect the performance of subjects in noisy environments because of their simple testing process, relatively low influence from language cognition factors, and suitability for rapid screening.

[0003] However, existing technologies still have the following shortcomings: existing solutions mainly rely on general noise reduction or unified speech enhancement strategies, which make it difficult to take into account the differences in digital recognition, the differences in frequency sensitivity of the two ears, and the asymmetry of the two ears among different subjects in noisy scenes. Existing enhancement technologies mostly process the entire speech to be recognized uniformly, which results in noise reduction being completed in the background of noise, but insufficient preservation of the key pronunciation segments that determine the recognition result. Summary of the Invention

[0004] In view of the aforementioned existing problems, the present invention is proposed.

[0005] Therefore, this invention provides a hearing aid speech recognition enhancement method for noisy environments to address the problem of insufficient retention of key pronunciation segments that determine the recognition results.

[0006] To solve the above-mentioned technical problems, the present invention provides the following technical solution: In a first aspect, the present invention provides a method for enhancing speech recognition in hearing aids for noisy environments, comprising, The test terminal for hearing aids is used to conduct diatonic tests on subjects in background noise, generate test feedback information, and perform screening and judgment to generate diatonic screening results. When the diatonic screening result is not passed, the subject undergoes binaural pure tone audiometry to generate binaural frequency band hearing threshold results. Auditory features are extracted from the failed diatonic screening results and binaural frequency band hearing threshold results to generate an auditory assessment profile of the subject. Type matching is performed on the auditory assessment profile of the subject to generate task type labels. Based on the task type labels, speech activity detection is performed on the speech to be recognized by the hearing aid to generate candidate speech segments. Key segments are located in the candidate speech segments and key segment protection windows are generated. The speech to be recognized is enhanced by using a key segment protection window to generate enhanced speech. Speech recognition is then performed on the enhanced speech to generate the target recognition result.

[0007] As a preferred embodiment of the hearing aid speech recognition enhancement method for noisy scenarios described in this invention, the specific steps for generating test feedback information are as follows: The test terminal for the hearing aid outputs background noise and diatonic test data to the subject's ears, controls the sound pressure level of the background noise and the playback rhythm of the diatonic test data, identifies and detects the diatonic test data, and generates test response information. Record the content, timing, and results of the responses in the test response information to generate the original feedback record; The test feedback information is generated by statistically analyzing the number of correct answers, incorrect answers, and missed answers in the original feedback records.

[0008] As a preferred embodiment of the hearing aid speech recognition enhancement method for noisy scenarios described in this invention, the specific steps for generating the digraph screening results are as follows: The consistency of the answers in the test feedback information is compared with the two-syllable test corpus to generate answer judgment information; Based on the answer judgment information, the accuracy rate and error distribution of the answers are statistically analyzed, and the screening judgment information is output. The screening information is analyzed to determine its status and generate a two-syllable screening result.

[0009] As a preferred embodiment of the hearing aid speech recognition enhancement method for noisy scenarios described in this invention, the specific steps for generating binaural hearing threshold results are as follows: When the diatonic screening result is not passed, pure tone test signals of different frequencies and intensities are output to the subject's ears through the matching test terminal of the hearing aid to generate binaural pure tone test signals; Binaural auditory response data is generated by detecting the subject's response to binaural pure tone test signals. Frequency band aggregation is performed on the binaural auditory response data to generate binaural frequency band hearing threshold results.

[0010] As a preferred embodiment of the hearing aid speech recognition enhancement method for noisy scenarios described in this invention, the specific steps for generating the subject's auditory assessment profile are as follows: Extract accuracy features, error distribution features, and noise scene identification features from the failed bigeminy screening results, and summarize them to form a screening feature set; The left and right ear frequency band sensitivity features, frequency threshold change features, and binaural difference features are extracted from the binaural frequency band hearing threshold results to obtain the hearing threshold feature set; The screening feature set and the hearing threshold feature set are merged and organized to generate a hearing assessment profile of the subject.

[0011] As a preferred embodiment of the hearing aid speech recognition enhancement method for noisy scenarios described in this invention, the specific steps for generating task type labels are as follows: Noise scene identification features, left and right ear frequency band sensitivity features, and binaural difference features are extracted from the auditory assessment profiles of the subjects to form a task matching feature set; The task matching feature set is compared with the preset digital recognition constraints and control command recognition constraints to obtain the matching degree of each task type; The task type with the highest matching degree is identified as the current task type, and a label is assigned to it to generate a task type label.

[0012] As a preferred embodiment of the hearing aid speech recognition enhancement method for noisy scenes described in this invention, the specific steps for generating candidate speech segments are as follows: Perform rule matching on task type tags to generate voice activity detection rules; Acquire the speech to be recognized from the hearing aid, perform endpoint detection and activity interval detection on the speech to be recognized according to the speech activity detection rules, and generate the speech activity interval; Segment splicing is performed on the speech activity interval to generate candidate speech segments.

[0013] As a preferred embodiment of the hearing aid speech recognition enhancement method for noisy scenes described in this invention, the specific steps for generating the key segment protection window are as follows: Extract task type information from task type labels, identify speech segments that match the task type information in candidate speech segments, and generate target speech attention segments; Perform duration statistics, energy change statistics, and frequency band distribution statistics on the target speech interest segments to generate segment statistical results; The importance of target speech segments of interest is compared using segment statistics to generate key segment localization results. The starting and ending positions in the key segment localization results are expanded to generate a key segment protection window.

[0014] As a preferred embodiment of the hearing aid speech recognition enhancement method for noisy scenes described in this invention, the specific steps for generating enhanced speech are as follows: The speech to be identified covered by the protection window of key segments is subjected to frequency band preservation and noise suppression to generate protection enhancement segments; Perform standard noise reduction on the speech to be recognized outside the key segment protection window to generate background-enhanced segments; The protection enhancement segment and the background enhancement segment are sequentially concatenated to generate enhanced speech.

[0015] As a preferred embodiment of the hearing aid speech recognition enhancement method for noisy scenes described in this invention, the specific steps for generating the target recognition result are as follows: Feature extraction is performed on the enhanced speech to obtain the recognized speech features; The speech features are labeled and decoded using task type tags to generate initial recognition content. The initial identified content is evaluated to generate target recognition results.

[0016] The beneficial effects of this invention are as follows: By performing state discrimination, it generates two-tone screening results, avoiding the problem of rough judgment caused by relying solely on a single accuracy indicator; by merging and organizing, it generates a subject's auditory assessment profile, unifying and integrating functional identification results with frequency band hearing threshold results; by determining the task type with the highest matching degree as the current task type and generating a task type label, it enables task-oriented processing based on different requirements of digital recognition or control command recognition, thereby improving the targeting of enhancement and recognition; by expanding the boundary, it generates a key segment protection window, achieving accurate framing and appropriate expansion protection of key segments, providing a clear target for subsequent differentiated enhancement processing; and by performing result judgment, it generates target recognition results, improving the effectiveness of hearing aid speech recognition in noisy environments. Attached Figure Description

[0017] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a flowchart of a speech recognition enhancement method for hearing aids in noisy environments.

[0019] Figure 2 A flowchart for generating auditory assessment profiles of test subjects.

[0020] Figure 3 A flowchart for generating a protection window for critical segments.

[0021] Figure 4 A flowchart for generating target recognition results. Detailed Implementation

[0022] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0023] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0024] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0025] Reference Figures 1-4 As one embodiment of the present invention, this embodiment provides a method for enhancing speech recognition in hearing aids in noisy environments, comprising the following steps: S1: The subject is tested for diatonic sounds under background noise using the hearing aid's matching testing terminal. Test feedback information is generated, and screening and judgment are performed to generate diatonic sound screening results.

[0026] S1.1: The background noise and diatonic test data are output to the subject's ears through the matching test terminal of the hearing aid. The sound pressure level of the background noise and the playback rhythm of the diatonic test data are controlled. The diatonic test data is identified and detected, and test response information is generated.

[0027] The duple-pronunciation test data and background noise data are imported into the hearing aid's testing terminal. The duple-pronunciation test data uses digital combination speech material, which is formed by the sequential pronunciation of two numbers. Multiple different number combinations are selected to form duple-pronunciation test data, which is then imported into the hearing aid's testing terminal. The background noise data uses environmental noise recordings, which are selected from at least one of the following: stable noise, traffic noise, human voice interference noise, or mixed noise in daily life scenarios. This background noise data is also imported into the hearing aid's testing terminal. The sound pressure level range of the background noise is set in the hearing aid's testing terminal, for example, from 40 decibels to 70 decibels. The playback rhythm range of the duple-pronunciation test data is also set, for example, the playback duration of each set of duple-pronunciation test data is from 0.5 seconds to 2 seconds, and the time interval between two adjacent sets of duple-pronunciation test data is from 1 second to 3 seconds.

[0028] The hearing aid's testing terminal first outputs background noise to the subject's ears to induce background noise listening. After the background noise output stabilizes, the duple test data is superimposed onto the background noise, and the two numbers from the duple test data are output to the subject's ears sequentially. The subject provides the first identification content and the second identification content, respectively. The hearing aid's testing terminal, according to the output order of the two numbers in the duple test data, maps the first identification content to the preceding number and the second identification content to the following number, forming the identification result. The hearing aid's testing terminal then associates and records the duple test data and the identification result to generate test response information.

[0029] S1.2: Record the content, timing, and results of the responses in the test response information to generate the original feedback record; count the number of correct responses, the number of incorrect responses, and the number of missed responses in the original feedback record to generate test feedback information.

[0030] The hearing aid's matching testing terminal reads the digraph test data, recognition results, and the positional correspondence of the recognition results in the digraph test data from the test response information. The order of the two numbers in the digraph test data is used as the response sequence, the order of the two numbers in the recognition results is used as the response content, and the consistency between the recognition results and the digraph test data is used as the response result. The response content, response sequence, and response result are written into the test record to form the original feedback record.

[0031] The hearing aid's testing terminal statistically analyzes each response in the original feedback record. A response that perfectly matches the diatonic test data is recorded as correct. A response that is inconsistent with the diatonic test data but contains discriminatory content is recorded as incorrect. A response missing a discriminatory result corresponding to the diatonic test data in the original feedback record is recorded as a missed response. The hearing aid's testing terminal accumulates the correct responses in the original feedback record to obtain the number of correct responses, the incorrect responses to obtain the number of incorrect responses, and the missed responses to obtain the number of missed responses. The total number of correct responses, incorrect responses, and missed responses is then summarized to generate test feedback information.

[0032] S1.3: Compare the answers in the test feedback information with the two-syllable test corpus to generate answer judgment information.

[0033] The hearing aid's testing terminal reads the answers from the test feedback information and the corresponding binomial test data. It compares the first digit of the answer with the first digit of the binomial test data, and then compares the second digit of the answer with the second digit of the binomial test data, recording the comparison result for each digit. If both the first and second digits match, the answer is recorded as correct. If they don't match, the answer is recorded as incorrect. If the test feedback information lacks an answer corresponding to the binomial test data, the answer is recorded as missing. The hearing aid's testing terminal collects correct, incorrect, and missing answers to form a judgment record. This judgment record is then linked to the answer content in the test feedback information to generate answer judgment information.

[0034] S1.4: Based on the answer judgment information, calculate the accuracy rate and error distribution of the answer content, and output the screening judgment information; perform state discrimination on the screening judgment information and generate the two-syllable screening results.

[0035] The system reads correct, incorrect, and missed answers from the response judgment information using the hearing aid's matching testing terminal. It then counts the total number of responses, the number of correct responses, and the number of incorrect responses. The testing terminal calculates the ratio of the number of correct responses to the total number of responses to obtain the accuracy rate of the response content. It combines the numerical data of the digraph test corpus corresponding to the incorrect responses and reads the error position of the incorrect response in the preceding and following digits. Incorrect responses with the same digit combination but an incorrect preceding digit are grouped into the same preceding error category, and incorrect responses with the same digit combination but an incorrect following digit are grouped into the same following error category. The occurrence frequency of each preceding and following error category is counted to obtain the error distribution of the response content. Finally, the accuracy rate and error distribution of the response content are summarized, and the screening judgment information is output.

[0036] A pass rate threshold is set based on the percentage of correct answers in the total number of answers in the bigram test. This threshold characterizes the subject's basic recognition level of the bigram test corpus under background noise. The pass rate threshold ranges from 70% to 90%. The range is determined based on the fluctuation range of the accuracy rate of the bigram test under background noise conditions and the discrimination requirements in the rapid screening scenario. Using a range of 70% to 90% balances the rapid screening's ability to differentiate answer accuracy and the accuracy of identifying failure states. When the accuracy rate of the answer reaches the pass rate threshold and the error distribution is not concentrated in the same number combination, the screening judgment information is judged as a pass. When the accuracy rate of the answer does not reach the pass rate threshold or the error distribution is concentrated in the same number combination, the screening judgment information is judged as a failure, generating a bigram screening result.

[0037] S2: When the diatonic screening result is not passed, perform binaural pure tone audiometry on the subject to generate binaural frequency band hearing threshold results, and extract auditory features from the failed diatonic screening results and binaural frequency band hearing threshold results to generate an auditory assessment profile of the subject.

[0038] S2.1: When the diatonic screening result is not passed, pure tone test signals of different frequencies and intensities are output to the subject's ears through the matching test terminal of the hearing aid to generate binaural pure tone test signals.

[0039] The hearing aid's testing terminal reads the failed status from the diatonic screening results and transfers the subjects with the failed status to the pure tone audiometry process. Multiple test frequencies and multiple test intensities are selected. The multiple test frequencies are used to characterize the hearing sensitivity of different frequency bands, and the multiple test intensities are used to characterize the changes in audibility at the same frequency. The hearing aid's testing terminal combines and arranges the multiple test frequencies and multiple test intensities in the order of left ear testing and right ear testing to form a left ear pure tone output sequence and a right ear pure tone output sequence. Each pure tone signal in the left ear pure tone output sequence is output to the subject's left ear according to the corresponding test frequency and test intensity, and each pure tone signal in the right ear pure tone output sequence is output to the subject's right ear according to the corresponding test frequency and test intensity. The hearing aid's testing terminal combines the left ear pure tone output sequences and the right ear pure tone output sequences to generate a binaural pure tone test signal.

[0040] S2.2: Generate binaural auditory response data by detecting the subject's response to the binaural pure tone test signal; collect the binaural auditory response data by frequency band to generate binaural frequency band hearing threshold results.

[0041] Under the influence of pure tone output sequences in the left and right ears, the subjects made auditory responses to pure tone test signals of different frequencies and intensities. The hearing aid's matching test terminal recorded whether the subjects heard the pure tone test signal under each set of pure tone test signals, generating a binaural auditory response record. The hearing aid's matching test terminal combined and organized the test frequency, test intensity, left ear auditory response, and right ear auditory response from the binaural auditory response record to generate binaural auditory response data.

[0042] The binaural auditory response data are grouped according to the test frequency. Different test intensities at the same test frequency and the corresponding auditory responses are recorded as a group of frequency band data. The hearing aid's matching test terminal filters out the test intensities that can produce an auditory response from the subject in each frequency band data, and uses the test intensity at the same test frequency as the hearing threshold record at that test frequency. The hearing threshold records at each test frequency are summarized separately for the left and right ears through the hearing aid's matching test terminal to generate binaural frequency band hearing threshold results.

[0043] S2.3: Extract accuracy features, error distribution features, and noise scene identification features from the failed bigeminy screening results, and summarize them to form a screening feature set.

[0044] The correct answer rate, error distribution, and diphthong recognition under background noise are read from the failed diphthong screening results using the hearing aid's matching testing terminal. The correct answer rate from the failed diphthong screening results is used as the correct answer rate feature, the distribution of incorrect digit combinations is used as the error distribution feature, and the recognition of diphthong test data under background noise is used as the noise scene recognition feature. The correct answer rate feature, error distribution feature, and noise scene recognition feature are summarized to form a screening feature set.

[0045] S2.4: Extract left and right ear frequency band sensitivity features, frequency threshold change features, and binaural difference features from the binaural frequency band hearing threshold results to obtain a hearing threshold feature set; merge and organize the screening feature set and the hearing threshold feature set to generate a subject's auditory assessment profile.

[0046] The hearing aid's testing terminal organizes the left and right ear hearing thresholds from the binaural frequency band results separately according to the test frequency. The left and right ear hearing thresholds are compared at the same test frequency to obtain the hearing threshold differences between the left and right ears at each test frequency. These hearing threshold differences from all test frequencies are combined to form binaural difference features. The changes in left and right ear hearing thresholds are compared between adjacent test frequencies to obtain threshold changes between different test frequencies, and these threshold changes between different test frequencies are used as frequency threshold change features. The high and low levels of left and right ear hearing thresholds at all test frequencies are organized to obtain the auditory sensitivity of the left and right ears at different test frequencies, and this is used as the left and right ear frequency band sensitivity features. The left and right ear frequency band sensitivity features, frequency threshold change features, and binaural difference features are summarized to obtain a hearing threshold feature set. The screening feature set and the hearing threshold feature set are merged and organized to generate a subject's auditory assessment profile.

[0047] S3: Perform type matching on the auditory assessment profile of the subject, generate task type labels, perform speech activity detection on the speech to be recognized by the hearing aid based on the task type labels, generate candidate speech segments, locate key segments in the candidate speech segments, and generate key segment protection windows.

[0048] S3.1: Extract noise scene identification features, left and right ear frequency band sensitivity features, and binaural difference features from the subject's auditory assessment profile to form a task matching feature set.

[0049] The hearing aid's accompanying testing terminal reads the screening feature set and hearing threshold feature set from the subject's auditory assessment profile. It then lists the noise scene identification-related content in the screening feature set separately, and lists the left and right ear frequency band sensitivity-related content and binaural difference-related content separately in the hearing threshold feature set. The noise scene identification-related content in the screening feature set is used as the noise scene identification feature. The auditory sensitivity of the left and right ears at different test frequencies in the hearing threshold feature set is used as the left and right ear frequency band sensitivity feature. The hearing threshold difference of the left and right ears at the same test frequency in the hearing threshold feature set is used as the binaural difference feature. The noise scene identification feature, left and right ear frequency band sensitivity feature, and binaural difference feature are summarized to form the task matching feature set.

[0050] S3.2: Compare the task matching feature set with the preset digital recognition constraints and control instruction recognition constraints respectively to obtain the matching degree of each task type.

[0051] Numerical recognition constraints are set based on the numerical combination recognition requirements in the diatonic test corpus. These constraints include requirements for noise scene recognition features, left and right ear frequency band sensitivity features, and binaural difference features. Control command recognition constraints are set based on the recognition requirements of hearing aid control command speech. These constraints also include requirements for noise scene recognition features, left and right ear frequency band sensitivity features, and binaural difference features. The noise scene recognition features in the task matching feature set are compared with the noise recognition requirements in the numerical recognition constraints. When the noise scene recognition features indicate that the subject's recognition ability reaches the required level for numerical recognition (e.g., the required level is that the subject's recognition accuracy in the diatonic test corpus is not less than 70%), the noise scene recognition features are recorded as meeting the noise scene recognition feature requirements in the numerical recognition constraints. Otherwise, the noise scene recognition features are recorded as not meeting the noise scene recognition feature requirements in the numerical recognition constraints.

[0052] The frequency band sensitivity features of the left and right ears are compared with the frequency band sensitivity requirements in the digital recognition constraints. When the frequency band sensitivity features of the left and right ears represent the auditory sensitivity of the subject to the main frequency band of digital speech to reach the frequency band sensitivity level required for digital recognition, for example, the required frequency band sensitivity level is that both ears can maintain effective auditory response within the main frequency band of digital speech and there is no obvious hearing threshold break between adjacent frequency bands, the frequency band sensitivity features of the left and right ears are recorded as meeting the frequency band sensitivity feature requirements in the digital recognition constraints. Otherwise, the frequency band sensitivity features of the left and right ears are recorded as not meeting the frequency band sensitivity feature requirements in the digital recognition constraints.

[0053] The binaural difference features are compared with the binaural difference requirements in the digital recognition constraints. When the binaural difference features, which represent the degree of difference between the left and right ears in digital speech recognition, are within the acceptable range for digital recognition (for example, the acceptable range for digital recognition is that the difference between the left and right ears at the same test frequency does not exceed 10 dB to 15 dB), the binaural difference features are recorded as meeting the binaural difference feature requirements in the digital recognition constraints. Conversely, the binaural difference features are recorded as not meeting the binaural difference feature requirements in the digital recognition constraints. When the noise scene identification features, left and right ear frequency band sensitivity features, and binaural difference features all meet the digital recognition constraints, the digital recognition task is recorded as a perfect match. When two of the noise scene identification features, left and right ear frequency band sensitivity features, and binaural difference features meet the digital recognition constraints, the digital recognition task is recorded as a partial match. When only one of the noise scene features, left and right ear frequency band sensitivity features, and binaural difference features meets the digital recognition constraints, or when none of them meet the digital recognition constraints, the digital recognition task is recorded as a low match.

[0054] The noise scene identification features in the task matching feature set are compared with the noise identification requirements in the control command recognition constraints. When the noise scene identification features indicate that the subject's ability to recognize the speech of the control command reaches the required recognition level for control command recognition (e.g., the required recognition level is that the subject's accuracy in recognizing the speech of the control command is not less than 70%), the noise scene identification features are recorded as meeting the noise scene feature requirements in the control command recognition constraints. Conversely, the noise scene identification features are recorded as not meeting the noise scene feature requirements in the control command recognition constraints. The left and right ear frequency band sensitivity features are compared with the control command... The frequency band sensitivity requirements in the recognition constraints are compared. When the frequency band sensitivity features of the left and right ears represent the auditory sensitivity of the subject to the main frequency band of the control command speech to the frequency band sensitivity level required for control command recognition, for example, the frequency band sensitivity level required for control command recognition is that both ears can maintain an effective auditory response within the main frequency band of the control command speech, and there is no obvious hearing threshold break between consecutive frequency bands, the frequency band sensitivity features of the left and right ears are recorded as meeting the frequency band sensitivity feature requirements in the control command recognition constraints. Conversely, the frequency band sensitivity features of the left and right ears are recorded as not meeting the frequency band sensitivity feature requirements in the control command recognition constraints.

[0055] The binaural difference features are compared with the binaural difference requirements in the control command recognition constraints. When the degree of difference between the left and right ears in the speech recognition of control commands, as represented by the binaural difference features, is within the acceptable range for control command recognition (for example, the acceptable range is that the difference between the left and right ears at the same test frequency does not exceed 10 decibels to 15 decibels), the binaural difference features are recorded as meeting the binaural difference feature requirements in the control command recognition constraints. Conversely, the binaural difference features are recorded as not meeting the binaural difference feature requirements in the control command recognition constraints. When the noise scene recognition features, left and right ear frequency band sensitivity features, and binaural difference features all meet the control command recognition constraints, the control command recognition is considered satisfactory. The task is recorded as a complete match. If two of the noise scene identification features, left and right ear frequency band sensitivity features, and binaural difference features satisfy the control command recognition constraint, the control command recognition task is recorded as a partial match. If only one of the noise scene identification features, left and right ear frequency band sensitivity features, and binaural difference features satisfies the control command recognition constraint, or if none of them satisfy the control command recognition constraint, the control command recognition task is recorded as a low match. The complete match, partial match, and low match of the digit recognition task are taken as the matching degree of the digit recognition task, and the complete match, partial match, and low match of the control command recognition task are taken as the matching degree of the control command recognition task, thus obtaining the matching degree of each task type.

[0056] S3.3: Determine the task type with the highest matching degree as the current task type, assign a label to it, and generate a task type label.

[0057] The matching degree of the digit recognition task is compared with that of the control command recognition task. When the digit recognition task achieves a complete match and the control command recognition task achieves a partial or low match, the digit recognition task is determined as the current task type. When both the digit recognition and control command recognition tasks fail to achieve a complete match, a partial match is defined as a higher match than a low match, and the task type with the higher match degree is determined as the current task type. After the current task type is determined, if the current task type is digit recognition... When identifying a task, the task type content in the task type label is recorded as a number recognition. When the current task type is a control command recognition task, the task type content in the task type label is recorded as control command recognition. When the matching degree corresponding to the current task type is a complete match, the matching level content in the task type label is recorded as a complete match. When the matching degree corresponding to the current task type is a partial match, the matching level content in the task type label is recorded as a partial match. When the matching degree corresponding to the current task type is a low match, the matching level content in the task type label is recorded as a low match. The task type content and the matching level content are combined to generate the task type label.

[0058] S3.4: Perform rule matching on task type labels to generate voice activity detection rules.

[0059] The task type and matching level content in the task type label are read using the hearing aid's matching test terminal. When the task type is a digit recognition task, the digit speech start position detection content, digit speech end position detection content, and digit speech pause interval detection content are determined as the speech activity detection rule content for the digit recognition task. Among them, the digit speech start position detection content is used to determine the position where the digit speech begins to be pronounced, the digit speech end position detection content is used to determine the position where the digit speech ends to be pronounced, and the digit speech pause interval detection content is used to determine the interval position between two digits. When the task type is a control command recognition task, the command speech start position detection content, command speech end position detection content, and command speech continuous segment detection content are determined as the speech activity detection rule content for the control command recognition task. Among them, the command speech start position detection content is used to determine the position where the command speech begins to be pronounced, the command speech end position detection content is used to determine the position where the command speech ends to be pronounced, and the command speech continuous segment detection content is used to determine the complete interval of continuous command speech.

[0060] When the matching level is a perfect match, all detection content in the speech activity detection rules of the digit recognition task or the control command recognition task is retained. When the matching level is a partial match, the start position detection content and end position detection content in the speech activity detection rules of the digit recognition task or the control command recognition task are retained. When the matching level is a low match, the start position detection content in the speech activity detection rules of the digit recognition task or the control command recognition task is retained. All retained detection content is combined to generate a speech activity detection rule.

[0061] It should be noted that the speech activity detection rule refers to the detection rule formed by combining the speech start position detection content, speech end position detection content, digital speech pause interval detection content, or instruction speech continuous segment detection content. It is used to determine the speech activity start position, speech activity end position, digit interval position, or continuous speech interval in the speech to be recognized.

[0062] S3.5: Obtain the speech to be recognized from the hearing aid, perform endpoint detection and activity interval detection on the speech to be recognized according to the speech activity detection rules, and generate speech activity intervals; splice the speech activity intervals to generate candidate speech segments.

[0063] The hearing aid's accompanying testing terminal acquires the speech to be recognized and arranges it into a continuous speech sequence according to time order. The speech start position detection content in the speech activity detection rule is applied to the continuous speech sequence to determine the position where the speech activity begins in the speech to be recognized. The speech end position detection content in the speech activity detection rule is applied to the continuous speech sequence to determine the position where the speech activity ends in the speech to be recognized. When the speech activity detection rule includes digit speech pause interval detection content, the digit speech pause interval detection content is applied to the continuous speech sequence to determine the interval position between adjacent digits. When the speech activity detection rule includes instruction speech continuous segment detection content, the instruction speech continuous segment detection content is applied to the continuous speech sequence to determine the complete interval of continuous instruction speech. The speech start position, speech end position, digit interval position, or continuous instruction speech interval are sorted to obtain the speech activity interval, for example, 0.2s-2s.

[0064] The splicing duration is set according to the time interval between speech activity intervals, for example, 0.1s-0.5s. When the time interval between adjacent speech activity intervals is less than the splicing duration, the adjacent speech activity intervals are merged into the same segment. When the time interval between adjacent speech activity intervals is greater than the splicing duration, the adjacent speech activity intervals are retained as different segments and the segment splicing result is obtained. All speech segments in the segment splicing result are arranged in chronological order to generate candidate speech segments.

[0065] S3.6: Extract task type information from task type labels, identify speech segments that match the task type information in candidate speech segments, and generate target speech interest segments; perform duration statistics, energy change statistics, and frequency band distribution statistics on target speech interest segments to generate segment statistics results.

[0066] The task type information in the task type label is read through the hearing aid's matching test terminal. When the task type is a digit recognition task, the digit speech segments in the candidate speech fragments are used as the recognition objects. The candidate speech fragments are divided into multiple digit speech segments according to the pause position between two digits. The speech segments that can maintain complete digit pronunciation among the multiple digit speech segments are identified as speech segments that match the task type information. When the task type is a control command recognition task, the continuous speech segments in the candidate speech fragments are used as the recognition objects. The candidate speech fragments are divided into multiple command speech segments according to the start and end positions of continuous pronunciation. The speech segments that can maintain complete continuous pronunciation among the multiple command speech segments are identified as speech segments that match the task type information. These segments are then aggregated to generate the target speech attention segment.

[0067] The duration between the start and end positions of the target speech interest segment is recorded to obtain the duration statistics of each target speech interest segment. The speech intensity change process in the target speech interest segment is recorded to obtain the energy change statistics of each target speech interest segment. The speech signal in the target speech interest segment is divided according to frequency, and the energy distribution in each frequency range is statistically analyzed to obtain the frequency band distribution statistics of each target speech interest segment. The duration statistics, energy change statistics, and frequency band distribution statistics are summarized to generate the segment statistics.

[0068] S3.7: Compare the importance of target speech interest segments based on segment statistics results to generate key segment localization results; expand the boundaries of the start and end positions in the key segment localization results to generate key segment protection windows.

[0069] The duration, energy variation, and frequency band distribution statistics are read from the segment statistics using the hearing aid's testing terminal. These statistics are then mapped to each target speech attention segment. The duration statistics within each target speech attention segment are compared with the speech length requirements of the digit recognition or control command recognition task. If the duration of the target speech attention segment covers the complete digit or command pronunciation, it is recorded as a duration-satisfied segment. If the duration does not cover the complete digit or command pronunciation, it is recorded as a duration-insufficient segment. The energy variation statistics within the target speech attention segment are then compared. If the speech energy fluctuations within the target speech attention segment are significant and the start, body, and end segments of the pronunciation can be distinguished, it is recorded as an energy-stable segment. If the speech energy variation within the target speech attention segment is too small or too discrete, it is recorded as an energy-unstable segment.

[0070] The frequency band distribution statistics of the target speech interest segments are compared. When the frequency band energy distribution within the target speech interest segment can maintain the main audio band required for the current task type, the target speech interest segment is recorded as a frequency band concentrated segment. When the frequency band energy distribution within the target speech interest segment cannot maintain the main audio band required for the current task type, the target speech interest segment is recorded as a frequency band dispersed segment. Segments that meet the duration, energy stability, and frequency band concentration criteria are designated as high-importance segments. Target speech interest segments that meet two of these criteria are designated as medium-importance segments. Target speech interest segments that meet only one of these criteria or none of them are designated as low-importance segments. When high-importance segments exist, they are identified as key segment localization results. When no high-importance segments exist but medium-importance segments exist, medium-importance segments are identified as key segment localization results. When only low-importance segments exist, the low-importance segment with the longest duration is identified as the key segment localization result.

[0071] The starting position in the key segment localization result is extended forward, and the ending position in the key segment localization result is extended backward. For example, the extension time is 0.05s-0.2s. When the starting position is extended forward beyond the starting boundary of the candidate speech segment, the starting boundary of the candidate speech segment is taken as the extended starting position. When the ending position is extended backward beyond the ending boundary of the candidate speech segment, the ending boundary of the candidate speech segment is taken as the extended ending position. The speech interval between the extended starting position and the extended ending position is taken as the key segment protection window.

[0072] S4: Enhance the speech to be recognized through the key segment protection window to generate enhanced speech, perform speech recognition on the enhanced speech, and generate target recognition results.

[0073] S4.1: Perform frequency band preservation and noise suppression on the speech to be recognized covered by the key segment protection window to generate a protection-enhanced segment; perform regular noise reduction on the speech to be recognized outside the key segment protection window to generate a background-enhanced segment.

[0074] The starting and ending positions of the critical segment protection window are read using the hearing aid's matching test terminal. The speech to be recognized is segmented into the speech interval covered by the critical segment protection window and the speech interval outside the critical segment protection window. The speech interval covered by the critical segment protection window is divided into multiple frequency bands according to frequency, retaining the frequency bands with concentrated speech energy and reducing the noise components in the frequency bands without concentrated speech energy. The retained speech frequency bands and the suppressed noise frequency bands are recombined to generate a protection enhancement segment. The speech interval outside the critical segment protection window is divided into multiple noise reduction segments according to time sequence. The background noise intensity in each noise reduction segment is counted, and the noise components in each noise reduction segment are weakened in descending order of background noise intensity, while retaining the speech components in each noise reduction segment. The noise reduction segments are spliced ​​together in the original time sequence to generate a background enhancement segment.

[0075] S4.2: The protection enhancement segment and the background enhancement segment are sequentially concatenated to generate enhanced speech; the enhanced speech is then subjected to feature extraction to obtain the recognition speech features.

[0076] The start and end positions of the protective enhancement segment and the background enhancement segment in the speech to be recognized are read using the hearing aid's matching test terminal. The protective enhancement segment is placed in the speech interval covered by the critical segment protection window, and the background enhancement segment is placed in the speech interval outside the critical segment protection window. The protective enhancement segment and the background enhancement segment are connected according to the time sequence in the speech to be recognized to generate enhanced speech.

[0077] The enhanced speech is divided into continuous speech frames in chronological order. The speech energy changes, frequency distributions, and temporal changes in each continuous speech frame are extracted to obtain the frame features of each continuous speech frame. The frame features of each continuous speech frame are then combined in chronological order to obtain the recognized speech features.

[0078] S4.3: Use task type labels to perform label-guided decoding of the recognized speech features to generate initial recognition content; perform result judgment on the initial recognition content to generate target recognition results.

[0079] The task type content in the task type label is read by the matching test terminal of the hearing aid. When the task type content is a digit recognition task, the digit speech feature segments in the recognition speech features are extracted and decoded according to the order of digit combination to generate digit recognition content. When the task type content is a control command recognition task, the command speech feature segments in the recognition speech features are extracted and decoded according to the continuous expression order of the command speech to generate control command recognition content. The digit recognition content and control command recognition task are combined to form the initial recognition content.

[0080] The method checks whether the speech content in the initial recognition content is complete, whether the speech sequence is correct, and whether the speech content meets the recognition requirements of the current task type. When the initial recognition content meets the requirements of completeness, sequence, and task type, the initial recognition content is determined as the target recognition result, thus completing a hearing aid speech recognition enhancement method for noisy scenarios based on diatonic screening and binaural pure tone audiometry.

[0081] This embodiment also provides a computer device applicable to the hearing aid speech recognition enhancement method for noisy scenes, including: a memory and a processor; the memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the hearing aid speech recognition enhancement method for noisy scenes as proposed in the above embodiment.

[0082] The computer device can be a terminal, comprising a processor, memory, communication interface, display screen, and input devices connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, carrier networks, NFC (Near Field Communication), or other technologies. The display screen can be an LCD screen or an e-ink screen. The input devices can be a touch layer covering the display screen, buttons, a trackball, or a touchpad on the computer device's casing, or an external keyboard, touchpad, or mouse.

[0083] This embodiment also provides a storage medium storing a computer program, which, when executed by a processor, implements the hearing aid speech recognition enhancement method for noisy scenarios as proposed in the above embodiments. The storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Red-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0084] In summary, this invention generates glottal screening results by performing state discrimination, avoiding the problem of rough judgment caused by relying solely on a single accuracy indicator. By merging and organizing the data, it generates a subject's auditory assessment profile, unifying and integrating functional identification results with frequency-based hearing threshold results. By determining the task type with the highest matching degree as the current task type and generating a task type label, it enables task-oriented processing based on different requirements of digital recognition or control command recognition, thereby improving the targeting of enhancement and recognition. By expanding the boundaries and generating a key segment protection window, it achieves accurate framing and appropriate expansion protection of key segments, providing a clear target for subsequent differentiated enhancement processing. By judging the results, it generates target recognition results, improving the effectiveness of hearing aid speech recognition in noisy environments.

[0085] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A method for enhancing speech recognition in hearing aids for noisy environments, characterized in that: include, The test terminal for hearing aids is used to conduct diatonic tests on subjects in background noise, generate test feedback information, and perform screening and judgment to generate diatonic screening results. When the diatonic screening result is not passed, the subject undergoes binaural pure tone audiometry to generate binaural frequency band hearing threshold results. Auditory features are extracted from the failed diatonic screening results and binaural frequency band hearing threshold results to generate an auditory assessment profile of the subject. Type matching is performed on the auditory assessment profile of the subject to generate task type labels. Based on the task type labels, speech activity detection is performed on the speech to be recognized by the hearing aid to generate candidate speech segments. Key segments are located in the candidate speech segments and key segment protection windows are generated. The speech to be recognized is enhanced by using a key segment protection window to generate enhanced speech. Speech recognition is then performed on the enhanced speech to generate the target recognition result.

2. The hearing aid speech recognition enhancement method for noisy scenes as described in claim 1, characterized in that: The specific steps for generating test feedback information are as follows: The test terminal for the hearing aid outputs background noise and diatonic test data to the subject's ears, controls the sound pressure level of the background noise and the playback rhythm of the diatonic test data, identifies and detects the diatonic test data, and generates test response information. Record the content, timing, and results of the responses in the test response information to generate the original feedback record; The test feedback information is generated by statistically analyzing the number of correct answers, incorrect answers, and missed answers in the original feedback records.

3. The hearing aid speech recognition enhancement method for noisy scenes as described in claim 2, characterized in that: The specific steps for generating the bigram screening results are as follows: The consistency of the answers in the test feedback information is compared with the two-syllable test corpus to generate answer judgment information; Based on the answer judgment information, the accuracy rate and error distribution of the answers are statistically analyzed, and the screening judgment information is output. The screening information is analyzed to determine its status and generate a two-syllable screening result.

4. The hearing aid speech recognition enhancement method for noisy scenes as described in claim 3, characterized in that: The specific steps for generating binaural hearing threshold results are as follows. When the diatonic screening result is not passed, pure tone test signals of different frequencies and intensities are output to the subject's ears through the matching test terminal of the hearing aid to generate binaural pure tone test signals; Binaural auditory response data is generated by detecting the subject's response to binaural pure tone test signals. Frequency band aggregation is performed on the binaural auditory response data to generate binaural frequency band hearing threshold results.

5. The hearing aid speech recognition enhancement method for noisy scenes as described in claim 4, characterized in that: The specific steps for generating the auditory assessment profile of the subject are as follows: Extract accuracy features, error distribution features, and noise scene identification features from the failed bigeminy screening results, and summarize them to form a screening feature set; The left and right ear frequency band sensitivity features, frequency threshold change features, and binaural difference features are extracted from the binaural frequency band hearing threshold results to obtain the hearing threshold feature set; The screening feature set and the hearing threshold feature set are merged and organized to generate a hearing assessment profile of the subject.

6. The hearing aid speech recognition enhancement method for noisy scenes as described in claim 5, characterized in that: The specific steps for generating task type tags are as follows: Noise scene identification features, left and right ear frequency band sensitivity features, and binaural difference features are extracted from the auditory assessment profiles of the subjects to form a task matching feature set; The task matching feature set is compared with the preset digital recognition constraints and control command recognition constraints to obtain the matching degree of each task type; The task type with the highest matching degree is identified as the current task type, and a label is assigned to it to generate a task type label.

7. The hearing aid speech recognition enhancement method for noisy scenes as described in claim 6, characterized in that: The specific steps for generating candidate speech segments are as follows: Perform rule matching on task type tags to generate voice activity detection rules; Acquire the speech to be recognized from the hearing aid, perform endpoint detection and activity interval detection on the speech to be recognized according to the speech activity detection rules, and generate the speech activity interval; Segment splicing is performed on the speech activity interval to generate candidate speech segments.

8. The hearing aid speech recognition enhancement method for noisy scenes as described in claim 7, characterized in that: The specific steps for generating the critical segment protection window are as follows: Extract task type information from task type labels, identify speech segments that match the task type information in candidate speech segments, and generate target speech attention segments; Perform duration statistics, energy change statistics, and frequency band distribution statistics on the target speech interest segments to generate segment statistical results; The importance of target speech segments of interest is compared using segment statistics to generate key segment localization results. The starting and ending positions in the key segment localization results are expanded to generate a key segment protection window.

9. The hearing aid speech recognition enhancement method for noisy scenes as described in claim 8, characterized in that: The specific steps for generating enhanced speech are as follows: The speech to be identified covered by the protection window of key segments is band-preserved and noise suppressed to generate protection-enhanced segments; Perform standard noise reduction on the speech to be recognized outside the key segment protection window to generate background-enhanced segments; The protection enhancement segment and the background enhancement segment are sequentially concatenated to generate enhanced speech.

10. The hearing aid speech recognition enhancement method for noisy scenes as described in claim 9, characterized in that: The specific steps for generating the target recognition result are as follows: Feature extraction is performed on the enhanced speech to obtain the recognized speech features; The speech features are labeled and decoded using task type tags to generate initial recognition content. The initial identified content is evaluated to generate target recognition results.