Sound recognition and synthesis method for hearing aids

By constructing a feedback-free auditory embryo through feedback shadow isolation and an improved YAMNet model, the problems of acoustic feedback and processing delay in hearing aids are solved, achieving stable and coordinated speech and ambient sound output and improving user experience.

CN122201246APending Publication Date: 2026-06-12BOLI ZHITONG TECHNOLOGY (SHANGHAI) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BOLI ZHITONG TECHNOLOGY (SHANGHAI) CO LTD
Filing Date
2026-04-20
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing hearing aids suffer from acoustic feedback problems under high-gain conditions, resulting in howling. Furthermore, the introduction of sound recognition and synthesis processing lengthens the system processing chain, affecting real-time performance and auditory continuity.

Method used

By introducing a feedback shadow isolation mechanism, improving the YAMNet model's feedback-free auditory embryo construction method, and using hearing loss-constrained dual-stream reconstruction and delay compensation techniques, feedback shadow components are identified and isolated. A speech-environment coupling constraint graph is constructed, dominant event competition screening and accompanying event attachment processing are performed to generate a feedback-free auditory embryo. Frequency band weight redistribution and delay compensation are then performed.

🎯Benefits of technology

It effectively reduces the probability of feedback, improves the reliability and robustness of hearing aids, enables the coordinated expression of speech and ambient sound, and enhances the user's auditory experience and the stability of output sound.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201246A_ABST
    Figure CN122201246A_ABST
Patent Text Reader

Abstract

The application discloses a sound recognition and synthesis method for a hearing aid, comprising the following steps: collecting external sound signals and obtaining a receiver driving reference signal, generating a sound analysis frame sequence and a reference driving frame sequence; recognizing a feedback shadow component, performing isolation processing to generate an effective sound sequence; inputting an improved YAMNet model to obtain a frame-level embedding vector, performing prototype clustering and boundary detection to generate a prototype node sequence; constructing a speech environment coupling constraint graph and combining a loudness trajectory, screening a dominant event and hanging an accompanying event to generate a hearing embryo; obtaining a hearing loss image to perform frequency band weight adjustment and dynamic range constraint, and performing double-flow reconstruction processing on the dominant and accompanying events; performing delay compensation processing on the output sound and performing gain control, and driving the receiver to complete the final sound output. The application realizes stable output of the hearing aid without feedback and collaborative perception of the speech environment through feedback shadow isolation and improved YAMNet modeling.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of audio signal processing technology, and in particular to a method for sound recognition and synthesis for hearing aids. Background Technology

[0002] With the development of hearing-assistance technology, hearing aids have evolved from traditional analog amplification devices to intelligent devices based on digital signal processing. Current hearing aids typically collect external sound signals through microphones, and after noise reduction, gain control, and bandwidth compensation, directly drive the receiver output to improve the user's perception of speech signals. Some improved solutions further incorporate adaptive feedback suppression algorithms and speech enhancement algorithms to improve speech clarity and wearing comfort in complex environments. With the development of artificial intelligence technology, existing solutions attempt to introduce speech recognition and speech synthesis technologies into hearing aids to replace the traditional direct amplification path, thereby improving speech intelligibility to some extent.

[0003] However, existing hearing aids commonly suffer from acoustic feedback under high-gain operating conditions. This means that the receiver's output sound leaks back to the microphone via the acoustic path, is re-collected and processed, forming a closed-loop amplification that produces howling. Current technologies often employ adaptive feedback cancellation algorithms to suppress this problem, but these rely on estimating the feedback path and struggle to maintain stable performance in dynamic wearing environments. When hearing aids incorporate sound recognition and synthesis architectures, the leaked sound may still enter the recognition process, being misinterpreted as real speech or ambient sound, leading to incorrect reconstruction and repeated amplification, resulting in distorted, jittery, or unstable output sound.

[0004] Existing technologies, when dealing with the relationship between speech and ambient sound, typically only enhance the speech, lacking structured modeling and controllable reconstruction mechanisms for the ambient sound. This results in outputs that either ignore environmental information and reduce realism, or fail to distinguish between primary and secondary information, leading to information aliasing. Introducing recognition and synthesis processes lengthens the system's processing chain, easily causing accumulated latency and affecting real-time performance and auditory continuity.

[0005] Therefore, how to provide a sound recognition and synthesis method for hearing aids is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0006] One objective of this invention is to propose a sound recognition and synthesis method for hearing aids. This invention details the complete processing flow for achieving sound defeedback recognition and reconstruction output under high gain conditions by introducing a feedback shadow isolation mechanism, a feedback-free auditory embryo construction method based on an improved YAMNet model, and hearing loss-constrained dual-stream reconstruction and delay compensation processing technology. It can avoid misidentification of feedback components while performing structured modeling and collaborative reconstruction of speech and ambient sound, and has the advantages of strong anti-feedback capability, high speech intelligibility, realistic environmental perception, and good output stability.

[0007] A sound recognition and synthesis method for a hearing aid according to an embodiment of the present invention includes:

[0008] The external sound signals are collected through the hearing aid microphone, the output reference signal of the receiver driver is obtained, and the external sound signals and the output reference signal are processed to generate a sound analysis frame sequence and a reference driver frame sequence.

[0009] Based on the time delay correspondence, frequency band energy correspondence, and phase following relationship between the sound analysis frame sequence and the reference driving frame sequence, the feedback shadow component is identified, and the corresponding feedback shadow component in the sound analysis frame sequence is isolated to generate an effective sound sequence.

[0010] The effective sound sequence is input into the improved YAMNet model to obtain the corresponding frame-level embedding vector. Online prototype clustering and self-attention boundary detection are performed on the frame-level embedding vector to dynamically generate a sequence of speech prototype nodes and an environment prototype node sequence with timestamps.

[0011] Based on the speech prototype node sequence and the environment prototype node sequence, a speech-environment coupling constraint graph is constructed. The dominant event competition screening and accompanying event attachment processing are performed on the speech-environment coupling constraint graph by combining the loudness fluctuation trajectory of the effective sound sequence, generating a feedback-free auditory embryo containing a dominant event chain, an accompanying event chain and a loudness envelope chain.

[0012] The system acquires a user's hearing loss profile, performs frequency band weight redistribution, dynamic range constraint, and key speech component enhancement on the non-feedback auditory embryo, and performs dual-stream reconstruction processing on the dominant event chain and the accompanying event chain to generate the target output sound.

[0013] The target output sound is processed with delay compensation. The delayed sound signal is then input into the amplification unit for gain control before driving the receiver output.

[0014] Optionally, the external sound signal includes a speech signal, an ambient sound signal, and a background noise signal. The speech signal includes continuous speech segments and discontinuous speech segments. The ambient sound signal includes traffic sounds, mechanical sounds, animal sounds, and music. The background noise signal includes steady-state noise and non-steady-state noise. The output reference signal includes a receiver drive electrical signal, a corresponding receiver diaphragm displacement change signal, and a return acoustic response signal of the receiver output sound within the hearing aid housing.

[0015] Optionally, the processing of the external sound signal and the output reference signal to generate a sound analysis frame sequence and a reference driving frame sequence includes:

[0016] External sound signals and output reference signals are synchronously sampled at a uniform sampling frequency to obtain corresponding discrete time sequences. The discrete time sequences are then divided into frames according to a preset frame length and frame shift to generate a continuous time frame sequence. Windowing is applied to each time frame to reduce inter-frame leakage, resulting in windowed time frames. Frequency banding is then performed on the windowed time frames to obtain multi-frequency band component sequences. Amplitude normalization is then performed on the multi-frequency band component sequences to generate a sound analysis frame sequence and the reference driving frame sequence.

[0017] Optionally, the identification of feedback shadow components involves performing isolation processing on the corresponding feedback shadow components in the sound analysis frame sequence to generate a valid sound sequence, including:

[0018] The sound analysis frame sequence and the reference driving frame sequence are aligned frame by frame according to the same time frame order. Within each time frame, the corresponding driving change segment, energy fluctuation segment and continuous state segment are extracted according to the frequency band partition to generate a reference projection segment sequence that corresponds to the sound analysis frame sequence frame by frame.

[0019] Within each frequency band of each time frame, the sound analysis frame sequence is compared with the reference projection segment sequence to identify the sound components that simultaneously satisfy the following conditions: driving and following, frequency band position is continuously locked, and energy fluctuations change synchronously. The sound components are marked as initial feedback shadow candidate components.

[0020] Perform backflow closed-loop verification on the initial feedback shadow candidate components along continuous time frames to identify candidate components that continue to rise after the reference drive is enhanced, have delayed attenuation after the reference drive is weakened, and still maintain frequency band lingering in the speech discontinuity area or environmental event switching area. The candidate components that pass the backflow closed-loop verification are determined as feedback shadow components.

[0021] The feedback shadow components are aggregated according to the time continuity relationship, frequency band expansion relationship and intensity transmission relationship to form feedback shadow trajectory segments. The center frequency band of each feedback shadow trajectory segment is used as the main isolation band and the adjacent traction frequency band is used as the accompanying isolation band to generate the corresponding feedback shadow isolation region.

[0022] Hierarchical isolation processing is performed on the components falling into the feedback shadow isolation region in the sound analysis frame sequence. Specifically, the feedback shadow component in the main isolation band is suppressed and stripped, and the feedback shadow component in the accompanying isolation band is edge-preserving and weakening. The isolated time frames are then continuously spliced ​​together to generate an effective sound sequence.

[0023] Optionally, the step of performing online prototype clustering and self-attention boundary detection on the frame-level embedding vectors to dynamically generate timestamped speech prototype node sequences and environment prototype node sequences includes:

[0024] The effective sound sequence is continuously segmented according to the preset frame length and frame shift. Log-Mel spectrum transformation, frequency band energy normalization and time ordering are performed on each time frame in sequence to generate frame-level acoustic feature maps that correspond one-to-one with the time frames.

[0025] An improved YAMNet model is constructed, comprising an input shaping layer, a backflow suppression layer, a cross-frame prototyping layer, and a self-attention boundary detection layer, wherein:

[0026] The input shaping layer receives frame-level acoustic feature maps and performs channel unrolling processing to generate input feature blocks of uniform size;

[0027] The backflow residue suppression layer is set before the original YAMNet backbone network. Based on the narrowband continuous enhancement relationship, frequency band tail relationship and abrupt loss relationship between adjacent time frames, the feature response corresponding to the residual backflow component is reduced to generate a clean feature map.

[0028] The cross-frame prototyping layer is set after the original YAMNet backbone network. It performs cross-frame convergence processing on the high-level embedding results of multiple consecutive time frames to generate a frame-level embedding vector containing the features of the current frame, the continuation features of the previous frame, and the transition features of the subsequent frame.

[0029] The self-attention boundary detection layer is connected to the cross-frame prototype coding layer. It jointly calculates the correlation strength, class transfer strength and energy fluctuation synchronization between adjacent frame-level embedding vectors to generate boundary detection results.

[0030] The frame-level acoustic feature map is input into the improved YAMNet model to obtain a sequence of frame-level embedding vectors arranged in time order;

[0031] Perform online prototype clustering on frame-level embedding vector sequences:

[0032] Read the frame-level embedding vectors in chronological order, establish prototype centers and perform similarity merging and updating. For frame-level embedding vectors that do not meet the merging conditions, establish new prototype centers and perform merging processing on prototype centers that are not chronologically continuous and have short durations to generate candidate prototype sequences.

[0033] Based on the boundary detection results output by the self-attention boundary detection layer, boundary correction processing is performed on the candidate prototype sequence:

[0034] Event boundary points are established at locations where boundary detection results indicate enhanced category transfer and synchronous changes in energy fluctuations. Event boundary points are eliminated at locations where a continuous relationship exists. Based on the dominant category of each candidate prototype, the candidate prototypes are divided into speech prototype nodes and environment prototype nodes. The start time, end time, and duration interval of each prototype node are recorded in chronological order to generate a sequence of speech prototype nodes and an environment prototype node sequence with timestamps.

[0035] Optionally, generating a feedback-free auditory embryo comprising a dominant event chain, a secondary event chain, and a loudness envelope chain includes:

[0036] Read the speech prototype node sequence and the environment prototype node sequence in chronological order, identify the temporal adjacency relationship, co-occurrence intensity relationship and coverage substitution relationship between the speech prototype node and the environment prototype node, and establish the speech prototype node and environment prototype node with the start and end times connected, the continuous intervals overlapping or the energy changes synchronized as candidate coupling node pairs.

[0037] Based on each candidate coupled node pair, a speech-environment coupling constraint graph is constructed. In the speech-environment coupling constraint graph, adjacency constraint edges are established for nodes with temporal connection relationships, co-occurrence constraint edges are established for nodes with synchronization enhancement relationships, and substitution constraint edges are established for nodes with time period coverage and category replacement relationships, thus generating a time-expanded coupling constraint structure.

[0038] Within each time-competition segment, a dominant event competition screening is performed on the candidate nodes in the speech-environment coupling constraint graph:

[0039] Candidate nodes that are continuous in their duration, have a stable energy framework, and maintain an event continuity relationship with the previous time-competitive segment are identified as dominant event nodes.

[0040] Candidate nodes that are not identified as dominant event nodes but have a temporal or energy-following relationship with the dominant event node are retained as accompanying event nodes;

[0041] Connect the dominant event nodes in chronological order to generate a dominant event chain;

[0042] Based on the start-end containment relationship, boundary fitting relationship and intensity dependency relationship between each accompanying event node and the corresponding dominant event node, the accompanying event node is attached to the front, back or middle section of the corresponding dominant event node to generate an accompanying event chain.

[0043] Based on the loudness fluctuations, peak-to-valley transitions, and sustained stable intervals of the effective sound sequence in each time segment, the loudness change trajectory arranged in chronological order is extracted. The loudness change trajectory is then aligned with the dominant event chain and the accompanying event chain in a unified time. Loudness control markers are applied to the segments covered by the dominant event chain, loudness yield markers are applied to the segments connected to the accompanying event chain, and loudness release markers are applied to the segments without event gaps. This generates a feedback-free auditory embryo containing the dominant event chain, the accompanying event chain, and the loudness envelope chain.

[0044] Optionally, the step of performing dual-stream reconstruction processing on the dominant event chain and the accompanying event chain respectively to generate the target output sound includes:

[0045] Obtain a user's hearing loss profile, map the hearing loss profile to each time segment and frequency band region of the non-feedback auditory embryo, and generate hearing loss constrained distribution results;

[0046] Based on the hearing loss constraint distribution results, priority enhancement marking is performed on the speech-related frequency bands in the time segment where the dominant event chain is located, secondary preservation marking is performed on the environment-related frequency bands in the time segment where the accompanying event chain is located, and energy attenuation marking is performed on the no-event segment to generate zoning control results;

[0047] Based on the comfortable loudness range, the loudness envelope chain is subjected to segmented adjustment processing. Compression processing is performed in the time segment where the loudness is higher than the upper limit of comfort, enhancement processing is performed in the time segment where the loudness is lower than the lower limit of perception, and the original trend of change is maintained in the time segment where the loudness is within the comfortable range, thus generating the adjusted loudness envelope chain.

[0048] Based on the zonal control results and the adjusted loudness envelope chain, speech-dominant reconstruction processing is performed on the dominant event chain. The continuity of the speech structure is restored according to the temporal order of the speech prototype nodes. Directional enhancement is performed on the key frequency bands of the speech. Smooth transition processing is performed at the speech boundaries to generate a speech reconstruction stream. Environmental dependency reconstruction processing is performed on the accompanying event chain. Based on the attachment relationship between the accompanying event nodes and the dominant event nodes, the environmental events are subjected to continuation compensation, boundary fitting and energy following adjustment in the corresponding time segment to generate an environmental reconstruction stream.

[0049] The speech reconstruction stream and the environment reconstruction stream are fused together in a unified time sequence. The speech reconstruction stream is the main output in the dominant event segment, the environment reconstruction stream is superimposed in the accompanying event segment, and a smooth transition process is performed in the no-event segment to generate the target output sound.

[0050] Optionally, the step of performing delay compensation processing on the target output sound, and then inputting the delay-compensated sound signal into the amplification unit for gain control before driving the receiver output includes:

[0051] The processing time of each stage, including feedback shadow isolation processing, prototype node construction, non-feedback auditory embryo generation, and dual-stream reconstruction processing, is recorded and accumulated in chronological order to generate the total processing delay corresponding to the current target output sound.

[0052] The total processing delay is compared with the preset real-time output threshold. When the total processing delay is greater than the real-time output threshold, the delay compensation trigger segment and the corresponding time range are determined.

[0053] Within the delay compensation triggering section, a compensation segment is generated based on the dominant event chain, accompanying event chain, and loudness envelope chain of the previous stable time segment. The compensation segment is inserted into the corresponding position of the target output sound, and amplitude continuity processing and frequency band smoothing processing are performed on the connected region to generate continuous output sound.

[0054] The continuous output sound input amplifier unit performs frequency band gain adjustment according to the preset gain control parameters, and sends the gain-adjusted sound signal to the receiver for output.

[0055] The beneficial effects of this invention are:

[0056] This invention introduces a feedback shadow isolation mechanism at the front end of the sound processing link. Before sound recognition, it identifies and isolates the feedback components generated by receiver backflow, ensuring that the signal entering subsequent processing does not contain feedback-dominant components. This prevents feedback signals from being re-identified and amplified at the source. Compared to existing methods that rely solely on feedback suppression algorithms, this invention does not depend on precise modeling of the feedback path. It maintains stable performance even in dynamic wearing environments and under high-gain conditions, effectively reducing the probability of howling and improving the overall reliability and robustness of the hearing aid.

[0057] This invention constructs a feedback-free auditory embryo based on an improved YAMNet model, structurally modeling speech and ambient sound information along a unified time axis. It distinguishes and organizes different sound components through dominant and accompanying event chains, ensuring the output sound not only has clear speech intelligibility but also retains the realism and continuity of ambient sound. Compared to existing methods that merely enhance speech or simply superimpose ambient sound, this invention achieves coordinated expression of speech and ambient sound, avoiding information aliasing or missing environmental information, and enhancing the user's auditory experience.

[0058] This invention incorporates user hearing loss characteristics into the sound reconstruction process through hearing loss-constrained dual-stream reconstruction and delay compensation mechanisms. It performs differentiated processing on different frequency bands and event types, and compensates for and smooths the output when the processing link delay exceeds a real-time threshold, ensuring both personalized adaptation and temporal continuity in the final output sound. Compared to the delay accumulation and output instability problems in existing recognition and synthesis schemes, this invention achieves real-time, stable, and natural sound output even under complex processing procedures. Attached Figure Description

[0059] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:

[0060] Figure 1 This is a flowchart of the sound recognition and synthesis method for hearing aids proposed in this invention;

[0061] Figure 2 This is a schematic diagram of the feedback-free auditory embryo construction process of the sound recognition and synthesis method for hearing aids proposed in this invention. Detailed Implementation

[0062] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.

[0063] refer to Figure 1 and Figure 2 Methods for sound recognition and synthesis in hearing aids include:

[0064] The external sound signals are collected through the hearing aid microphone, the output reference signal of the receiver driver is obtained, and the external sound signals and the output reference signal are processed to generate a sound analysis frame sequence and a reference driver frame sequence.

[0065] Based on the time delay correspondence, frequency band energy correspondence, and phase following relationship between the sound analysis frame sequence and the reference driving frame sequence, the feedback shadow component is identified, and the corresponding feedback shadow component in the sound analysis frame sequence is isolated to generate an effective sound sequence.

[0066] The effective sound sequence is input into the improved YAMNet model to obtain the corresponding frame-level embedding vector. Online prototype clustering and self-attention boundary detection are performed on the frame-level embedding vector to dynamically generate a sequence of speech prototype nodes and an environment prototype node sequence with timestamps.

[0067] Based on the speech prototype node sequence and the environment prototype node sequence, a speech-environment coupling constraint graph is constructed. The dominant event competition screening and accompanying event attachment processing are performed on the speech-environment coupling constraint graph by combining the loudness fluctuation trajectory of the effective sound sequence, generating a feedback-free auditory embryo containing a dominant event chain, an accompanying event chain and a loudness envelope chain.

[0068] The system acquires a user's hearing loss profile, performs frequency band weight redistribution, dynamic range constraint, and key speech component enhancement on the non-feedback auditory embryo, and performs dual-stream reconstruction processing on the dominant event chain and the accompanying event chain to generate the target output sound.

[0069] The target output sound is processed with delay compensation. The delayed sound signal is then input into the amplification unit for gain control before driving the receiver output.

[0070] In this embodiment, the external sound signal includes speech signal, ambient sound signal and background noise signal. The speech signal includes continuous speech segments and non-continuous speech segments. The ambient sound signal includes traffic sound, mechanical sound, animal sound and music sound. The background noise signal includes steady-state noise and non-steady-state noise. The output reference signal includes receiver drive electrical signal, corresponding receiver diaphragm displacement change signal and receiver output sound echo acoustic response signal in hearing aid housing.

[0071] In this embodiment, the process of processing the external sound signal and the output reference signal to generate a sound analysis frame sequence and a reference driving frame sequence includes:

[0072] External sound signals and output reference signals are synchronously sampled at a uniform sampling frequency to obtain corresponding discrete time sequences. The discrete time sequences are then divided into frames according to a preset frame length and frame shift to generate a continuous time frame sequence. Windowing is applied to each time frame to reduce inter-frame leakage, resulting in windowed time frames. Frequency banding is then performed on the windowed time frames to obtain multi-frequency band component sequences. Amplitude normalization is then performed on the multi-frequency band component sequences to generate a sound analysis frame sequence and the reference driving frame sequence.

[0073] In this embodiment, the step of identifying the feedback shadow component, performing isolation processing on the corresponding feedback shadow component in the sound analysis frame sequence, and generating a valid sound sequence includes:

[0074] The sound analysis frame sequence and the reference driving frame sequence are aligned frame by frame according to the same time frame order. Within each time frame, the corresponding driving change segment, energy fluctuation segment and continuous state segment are extracted according to the frequency band partition to generate a reference projection segment sequence that corresponds to the sound analysis frame sequence frame by frame.

[0075] Within each frequency band of each time frame, the sound analysis frame sequence is compared with the reference projection segment sequence to identify the sound components that simultaneously satisfy the following conditions: driving and following, frequency band position is continuously locked, and energy fluctuations change synchronously. The sound components are marked as initial feedback shadow candidate components.

[0076] Perform backflow closed-loop verification on the initial feedback shadow candidate components along continuous time frames to identify candidate components that continue to rise after the reference drive is enhanced, have delayed attenuation after the reference drive is weakened, and still maintain frequency band lingering in the speech discontinuity area or environmental event switching area. The candidate components that pass the backflow closed-loop verification are determined as feedback shadow components.

[0077] The feedback shadow components are aggregated according to their temporal continuity, frequency band spread, and intensity transfer relationships to form feedback shadow trajectory segments. The center frequency band of each feedback shadow trajectory segment is used as the main isolation band, and the adjacent traction frequency band is used as the accompanying isolation band, generating corresponding feedback shadow isolation regions. The specific details of forming the feedback shadow trajectory segments are as follows:

[0078] Read the determined feedback shadow components in each time frame in chronological order, and identify the feedback shadow components in adjacent time frames that are continuously connected in frequency band position, maintain the same direction of energy change, and maintain synchronous continuity in phase change.

[0079] Feedback shadow components that satisfy the time continuity relationship are cascaded along the time direction, feedback shadow components that satisfy the frequency band spread relationship are combined along the frequency band direction, and feedback shadow components that satisfy the intensity transfer relationship are included in the same continuous propagation segment.

[0080] Extract the start time position, end time position, center frequency band position, frequency band coverage range, and intensity duration interval of the continuous propagation segment after cascading and merging.

[0081] The continuous propagation segment that simultaneously possesses continuous time span, stable frequency band coverage, and consistent intensity transmission direction is defined as the feedback shadow trajectory segment.

[0082] Hierarchical isolation processing is performed on the components falling into the feedback shadow isolation region in the sound analysis frame sequence. Specifically, suppression and stripping processing is performed on the feedback shadow components within the main isolation band, and edge-preserving attenuation processing is performed on the feedback shadow components within the accompanying isolation band. The isolated time frames are then continuously spliced ​​together to generate an effective sound sequence.

[0083] Suppression and stripping processing is performed on the feedback shadow components within the main isolation zone, specifically as follows:

[0084] The main frequency band is located for the feedback shadow components corresponding to each time frame within the main isolation band, and the main frequency band center position, main frequency band coverage area and main frequency band energy ratio are extracted.

[0085] Based on the center position and coverage of the main frequency band, fixed-point suppression is performed on the frequency band components that are consistent with the feedback shadow trajectory segment within the main isolation band, reducing the amplitude of the corresponding frequency band components to below the preset residual upper limit;

[0086] After the fixed-point suppression is completed, gap compensation is performed on the main isolation band. The adjacent frequency band information that was not marked as feedback shadow component in the previous and next time frames of the main isolation band is called to continuously fill the suppressed area.

[0087] The main isolation band after gap compensation is subjected to time-series smoothing to eliminate abrupt edges between adjacent time frames, generating a main isolation band sequence after suppression stripping.

[0088] The feedback shadow component within the accompanying isolation zone is subjected to edge-preserving attenuation processing, specifically as follows:

[0089] Extract the boundary frequency band position, edge energy change trend, and traction correlation with the main isolation band from the feedback shadow components corresponding to each time frame within the accompanying isolation band;

[0090] Based on the location of the boundary frequency band and the traction correlation, gradient reduction processing is performed on the feedback shadow component of the accompanying isolation band that is close to the main isolation band, while the original boundary contour is preserved for the frequency band component that is far from the main isolation band.

[0091] Boundary preservation correction is performed on the accompanying isolation band after gradient reduction processing to ensure that the fluctuations of the accompanying isolation band in the time continuity direction are consistent with the adjacent frequency bands that are not affected by feedback.

[0092] The accompanying isolating bands after boundary preservation correction are continuously spliced ​​to create a smooth transition between the accompanying isolating bands, the main isolating bands, and the unisolated frequency bands, generating an accompanying isolating band sequence after edge preservation attenuation processing.

[0093] In this embodiment, the step of performing online prototype clustering and self-attention boundary detection on the frame-level embedded vectors to dynamically generate timestamped speech prototype node sequences and environment prototype node sequences includes:

[0094] The effective sound sequence is continuously segmented according to the preset frame length and frame shift. Log-Mel spectrum transformation, frequency band energy normalization and time ordering are performed on each time frame in sequence to generate frame-level acoustic feature maps that correspond one-to-one with the time frames.

[0095] An improved YAMNet model is constructed, comprising an input shaping layer, a backflow suppression layer, a cross-frame prototyping layer, and a self-attention boundary detection layer, wherein:

[0096] The input shaping layer receives frame-level acoustic feature maps and performs channel unrolling processing to generate input feature blocks of uniform size. Specifically, generating uniform-sized input feature blocks involves:

[0097] The frame-level acoustic feature maps corresponding to each time frame are aligned and arranged according to the preset frequency band order and time order. Edge padding is performed on the part with insufficient frequency band dimension, and truncation is performed on the part with excessive frequency band dimension to form a single frame feature segment with consistent frequency band length.

[0098] Each single-frame feature segment is rearranged according to a preset channel expansion rule, and the feature components of different frequency band regions are mapped to the corresponding channel positions to form a multi-channel feature segment with consistent channel distribution.

[0099] Each multi-channel feature segment is spliced ​​together according to a preset time window length. For parts with insufficient time length, tail padding is performed, and for parts with excessive time length, fixed-length truncation is performed to generate input feature blocks of uniform size.

[0100] The residual backflow suppression layer is placed before the original YAMNet backbone network. Based on the narrowband continuous enhancement relationship, frequency band tail relationship, and abrupt loss relationship between adjacent time frames, it performs reduction processing on the feature responses corresponding to the residual backflow components to generate a cleaned feature map. Specifically, the reduction processing on the feature responses corresponding to the residual backflow components is as follows:

[0101] Read the feature responses corresponding to adjacent time frames in chronological order to locate the abnormal response segments that maintain narrow-band concentrated enhancement within multiple consecutive time frames and are inconsistent with the changing trends of normal speech or ambient sound before and after;

[0102] The frequency band trail range, duration, and energy attenuation trajectory are extracted from the abnormal response segments. The abnormal response segments that simultaneously meet the characteristics of obvious trail continuation, insufficient abrupt fluctuations, and cross-frame synchronous retention are identified as residual backflow response segments.

[0103] The characteristic response within the residual backflow response region is subjected to targeted reduction processing, gradually reducing the response intensity along the center frequency band to both sides, and the adjacent regions after reduction are subjected to boundary smoothing processing to generate a clean feature map.

[0104] The cross-frame prototyping layer is placed after the original YAMNet backbone network. It performs cross-frame convergence processing on the high-level embedding results of multiple consecutive time frames, generating a frame-level embedding vector containing features of the current frame, continuation features of the previous frame, and transition features of the subsequent frame. Specifically, the cross-frame convergence processing on the high-level embedding results of multiple consecutive time frames is as follows:

[0105] Read the high-level embedding results corresponding to multiple consecutive time frames in chronological order, take the high-level embedding result of the current time frame as the center embedding, extract the continuation features from the high-level embedding results of the preceding adjacent time frames, and extract the transition features from the high-level embedding results of the following adjacent time frames to form a cross-frame candidate embedding group.

[0106] Temporal correlation alignment processing is performed on each cross-frame candidate embedding group. High-level embedding results that maintain class continuity and energy continuity with the current time frame are incorporated into the previous frame continuity features, and high-level embedding results that have boundary transition and event switching relationships with the current time frame are incorporated into the subsequent frame transition features.

[0107] The current frame features, previous frame continuation features, and subsequent frame transition features are combined in a unified embedding order, and the combined result is subjected to compact mapping processing to generate a frame-level embedding vector corresponding to the current time frame.

[0108] The self-attention boundary detection layer, connected to the cross-frame prototype coding layer, jointly calculates the correlation strength, class transfer strength, and energy fluctuation synchronization degree between adjacent frame-level embedding vectors to generate boundary detection results. Specifically, the joint calculation of these parameters involves:

[0109] Read adjacent frame-level embedding vectors in chronological order, extract the feature similarity, orientation change and local aggregation between adjacent frame-level embedding vectors, and generate the corresponding association strength results;

[0110] By combining the changes in category distribution corresponding to adjacent frame-level embedding vectors, category preservation segments and category switching segments are identified. The continuous preservation intensity is extracted for the category preservation segments, and the switching abrupt intensity is extracted for the category switching segments, generating the corresponding category transfer intensity results.

[0111] Based on the energy change trend of adjacent time frames in the effective sound sequence, the loudness rise and fall consistency, peak-valley correspondence and fluctuation synchronization range between adjacent frames are extracted to generate the corresponding energy fluctuation synchronization results.

[0112] The correlation strength results, category transfer strength results, and energy fluctuation synchronization results are jointly converged to identify boundary enhancement locations and boundary continuation locations, and generate boundary detection results.

[0113] The frame-level acoustic feature map is input into the improved YAMNet model to obtain a sequence of frame-level embedding vectors arranged in time order;

[0114] Perform online prototype clustering on frame-level embedding vector sequences:

[0115] The frame-level embedding vectors are read sequentially over time, prototype centers are established, and similarity merging and updating are performed. New prototype centers are established for frame-level embedding vectors that do not meet the merging criteria. Merging is then performed on prototype centers that are not time-continuous and have short durations to generate candidate prototype sequences, where:

[0116] Establish a prototype center and perform similarity merging and updating, specifically as follows:

[0117] Read the frame-level embedding vectors in chronological order, establish the feature result corresponding to the first frame-level embedding vector as the initial prototype center, and record the time start point, time end point and category distribution status of the initial prototype center.

[0118] The similarity between the currently read frame-level embedding vector and each prototype center is compared. Frame-level embedding vectors that meet the merging conditions are assigned to the corresponding prototype centers, and the feature representation, temporal coverage and category distribution status of the corresponding prototype centers are updated synchronously.

[0119] For frame-level embedding vectors that do not meet the merging conditions, establish new prototype centers and record the starting time position, feature representation and category distribution state corresponding to the new prototype centers;

[0120] Based on the boundary detection results output by the self-attention boundary detection layer, boundary correction processing is performed on the candidate prototype sequence:

[0121] Event boundary points are established at locations where boundary detection results indicate enhanced category transfer and synchronous energy fluctuations. Event boundary points are eliminated at locations with continuous relationships. Based on the dominant category of each candidate prototype, candidate prototypes are divided into speech prototype nodes and environment prototype nodes. The start time, end time, and duration interval of each prototype node are recorded in chronological order, generating timestamped sequences of speech prototype nodes and environment prototype nodes. Specifically, the division of candidate prototypes into speech prototype nodes and environment prototype nodes is as follows:

[0122] The category distribution results corresponding to each candidate prototype are statistically analyzed, the category with the highest proportion is extracted as the dominant category, and the time coverage interval corresponding to the dominant category is recorded.

[0123] Candidate prototypes whose dominant category belongs to the speech category set are identified as speech prototype nodes, and candidate prototypes whose dominant category belongs to the ambient sound category set are identified as ambient prototype nodes. Candidate prototypes in the boundary alternation segment are subjected to classification adjustment processing according to the category continuity relationship of adjacent time segments.

[0124] The divided speech prototype nodes and environment prototype nodes are rearranged in chronological order, and each prototype node is marked with a start time, end time, and duration interval, generating a sequence of speech prototype nodes and an environment prototype node sequence with timestamps.

[0125] In this embodiment, generating a feedback-free auditory embryo comprising a dominant event chain, a companion event chain, and a loudness envelope chain includes:

[0126] Read the speech prototype node sequence and the environment prototype node sequence in chronological order, identify the temporal adjacency relationship, co-occurrence intensity relationship and coverage substitution relationship between the speech prototype node and the environment prototype node, and establish the speech prototype node and environment prototype node with the start and end times connected, the continuous intervals overlapping or the energy changes synchronized as candidate coupling node pairs.

[0127] Based on each candidate coupled node pair, a speech-environment coupling constraint graph is constructed. In this graph, adjacency constraint edges are established for nodes with temporal continuity relationships, co-occurrence constraint edges are established for nodes with synchronization enhancement relationships, and substitution constraint edges are established for nodes with time period coverage and category replacement relationships. This generates a time-expanded coupling constraint structure. Specifically, the construction of the speech-environment coupling constraint graph is as follows:

[0128] Read each candidate coupling node pair in chronological order, record each speech prototype node and each environment prototype node as graph nodes into a unified node set, and mark each graph node with node category, start time, end time and duration interval.

[0129] Establish time connection relationships for graph nodes whose start and end times are connected or whose durations overlap; establish co-occurrence relationships for graph nodes that show synchronous enhancement or synchronous weakening trends within the same time segment; and establish substitution relationships for graph nodes that have event substitution, event exit, or category switching within the same coverage period.

[0130] The temporal connection relationship is written into the adjacency constraint edge, the co-occurrence relationship is written into the co-occurrence constraint edge, and the substitution relationship is written into the substitution constraint edge. The constraint edges are then connected and expanded according to the temporal order of the graph nodes to generate a temporally expanded speech-environment coupling constraint graph.

[0131] Within each time-competition segment, a dominant event competition screening is performed on the candidate nodes in the speech-environment coupling constraint graph:

[0132] Candidate nodes that are continuous in their duration, have a stable energy framework, and maintain an event continuity relationship with the previous time-competitive segment are identified as dominant event nodes.

[0133] Candidate nodes that are not identified as dominant event nodes but have a temporal or energy-following relationship with the dominant event node are retained as accompanying event nodes;

[0134] Connect the dominant event nodes in chronological order to generate a dominant event chain;

[0135] Based on the start-end containment relationship, boundary fitting relationship and intensity dependency relationship between each accompanying event node and the corresponding dominant event node, the accompanying event node is attached to the front, back or middle section of the corresponding dominant event node to generate an accompanying event chain.

[0136] Based on the loudness fluctuations, peak-to-valley transitions, and sustained stable intervals of the effective sound sequence within each time segment, loudness change trajectories arranged chronologically are extracted. These trajectories are then time-aligned with the dominant event chain and accompanying event chains. Loudness control markers are applied to the segments covered by the dominant event chain, loudness yield markers are applied to the segments connected to the accompanying event chains, and loudness release markers are applied to segments without event gaps. This generates a feedback-free auditory embryo containing the dominant event chain, accompanying event chain, and loudness envelope chain.

[0137] The loudness master control flag is executed on the segment covered by the dominant event chain, specifically as follows:

[0138] The main control region is located for the loudness change trajectory within the time segment of the dominant event chain, and the peak position, duration interval and energy distribution range of the corresponding segment are extracted;

[0139] Based on the temporal continuity of the dominant event chain and the speech priority characteristics, the loudness in the main control area is preferentially maintained so that the dominant event chain maintains a continuous dominant state in the corresponding time segment.

[0140] Transition constraint processing is applied to loudness changes at the boundary of the main control area to ensure smooth connection of the dominant event chain when entering and exiting the corresponding time segment;

[0141] A loudness yielding flag is applied to the segment attached to the accompanying event chain, specifically as follows:

[0142] The subordinate region is located for the loudness change trajectory within the time segment of the accompanying event chain, and the time segment that overlaps with or is adjacent to the dominant event chain is extracted.

[0143] Based on the temporal connection between the accompanying event chain and the dominant event chain, the loudness in the subordinate region is proportionally compressed so that the accompanying event chain remains in an auxiliary expression state within the corresponding time segment.

[0144] Synchronous adjustment processing is performed on the boundary positions within the subordinate region to ensure that the loudness changes of the accompanying event chain are consistent with the changing trend of the dominant event chain;

[0145] For event-free gaps, loudness mitigation marking is applied, specifically as follows:

[0146] For time segments not covered by the dominant event chain and accompanying event chain, gap segments are located, and the start and end time range and loudness change trend of the corresponding segments are extracted.

[0147] Based on the duration of the gap segment and the loudness level of the adjacent event segments, a gradual transition process is performed on the loudness within the gap segment, so that the loudness change smoothly transitions from the previous event segment to the next event segment.

[0148] Abrupt peaks and abnormal fluctuations within the gap section are reduced to keep the loudness change in that section stable.

[0149] In this embodiment, the step of performing dual-stream reconstruction processing on the dominant event chain and the accompanying event chain respectively to generate the target output sound includes:

[0150] Obtain a user's hearing loss profile, which includes the distribution of hearing thresholds, comfortable loudness range, and speech-sensitive frequency bands in each frequency band. Map the hearing loss profile to each time segment and frequency band region of the non-feedback auditory embryo to generate a hearing loss constrained distribution result. Specifically, the generation of the hearing loss constrained distribution result is as follows:

[0151] Read each time segment in the non-feedback auditory embryo in chronological order, extract the coverage of the dominant event chain, accompanying event chain and loudness envelope chain corresponding to each time segment, and mark the corresponding frequency band region;

[0152] Based on the hearing threshold distribution of each frequency band, the perceptibility of the corresponding frequency band in each time segment is classified and labeled. Frequency bands below the hearing threshold range are labeled as enhancement segments, frequency bands close to the hearing threshold range are labeled as compensation segments, and frequency bands above the hearing threshold range are labeled as maintenance segments.

[0153] Based on the comfortable loudness range, interval matching is performed on the loudness envelope chain in each time segment. Segments above the upper limit of comfort are marked as compressed segments, segments below the lower limit of comfort are marked as enhanced segments, and segments within the comfortable range are marked as stable segments.

[0154] Based on the distribution of speech-sensitive frequency bands, priority marking is performed on key speech bands within the time segment of the dominant event chain, and subordinate marking is performed on non-key frequency bands within the time segment of the accompanying event chain.

[0155] The frequency band classification markers, loudness band markers, and speech sensitivity markers for each time segment are integrated to generate hearing loss constraint distribution results for each time segment and frequency band region.

[0156] Based on the hearing loss constraint distribution results, priority enhancement marking is performed on the speech-related frequency bands in the time segment where the dominant event chain is located, secondary preservation marking is performed on the environment-related frequency bands in the time segment where the accompanying event chain is located, and energy attenuation marking is performed on the event-free segments, generating zoning control results, where:

[0157] Priority enhancement marking is performed on the speech-related frequency bands within the time segment of the dominant event chain, specifically as follows:

[0158] Extract the speech-related frequency band range for each time segment covered by the dominant event chain, and determine the frequency band that belongs to the speech-sensitive frequency band and is in the enhancement or compensation segment as the priority enhancement frequency band;

[0159] Mark the enhancement level for the priority enhancement frequency band within the corresponding time segment, and record the enhancement start position, enhancement duration interval and enhancement end position;

[0160] Mark the transition interval at the boundary position between the priority enhancement frequency band and the adjacent non-enhanced frequency band, and record the corresponding time range and frequency band range;

[0161] Secondary preservation marking is performed on the environmentally relevant frequency bands within the time segment of the accompanying event chain, specifically as follows:

[0162] For each time segment covered by the accompanying event chain, the environmentally relevant frequency band range is extracted, and the area that is not covered by the speech priority enhancement marker and belongs to the environmental event frequency band is determined as the secondary reserved frequency band.

[0163] Mark the retention level for secondary reserved frequency bands within the corresponding time segment, and record the retention range and corresponding time segment;

[0164] Mark the coordination interval for the overlapping area between the secondary reserved frequency band and the frequency band where the dominant event chain is located, and record the overlapping position and duration interval;

[0165] Energy decay flags are applied to event-free sections, specifically as follows:

[0166] Extract the corresponding frequency band range and loudness change trajectory for time segments not covered by the dominant event chain and accompanying event chain, and identify such time segments as event-free segments;

[0167] Mark the attenuation level for each frequency band within the event-free segment, and record the attenuation start position, attenuation duration range, and attenuation end position;

[0168] Mark the transition intervals at the connection points between no-event segments and adjacent event segments, and record the corresponding time ranges;

[0169] Based on the comfortable loudness range, the loudness envelope chain is subjected to segmented adjustment processing. Compression processing is performed in the time segment where the loudness is higher than the upper limit of comfort, enhancement processing is performed in the time segment where the loudness is lower than the lower limit of perception, and the original trend of change is maintained in the time segment where the loudness is within the comfortable range, thus generating the adjusted loudness envelope chain.

[0170] Based on the zonal control results and the adjusted loudness envelope chain, speech-dominant reconstruction processing is performed on the dominant event chain. The continuity of the speech structure is restored according to the temporal order of the speech prototype nodes. Directional enhancement is performed on the key frequency bands of the speech. Smooth transition processing is performed at the speech boundaries to generate a speech reconstruction stream. Environmental dependency reconstruction processing is performed on the accompanying event chain. Based on the attachment relationship between the accompanying event nodes and the dominant event nodes, the environmental events are subjected to continuation compensation, boundary fitting and energy following adjustment in the corresponding time segment to generate an environmental reconstruction stream.

[0171] The speech reconstruction stream and the environment reconstruction stream are fused together in a unified time sequence. The speech reconstruction stream is the main output in the dominant event segment, the environment reconstruction stream is superimposed in the accompanying event segment, and a smooth transition process is performed in the no-event segment to generate the target output sound.

[0172] In this embodiment, the step of performing delay compensation processing on the target output sound, and then inputting the delay-compensated sound signal into the amplification unit for gain control before driving the receiver output includes:

[0173] The processing time of each stage, including feedback shadow isolation processing, prototype node construction, non-feedback auditory embryo generation, and dual-stream reconstruction processing, is recorded and accumulated in chronological order to generate the total processing delay corresponding to the current target output sound.

[0174] The total processing delay is compared with the preset real-time output threshold. When the total processing delay is greater than the real-time output threshold, the delay compensation trigger segment and the corresponding time range are determined.

[0175] Within the delay compensation triggering section, a compensation segment is generated based on the dominant event chain, accompanying event chain, and loudness envelope chain of the previous stable time segment. The compensation segment is inserted into the corresponding position of the target output sound, and amplitude continuity processing and frequency band smoothing processing are performed on the connected region to generate continuous output sound.

[0176] The continuously output sound is input to the amplification unit, and the frequency band gain is adjusted according to the preset gain control parameters. The gain-adjusted sound signal is then sent to the receiver for output. The preset gain control parameters are the target gain value, the maximum output limit value, the compression threshold, and the compression ratio parameter for each frequency band.

[0177] Example 1: To verify the feasibility of this invention in practice, it was applied to a hearing rehabilitation center in a hospital to conduct hearing aid performance verification tests on a group of subjects with moderate to severe sensorineural hearing loss. The test location was set in the waiting area on the third floor of the outpatient building. This area is a semi-open space with multiple sound sources superimposed, including conversations, announcements, cart movements, and air conditioning operation, representing a typical complex auditory environment. The tests were conducted from 9:00 AM to 11:00 AM for three consecutive days, with each test lasting approximately 120 minutes. A total of 15 subjects participated in the test, aged between 50 and 70 years old. Their hearing curves generally showed mid-to-high frequency hearing loss, and they had previously experienced problems such as frequent feedback, unclear speech, and significant environmental noise interference when using traditional digital hearing aids.

[0178] In this scenario, the method of this invention is deployed in a hearing aid device with embedded processing capabilities. During operation, the hearing aid first acquires external sound signals through a microphone, and simultaneously obtains the output reference signal from the receiver driver in real time, processing both signals synchronously according to a uniform sampling rhythm. In the waiting area environment, due to ear canal leakage and spatial reflection, the receiver output sound will re-enter the microphone, forming a backflow path. This invention, through feedback shadow isolation processing at the front end, performs corresponding analysis on the sound analysis frame sequence and the reference driver frame sequence, identifying feedback shadow components that exhibit time delay consistency, sustained frequency band locking, and synchronized energy changes with the receiver driver signal. These components are then isolated from subsequent processing links to prevent feedback components from entering the recognition and synthesis process.

[0179] After completing feedback shadow isolation, the effective sound sequence is input into the improved YAMNet model. Through internal backflow remnant suppression and cross-frame prototype coding, speech and ambient sound are modeled in a unified manner, generating speech prototype node sequences and ambient prototype node sequences. During testing in the waiting area, the system can stably recognize speech events arising from doctor-patient interactions, while also recognizing ambient events such as broadcast prompts, cart sounds, and footsteps. Furthermore, through the construction of speech-environment coupling constraints, speech and ambient events in different time slots are competitively filtered to determine the dominant event chain and accompanying event chains. When the doctor speaks, the speech event is identified as the dominant event, while background broadcast sounds and ambient noise are attached as accompanying events in the corresponding time slots.

[0180] Subsequently, dual-stream reconstruction processing was performed based on the hearing loss profile of the subjects. Addressing the prevalent hearing loss in the 2kHz to 4kHz frequency range among the test population, the system enhanced the key speech frequency bands in the dominant event chain while simultaneously applying energy constraints and boundary fitting to the ambient sound in the accompanying event chain, ensuring the ambient sound remained present without interfering with speech recognition. During processing, when the cumulative delay of the recognition and reconstruction link approached the real-time threshold, the system generated transition segments through a delay compensation mechanism and smoothly connected them, thus ensuring continuous and stable sound output. The final output sound was then amplified and subjected to frequency band gain control before driving the receiver for playback.

[0181] Under the same testing environment, the actual performance of traditional digital hearing aids and the method of this invention was compared, and the key performance indicators were statistically analyzed to obtain the following data results.

[0182] Table 1 Comparison of Key Performance Indicators of Hearing Aids

[0183]

[0184] As shown in Table 1, in the real and complex hospital waiting environment, traditional hearing aids exhibit significant feedback under high-gain conditions, averaging 4.8 times per hour. The method of this invention reduces this to 0.9 times per hour, indicating that the feedback shadow isolation mechanism effectively blocks the backfeed signal from entering the recognition and synthesis link, suppressing feedback at its source. The speech recognition accuracy improved from 64.5% to 81.2%, and the speech intelligibility score increased from 5.2 to 8.1, demonstrating that this invention has high accuracy in dominant event chain extraction and speech reconstruction, enhancing speech intelligibility and improving the user's auditory experience in conversational scenarios.

[0185] In terms of ambient sound perception, traditional hearing aids, lacking a structured mechanism for processing ambient sound, are prone to ambient sound attenuation or aliasing with speech, resulting in a score of only 4.8. The method of this invention, by constructing an accompanying event chain and performing collaborative reconstruction, improves the ambient sound score to 7.3, preserving necessary environmental information while prioritizing speech, resulting in a more natural overall auditory experience. The output stability score increased from 6.0 to 8.5, and the number of output interruptions decreased from 2.6 times per hour to 0.5 times per hour. This demonstrates that the invention, through dual-stream reconstruction and delay compensation mechanisms, reduces sound jitter and breaks, improving continuity and stability.

[0186] Although the average processing latency of the method in this invention increased from 15 milliseconds to 30 milliseconds, and the latency fluctuation range also expanded accordingly, it was still controlled within the acceptable real-time range for hearing aids and did not significantly affect the user's normal hearing. Combined with the result of reduced output interruption frequency, it can be seen that this invention compensates for the impact of the extended processing link through latency compensation, maintaining output smoothness and continuity while ensuring improved processing accuracy. Overall, this invention achieves significant improvements in feedback control, voice enhancement, environmental awareness, and output stability, demonstrating good practical effects.

[0187] Example 2: In a video conference room with an area of ​​approximately 30 square meters, a video conferencing system is installed, including a conference host, a conference speaker, and three omnidirectional microphones distributed at different positions on the conference table. The meeting is conducted remotely using mainstream video conferencing software. To ensure clarity of speech, all three microphones are turned on, and the conference speaker volume is set to a high level so that all participants in the room can clearly hear the remote speaker.

[0188] In practical use, when a remote participant speaks, the sound played by the conference speakers is picked up again by multiple microphones and transmitted back to the conference system through their respective audio links, forming an acoustic feedback loop. Simultaneously, different microphones also create cross-feedback through the speakers. Under traditional echo cancellation technology, when multiple microphones are working simultaneously and the conference environment involves movement, changes in speaking position, or adjustments to speaker volume, the feedback path constantly changes. This can easily lead to problems such as sharp howling, fluctuating volume, and speech being misinterpreted as echo and weakened, severely impacting conference quality.

[0189] In this scenario, the method for sound recognition and synthesis of this invention is deployed in the audio processing module of the conference host. All microphone-collected sounds serve as external sound signal inputs, and speaker drive signals serve as output reference signals input to the system. The system first performs unified frame segmentation processing on the multiple sound streams, generating a sound analysis frame sequence and a reference drive frame sequence. Subsequently, based on the time delay correspondence, frequency band energy correspondence, and phase following relationship between the two, the feedback shadow component formed by the sound played back by the speaker after spatial propagation is identified, and this part undergoes hierarchical isolation processing to obtain the effective sound sequence.

[0190] Next, the effective sound sequence is input into the improved YAMNet model to extract frame-level embedding vectors. Through online prototype clustering and self-attention boundary detection, the sound is dynamically divided into speech prototype nodes and environment prototype nodes. In a conference scenario, participants' speeches are identified as speech events, while keyboard clicks, page-turning sounds, and air conditioning noise are identified as environment events. Feedback components from the speakers are effectively isolated in the pre-processing stage and do not enter the recognition and modeling chain. The system constructs a speech-environment coupling constraint graph based on the speech and environment prototype nodes, generating a feedback-free auditory embryo containing the dominant event chain, accompanying event chain, and loudness envelope chain. Dual-stream reconstruction processing is performed according to the need for speech priority in the conference, outputting a clear and stable conference sound signal, which is finally played through the speakers after delay compensation and gain control.

[0191] Table 2 Performance Comparison Table for Multi-Microphone Conferencing Scenarios

[0192]

[0193] As shown in Table 2, the method of this invention demonstrates significant advantages in suppressing feedback and improving voice quality in multi-microphone online conferencing scenarios. In traditional solutions, due to the complex acoustic feedback loops formed between multiple microphones and speakers, an average of 3.6 feedback incidents occur per hour. However, the method of this invention, by identifying and isolating feedback shadow components, significantly reduces the number of feedback incidents to 0.4, essentially eliminating feedback interference during the meeting. The voice clarity score improved from 6.1 to 8.6, and the far-end voice intelligibility increased from 82.3% to 93.8%, indicating that this invention effectively suppresses feedback while better preserving and enhancing authentic voice information, providing participants with a clearer and more natural auditory experience.

[0194] In terms of voice protection and environmental restoration, this invention also demonstrates significant advantages. Traditional solutions, due to their conservative echo cancellation and noise reduction strategies, are prone to misclassifying some real speech as echo or noise and attenuating it, resulting in a false suppression rate of 11.5%. This invention, by constructing a feedback-free auditory embryo and performing dual-stream reconstruction of dominant and accompanying events, reduces the false suppression rate to 2.8%, effectively avoiding excessive speech attenuation. Simultaneously, the environmental sound naturalness score improves from 5.7 to 7.9, and the output stability score improves from 6.4 to 8.8, indicating that this invention, while ensuring speech clarity, can more realistically restore the atmosphere of the meeting environment and maintain the continuity and stability of sound output, making it particularly suitable for meeting scenarios with multiple speakers speaking simultaneously and frequent speaker switching. Furthermore, the accuracy rate for recognizing multiple speakers simultaneously improves from 78.9% to 91.2%, further validating the superior performance of this invention in complex acoustic environments.

[0195] Regarding system real-time performance, the average processing latency of the method in this invention increased from 12 ms to 26 ms, and the latency fluctuation range changed from 7-18 ms to 24-30 ms. This increase in latency mainly comes from processing steps such as feedback shadow recognition, improved YAMNet modeling, and dual-stream reconstruction. However, the overall latency is still far below the 150 ms threshold typically acceptable for real-time voice communication, and no perceptible stuttering or desynchronization occurred during actual meetings. This invention achieves a comprehensive effect of reduced feedback, significantly improved voice quality, more natural ambient sound, and more stable system output with acceptable latency overhead, fully demonstrating the effectiveness and practical value of this method in audio scenarios with acoustic feedback loops, such as multi-microphone conferencing systems.

[0196] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A method for sound recognition and synthesis in hearing aids, characterized in that, include: The external sound signals are collected through the hearing aid microphone, the output reference signal of the receiver driver is obtained, and the external sound signals and the output reference signal are processed to generate a sound analysis frame sequence and a reference driver frame sequence. Based on the time delay correspondence, frequency band energy correspondence, and phase following relationship between the sound analysis frame sequence and the reference driving frame sequence, the feedback shadow component is identified, and the corresponding feedback shadow component in the sound analysis frame sequence is isolated to generate an effective sound sequence. The effective sound sequence is input into the improved YAMNet model to obtain the corresponding frame-level embedding vector. Online prototype clustering and self-attention boundary detection are performed on the frame-level embedding vector to dynamically generate a sequence of speech prototype nodes and an environment prototype node sequence with timestamps. Based on the speech prototype node sequence and the environment prototype node sequence, a speech-environment coupling constraint graph is constructed. The dominant event competition screening and accompanying event attachment processing are performed on the speech-environment coupling constraint graph by combining the loudness fluctuation trajectory of the effective sound sequence, generating a feedback-free auditory embryo containing a dominant event chain, an accompanying event chain and a loudness envelope chain. The system acquires a user's hearing loss profile, performs frequency band weight redistribution, dynamic range constraint, and key speech component enhancement on the non-feedback auditory embryo, and performs dual-stream reconstruction processing on the dominant event chain and the accompanying event chain to generate the target output sound. The target output sound is processed with delay compensation. The delayed sound signal is then input into the amplification unit for gain control before driving the receiver output.

2. The sound recognition and synthesis method for hearing aids according to claim 1, characterized in that, The external sound signals include speech signals, ambient sound signals, and background noise signals. The speech signals include continuous speech segments and non-continuous speech segments. The ambient sound signals include traffic sounds, mechanical sounds, animal sounds, and music sounds. The background noise signals include steady-state noise and non-steady-state noise. The output reference signals include receiver drive electrical signals, corresponding receiver diaphragm displacement change signals, and receiver output sound echo acoustic response signals within the hearing aid housing.

3. The sound recognition and synthesis method for hearing aids according to claim 1, characterized in that, The process of processing external sound signals and output reference signals to generate sound analysis frame sequences and reference driving frame sequences includes: External sound signals and output reference signals are synchronously sampled at a uniform sampling frequency to obtain corresponding discrete time sequences. The discrete time sequences are then divided into frames according to a preset frame length and frame shift to generate a continuous time frame sequence. Windowing is applied to each time frame to reduce inter-frame leakage, resulting in windowed time frames. Frequency banding is then performed on the windowed time frames to obtain multi-frequency band component sequences. Amplitude normalization is then performed on the multi-frequency band component sequences to generate a sound analysis frame sequence and the reference driving frame sequence.

4. The sound recognition and synthesis method for hearing aids according to claim 1, characterized in that, The identification of feedback shadow components involves isolating the corresponding feedback shadow components in the sound analysis frame sequence to generate a valid sound sequence, including: The sound analysis frame sequence and the reference driving frame sequence are aligned frame by frame according to the same time frame order. Within each time frame, the corresponding driving change segment, energy fluctuation segment and continuous state segment are extracted according to the frequency band partition to generate a reference projection segment sequence that corresponds to the sound analysis frame sequence frame by frame. Within each frequency band of each time frame, the sound analysis frame sequence is compared with the reference projection segment sequence to identify the sound components that simultaneously satisfy the following conditions: driving and following, frequency band position is continuously locked, and energy fluctuations change synchronously. The sound components are marked as initial feedback shadow candidate components. Perform backflow closed-loop verification on the initial feedback shadow candidate components along continuous time frames to identify candidate components that continue to rise after the reference drive is enhanced, have delayed attenuation after the reference drive is weakened, and still maintain frequency band lingering in the speech discontinuity area or environmental event switching area. The candidate components that pass the backflow closed-loop verification are determined as feedback shadow components. The feedback shadow components are aggregated according to the time continuity relationship, frequency band expansion relationship and intensity transmission relationship to form feedback shadow trajectory segments. The center frequency band of each feedback shadow trajectory segment is used as the main isolation band and the adjacent traction frequency band is used as the accompanying isolation band to generate the corresponding feedback shadow isolation region. Hierarchical isolation processing is performed on the components falling into the feedback shadow isolation region in the sound analysis frame sequence. Specifically, the feedback shadow component in the main isolation band is suppressed and stripped, and the feedback shadow component in the accompanying isolation band is edge-preserving and weakening. The isolated time frames are then continuously spliced ​​together to generate an effective sound sequence.

5. The sound recognition and synthesis method for hearing aids according to claim 1, characterized in that, The process of performing online prototype clustering and self-attention boundary detection on frame-level embedded vectors to dynamically generate timestamped speech prototype node sequences and environment prototype node sequences includes: The effective sound sequence is continuously segmented according to the preset frame length and frame shift. Log-Mel spectrum transformation, frequency band energy normalization and time ordering are performed on each time frame in sequence to generate frame-level acoustic feature maps that correspond one-to-one with the time frames. An improved YAMNet model is constructed, comprising an input shaping layer, a backflow suppression layer, a cross-frame prototyping layer, and a self-attention boundary detection layer, wherein: The input shaping layer receives frame-level acoustic feature maps and performs channel unrolling processing to generate input feature blocks of uniform size; The backflow residue suppression layer is set before the original YAMNet backbone network. Based on the narrowband continuous enhancement relationship, frequency band tail relationship and abrupt loss relationship between adjacent time frames, the feature response corresponding to the residual backflow component is reduced to generate a clean feature map. The cross-frame prototyping layer is set after the original YAMNet backbone network. It performs cross-frame convergence processing on the high-level embedding results of multiple consecutive time frames to generate a frame-level embedding vector containing the features of the current frame, the continuation features of the previous frame, and the transition features of the subsequent frame. The self-attention boundary detection layer is connected to the cross-frame prototype coding layer. It jointly calculates the correlation strength, class transfer strength and energy fluctuation synchronization between adjacent frame-level embedding vectors to generate boundary detection results. The frame-level acoustic feature map is input into the improved YAMNet model to obtain a sequence of frame-level embedding vectors arranged in time order; Perform online prototype clustering on frame-level embedding vector sequences: Read the frame-level embedding vectors in chronological order, establish prototype centers and perform similarity merging and updating. For frame-level embedding vectors that do not meet the merging conditions, establish new prototype centers and perform merging processing on prototype centers that are not chronologically continuous and have short durations to generate candidate prototype sequences. Based on the boundary detection results output by the self-attention boundary detection layer, boundary correction processing is performed on the candidate prototype sequence: Event boundary points are established at locations where boundary detection results indicate enhanced category transfer and synchronous changes in energy fluctuations. Event boundary points are eliminated at locations where a continuous relationship exists. Based on the dominant category of each candidate prototype, the candidate prototypes are divided into speech prototype nodes and environment prototype nodes. The start time, end time, and duration interval of each prototype node are recorded in chronological order to generate a sequence of speech prototype nodes and an environment prototype node sequence with timestamps.

6. The sound recognition and synthesis method for hearing aids according to claim 1, characterized in that, The generation of a feedback-free auditory embryo comprising a dominant event chain, a secondary event chain, and a loudness envelope chain includes: Read the speech prototype node sequence and the environment prototype node sequence in chronological order, identify the temporal adjacency relationship, co-occurrence intensity relationship and coverage substitution relationship between the speech prototype node and the environment prototype node, and establish the speech prototype node and environment prototype node with the start and end times connected, the continuous intervals overlapping or the energy changes synchronized as candidate coupling node pairs. Based on each candidate coupled node pair, a speech-environment coupling constraint graph is constructed. In the speech-environment coupling constraint graph, adjacency constraint edges are established for nodes with temporal connection relationships, co-occurrence constraint edges are established for nodes with synchronization enhancement relationships, and substitution constraint edges are established for nodes with time period coverage and category replacement relationships, thus generating a time-expanded coupling constraint structure. Within each time-competition segment, a dominant event competition screening is performed on the candidate nodes in the speech-environment coupling constraint graph: Candidate nodes that are continuous in their duration, have a stable energy framework, and maintain an event continuity relationship with the previous time-competitive segment are identified as dominant event nodes. Candidate nodes that are not identified as dominant event nodes but have a temporal or energy-following relationship with the dominant event node are retained as accompanying event nodes; Connect the dominant event nodes in chronological order to generate a dominant event chain; Based on the start-end containment relationship, boundary fitting relationship and intensity dependency relationship between each accompanying event node and the corresponding dominant event node, the accompanying event node is attached to the front, back or middle section of the corresponding dominant event node to generate an accompanying event chain. Based on the loudness fluctuations, peak-to-valley transitions, and sustained stable intervals of the effective sound sequence in each time segment, the loudness change trajectory arranged in chronological order is extracted. The loudness change trajectory is then aligned with the dominant event chain and the accompanying event chain in a unified time. Loudness control markers are applied to the segments covered by the dominant event chain, loudness yield markers are applied to the segments connected to the accompanying event chain, and loudness release markers are applied to the segments without event gaps. This generates a feedback-free auditory embryo containing the dominant event chain, the accompanying event chain, and the loudness envelope chain.

7. The sound recognition and synthesis method for hearing aids according to claim 1, characterized in that, The process of performing dual-stream reconstruction processing on the dominant event chain and the accompanying event chain to generate the target output sound includes: Obtain a user's hearing loss profile, map the hearing loss profile to each time segment and frequency band region of the non-feedback auditory embryo, and generate hearing loss constrained distribution results; Based on the hearing loss constraint distribution results, priority enhancement marking is performed on the speech-related frequency bands in the time segment where the dominant event chain is located, secondary preservation marking is performed on the environment-related frequency bands in the time segment where the accompanying event chain is located, and energy attenuation marking is performed on the no-event segment to generate zoning control results; Based on the comfortable loudness range, the loudness envelope chain is subjected to segmented adjustment processing. Compression processing is performed in the time segment where the loudness is higher than the upper limit of comfort, enhancement processing is performed in the time segment where the loudness is lower than the lower limit of perception, and the original trend of change is maintained in the time segment where the loudness is within the comfortable range, thus generating the adjusted loudness envelope chain. Based on the zonal control results and the adjusted loudness envelope chain, speech-dominant reconstruction processing is performed on the dominant event chain. The continuity of the speech structure is restored according to the temporal order of the speech prototype nodes. Directional enhancement is performed on the key frequency bands of the speech. Smooth transition processing is performed at the speech boundaries to generate a speech reconstruction stream. Environmental dependency reconstruction processing is performed on the accompanying event chain. Based on the attachment relationship between the accompanying event nodes and the dominant event nodes, the environmental events are subjected to continuation compensation, boundary fitting and energy following adjustment in the corresponding time segment to generate an environmental reconstruction stream. The speech reconstruction stream and the environment reconstruction stream are fused together in a unified time sequence. The speech reconstruction stream is the main output in the dominant event segment, the environment reconstruction stream is superimposed in the accompanying event segment, and a smooth transition process is performed in the no-event segment to generate the target output sound.

8. The sound recognition and synthesis method for hearing aids according to claim 1, characterized in that, The process of performing delay compensation processing on the target output sound, and then inputting the delay-compensated sound signal into the amplification unit for gain control before driving the receiver output, includes: The processing time of each stage, including feedback shadow isolation processing, prototype node construction, non-feedback auditory embryo generation, and dual-stream reconstruction processing, is recorded and accumulated in chronological order to generate the total processing delay corresponding to the current target output sound. The total processing delay is compared with the preset real-time output threshold. When the total processing delay is greater than the real-time output threshold, the delay compensation trigger segment and the corresponding time range are determined. Within the delay compensation triggering section, a compensation segment is generated based on the dominant event chain, accompanying event chain, and loudness envelope chain of the previous stable time segment. The compensation segment is inserted into the corresponding position of the target output sound, and amplitude continuity processing and frequency band smoothing processing are performed on the connected region to generate continuous output sound. The continuous output sound input amplifier unit performs frequency band gain adjustment according to the preset gain control parameters, and sends the gain-adjusted sound signal to the receiver for output.