Method, apparatus, and computer program for neural network hearing aids
A neural network-based audio processing system in hearing aids addresses the challenge of separating speech from background noise by dynamically adjusting sound filtering, improving speech intelligibility in noisy environments through a dual-path signal chain.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- FORTELL RESEARCH INC
- Filing Date
- 2022-01-14
- Publication Date
- 2026-06-26
AI Technical Summary
Conventional hearing aids struggle to effectively separate speech from background noise in noisy environments, especially for individuals with hearing loss, due to limitations in computational power and the inability of existing algorithms to adapt to diverse acoustic environments.
Integration of a neural network-based audio processing system within a hearing aid that selectively engages a dual-path signal chain, utilizing both digital signal processing and neural networks to dynamically adjust sound filtering based on user input, environmental conditions, and acoustic analysis.
Enhances speech intelligibility in noisy environments by providing fine-grained adjustments to both magnitude and phase of incoming sounds, reducing background noise while maintaining a desirable user experience and minimizing latency.
Smart Images

Figure 0007880969000001 
Figure 0007880969000002 
Figure 0007880969000003
Abstract
Description
Technical Field
[0001] The present disclosure generally relates to methods, apparatuses, and Computer program for neural network-enabled auditory devices. In some embodiments, the present disclosure provides a method for improving the understanding of a user's utterance in a real-time conversation by processing speech through a neural network included in an auditory device such as a headset or a hearing aid, Computer program and an apparatus.
[0002] The ease of conversation among people in real-world situations is often hampered by background noise. When the background noise is high relative to the utterance, the utterance may be drowned out by the background noise. Bars, restaurants, concerts, etc. are examples of environments that are generally difficult for conversation. In particularly difficult "signal-to-noise" ratios, even normal-hearing people struggle, but these environments are especially difficult for people with hearing loss.
[0003] Hearing loss or impairment makes it difficult to hear, recognize, and understand sounds. Hearing loss can occur regardless of age and can be due to congenital defects, aging, or other causes. The most common type of hearing loss is sensorineural hearing loss. Sensorineural hearing loss is a permanent hearing loss that occurs when there is damage to small hair-like cells called stereocilia in the inner ear or to the auditory nerve itself, preventing or weakening the transmission of nerve signals to the brain. Sensorineural hearing loss typically impairs both volume sensitivity (the ability to hear soft sounds) and frequency selectivity (the ability to distinguish clear sounds from noise). This second impairment has a particularly severe impact on the intelligibility of conversation in noisy environments. Even when the speech is well above the hearing threshold, people with hearing loss experience a decline in their ability to follow a conversation in the presence of background noise compared to normal-hearing people.
[0004] Conventional hearing aids amplify the sound to compensate for reduced sound sensitivity. While this is helpful in quiet environments, the effect of amplification is limited in noisy environments, making it difficult for the hearing-impaired to selectively hear the sounds they want to hear. Conventional hearing aids use various techniques to improve the wearer's signal-to-noise ratio, such as directional microphones, beamforming technology, and post-filters. However, these methods are not particularly effective because they often rely on incorrect assumptions, such as the speaker's position and the statistical characteristics of signals in different frequency bands. As a result, even with the latest hearing aids, the hearing-impaired still have difficulty following conversations in noisy environments.
[0005] Neural networks provide a means of treating sounds differently based on the semantics of sound. While such algorithms can be used to separate speech from background noise in real time, placing powerful algorithms like neural networks in the signal path has previously been considered impossible in hearing aids and headphones. Hearing aids have limited battery capacity for computing such algorithms, and such algorithms have struggled to function adequately in the various environments encountered in the real world. The disclosed embodiments address these and other shortcomings of conventional hearing aids. [Brief explanation of the drawing]
[0006] The disclosed embodiments will be described in relation to the following exemplary and non-limiting embodiments, which similar elements are numbered similarly.
[0007] [Figure 1] Figure 1 is a system diagram according to one embodiment of the present disclosure. [Figure 2] Figure 2 schematically shows an exemplary front-end receiving unit according to one embodiment of the present disclosure. [Figure 3A] Figure 3A is a schematic diagram of an exemplary system according to one embodiment of the present disclosure. [Figure 3B] Figure 3B shows the speech volume, background noise level control, and mode switch. [Figure 4] Figure 4 shows a signal processing system according to another embodiment of the present disclosure. [Figure 5A] Figure 5A shows the interaction between user preference and the nonlinear gain applied by an exemplary NNE according to one embodiment of the present disclosure. [Figure 5B] Figure 5B is an illustrative diagram of an exemplary NNE circuit logic implemented according to one embodiment of the present disclosure. [Figure 5C] Figure 5C is a schematic diagram illustrating an exemplary architecture for engaging an NNE circuit according to one embodiment of the present disclosure. [Figure 6] Figure 6 is a flowchart illustrating an exemplary startup / de-startup of an NNE circuit according to one embodiment of the present disclosure. [Figure 7] Figure 7 is a block diagram of an SOC package according to one embodiment. [Figure 8] Figure 8 is a block diagram of an exemplary auxiliary processing system that may be used in relation to the disclosed principle. [Figure 9] Figure 9 is a generalized diagram of a machine learning software stack according to one or more embodiments. [Figure 10] Figure 10 shows the training and deployment of a deep neural network according to one or more embodiments. [Modes for carrying out the invention]
[0008] The following description and exemplary embodiments are provided to give a thorough understanding of the various embodiments. However, the various embodiments may be carried out without specific details. In other examples, well-known methods, procedures, components, and circuits are not described in detail so as not to obscure certain embodiments. Furthermore, various aspects of the embodiments may be carried out by various means, such as integrated semiconductor circuits ("hardware"), computer-readable instructions organized into one or more programs ("software"), or any combination of hardware and software. For the purposes of this disclosure, reference to "logic" means either hardware, software, firmware, or any combination thereof.
[0009] The disclosed embodiments generally relate to the improvement of speech data in in-ear systems such as hearing aids and headphones using neural networks. Neural network-based speech improvement is also being deployed in other applications such as video conferencing and other telecommunication media. In many of these applications, these algorithms are used to reduce background noise and make it easier for the user to hear the target sound (usually the speech of the person speaking to the user). Neural network-based speech improvement has been considered too difficult for face-to-face applications where the user is in the same location as the person or thing they are trying to hear.
[0010] One of the main reasons face-to-face communication has been considered impractical is the complexity of the tasks that algorithms face. In video communication, acceptable latency is relatively long (>50 milliseconds), the speaker is usually close to the microphone (resulting in a relatively high signal-to-noise ratio (SNR) for the signal received by the microphone), and ambient noise is usually limited to what is encountered in face-to-face scenarios, making it far less tolerant.
[0011] Human hearing is highly sensitive to delays caused by signal processing from devices worn on the ear. If the delay is too large, both the original sound and the amplified sound played through the earphone arrive at the ear at different times, resulting in an echo-like perception. Furthermore, delays can disrupt the brain's processing of incoming sounds by creating a disconnect between visual cues (such as lip movements) and the arrival of associated sounds. Hearing aids are a prime example of an ear-worn device for face-to-face communication. While the optimal latency for such devices is 10 milliseconds (ms) or less, longer latency of up to 32 milliseconds may be acceptable in some situations.
[0012] In such face-to-face scenarios, the nature of background noise is highly variable, and signals with much lower signal-to-noise ratios (SNRs) are also generated. Social environments such as bars, restaurants, and outdoor venues often require conversations to take place in the presence of overwhelming background noise. Similarly, general types of environments are far more diverse than typical teleconferencing. Therefore, creating a neural network that is robust to such situations is more difficult.
[0013] Neural networks offer a fundamentally different method of sound filtering compared to conventional hearing aids. The main difference lies in the power and flexibility with which they execute auditory algorithms. Conventional digital signal processing systems required manual adjustment of the parameters of the auditory equations. Neural networks, on the other hand, can discover optimal parameters through training. This is a computational process in which the network learns to solve tasks by adjusting parameters, gradually improving its performance. While humans can optimally adjust 100 parameters, neural networks can learn millions of parameters.
[0014] Digital signal processing in conventional hearing devices typically applies a series of filters and gains (or weights, interchangeably) to adjust the magnitude of signals at different frequencies. In conventional hearing aids, these gains specifically compensate for the user's lost frequency sensitivity. These algorithms typically do not adjust the phase of the incoming signal. Neural networks have the computational power to robustly generate fine-grained adjustments to both the magnitude and phase of the incoming signal, with excellent granularity in both the time and frequency domains.
[0015] A challenge associated with incorporating neural network algorithms is the computational cost. There is a well-established positive correlation between network size and network performance, observed across different domains of deep learning. To obtain the fine-grained responses necessary to robustly handle various acoustic environments, neural networks have thousands of parameters and require millions, or even billions, of operations per second. The size of a feasible network is limited by the computational power of the hearing device's processor. Hearing aids must be compact and suitable for long-term use to ensure comfort and convenience for the wearer. Ideally, hearing aids should be integrated into a single device rather than spanning multiple devices (e.g., a hearing device and a smart device).
[0016] Such neural network algorithms are also difficult to incorporate in a way that delivers an optimal user experience. Even if a hearing aid can isolate sound from a single source, its operation is not always desirable. For example, ambient noise may be important to a pedestrian. Even if speech isolation is the primary objective, a certain level of ambient noise may be desirable. For instance, someone in a restaurant might prefer to allow at least low levels of ambient noise to pass through, so as not to hear only speech and thus be able to perceive the atmosphere. Thus, achieving a desirable user experience requires the device to leverage the power of neural networks and use their output intelligently.
[0017] Another problem for creating a good user experience is to handle model errors. Even a well-trained large-scale neural network does not function perfectly, and in certain environments, it may not be able to distinguish one sound source from another. In such scenarios, the device needs to gracefully fail in a way that provides a comfortable auditory experience for the user. As an example, a conversation interrupted by a loud vehicle may generate unintelligible white noise for the listener if the model output is played back without considering model errors. Therefore, solutions are needed to monitor the output and performance of the model and dynamically adjust it to create an appropriate user experience.
[0018] As used herein, a hearing device generally refers to a hearing aid, an active ear protection device, or other audio processing devices, which can be configured to improve, amplify, and / or protect the user's hearing ability. A hearing aid can be implemented in one or two earpieces. Such devices typically receive an acoustic signal from the user's surroundings, apply possible modifications to the audio signal to generate a corresponding audio signal, and provide the modified audio signal to the user as an audible signal. The modifications can be implemented in one or both hearing devices corresponding to each ear of the user. In certain embodiments, the hearing device can include earphones (individually or as a pair), a headset, or other external devices adapted to provide an audible acoustic signal to the outer ear of the user. The transmitted acoustic signal may be finely adjusted through one or more controls to optimally transmit mechanical vibrations to the user's auditory system.
[0019] In one embodiment, the present disclosure relates to a hearing aid that can utilize neural network-based audio improvement in a signal processing chain. As used herein, a neural network within a signal processing chain includes a system in which the neural network is integrated with an in-ear hearing device. In some embodiments, the hearing device particularly includes a neural network integrated with an auxiliary circuit on an integrated circuit (IC). The IC may include a system-on-chip (SoC).
[0020] In some embodiments, the exemplary device is configured to, among other things, amplify all ambient sounds, filter incoming sounds up to speech (removing background noise), filter incoming sounds up to one or more target speakers, switch between these modes in response to user input, adjust the volume of background noise in response to user input, change what kinds of sounds are considered "noise", adjust the output of the hearing aid in all modes to conform to the user's auditory profile (comprising frequency sensitivity and dynamic range), and the like.
[0021] In one embodiment, a neural network is incorporated into the hearing aid. The hearing aid can comprise one or more processors optimized to handle the workload of the neural network. The one or more processors can be selectively engaged based on the operating mode of the device. Some embodiments of the present invention address these problems by introducing a dual-path signal chain that can selectively engage one or more of a neural network and a digital signal processor. By creating a dual signal processing path, a hearing aid user can enjoy the benefits of neural network-based improvements when the engagement of the neural network is necessary and desirable. These and other embodiments of the present disclosure are discussed in connection with the following exemplary embodiments.
[0022] FIG. 1 is a system diagram according to one embodiment of the present disclosure. System 100 may be implemented in a hearing aid. In an exemplary embodiment, system 100 is implemented in one or both earpieces of the hearing device. System 100 may be implemented as an integrated circuit. System 100 may be implemented as an IC or SoC.
[0023] System 100 receives an incoming signal 110 and provides an output signal 190. The incoming signal 110 may consist of acoustic signals emitted from multiple sound sources. The acoustic sources emitting the acoustic signal 110 may include ambient noise, human voices, alarm sounds, etc. Each acoustic source may emit sound at a different volume relative to other sound sources. Therefore, the incoming signal 110 may be a collection of different sounds that reach System 100 at different volumes.
[0024] The front-end receiver 120 may comprise one or more modules configured to convert the incoming acoustic signal 110 into a digital signal using an analog-to-digital converter (ADC). The front-end receiver 120 can also receive signals from one or more microphones in one or more earphones. In one embodiment, the signal received by one earphone is transmitted to other earphones using a low-latency protocol such as near-field magnetic induction for signal processing. The output of the front-end receiver 120 is a digital signal 125 representing one or more received audio streams. Note that Figure 1 shows an exemplary embodiment in which the front-end 120 and the control unit 130 are separate components. In one embodiment, to omit the front-end 120, one or more functions of the front-end 120 may be performed by the control unit 130.
[0025] In the embodiment shown in Figure 1, the NNE circuit is interposed between the control unit 130 and the DSP 140. Therefore, the NNE circuit 150 is in the direct signal processing path. This means that, when this signal path is employed, the speech is processed and improved through the neural network before the same speech is played back. This is in contrast to methods in which a neural network is employed outside the direct signal chain to adjust the parameters of the direct signal chain. These methods use the neural network output to improve the speech received afterward, rather than the same speech processed through the neural network. In one embodiment, the NNE circuit is configured to selectively apply a complex ratio mask to the incoming signal of the front-end receiver to obtain multiple components, each of which corresponds to a speech class or individual speaker, and the NNE circuit is further configured to couple these components to an output signal, with the volume of the components set to obtain a user-controlled signal-to-noise ratio.
[0026] The control unit 130 receives a digital signal 125 from the front-end receiver 120. The control unit 130 may include one or more processor circuits (as specified herein, processors), memory circuits, and other electronic and software components configured to (a) perform digital signal processing operations necessary to prepare the signal for processing by the neural network engine 150 or the DSP engine 140, and (b) determine the next step in the processing chain from a plurality of options. In one embodiment of the present disclosure, the control unit 130 performs decision logic to determine whether to proceed with signal processing through one or both of the DSP unit 140 and the neural network engine (NNE) circuit 150. Note that the front-end 120 includes one or more processors that transform the incoming signal, while the control unit 130 includes one or more processors that perform the exemplary tasks disclosed herein, and these functions may be combined and performed in the control unit 130.
[0027] The DSP140 may be configured to apply a set of filters to the incoming audio components. Each filter can isolate the incoming signal within a desired frequency range and apply a non-linear, time-varying gain to each filtered signal. The gain value may be set to achieve dynamic range compression or to identify steady-state background noise. The DSP140 may then recombine the filtered signal and the gain-applied signal to provide an output signal.
[0028] As described above, in one embodiment, the control unit performs digital signal processing operations to prepare the signal for processing by one or both of the DSP140 and NNE150. The NNE150 and DSP140 may accept a time-frequency domain signal (e.g., signal 110) as input, and the control unit 130 may perform a short-time Fourier transform (STFT) on the incoming signal before passing it to the control unit. In another example, the control unit 130 can perform beamforming of signals received by different microphones to improve sound coming from a particular direction.
[0029] In some embodiments, the control unit 130 continuously determines the next step in the signal chain for processing the received audio data. For example, the control unit 130 activates the NNE 150 based on one or more of the calculated metrics characterizing the audio environment, such as user-controlled criteria, user-independent criteria, user clinical criteria, accelerometer data, location information, stored data, and signal-to-noise ratio (SNR). If the NNE 150 is not activated, the control unit 130 instead passes signal 135 directly to the DSP 140. In some embodiments, the control unit 130 may pass data to both the NNE 150 and the DSP 140 simultaneously, as indicated by arrow 135.
[0030] User-controlled criteria (compatiblely logical or user-defined) may include user input, such as the selection of an operating mode via an application on the user's smartphone or input on the device (e.g., by tapping the device). For example, if a user is in a restaurant, they can change the operating mode to noise cancellation / speech separation by making an appropriate selection on their smartphone. User-controlled criteria may also include a set of user-defined settings or preferences that the user inputs through an application (app) or that the device learns over time. For example, the logic of user control may include user preferences regarding the sounds the user hears (e.g., a new parent might always want to amplify a baby's crying, or a dog owner might always want to amplify barking), or the user's general tolerance for background noise. User clinical criteria may include a clinically relevant auditory profile, such as the user's general degree of hearing loss or their ability to understand speech in noisy environments.
[0031] User-controlled logic can be used in relation to or independently of user-independent criteria (or logic). User-independent logic can consider variables independent of the user. For example, user-independent logic may consider the available power level of the hearing aid, the time of day, or the expected duration of NNE operation (as a function of the expected NNE operation demand).
[0032] In some embodiments, acceleration data captured by sensors within the device may assist the control unit 130 in determining whether to direct the output signal 135 of the signal control unit to one or both of the DSP 140 and NNE 150. Motion or acceleration information may guide the control unit 130 in determining whether the user is moving or sitting. Acceleration data may be used in combination with other information or may be overwritten by other data. Similarly, data from acceleration-capturing sensors may be provided to the neural network as information for inference.
[0033] In other embodiments, the user's location may be used by the control unit 130 to determine whether to engage one or both of the DSP 140 and the NNE circuit 150. Certain locations may require the activation of the NNE circuit 150. For example, if the user's location exhibits high ambient noise (e.g., the user is walking in a park or attending a concert) and there is no direct conversation, the control unit 130 may activate only the DSP 140. On the other hand, if the user's location suggests the user is on the move (e.g., using a car or train) and other indicators suggest human communication, the NNE circuit 150 may be activated to amplify the human voice more than the ambient noise.
[0034] The stored data may also be a factor in determining the processing path for the control unit 130. The stored data may include important characteristics of user-specific sounds, voices, preferences, or commands. System 100 may optionally include a storage circuit 132 for storing data representing speech that, when detected, can function as input to the logic of the control unit. The storage circuit 132 may be local, as shown in the figure, or remote from the auditory device. The stored data may include a so-called voice registry of known conversation partners. The voice registry may provide information necessary for a neural network to detect a particular voice and isolate it from background noise. The voice registry may include classification embeddings of each registered voice computed by a neural network not on the device (i.e., a large NNE), which are described herein as voice signatures, and a neural network on the device (i.e., a local NNE) may be configured to accept voice signatures as input for isolating utterances that match the signatures.
[0035] In addition to voice signatures, system 100 may store different preferences for each voice in a memory circuit (registry) 132 so that different speakers can elicit different actions from the device. NNE 150 can then implement various algorithms to determine which voice to amplify for other voices.
[0036] The control unit 130 can execute algorithmic logic to select a processing path. The control unit 130 can consider the detected SNR and determine whether to engage one or both of the DSP 140 and NNE 150. In one embodiment, the control unit 130 compares the detected SNR value to a threshold to determine which processing path to initiate. The threshold can be one or more empirically determined, user-independent, or user-controlled. The control unit 130 may also consider other user preferences or parameters when determining the threshold, as described above.
[0037] In another embodiment, the control unit 130 may compute specific metrics to characterize the incoming speech as input for determining the subsequent processing path. These metrics can be computed based on the received speech signal. For example, the control unit 130 may detect periods of silence, knowing that only the DSP 140 should be engaged because silence does not require neural network improvement. In a more complex example, the control unit 130 may include a speech activity detector (VAD) 134 to determine the processing path for speech separation modes. In some embodiments, the VAD may be a much smaller (i.e., much less computationally intensive) neural network within the control unit.
[0038] In an exemplary embodiment, the control unit 130 may receive the output of the NNE 150 for recently processed speech as input to its calculations, as indicated by arrow 151. The NNE 150 may be configured to isolate the speech of interest in the presence of background noise and provide the input necessary to robustly estimate the SNR. The control unit 130 can use this capability to detect whether the SNR of the incoming signal is high or low enough to affect the processing path. In yet another example, the output of the NNE 150 can be used as the basis for a more robust VAD 134. Speech detection in the presence of noise is computationally intensive. By leveraging the output of the NNE 150, the system 100 can perform this task with minimal computational overhead.
[0039] When the control unit 130 utilizes the NNE output 151, it can only use the output 151 to influence the signal path of subsequently received audio. When a predetermined audio sample is received by the control unit, the output of the NNE 150 for that sample has not yet been calculated and therefore cannot be used to influence the control unit's decision regarding that sample. However, the audio environment from less than one second ago can predict the current environment, so the NNE output of previously received audio can be used.
[0040] If the NNE150 is activated, using the NNE output 151 in the control unit does not incur any additional computational cost. In one embodiment, the control unit 130 can activate the NNE150 for support calculations even in modes where the NNE150 is not on the selected signal path. In such modes, the incoming voice signal is passed directly from the control unit 130 to the DSP 140, but data (i.e., voice clips) is additionally passed to the NNE150 at less frequent intervals for calculations. These calculations provide an estimate of the signal-to-noise ratio of the ambient environment or detect speech in the presence of noise in virtually real time. In an exemplary implementation, the control unit 130 may transmit a 16ms data window once per second for VAD 134 detection at the NNE150. In some embodiments, the NNE150 may be used for VAD instead of the control unit 130. In another embodiment, the control unit 130 may dynamically adjust the duration of the voice clips or the frequency of communicating the voice clips as a function of the estimated probability of useful calculations. For example, if the SNR has fluctuated significantly due to recent requests, the control unit 130 can request additional NNE calculations at more frequent intervals.
[0041] The NNE150 may include one or more actual and virtual circuits for receiving the control unit output signal 135 and providing an improved digital signal 155. In exemplary embodiments, the NNE150 improves the signal by generating a set of intermediate signals using a neural network algorithm (NN model). Each intermediate signal is representative of one or more original sound sources that make up the original signal. For example, the incoming signal 110 may consist of two utterances, an alarm, and other background noise. In some embodiments, the NN model running on the NNE150 can generate a first intermediate signal representing the utterances and a second first intermediate signal representing the background noise. The NNE150 may also separate one speaker from the other. The NNE150 may separate the alarm from the remaining background noise so that the user can be sure the alarm is heard even when the noise cancellation mode is activated. Different situations may require different intermediate signals, and different embodiments of the invention may include different neural networks with different capabilities that are best suited to the wearer's needs. In one embodiment, a remote (off-chip) NNE can enhance the capabilities of a local (on-chip) NNE.
[0042] As will be discussed in relation to Figures 7-10, a neural network, also called an artificial neural network (ANN) or simulated neural network (SNN) in the case of artificial neurons, is an interconnected group of natural or artificial neurons that uses mathematical or computational models for information processing based on a so-called connectionistic approach to computation. In most cases, an ANN is an adaptive system, changing its structure based on external or internal information flowing through the network. Neural networks are nonlinear statistical data modeling or decision-making tools. Such systems may be used to model complex relationships between inputs and outputs and to find patterns in data. The usefulness of artificial neural network models lies in their ability to infer and utilize functions from observations. This is achieved by training the model. The model takes representative data as input and iteratively changes the weights of parameters in the network to optimize a given function. In supervised learning, the model operates on labeled datasets, while in unsupervised learning, the model operates on unlabeled data. These methods can be used in combination. An illustrative explanation of an ANN or NNE is shown in Figure 10.
[0043] According to some of the disclosed principles, a neural network (which may be implemented via a neural network engine) is trained to separate one or more sound sources. In an exemplary embodiment, this is done by supervised learning. As input data, the model receives pairs of audio clips, one of which is the target and the other is a mixture containing both the target signal and other signals. Training data includes speaker clips speaking without background noise as the target, and then the clips can be synthetically mixed with recordings of background noise to form the mixture clip. Through training, the model learns to generate a complex mask for each pair of clips, and when this mask is applied to the mixture clip, it returns, on average, the audio that best approximates the target clip as measured by a loss function (training aims to minimize the loss across the training dataset). By devising a model that works well with a variety of different clips representing the task at hand, the model learns a function that can generalize audio data it has never seen before. When applied to data with speaker utterances and background noise, the model can estimate a signal containing only the utterance content, or at least substantially the utterance content.
[0044] To generate a model suitable for face-to-face processing of speech, the model may be trained to produce an output based on an input representing a small sample of speech. The model can process speech sequentially, receiving and processing each sample (or audio clip) so that it can be played back before the playback of the most recent sample finishes.
[0045] As an example, this model operates with 4ms audio samples. At t=0, the preprocessor begins receiving data from the microphone. At t+4ms, the control unit (e.g., control unit 130), having received the entire sample, passes the sample to the NNE 150 for processing. The NNE then calculates an estimate of the 4ms audio sample (clip) and passes the intermediate signal to the next step in the signal chain. Once the remaining signal processing is complete, playback to the user begins. At t+8ms, the NNE 150 receives the next 4ms sample clip from control unit 130. By the time the playback of the first sample is complete (4ms after playback begins), the next 4ms sample clip is ready for playback without any gaps. In the case of a recurrent neural network, this means that the calculation of subsequent samples must be completed in less than the sample length, because the calculation of subsequent samples depends on updated invocations from the current sample. Other model architectures can circumvent this constraint through parallelization (at a high computational cost).
[0046] In this example, the model operates with a 4ms audio clip sample. The sample length can be increased or decreased depending on various parameters. For example, the sample length could be less than 1ms or 32ms of data. The longer the sample length, the longer the model has to wait to provide a response, and therefore the longer the user experiences. If the model waits a full second for the audio data, it may provide excellent background noise suppression, but the user may experience an unbearable playback delay. In some embodiments, the model can have a look-ahead function, thereby allowing the model to wait to receive more audio before processing, thereby increasing the information available to the model. Extending the above example, the model might wait until t+8ms (given a 4ms look-ahead) before starting to process the first 4ms of audio, which may improve the model's performance but introduces additional latency. In some embodiments, the total latency is kept below 32ms (or below 20ms) to prevent unpleasant echoes for the user.
[0047] In one embodiment, the auditory system may be configured to generate an audible signal approximately 30–35 milliseconds, 20–30 milliseconds, 10–20 milliseconds, 12–8 milliseconds, 10–6 milliseconds, or 8–3 milliseconds from the reception of an incoming speech signal.
[0048] Many variations exist for the training methods disclosed. For example, a model can be trained to capture multiple audio streams from multiple microphones. The input data may be in the time domain or the time-frequency domain. The loss function may be the mean squared error of the signal or the mean squared error of a complex ideal ratio mask. The input data may include additional sensor data. The input data may include information about the desired object of the neural network, such as in the example where the network is trained to separate speech that matches a particular speech signature and also receives the signature as input data. The model can also be trained to output each speaker individually or to output multiple speakers in a single signal. The model's training object may be speech with different SNRs (not just utterances). It is also possible to train the model using unsupervised techniques to enable the use of speech without a clear object. The training data may be synthetically generated or generated by recording real-world audio streams at the same time. The above variations are illustrative to illustrate the basic concepts and do not cover all potential variations in model training.
[0049] One exemplary embodiment of the NNE150 includes a recurrent neural network of approximately 40 million units, organized into six layers. This network takes an 8ms clip (or frame, interchangeably) of audio data as input and internally transforms the chip into a time-frequency representation using a short-time Fourier transform. Thus, the network can generate a complex mask that can be applied to the original signal to modify the phase and magnitude of each frequency. The network then outputs a clear time-domain audio signal.
[0050] In an additional embodiment, the NNE150 consists of a convolutional neural network of approximately 1 million units organized into 13 layers. The first six layers correspond to encoders, where the input is gradually downsampled along the frequency axis via strided one-dimensional convolutions. Gated recurrent unit (GRU) layers are applied in the bottleneck layer to aggregate the temporal context. The decoder consists of six layers, which gradually upsample the input from the bottleneck via transpose convolutions. The network takes a time-domain signal with speech and noise (divided into 8ms clips fed to the model in real time) as input and outputs a corresponding time-domain clear signal.
[0051] Next, the NNE150 recombines the intermediate signals to generate a new signal. In some embodiments, the signals are recombined in a manner that maximizes the SNR by retaining only the signal (or signal component) containing the target speech. For example, the modified signal may contain only the voice of the target speaker. In other embodiments, the recombination is performed with a preferred SNR as the target, and the preference is determined by user-based and user-independent criteria. As used herein, SNR refers to the ratio of the power of the intermediate signals in the combined signal, each being an estimate of a particular sound source in the original signal, and it is recognized that such estimates are approximations.
[0052] User-based criteria may include user input in an application on a smartphone connected to the hearing aid via wireless communication. For example, the user may have the ability to slide, or dial up and dial down, the amount of desirable background noise that is converted to the model's target SNR. In another example, when the user selects noise cancellation, the user may have a preferred level of background noise stored as an application setting, so that the desired SNR is already known as a predetermined value. In another embodiment, the SNR may be determined as a function of clinical criteria, where the SNR is set to achieve clarity and comfort for the user, based on the user's stored auditory profile, while retaining a certain amount of ambient noise. If there are multiple intermediate signals (i.e., multiple speakers), the logic described above is extended so that each subject is adjusted to achieve a desired SNR. Given the constraint that the noise may be constant between two, the optimal SNR for two speakers at the same time may be different. User-based criteria (i.e., user-defined criteria or user-controlled criteria) will be discussed further in relation to Figure 3B.
[0053] Once processed, the signal components (i.e., intermediate signals) are recombined by selecting the degree of amplification (i.e., gain) to be applied to each signal. The challenge in setting the gain is to ensure that the speech is mixed in a way that achieves the target SNR without causing too much variation in gain. For example, if the SNR is targeted every 4 millisecond samples of speech, the SNR of the incoming signal measured over such short samples will be very unstable, and the gain applied to each signal may change dramatically every 4 milliseconds, so the results will be meaningless. Therefore, the NNE150 may consider a slower moving average to determine the SNR (in other words, it may evaluate the relative volume over a longer time window), and may respond differently to changes in background noise volume and changes in speaker volume.
[0054] To optimize speech quality, user-independent criteria can be used. These criteria may include algorithms generally known to achieve a desirable user experience. For example, in the absence of personalized settings, noise cancellation might target a signal-to-noise ratio (SNR) that generally leads to improved intelligibility for the hearing impaired. In exemplary embodiments, the SNR can be dynamically set based on the performance of the neural network (NN) model.
[0055] Another important user-independent aspect of intermediate signal recombination is the model's estimation performance. Even the best-trained models will struggle just as much as someone with normal hearing if the signal-to-noise ratio (SNR) is extremely low (the noise is significantly louder than the speech), as the noise completely obscures the speech signal. Therefore, in exemplary embodiments, measuring the SNR is useful as an indicator of when the model is likely to fail, allowing the system to fail straightforwardly rather than reproducing an estimated speech that sounds unnatural and incomprehensible. In one embodiment, the model simply plays nothing. In another embodiment, the model can default to returning to the original signal. In yet another embodiment, the model may mix the estimated value in question with the original signal, or mix in a noise estimate, which is the difference between the original signal and the speech estimate, to some extent.
[0056] In some embodiments, a neural network model can use other measures of its performance as input to a recombination algorithm. Certain intermediate metrics computed by the neural network may serve as a surrogate for model reliability, which can be leveraged to monitor the likelihood of model failure. In one embodiment, the neural network can estimate the phase of the target signal using Gumbel softmax, and the pre-threshold value can be used as a frame-by-frame measure of model reliability. The processor may also include other algorithms specifically tuned to measure the quality of the model output. Some examples are metrics commonly used in speech improvement research, such as PESQ and STOI, while others may be specifically developed for this purpose, such as lightweight neural networks trained simply to assess the quality of clear speech output.
[0057] In an exemplary embodiment, the NNE150 combines the target SNR with the limit SNR, where the target SNR is generated based on user input (such as the user adjusting the desired levels of background noise and speech in the app), and the limit SNR represents the maximum achievable SNR that the model estimates it can achieve while fitting certain estimated performance requirements. For example, a user might want to set the noise reduction parameter to maximum and eliminate background noise if overwhelming background noise is present, but the model might not be able to properly improve the incoming speech because the incoming SNR is very difficult for the model to handle. In this case, the limit SNR is determined to be the incoming SNR, and the speech is played back without modification (which may be preferable to playing back an incomprehensible estimated speech).
[0058] The NNE circuit 150 can be updated via wireless communication with a processing device or the cloud. In a preferred embodiment, an application on the user's smartphone can connect to the cloud and download an updated model (retrained for better performance), which can then be transmitted to the device via a wireless protocol. In another embodiment, the model is retrained on the smartphone using user-specific data collected by recording voice on the device. Once retrained, the updated model is transmitted to the hearing aid device.
[0059] In one embodiment, NNE150 may run on a remote device that communicates with the hearing aid. For example, NNE150 may run on a smart device (e.g., a smartphone) that communicates with the hearing aid. The hearing aid and the smart device may communicate using Bluetooth® Low Energy (BTE). In yet another embodiment, part or all of NNE150 may run on an auxiliary device that communicates with the hearing aid. The auxiliary device may comprise any device capable of communicating with one or more servers capable of executing the machine language algorithms disclosed herein.
[0060] The DSP140 comprises hardware, software, and a combination of hardware and software (firmware) for applying digital signal processing to an incoming frequency band. In some embodiments, a key objective of DSP processing is to improve the audibility and clarity of the incoming signal for hearing aid wearers, taking into account the user's hearing loss. Conventionally, this is done by compensating for reduced volume sensitivity, reduced dynamic range, and increased sensitivity to background noise at specific frequencies. The DSP140 can implement various digital signal processing algorithms to achieve dynamic range compression, amplification, and frequency adjustment (applying differential amplification to different frequency bands). The digital signal processing may include these conventional algorithms or may include additional processing functions configured to reduce background noise (e.g., a steady-state noise reduction algorithm). In some embodiments, the DSP140 can apply a predetermined gain to the incoming signal (e.g., the control unit output signal 135 or the improved digital signal 155). The applied gain may be linear or nonlinear and may be configured to improve the amplification of one frequency signal band compared to other bands.
[0061] In an exemplary embodiment, the DSP140 can pass the incoming signal through a filter bank. The filter bank divides the incoming signal into different frequency bands and applies gain to each band. The gain may be linear or nonlinear with respect to each frequency band or group of frequencies. A group of frequencies is often referred to as a channel. In a preferred embodiment, certain parameters of the filter, particularly the gain, are user-specific and configured so that the final signal applies greater amplification to frequencies in which the user is more deaf. The gain can be set to apply greater amplification to quieter sounds than to relatively louder sounds, thereby compressing the dynamic range of the signal. In this embodiment, the parameters are set as a function of the user's auditory profile, which includes but is not limited to an audiogram. The process of adjusting the parameters applied by the DSP processor to a specific individual can be done either by the individual themselves through an app-based adaptation process or by a professional who can program the device via software connected to the device wirelessly.
[0062] In another embodiment, the filter and gain are set by analyzing the incoming signal in the time-frequency domain. In some embodiments, the signal is received in this form so STFT is not required in the DSP140, but in other embodiments, the processor receives the signal in the time domain and then applies STFT. In some embodiments, the algorithm can be applied to different frequency bands or groups of frequency bands to analyze their contents and set the gain accordingly. As an example, such an algorithm can be applied to identify which frequencies contain steady noise and attenuate these frequencies (receiving them with a lower gain) to improve the SNR of the reconstructed signal. After applying frequency gain to different frequency bands, the bands can be recombined into a single signal.
[0063] The output 145 of the DSP 140 is directed to the backend / output processor 160. The backend processing circuit 160 may include one or more circuits for converting the processed signal bandwidth 145 into an audible signal in the time domain. For example, the backend processor 160 may include a digital-to-analog (DAC) converter (not shown) that converts the amplified digital signal into an analog signal. The DAC then sends the analog signal to a driver and one or more diaphragm speakers (not shown) to present the processed and amplified sound to the user. The speakers (not shown) may further include means for adjusting the output volume.
[0064] As mentioned above, the DSP140 can receive signal data from either the control unit 130 or the NNE150. This means that the signal may pass through the NNE150 (receiving corresponding computational costs and associated improvements) or pass directly through the DSP140. In either case, the DSP140 may become engaged. If the NNE150 becomes engaged, the number of steps in the signal processing chain increases, increasing the system's power consumption and the time required for computation. The additional processing may result in further latency for the end user.
[0065] In one embodiment, the system 100 in Figure 1 is formed on an IC. The IC may define a SoC. The integrated circuit may further include a speaker and a driver for the speaker. In the latter embodiment, the integrated circuit 100 may include one or more communication circuits to enable communication between the circuit 100 and one or more external devices that support the NNE 150. Such communication may include, for example, Bluetooth® (BT) and Bluetooth Low Energy (BLE) or other short-range wireless technology.
[0066] As mentioned above, one of the main obstacles to placing neural networks in signal paths is the power consumption required to run the neural network on a battery available for such processing. Therefore, in order to achieve excellent performance while maintaining a long battery life, one embodiment of the present invention must achieve high efficiency in the neural network circuit, such as measured in milliwatts of operation.
[0067] In exemplary embodiments, approximately 10 milliwatt-hours of this battery can be freed up for neural network processing by slightly shortening the runtime or increasing the battery size. Batteries found in conventional rechargeable hearing aids and headphones typically have a capacity of about 300 milliwatt-hours. Ideally, a user needs access to 10 hours of neural network processing to use speech enhancement features and lead an active, social life, which means that only 1 milliwatt of additional power can be consumed when the neural network circuit is running. Therefore, if a chip performance of 2 to 3 billion operations per milliwatt is achieved, the neural network's computation budget would be 2 to 3 billion operations per second, which is sufficient for speech separation. In other embodiments, a larger computation budget can be allocated to the neural network by reducing the total execution time (therefore allocating more battery budget to the neural network) or by reducing the execution time of the neural network (therefore increasing the neural network's budget per second).
[0068] To achieve efficient signal processing, the DSP140 and NNE150 may be located on separate cores on a chip having different architectures suited to their respective tasks. For example, the neural network circuit may be configured for low-precision numerical processing with 8-bit (or fewer) arithmetic logic units. It may also be configured for efficient data movement, ensuring that all data required for computation is stored within the SOC. In some embodiments, this neural network core may also be configured so that the same processor used to run the neural network can be used for more traditional DSP operations, such as 24-bit arithmetic. Thus, in some embodiments, the DSP140 and NNE150 can run on the same processor.
[0069] Figure 2 schematically shows an exemplary front-end receiver 200 according to one embodiment of the present disclosure. In Figure 2, incoming sound, which may be a combination of voice and ambient noise, is received by microphones 214 and 224. Microphones 214 and 224 correspond to separate devices located on the left and right sides of the user's head and receive input speech identified as 210 and 220, respectively. In some embodiments, each device comprises multiple microphones. Microphones 214 and 224 direct the received signals 210 and 220 to ADCs 218 and 228, respectively. ADCs 218 and 228 convert the received time-varying signals 210 and 220 into corresponding digital representative values 219 and 229. Once digitized, the signals 219 and 229 are passed to the control units 130 of the respective devices. In some embodiments, they are further passed to the control units of the opposite-side devices to enable processing of binaural input data.
[0070] Figure 3A is a schematic diagram of an exemplary system according to one embodiment of the present disclosure. Specifically, Figure 3A shows an exemplary decision-making process that may be implemented in the control system. The control unit 300 may function as a signal processor that performs specific transformations and calculations on an incoming signal (e.g., 110 or 125, Figure 1) to make the incoming signal into a form required for processing and select the next processing step. In some embodiments, the control unit 300 may function as a selector switch that optimizes user selection, preference, and power consumption. In some embodiments, the control system 300 may determine when to engage a larger NNE based on user preference to amplify the user-preferred sound.
[0071] The control system 300 shown in Figure 3A can be implemented in a hearing aid or headphones. The control unit may be integrated into the hearing aid as hardware, software, or a combination of hardware and software. The control system 300 includes a processor circuit 330 that receives an audio signal 325. The audio signal may be digital (e.g., 125, Figure 1) or a time-varying signal (e.g., 110, Figure 1). If the signal is time-varying, an additional ADC (not shown) may be used. As described in relation to Figure 1, a digital audio signal may consist of multiple components, including one or more audio signals and ambient noise or background noise.
[0072] The processor 330 can receive user input from the user control 310. The user input may include user preferences dialed into the system from an auxiliary device such as a smartphone (see, for example, Figure 3B). Specific user preferences may provide amplification parameters or preferences regarding the relative amplification of different sounds, which may determine the signal-to-noise ratio (SNR). For example, a user might prefer the amplification of speech to that of other ambient sounds. User preferences may be obtained through a graphical user interface (GUI) implemented by an app on the user's auxiliary device, such as a smartphone. User control may be delivered wirelessly to the process circuit 330. The user control 310 may include a mode selection 312, a directivity selection 314, a source selection 316, and a target volume 318. These exemplary embodiments are described below with reference to Figure 3B.
[0073] In an exemplary embodiment, the system 300 may optionally include a module (not shown) for receiving and executing a so-called wake word. The wake word may be one or more special words designated to activate the device when spoken. The wake word is also known as a hot word or trigger word. The processor 330 may have designated wake words that can be used by a user to activate the NNE 350. Activation overrides the processor 330 and decision logic 335, directing incoming audio to the NNE 350. This is indicated by arrow 331.
[0074] Decision logic 335 is illustrated separately, but may optionally be integrated with processor circuit 330. Decision logic 335 determines when to engage NNE 350 and to what extent such engagement occurs. Decision logic 335 can apply decision considerations provided by the user, the NNE, or a combination of both. Decision logic 335 may optionally consider the input of power indicator 305, which indicates the available battery level. Decision logic 335 may also utilize such considerations to determine the extent of NNE engagement. Decision logic 335 determines whether to engage NNE 350 (or part thereof), DSP 340, or both. If selected, DSP 340 filters the incoming signal 325 into a number of different frequency bands. Processor 330 and decision logic 335 may collectively decide when to engage NNE 350. For example, the processor 330 may use its own logic in combination with user input to determine that the receiving frequency band 325 contains only background noise and does not engage the NNE 350.
[0075] The received frequency band may have more than 400 bandwidths. The DSP340 then assigns a different gain to each frequency band. The gain may be linear or nonlinear. In one embodiment, the DSP340 sets an ideal gain for each frequency to significantly reduce noise.
[0076] Figure 3B shows an exemplary graphical user interface (GUI) according to one embodiment of the present disclosure. The GUI may be implemented as an application on a smart device. The GUI allows user preferences to be communicated to the hearing aid device. Speech volume and background noise may be configured to allow the user to input preferences for amplification of speech and noise, respectively. Directionality is an additional input that allows the user to increase the relative volume of noise coming from one direction relative to the user (usually in front, but in other embodiments, the user may be able to select a different direction). Detected speaker allows the user to select a particular speaker whose voice to amplify (compared to other sounds which may be treated as noise). Mode selection 312 allows the user to select an operating mode for the device (exemplified by the activation of conversation mode). In some embodiments, the selectable modes may include conversation mode, ambient mode, and automatic mode. If ambient mode is selected, the NNE 150 may be disengaged. Other modes, such as voice mode, may indicate that noise reduction is desired. The automatic mode can indicate that the processor 330 should make the best prediction about when to turn on the NNE 150 to match user preferences (for example, when the user is engaged in a conversation and there is background noise).
[0077] Total volume, speech volume, background noise, and directionality may each have dials or sliders on the user's device to implement specific user preferences. Additional controls may also be included to accommodate one or more sound categories or sound sources. In some embodiments, a dial on the device functions as a volume control for a set speech class, such as speech or background noise. Turning the dial allows the user to raise or lower a user-defined SNR target for recombining the output of the neural network. In some embodiments, one device may have a dial for ambient volume control, and the other device may have a dial for changing the level of background noise. In some embodiments, a single dial can adjust the SNR by dynamically adjusting either the speech volume or the noise volume based on the starting SNR or incoming volume. For example, the SNR can be increased by initially gradually decreasing the volume of background noise in the output signal, but further improvement of the SNR can be achieved by increasing the volume of the speech signal once the background noise is completely gone (because the speech signal is still competing with sounds entering the ear around the hearing device). In some embodiments, the physical dial can be specifically configured in the settings of a smartphone app to assign different behaviors.
[0078] Figure 3B shows the speech volume, background noise level control, and mode switch. These parameters (together with or in combination with other parameters) can be used to determine the user's desired noise reduction level. Referring to Figure 3A, the user's desired noise reduction level may be communicated to the NNE350 via the processor 330 or it may be input directly to the NNE350 (not shown). Once engaged, the NNE350 can identify different sound sources and separate incoming signals accordingly. Given the user's preferred noise reduction level, the NNE350 can then apply appropriate amplification gains to the target sound and noise.
[0079] In one embodiment, source selection 316 allows the user to pre-identify specific voices and match the identified voices with known individuals. Source selection 316 can be performed optionally. NNE 350 or a subset thereof may be performed to enable the user to perform source selection. Once the incoming frequency band is matched with the identified individuals, the system 300 can perform the step of isolating and amplifying the individuals' voices from ambient noise. Identified voices may include those of caregivers, children, and family members. Other sounds, including alarms or emergency sirens, may also be identified by the user or by the system 300 so that they can be easily isolated and selectively amplified. In one embodiment, source selection 316 allows the user to identify one or more groups of sounds for amplification (or non-amplification).
[0080] Figure 4 shows a signal processing system according to another embodiment of the present disclosure. The system of Figure 4 can be implemented in an auditory device based on the disclosed principle. In Figure 4, a front-end receiver 420 is shown together with the receiver 420, which combines incoming signals from different microphones into a single digital signal, as described in relation to Figure 2. The control system 430 includes user control 434, an SNR detector 432, and decision logic 436.
[0081] The decision logic 436 communicates with both the DSP 440 and the NNE 450, as described in relation to Figure 3A. In Figure 4, the NNE 450 provides additional feedback to the decision logic 436, as indicated by arrow 451. In some embodiments, the NNE 450 measures the estimated SNR of the incoming signal, which can then serve as an input to the logic 436. If the SNR is extremely high, the NNE 450 may no longer be necessary. If the SNR is exceptionally low, to the point that no voice is detected, the NNE 450 may be of no use. In some embodiments, a way is provided to measure the characteristics of the voice signal without constantly consuming power by intermittently transmitting data to the NNE 450.
[0082] The exemplary NNE 450 in Figure 4 includes exemplary modules of source separation 452, relative gain 454, recombiner 456, and performance monitoring 458. When activated, the source separation 452 receives the incoming audio signal frame by frame. The audio can be received in the time domain or the time-frequency domain. For example, a frame may be 10, 14, 16, or 20 milliseconds long. In some embodiments, a frame may be less than 1 millisecond or longer than 30 milliseconds. Each frame is processed through the neural network, which outputs one or more complex masks that can be used to separate one or more sound sources. By applying these masks, the source separation module 452 can filter each frame down to the sound source. Noise can be found by generating a mask for the noise or by subtracting all other separated sources from the original signal so that the noise remains.
[0083] The relative gain module receives the user's auditory preferences from the user control 434 and applies one or more relative gains to each frame received from the source separation 452. The gains applied to different frequency bands in the NNE 450 may be nonlinear (compared to the gains applied in the DSP 440). In this implementation, different gains can be applied on a source and frame-by-frame basis.
[0084] Figure 5A illustrates the interaction between user preference and the nonlinear gain applied by an exemplary NNE according to one embodiment of the present disclosure. In Figure 5A, incoming sound is directed to the NNE 510 in the form of a digitized signal 500. Source separation 452 divides the incoming sound into different data streams, for example, as a function of each source. This data is then directed as different bandwidths to relative gain filters 454 that apply different gains based on user preference, as indicated by arrows 435. User preference 540 determines the optimal combination (or optimal weighting) of various sources. Recombining unit 456 then combines the differently weighted frequency bands to form a combined signal 580.
[0085] Referring again to Figure 4, the NNE450 directs the recombined audio stream to the DSP440 for further processing. In this way, according to one embodiment, the components of the NNE450 estimate an ideal ratio mask for separating the audio signal from the noise signal, apply differential gain to the identified audio signal and noise signal respectively, and combine the differentially amplified signals into a single data stream.
[0086] The performance monitoring module 458 can be used optionally. In one embodiment, the performance monitoring module 458 examines the output signal of the NNE 450 to determine whether the output signal is within the auditory requirements criteria. If the output signal does not meet the requirements, the performance monitoring module 458 can signal the decision logic 436 to redirect the incoming signal directly to the DSP 440. This is indicated by arrow 451. Otherwise, the NNE output can be directed to the DSP 440, as shown by arrow 459. In another embodiment, the performance monitor 458 can function as an incoming signal to the relative gain 454 and can limit the aggressiveness of noise suppression if the performance monitor 458 detects an error in the source isolation 452.
[0087] The DSP440 includes, in particular, a filter bank 442 that separates the incoming signal into different frequency bands, and a nonlinear gain filter 444 that applies gain to each band. In one embodiment, each filter identifies noise components within its respective different band and applies noise cancellation gain to cancel out the noise components.
[0088] The active noise cancellation (ANC) 425 is located in the signal path between the front-end receiver 420 and the back-end receiver 460. The ANC can be used optionally. The ANC 425 may include processing circuitry configured to receive an ADC signal from the hearing aid microphone and process the signal to improve the signal-to-noise ratio (SNR). Conventional ANC techniques can be used for noise cancellation. The input to the ANC 425 may be the incoming signal 421, optionally the control signal output 431, or both. The ANC process may be performed on each unit of the hearing aid device to address noise uncertainties associated with each unit. In one embodiment of this disclosure, the ANC 425 may remain engaged even without a user control input 434, or without DSP engagement or NNE engagement. Given the latency of speech processing through the neural network and the low-latency requirements of ANC, the ANC is applied to the entire incoming signal (including both speech and noise components), and then the system plays the speech after processing is complete.
[0089] The backend processor 460, as well as an optional processor circuit 462, includes a speaker 464. The speaker 464 may include a conventional hearing aid speaker that converts the processed digital signal into an audible signal.
[0090] Figure 5B is an explanatory diagram of exemplary NNE circuit logic implemented according to one embodiment of the present disclosure. The logic may be implemented in the NNE engine circuit 550. The received audio signal is shown as input 530. The received audio signal is directed to a neural network (NN) model 532. The NN model 532 may include exemplary algorithms for separating sound sources or improving the SNR, according to the disclosed embodiments. The NN model 532 may comprise hardware, software, or a combination of hardware and software. The NN model 532 receives user preferences in the form of user control 531, as described, for example, in relation to Figure 3B. The output of the NN model 532 (NN output signal 533) is directed to a performance measurement unit 534. The performance measurement 534 implements metrics used to predict the performance of the neural network or to predict errors. These predictions can further be used as input to a recombiner 536, which attempts to optimize the way in which the outputs of the model are recombined to form the final signal. The recombination unit 536 takes into account both the user preferences indicated by the user control 531 and the output of the performance measurement 534 to optimally recombine the output of the NN model 532.
[0091] In an exemplary embodiment, the performance measurement unit 534 receives the output signal 533 in successive frames and determines the SNR for each frame. The measurement unit then estimates the mean SNR of the environment, which can be used to predict the model error (because the model error generally increases with more difficult input SNRs). The recombiner 536 also receives user preferences from user control 531. Given the user preferences and estimated SNR, the recombiner 536 then determines a set of relative gains to be applied to the signal 533 and communicates the gain values to the recombiner 536. In an exemplary embodiment, the recombiner attempts to set the gains to best suit the user preferences while keeping the total error below a certain threshold.
[0092] The recombining unit 536 applies the gain value to the NN output signal 533 to obtain the output signal 538. In one embodiment, multiple gain values are transmitted to the recombining unit 536. Each gain value corresponds to an intermediate signal, and the intermediate signal corresponds to a sound source. The recombining unit 536 multiplies each gain value by the corresponding intermediate signal and combines the results to generate the output 538.
[0093] The following embodiments illustrate specific, non-exclusive examples of the disclosed principle.
[0094] Example 1 - The average SNR value of signal 533 is below the threshold at which speech can be reliably separated (audible speech threshold). In this example, neural network processing is ineffective regardless of user preference or system capabilities. In this case, the performance measurement unit 534 may set the gain so that the incoming signal is not altered, or it may relay the signal to the control unit 130 as shown in Figure 1 to temporarily turn off the neural network processing in order to conserve battery power.
[0095] Example 2 – The average SNR value of signal 533 is above the audible speech threshold, and user preference is applied. In this embodiment, since the SNR value of signal 533 is above the audible speech threshold, the recombiner 536 can determine an appropriate gain. The gain may be determined as a function of user preference and the error of the estimated model. The performance measurement unit 534 then determines the gain that best approximates the SNR desired by the user, while keeping the error of the model heard by the user below a certain threshold.
[0096] Example 3 – The average SNR value of signal 533 is above the audible speech threshold, and the recombiner 536 recognizes user preference. The recombiner 536 may ignore user preference by prioritizing the estimation and application of different sets of relative gains. This may be because it understands that higher quality sound may be obtained by applying different gain criteria. In this embodiment, the recombiner 536 substitutes its own criteria for providing an audible output signal 538 that may or may not exceed the user's SNR preference. In this way, the system works with the NNE circuit in the signal path to provide an audible signal in virtually real time while straightforwardly handling the limitations of deep learning models in a real-world environment.
[0097] Figure 5C schematically illustrates an exemplary architecture for engaging an NNE circuit according to one embodiment of the present disclosure. The architecture of Figure 5C may be implemented in an NNE circuit. In Figure 5C, an incoming signal 550 is received by an NN model 556. User preferences in the form of user control 552 and a target source 554 are also provided to the NN model 556. The target source 554 may comprise one or more identified sources, for example, the voices of known speakers that have been identified and stored, or the voice of the user themselves.
[0098] The user's ideal SNR can also be set using user preferences. The ideal SNR can define threshold SNR values that correspond to the user's personal preferences and speech impairments. For example, the ideal SNR can be targeted at an output SNR of 10 dB, because it is the balance communicated by the user controls of the smartphone, or simply because the user's auditory profile is such that 10 dB is the minimum SNR at which the person can still reliably follow speech without effort.
[0099] The NN model 556 outputs a signal to the performance measurement unit 558. A general description of the performance measurement unit has been provided in relation to Figure 5B and will not be repeated here. In Figure 5C, the performance measurement unit 558 identifies an intermediate signal 560 which may include, for example, a target frequency band and a noise band. The recombiner 590 may include SNR optimization logic 564. The optimization logic 564 receives the user's ideal SNR 562, as well as the output from the performance measurement unit 558, and decides whether to apply or deviate from the user's preference (i.e., the ideal SNR 562). As a result, a set of gain values 568 are determined which are then applied to the intermediate signal 560, respectively, in order to provide the output signal 570. Note that in the exemplary embodiment of Figure 5C, the recombiner 590 also applies the optimization logic 564 to determine the gain values 568.
[0100] In an exemplary embodiment, the performance measurement 558 outputs a limiting SNR. The limiting SNR is the output SNR that keeps the audible distortion caused by the model's error below a certain threshold. The SNR optimization logic then compares the ideal SNR, determined based on user preference, with the limiting SNR and adopts the lower of the two. The gain is then set to target the SNR determined by this function.
[0101] Example 4 - In this embodiment, to comply with the user-preferred SNR 562, an output signal with an SNR of approximately 10 dB may be required. The SNR optimization logic 564 can compare this value to the available system bandwidth and impose a -5 dB limit on the output signal 570. The gain value is then determined based on the -5 dB SNR. In this way, the SNR optimization logic 564 functions as an SNR limiter.
[0102] Thus, according to the specific principles disclosed, the NN model may be run on small audio frames, for example, once per second, to obtain a preliminary SNR value. The frequency and duration of the audio frame tests can be varied.
[0103] Figure 6 is a flowchart illustrating an exemplary startup / destart of an NNE circuit according to one embodiment of the present disclosure. Such a flow would be performed in the control unit 130 of Figure 1. In one embodiment, the exemplary process aims to minimize the power consumption of the system while improving the user experience. The disclosed process may be implemented in hardware, software, or a combination of hardware and software. The disclosed process may be implemented in various parts of the system disclosed herein. For example, certain steps may be implemented in the front-end receiver, other steps in the control unit, and yet other steps in the NNE and DSP circuits.
[0104] In one embodiment, the system monitors incoming sounds without continuously activating the NNE circuit. This may be achieved by layering the logic so that it is executed only when a more computationally intensive task (i.e., a high-power computation) is required.
[0105] Referring to Figure 6, in step 602, the system detects incoming sound. Step 602 can be performed in the control unit with relatively low computational cost. A conventional sound detection mechanism can be used in step 602. Once sound is detected, the system determines whether the detected sound exceeds a predetermined threshold. This is shown in step 604. If the threshold is not met, the system returns to step 602 and continues detecting incoming sound. Steps 602 and 604 may be performed continuously or intermittently. These steps may be performed in the front-end receiver or elsewhere in the system.
[0106] Sound detection can be performed on one or both sides of the hearing aid. Sound detection may be performed by intermittently analyzing speech frames in low-power mode. If the detected sound level exceeds a predetermined threshold, the VAD may be activated in step 606. In step 608, the VAD determines whether the detected utterance is continuous. If the detected utterance is not continuous, the process then returns to step 602. If the detected utterance is continuous, the sampling frequency of the incoming speech may then be increased in step 610. Once activated, the logic can search for continuous utterances through more frequent sampling of the incoming speech.
[0107] In step 612, the system engages the NNE circuit to further process the incoming speech signal. When engaging the NNE circuit, the system may consider several competing interests. For example, the system may consider the user input, the NNE's ability to provide a meaningful SNR (i.e., the NNE's performance limits), and power availability. In one embodiment, when continuous speech is detected, the entire NNE circuit may then be engaged to analyze the incoming speech without modifying the output to the user. This allows the device to analyze the SNR of the incoming speech and determine whether it is preferable to activate the NNE.
[0108] In step 614, if the NNE is activated, the output is optionally modified according to user settings, and the audio stream is delivered to the user. Furthermore, the NNE may use the same model output to analyze the SNR of the incoming audio stream or audio clip and notify whether the NNE should remain activated.
[0109] In step 618, the control unit, having received SNR feedback from the NNE, determines whether the SNR exceeds the NNE's limit for providing audible speech. For example, if the SNR of the incoming speech is very high (conversation in a quiet room), no processing of the NNE is necessary. Therefore, the system focuses on a threshold SNR level set by the user or the device itself (e.g., if automatic mode is selected). If the SNR is sufficiently high that audible speech cannot be provided even with the NNE fully engaged, the system may refuse filtering, as described above. If the NNR level does not exceed the NNE's limit, the algorithm can then process the incoming signal at a level determined by the system or user (i.e., selecting the lower of the target SNR or the NNE's limit SNR). This step is shown as step 620 in Figure 6. The process may then return to step 602.
[0110] Figure 7 shows a block diagram of an exemplary SOC package. In Figure 7, the SOC 702 includes one or more central processing unit (CPU) cores 720, an input / output (I / O) interface 740, and a memory controller 742. Various components of the SOC package 702 may be optionally coupled to interconnects or buses as described herein with reference to other figures. The SOC package 702 may also include components as described with reference to the hearing aid system of Figures 1-6. Furthermore, each component of the SOC package 720 may include one or more other components, as described with reference to, for example, Figure 2 or Figure 3. In one embodiment, the SOC package 702 (and its components) are provided on one or more integrated circuit (IC) dies, for example, they are packaged into a single semiconductor device. The single semiconductor device may be configured to be used as a hearing aid, an amplification system, or an auditory device used in the human ear canal.
[0111] As shown in Figure 7, the SOC package 702 is coupled to the memory 760 via the memory control unit 742. In one embodiment, the memory 760 (or a portion thereof) can be integrated onto the SOC package 702. The I / O interface 740 may be coupled to one or more I / O devices 770, for example, via interconnects and / or buses as discussed herein. The I / O devices 770 may include means for communicating with the SOC 702. In an exemplary embodiment, the I / O interface 740 communicates wirelessly with the I / O devices 770. The SOC package 702 may include, for example, hardware, software, and logic for implementing the embodiments shown in Figures 1 and 4. The implementation may communicate with auxiliary devices, such as the I / O devices 770. The I / O devices 770 may have additional communication capabilities for accessing the NNE, such as cellular or WiFi.
[0112] Figure 8 is a block diagram of an exemplary auxiliary processing system 800 that may be used in connection with the disclosed principles. In various embodiments, the system 800 may comprise one or more processors 802 and one or more graphics processors 808, and may be a single-processor desktop system, a multi-processor workstation system, or a server system having a large number of processors 802 or processor cores 807. In one embodiment, the system 800 is a processing platform embedded in a system-on-a-chip (SoC or SOC) integrated circuit for use in mobile, handheld, or embedded devices.
[0113] Embodiments of System 800 may include, or be incorporated into, a server-based smart device platform or an online server accessible to the Internet. In some embodiments, System 800 is a mobile phone, smartphone, tablet computing device, or mobile internet device. The data processing system 800 may also include, or be integrated into, a wearable device such as a smartwatch wearable device, smart eyewear device (e.g., faceworn glasses), augmented reality device, or virtual reality device. In some embodiments, the data processing system 800 is a television or set-top box device having one or more processors 802 and a graphical interface generated by one or more graphics processors 808.
[0114] In some embodiments, one or more processors 802 each comprises one or more processor cores 807 for processing instructions that, when executed, perform the actions of the system and user software. In some embodiments, each of the one or more processor cores 807 is configured to process a particular instruction set 809. In some embodiments, the instruction set 809 may facilitate computation via complex instruction set computing (CISC), reduced instruction set computing (RISC), or very long instruction word (VLIW). Multiple processor cores 807 may each process a different instruction set 809, which may include instructions that facilitate the emulation of other instruction sets. The processor cores 807 may include other processing devices, such as digital signal processors (DSPs).
[0115] In some embodiments, the processor 802 includes a cache memory 804. Depending on the architecture, the processor 802 may have a single internal cache or multiple levels of internal caches. In some embodiments, the cache memory is shared among various components of the processor 802. In some embodiments, the processor 802 also uses an external cache (e.g., a level-3 (L3) cache or a last-level cache (LLC)) (not shown), which may be shared among processor cores 807 using known cache coherency techniques. The register file 806 may further include different types of registers (e.g., integer registers, floating-point registers, status registers, and instruction pointer registers) for storing different types of data within the processor 802. Some registers may be general-purpose registers, while others may be specific to the design of the processor 802.
[0116] In some embodiments, the processor 802 is coupled to a processor bus 88 to transmit communication signals, such as addresses, data, or control signals, between the processor 802 and other components in the system 800. In one embodiment, the system 800 uses an exemplary “hub” system architecture, including a memory controller hub 816 and an input / output (I / O) controller hub 830. The memory controller hub 816 facilitates communication between memory devices and other components of the system 800, while the I / O controller hub (ICH) 830 provides connectivity to I / O devices via a local I / O bus. In one embodiment, the logic of the memory controller hub 816 is integrated within the processor.
[0117] The memory device 820 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, a phase-change memory device, or another memory device having performance suitable for functioning as process memory. In one embodiment, the memory device 820 can act as system memory of the system 800 to store data 822 and instructions 821 for use by one or more processors 802 when executing an application or process. The memory controller hub 816 can also be coupled with an optional external graphics processor 812 and can communicate with one or more graphics processors 808 within the processor 802 to perform graphics and media operations.
[0118] In some embodiments, the ICH830 allows peripherals to connect to memory devices 820 and processor 802 via a high-speed I / O bus. I / O peripherals include, but are not limited to, an audio controller 846, a firmware interface 828, wireless transceivers 826 (e.g., Wi-Fi, Bluetooth), data storage devices 824 (e.g., hard disk drives, flash memory, etc.), and a legacy I / O controller 840 for connecting legacy (e.g., Personal System 2 (PS / 2)) devices to the system. One or more Universal Serial Bus (USB) control units 842 connect input devices such as a keyboard and mouse combination 844. A network controller 834 can also be connected to the ICH830. In some embodiments, a high-performance network controller (not shown) is coupled to the processor bus 88. It will be understood that the illustrated system 800 is illustrative and not limiting, as other types of data processing systems with different configurations may also be used. For example, the I / O controller hub 830 may be integrated into one or more processors 802, or the memory controller hub 816 and the I / O controller hub 830 may be integrated into a separate external graphics processor such as an external graphics processor 812.
[0119] Figure 9 is a generalized diagram of the machine learning software stack 900. The machine learning application 1102 may be configured to implement machine intelligence relating to the disclosed principles, either by training a neural network using a training dataset or by using a trained deep neural network. The machine learning application 902 may include dedicated software that can be used to train and infer neural networks, as well as / or train the neural network before deploying it to an auditory device. The machine learning application 902 can implement any type of machine intelligence, including but not limited to image recognition, mapping and localization, autonomous navigation, speech synthesis, medical image processing, or language translation.
[0120] Hardware acceleration for machine learning application 902 can be enabled via machine learning framework 904. Machine learning framework 904 can provide a library of machine learning primitives. Machine learning primitives are fundamental operations commonly performed by machine learning algorithms. Without machine learning framework 904, developers of machine learning algorithms would need to create and optimize the major computational logic associated with their algorithms, and then re-optimize that computational logic every time a new parallel processor is developed. Instead, machine learning applications can be configured to perform the necessary computations using primitives provided by machine learning framework 904. Exemplary primitives include tensor convolution, activation functions, and pooling, which are computational operations performed while training a convolutional neural network (CNN). Machine learning framework 904 can also provide primitives for performing fundamental linear algebra subprograms performed by many machine learning algorithms, such as matrix and vector operations.
[0121] The machine learning framework 904 can process input data received from the machine learning application 902 and generate appropriate input to the computer framework 906. The computer framework 906 can abstract the underlying instructions provided to the GPGPU driver 908 so that the machine learning framework 904 can utilize hardware acceleration via the GPGPU hardware 910 without requiring the machine learning framework 904 to be familiar with the architecture of the GPGPU hardware 910. Furthermore, the computing framework 1106 can enable hardware acceleration of the machine learning framework 904 across various types and generations of the GPGPU hardware 910.
[0122] The computational architectures provided by the embodiments described herein can be configured to perform a type of parallel processing particularly suited to training and deploying neural networks for machine learning implementations on auditory devices. A neural network can be generalized as a network of functions having graph relationships. As is known in the art, there are various types of implementations of neural networks used in machine learning. One exemplary type of neural network is the feedforward network, as described above.
[0123] A second exemplary type of neural network is the CNN. A CNN is a special type of feedforward neural network for processing data with a known grid-like topology, such as image data. Thus, CNNs are commonly used for computation in visual and image recognition applications, but they are also used for other types of pattern recognition, such as inference, speech, and language processing. The nodes in the CNN input layer are organized into a set of filters (feature detectors inspired by the receptive fields found in the retina), and the output of each set of filters is propagated to the nodes in subsequent layers of the network. The computation of a CNN involves applying a convolution operation to each filter to produce the output of that filter. Convolution is a special type of mathematical operation performed by two functions that produce a third function that modifies one of the two original functions. In the terminology of convolutional networks, the first function of the convolution is called the input, and the second function is called the convolution kernel. The output is called a feature map. For example, the input to the convolutional layer can be a multidimensional array of data that defines various color components of the incoming image. The convolutional kernel can be a multidimensional array of parameters, which are fitted through the neural network training process.
[0124] A recurrent neural network (RNN) is a family of feedforward neural networks that include feedback connections between layers. RNNs enable the modeling of sequential data by sharing parameter data across different parts of the neural network. The architecture of an RNN includes cycles. Cycles represent the influence that the current value of a variable has on its own value at a future point in time, as at least a portion of the output data from the RNN is used as feedback to process subsequent inputs in the sequence. This feature makes RNNs particularly useful for auditory processing due to the variable nature of auditory data, which can be constructed in such a way.
[0125] The diagrams described herein present exemplary feedforward, CNN, and RNN networks, and illustrate the general processes for training and deploying each of these types of networks, respectively. These descriptions are illustrative and not limiting to any particular embodiment described herein, and it will be understood that the illustrated concepts are generally applicable to deep neural networks and machine learning techniques.
[0126] The exemplary neural networks described above can be used to perform deep learning in order to implement one or more of the disclosed principles. Deep learning is a form of machine learning that uses deep neural networks. The deep neural networks used in deep learning are artificial neural networks composed of multiple hidden layers, in contrast to shallow neural networks which have only a single hidden layer. Deeper neural networks generally require more computation to train. However, the increased number of hidden layers in the network enables multi-stage pattern recognition, resulting in a reduction in output error compared to shallow machine learning techniques.
[0127] Deep neural networks used in deep learning typically consist of a front-end network that performs feature recognition coupled to a back-end network representing a mathematical model that can perform operations (e.g., object classification, noise and / or speech recognition) based on the feature representations provided to the model. Deep learning enables machine learning without requiring manual feature engineering for the model. Instead, deep neural networks can learn features based on statistical structures or correlations in the input data. The learned features can be provided to a mathematical model that can map the detected features to an output. The mathematical models used by the network are generally specialized for the specific task being performed, and different models are used for performing different tasks.
[0128] Once a neural network is structured, a learning model can be applied to the network to train it to perform a specific task. The learning model describes how to adjust the weights within the model to reduce the network's output error. Error backpropagation is a common method used to train neural networks. An input vector is presented to the network for processing. The network's output is compared to the desired output using a loss function, and an error value is calculated for each neuron in the output layer. The error values are propagated backward until each neuron has an associated error value that roughly represents its contribution to the original output. Then, algorithms such as the stochastic gradient descent algorithm are used to learn from these errors and update the weights of the neural network.
[0129] Figure 10 shows the training and deployment of a deep neural network according to one embodiment of the present disclosure. Once a given auditory network is structured for the task, the neural network can be trained using the training dataset 1002. Various training frameworks have been developed to enable hardware acceleration of the training process. For example, the machine learning framework 904 in Figure 9 can be configured as training framework 1004. Training framework 1004 can hook into an untrained neural network 1006, train the untrained neural network using the parallel processing resources described herein, and produce a trained neural network 1008. To start the training process, initial weights (e.g., amplification gains corresponding to sound sources) can be selected randomly or by pre-training using a deep belief network. The training cycle is then performed either supervised or unsupervised.
[0130] Supervised learning is a learning method in which training is performed as an intermediary operation, such as when the training dataset 1002 has inputs paired with desired outputs for inputs, or when the training dataset contains inputs with known outputs and the output of the neural network is manually graded. The network processes the inputs and compares the resulting output to a set of expected or desired outputs. The error is then propagated through the system. The training framework 1004 can be tuned to adjust the weights that control the untrained neural network 1006. The training framework 1004 can provide tools to monitor how well the untrained neural network 1006 is converging toward a model that is suitable for producing the correct answers based on known input data. The training process occurs repeatedly so that the network's weights are tuned to improve the output produced by the auditory neural network. The training process can continue until the neural network reaches a statistically desirable accuracy associated with the trained neural network 1208. This decision may be made by technical and auditory experts or it may be performed at a machine level. The trained neural network 1008 can then be deployed to perform any number of machine learning operations.
[0131] Unsupervised learning is an exemplary learning method in which a network attempts to train itself using unlabeled data. Thus, in unsupervised learning, the training dataset 1002 contains input data without associated output data. An untrained neural network 1006 can learn groupings within the unlabeled inputs and determine how individual inputs relate to the entire dataset. Unsupervised training can be used to generate self-organizing maps, a type of trained neural network 1007 that can perform useful operations to reduce the dimensionality of data. Unsupervised training can also be used to perform anomaly detection, thereby identifying data points in an input dataset that deviate from the normal pattern of the data.
[0132] Variations of supervised and unsupervised learning can also be employed. Semi-supervised learning is a method in which the training dataset 1002 contains a mixture of labeled and unlabeled data of the same distribution. Incremental learning is a variation of supervised learning in which input data is continuously used to further train the model. Incremental learning allows the trained neural network 1008 to adapt to new data 1012 without forgetting the knowledge implanted in the network during initial training. All of the aforementioned training can be carried out in collaboration with auditory specialists, doctors, and technicians.
[0133] Whether supervised or unsupervised, the training process for deep neural networks, in particular, can be computationally intensive on a single computing node. Instead of using a single node, a distributed network of computing nodes can be used to accelerate the training process.
[0134] Example 1 is a device for improving an incoming audio signal, comprising: a control unit that receives the incoming signal and provides a control unit output signal; a neural network engine (NNE) circuit that communicates with the control unit, is activatable by the control unit, and is configured to generate an NNE output signal from the control unit output signal; and a digital signal processing (DSP) circuit that receives one or more of the control unit output signals or NNE circuit output signals and thereby generates a processed signal, wherein the control unit determines the processing path of the control unit output signal through one of the DSP circuit or NNE circuit as a function of one or more of predetermined parameters, characteristics of the incoming signal, and feedback of the NNE circuit.
[0135] Example 2 is directed towards the device of Example 1, in which predetermined parameters have user-defined and user-independent characteristics.
[0136] Example 3 is directed to the apparatus of Example 2, wherein the user-defined characteristics further include one or more of the following: a user signal-to-noise ratio (U-SNR) threshold and natural speaker identification information.
[0137] Example 4 is directed towards the device of Example 2, in which user-independent characteristics include one or more of the available power levels and the system signal-to-noise ratio (S-SNR) threshold.
[0138] Example 5 is directed to the device of Example 1, wherein the characteristics of the incoming signal include detectable voice or detectable silence.
[0139] In Example 6, when the control unit detects silence, it disengages at least one of the DSP or NNE, and the silence is directed to the device of Example 5, which is defined by a noise level below a predetermined threshold.
[0140] In Example 7, the feedback of the NNE circuit is directed to the device of Example 1, which has the detected SNR value.
[0141] In Example 8, the feedback from the NNE circuit is directed to the device of Example 1, which includes instructions for voice detection in the NNE circuit.
[0142] Example 9 is directed towards the apparatus of Example 1, in which the control unit is configured to send audio clips to the NNE circuit and receive feedback from the NNE circuit.
[0143] In Example 10, an audio clip is directed to the device of Example 9, which defines a portion of the incoming signal and transmits it intermittently from the control unit.
[0144] Example 11 is directed to the apparatus of Example 9, in which audio clips have a predetermined length, are transmitted at predetermined intervals and at a certain frequency, the transmission frequency being determined as a function of the feedback signal of the NNE circuit.
[0145] Example 12 is directed towards the apparatus of Example 1, in which the control unit determines the processing path of the control unit output signal in virtually real time.
[0146] Example 13 is directed towards the device of Example 1, in which the control unit, DSP, and NNE are integrated on a system-on-a-chip (SOC).
[0147] Example 14 is directed to the device of Example 1, in which the control unit, DSP, and NNE are integrated into a hearing aid configured to fit into a human ear.
[0148] Example 15 is directed to the apparatus of Example 1, further comprising an active noise cancellation (ANC) circuit for processing the control unit output signal.
[0149] Example 16 relates to a method for improving the quality of an incoming audio signal, comprising: a control unit receiving the incoming signal and providing a control unit output signal; activating a neural network engine (NNE) and processing the control unit output signal to generate a neural network engine (NNE) output signal and an NNE feedback signal; and activating a digital signal processing (DSP) circuit to receive one or more of the control unit output signal and NNE circuit output signals and generate a processed signal, wherein the control unit determines the processing path of the control unit output signal through one of the DSP circuit or NNE circuit as a function of one or more of predetermined parameters, characteristics of the incoming signal, and feedback of the NNE circuit.
[0150] Example 17 is directed towards the method of Example 16, in which predetermined parameters have user-defined and user-independent characteristics.
[0151] Example 18 is directed toward the method of Example 17, wherein the user-defined characteristics further include one or more of the user signal-to-noise ratio (U-SNR) threshold and natural speaker identification information.
[0152] Example 19 is directed toward the method described in Example 17, wherein the user-independent characteristics further include one or more of the available power levels and system signal-to-noise ratio (S-SNR) thresholds.
[0153] Example 20 is directed toward the method of Example 16, wherein the characteristics of the incoming signal include detectable sound or detectable silence.
[0154] Example 21 is directed toward the method of Example 20, further comprising disengaging the DSP and NNE when silence is detected in the control unit.
[0155] Example 22 is directed toward the method of Example 16, further comprising detecting the SNR value and the NNE supplying the detected SNR value as a feedback signal to the NNE circuit.
[0156] Example 23 directs the NNE feedback signal to the method of Example 16, further providing instructions for voice detection in the NNE.
[0157] Example 24 is directed toward the method of Example 16, further comprising sending an audio clip from the control unit to the NNE before receiving the NNE feedback signal.
[0158] Example 25 is directed toward the method of Example 24, in which an audio clip defines a portion of the incoming signal and is transmitted intermittently.
[0159] Example 26 is directed toward the method of Example 24, in which the audio clip has a predetermined length, is transmitted at predetermined intervals and at a certain frequency, and the transmission frequency is determined as a function of the feedback signal of the NNE circuit.
[0160] Example 27 is directed toward the method of Example 16, further comprising determining the processing path in real time at the control unit.
[0161] Example 28 is directed toward the method of Example 16, further comprising integrating the control unit, DSP, and NNE on a system-on-chip (SOC).
[0162] Example 29 is directed toward the method of Example 16, further comprising integrating the control unit, DSP, and NNE into a hearing aid configured to fit the human ear.
[0163] Example 30 is directed toward the method of Example 16, further comprising engaging an active noise cancellation (ANC) circuit when processing the control output signal via the NNE circuit.
[0164] Example 31, when executed by computing hardware including a processor circuit coupled to a memory circuit, comprises instructions causing the computing hardware to perform the following actions: receive an incoming signal in a control unit and provide a control unit output signal; activate a neural network engine (NNE) to generate an NNE output signal and an NNE feedback signal and process the control unit output signal; and activate a digital signal processing (DSP) circuit to receive one or more of the control unit output signal and NNE circuit output signals and generate a processed signal, wherein the control unit is directed to the non-transient machine-readable medium to determine the processing path of the control unit output signal through one of the DSP circuit or NNE circuit as a function of one or more of predetermined parameters, characteristics of the incoming signal, and feedback of the NNE circuit.
[0165] Example 32 is directed to the medium of Example 31, in which predetermined parameters have user-defined and user-independent characteristics.
[0166] Example 33 is directed to the medium of Example 32, wherein the user-defined characteristics further include one or more of the user signal-to-noise ratio (U-SNR) threshold and natural speaker identification information.
[0167] Example 34 is directed to the medium of Example 32, wherein the user-independent characteristics further include one or more of the available power levels and system signal-to-noise ratio (S-SNR) thresholds.
[0168] Example 35 is directed to the medium of Example 31, wherein the characteristics of the incoming signal include detectable sound or detectable silence.
[0169] Example 36 is directed to the medium of Example 35, in which the instruction further causes the arithmetic hardware to disengage the DSP and NNE when silence is detected in the control unit.
[0170] Example 37 is directed to the medium of Example 31, in which the instruction further causes the arithmetic hardware to detect an SNR value and an NNE, and to provide the detected SNR value as a feedback signal to the NNE circuit.
[0171] Example 38 is an example in which the NNE feedback signal is directed to the medium of Example 31, further comprising instructions for voice detection in the NNE.
[0172] Example 39 is directed to the medium of Example 31, in which the instruction causes the arithmetic hardware to send an audio clip from the control unit to the NNE before receiving the NNE feedback signal.
[0173] Example 40 is directed to the medium of Example 39, where an audio clip defines a portion of the incoming signal and is transmitted intermittently.
[0174] Example 41 is directed to the medium of Example 39, where audio clips have a predetermined length, are transmitted at predetermined intervals and at a certain frequency, the transmission frequency being determined as a function of the feedback signal of the NNE circuit.
[0175] In Example 42, the instruction is directed to the medium of Example 31, which in turn causes the arithmetic hardware to determine the processing path in real time at the control unit.
[0176] Example 43 is directed to the medium described in Example 31, in which the control unit, DSP, and NNE are integrated into a hearing aid configured to fit the human ear.
[0177] Example 44 is an auditory system for improving incoming audio signals, comprising: a front-end receiver that receives one or more incoming audio signals, at least one of which has multiple signal components, each signal component corresponding to a respective signal source; a control unit that communicates with the front-end receiver, receives an input signal from the front-end receiver, provides a control unit output signal, and selectively provides an output signal to at least one of a first or second signal processing path; a neural network engine (NNE) circuit that communicates with the control unit to define a part of a first signal processing path, is activatable by the control unit, and is configured to generate an NNE output signal from a control unit output signal; and a digital signal processing (DSP) circuit that forms part of a first and second signal processing path, receives one or more of the control unit output signals or NNE circuit output signals, and generates a processing signal thereon, wherein the front-end receiver, control unit, NNE circuit and DSP circuit are formed on an integrated circuit (IC) and are directed toward an auditory system.
[0178] Example 45 is directed to the auditory system of Example 44, further comprising a backend receiver that receives the output signal from the DSP and forms an audible signal.
[0179] Example 46 defines one of the following: a hearing aid, headphones, or faceworn glasses, and the audible signal is directed to the auditory system of Example 45, formed less than 32 milliseconds after receiving the incoming signal.
[0180] Example 47 is directed towards the auditory system of Example 44, in which the IC is a system-on-a-chip (SOC).
[0181] Example 48 is directed towards the auditory system of Example 47, further comprising a housing that accommodates the SOC and power supply.
[0182] Example 49 is directed towards the auditory system of Example 44, in which the control unit determines the processing path of the control unit output signal as a function of feedback in the NNE circuit.
[0183] Example 50 is directed to the auditory system of Example 44, in which the control unit determines the processing path of the control unit output signal as a function of one or more of predetermined parameters, characteristics of the incoming signal, and feedback of the NNE circuit.
[0184] Example 51 is directed towards the auditory system of Example 44, which further includes a wireless communication system.
[0185] Example 52 is directed towards the auditory system of Example 44, in which an NNE circuit adjusts the relative volume of components of the incoming signal, and a DSP circuit applies frequency and time-varying gains to the received signal.
[0186] Example 53 directs the components of the incoming signal to the auditory system of Example 52, further comprising at least speech and noise, wherein the speech volume is increased relative to the noise volume.
[0187] Example 54 is an auditory system according to Example 44 in which a front-end receiver processes an incoming signal to provide an input signal to a control unit, the incoming signal including one or more of speech components and noise components.
[0188] Example 55 shows that the NNE circuit selectively applies a ratio mask to the incoming signal of the front-end receiver so that it obtains multiple components, each of which is directed to the auditory system of Example 52, corresponding to a speech class.
[0189] Example 56 is directed towards the auditory system of Example 44, in which the NNE circuit is configured to selectively apply a complex ratio mask to the control output signal to obtain multiple signal components, each of which corresponds to a speech class or individual speaker, and the NNE circuit is further configured to couple the multiple components to the output signal, and the volume of each component is adjusted relative to at least one other component according to a predetermined user-controlled signal-to-noise ratio.
[0190] Example 57 is directed to the auditory system of Example 56, wherein the signal components further include speech and noise, and the output signal has a speech volume increased relative to the noise volume.
[0191] Example 58 is directed to the auditory system of Example 56, wherein the signal components further include user utterances and several other sound sources, and the output signal includes user utterances reduced relative to the other sound sources.
[0192] Example 59 is directed towards the auditory system of Example 56, in which the NNE circuit is further configured to set the volume of each of the different sound sources as a function of user-controlled parameters.
[0193] Example 60 is directed to the auditory system of Example 44, where the second signal processing path excludes signal processing through the NNE.
[0194] Example 61 is directed to the auditory system of Example 44, in which the NNE circuit is further configured to perform one or more DSP functions.
[0195] Example 62 is a method for improving the quality of an incoming audio signal, comprising: receiving one or more incoming audio signals in a front-end receiver, at least one of the incoming audio signals having a plurality of signal components, each signal component corresponding to a respective signal source; receiving an input signal in a control unit and providing a control unit output signal, the control unit selectively providing the output signal to at least one of a first or second signal processing path; generating an NNE output signal from the control unit output signal in a neural network engine (NNE) circuit activatable by the control unit, the NNE defining and generating at least a portion of a first signal processing path; and generating a processing signal from the control unit output signal or the NNE circuit output signal in a digital signal processing (DSP) circuit, the DSP defining and generating at least a portion of the first and second signal processing paths, wherein the front-end receiver, control unit, NNE circuit, and DSP circuit are formed on an integrated circuit (IC).
[0196] Example 63 is directed toward the method of Example 62, further comprising forming an output signal from the processed signal in the backend receiver.
[0197] Example 64 is directed toward the method of Example 63, further comprising forming an output signal in less than 32 milliseconds after receiving an incoming signal.
[0198] Example 65 is directed to the method of Example 63, where the auditory system defines one of the following: a hearing aid, headphones, or faceworn glasses.
[0199] Example 66 is directed toward the method of Example 62, in which the IC comprises a system-on-a-chip (SOC).
[0200] Example 67 is directed toward the method of Example 66, further comprising a housing that accommodates the SOC and power supply.
[0201] Example 68 is directed toward the method of Example 62, further comprising determining the processing path of the control unit output signal as a function of the feedback of the NNE circuit.
[0202] Example 69 relates to the method of Example 62, further comprising determining the processing path of the control unit output signal as a function of one or more of predetermined parameters, characteristics of the incoming signal, and feedback of the NNE circuit.
[0203] Example 70 is directed toward the method of Example 62, further comprising processing an incoming signal having one or more of speech components and noise components in the front-end receiver so as to provide an input signal to the control unit.
[0204] Example 71 is directed toward the method of Example 70, in which the NNE circuit selectively applies a ratio mask to the incoming signal of the front-end receiver so that it obtains multiple components, each of which corresponds to an audio class.
[0205] Example 72 is directed toward the method system of Example 62, which applies a complex ratio mask to the control output signal in an NNE circuit to obtain multiple signal components, each of which corresponds to a speech class or individual speaker, and further comprises coupling the multiple components to the output signal in the NNE circuit, wherein the volume of each component is adjusted relative to at least one other component according to a predetermined user-controlled signal-to-noise ratio.
[0206] Example 73 is directed toward the method of Example 72, wherein the signal components further comprise speech and noise, and the output signal comprises speech volume increased relative to noise volume.
[0207] Example 74 is directed toward the method of Example 72, wherein the signal components further comprise user utterances and several other sound sources, and the output signal comprises user utterances reduced relative to the other sound sources.
[0208] Example 75 points to the method of Example 72, in which the NNE circuit is further configured to set the volume of each of the different sound sources as a function of user-controlled parameters.
[0209] Example 76 is directed toward the method of Example 62, in which signal processing through the first signal processing path excludes signal processing through the NNE.
[0210] Example 77, when executed by computing hardware including a processor circuit coupled to a memory circuit, involves the computing hardware receiving one or more incoming audio signals at a front-end receiving unit, at least one of the incoming audio signals having multiple signal components, each signal component corresponding to its respective signal source, receiving an input signal from the front-end receiving unit and providing a control unit output signal, the control unit selectively providing the output signal to at least one of a first signal processing path or a second signal processing path, and providing a control unit output signal, and a neural network engine (N) that can be started by the control unit. A non-transient machine-readable medium comprising instructions to perform the following actions: in an NE (New Energy) circuit, generate an NNE output signal from a control unit output signal, wherein the NNE defines and generates at least a portion of a first signal processing path; and in a digital signal processing (DSP) circuit, generate a processing signal from a control unit output signal or an NNE circuit output signal, wherein the DSP defines and generates at least a portion of first and second signal processing paths; wherein the front-end receiver, control unit, NNE circuit, and DSP circuit are directed toward the at least one non-transient machine-readable medium formed on an integrated circuit (IC).
[0211] In Example 78, the instruction is directed to the medium of Example 77, which in turn causes the arithmetic hardware to form an output signal from the processing signal in the backend receiver.
[0212] Example 79 is directed to the medium of Example 78, where the instruction further causes the arithmetic hardware to form an output signal in less than 32 milliseconds after receiving an incoming signal.
[0213] Example 80 is directed towards the medium of Example 78, which defines one of the following: a hearing aid, headphones, or facework glasses.
[0214] Example 81 is directed towards the medium of Example 77, in which the IC has a system-on-a-chip (SOC).
[0215] Example 82 is directed to the medium of Example 77, in which the instruction further causes the arithmetic hardware to determine the processing path of the control unit output signal as a function of the feedback of the NNE circuit.
[0216] Example 83 is directed to the medium of Example 77, which further causes the arithmetic hardware to determine the processing path of the control unit output signal as a function of one or more of predetermined parameters, characteristics of the incoming signal, and feedback of the NNE circuit.
[0217] Example 84 is directed to the medium of Example 77, which further directs the instruction to the arithmetic hardware to process an incoming signal having one or more of speech and noise components in the front-end receiver and provide the incoming signal to the control unit.
[0218] Example 85 is configured such that the NNE circuit selectively applies a ratio mask to the incoming signal of the front-end receiver to obtain multiple components, each of which is directed to the medium of Example 84, corresponding to the audio class.
[0219] Example 86 is directed to the medium of Example 77, in which the instruction further causes the arithmetic hardware to apply a complex ratio mask to the control output signal in an NNE circuit to obtain multiple signal components, each of which corresponds to a speech class or individual speaker, and the multiple components are coupled to the output signal in an NNE circuit, and the volume of each component is adjusted relative to at least one other component according to a predetermined user-controlled signal-to-noise ratio.
[0220] Example 87 is directed to the medium of Example 86, wherein the signal components further include speech and noise, and the output signal has a speech volume increased relative to the noise volume.
[0221] Example 88 is directed to the medium of Example 84, wherein the signal components further include user utterances and several other sound sources, and the output signal includes user utterances reduced relative to the other sound sources.
[0222] Example 89 is directed to the medium of Example 84, where the instruction further causes the arithmetic hardware to set the volume of each of the different sound sources as a function of user-controlled parameters.
[0223] Example 90 is directed to the medium of Example 77, where the signal processing through the first signal processing path excludes the signal processing through the NNE.
[0224] Example 91 is directed to an ear-worn auditory system that improves incoming audio signals, comprising a neural network engine (NNE) circuit configured to improve continuously received signal samples and then output a continuous audible signal based on the improved signal samples.
[0225] Example 92 directs the auditory system of Example 91 to which the audible signal is generated approximately 32 milliseconds or less after the reception of the received signal.
[0226] Example 93 directs the auditory system of Example 91 to which the audible signal is generated approximately 10 milliseconds or less after the reception of the received signal.
[0227] Example 94 is directed to the auditory system of Example 91, in which the audible signal is generated approximately 10–20 milliseconds, 12–8 milliseconds, 10–6 milliseconds, or 8–3 milliseconds from the reception of the incoming speech signal.
[0228] Example 95 is directed towards the auditory system of Example 92, where the neural network performs at least one billion operations per second.
[0229] Example 96 is directed to the auditory system of Example 95, in which the NNE circuit is configured to process the audio signal with associated power consumption of approximately 2 milliwatts or less.
[0230] Example 97 is directed to the auditory system described in Example 96, wherein the NNE circuit is formed on a system-on-chip (SOC) and further comprises multiple non-transient executable logics for performing multiple precision levels of signal processing operations.
[0231] Example 98 is directed towards the auditory system of Example 91, in which a neural network improves the audio signal by estimating a complex ratio mask for each signal sample and obtaining the desired signal components.
[0232] Example 99 is directed towards the auditory system of Example 98, where the desired signal component is speech.
[0233] In Example 100, the desired signal component is directed to the auditory system of Example 99, which is one or more recognized speakers.
[0234] Example 101 shows an improved audio signal with reduced background noise, which is user-configurable, directed towards the auditory system of Example 98.
[0235] Example 102 is directed toward the auditory system of Example 101, further comprising physical control switches accessible on the auditory system for adjusting the background noise level.
[0236] Example 103 is directed toward the auditory system of Example 101, further comprising a logic control switch accessible via an auxiliary device for adjusting the background noise level.
[0237] Example 104 is directed to an in-ear hearing system for improving an incoming audio signal, comprising: a neural network engine (NNE) circuit configured to improve the audibility of the received signal and provide an improved continuous output signal; and a control dial for adjusting background noise by manipulating at least one NNE circuit configuration to respond to user input.
[0238] Example 105 is directed to the auditory system described in Example 104, in which the control dial is provided with an adjustable physical dial.
[0239] Example 106 is directed towards the auditory system of Example 104, where the control dial affects the signal-to-noise ratio (SNR) of the continuous output signal.
[0240] Example 107 is an auditory system of Example 104 in which the control dial is directed to exclusively affect the noise components of the incoming sound.
[0241] Example 108 is a device for improving the audibility of an audio signal, comprising: a neural network engine (NNE) circuit that receives one or more input audio signals and outputs one or more intermediate signals, each intermediate signal further comprising an audio signal corresponding to one or more sound sources; and a sound mixer circuit configured to receive one or more intermediate signals, assign a gain to each intermediate signal, and recombine one or more intermediate signals to form a new output signal, wherein the gain assigned to one or more intermediate signals is set to achieve a target signal-to-noise ratio (SNR), and the SNR is determined as a function of at least one user-specific criterion and at least one user-independent criterion, directed towards the device.
[0242] Example 109 is directed to the apparatus of Example 108, where user-specific criteria include volume targets for a specific desired signal voice class and noise voice class, or a desired ratio of volume between a desired voice class and the SNR.
[0243] In Example 110, the volume of a desired voice class is directed to the device of Example 109, which is user-controlled.
[0244] Example 111 is directed to the device of Example 108, in which the number and configuration of intermediate signals output by the neural network can be configured according to user-specific selection criteria.
[0245] Example 112 is directed to the apparatus of Example 109, further comprising user-specific criteria that provide desired amplification for one or more natural speakers.
[0246] Example 113 is directed to the apparatus of Example 109, further comprising a user-independent criterion that provides an estimated SNR of the most recently received and processed input audio signal.
[0247] Example 114 is directed to the apparatus of Example 109, in which a user-independent criterion further enhances the estimation error of the neural network.
[0248] Example 115 is a modified version of the apparatus in Example 114, in which the sound mixer circuit steps recombine one or more intermediate signals based on the network's prediction error to form a new output signal.
[0249] Example 116 is directed to the apparatus of Example 108, where the target SNR is determined as the lower of either the user's desired SNR or the SNR based on the neural network's estimation error.
[0250] In various embodiments, the operations discussed herein may be implemented as hardware (e.g., logic circuits), software, firmware, or a combination thereof, for example with reference to the diagrams described herein, which may be provided as computer program products comprising, for example, a tangible (e.g., non-transient) machine-readable or computer-readable medium storing instructions (or software procedures) used to program a computer to perform the processes discussed herein. The machine-readable medium may comprise a storage device such as those discussed in relation to the drawings of this application.
[0251] Furthermore, such computer-readable media may be downloaded as a computer program product, in which case the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by data signals provided on a carrier or other propagation medium via a communication link (e.g., a bus, modem, or network connection).
[0252] Any reference in this specification to “one embodiment” or “embodiment” means that certain features, structures, and / or characteristics described in relation to an embodiment may be included in at least one embodiment. The phrase “in one embodiment” appears in various places in this specification, and may or may not all refer to the same embodiment.
[0253] Furthermore, the terms “joined” and “connected” may be used in this specification and in the claims, along with their derivatives. In some embodiments, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Joined” may mean that two or more elements are in direct physical or electrical contact. However, “joined” may also mean that two or more elements cooperate or interact with each other even if they are not in direct contact with each other.
[0254] Thus, while embodiments have been described in language specific to structural features and / or methodological actions, it should be understood that the claimed subject matter may not be limited to the specific features or actions described. Rather, specific features and actions are disclosed as examples of implementing the claimed subject matter.
Claims
1. A device for improving incoming audio signals, A control unit that receives the incoming audio signal and provides a control unit output signal, A neural network engine (NNE) circuit that applies a mask in the Short-Time Fourier Transform (STFT) domain to the control unit output signal to generate a plurality of intermediate signals corresponding to the speech class, individual speakers and / or noise class, and outputs an improved digital signal, the NNE output signal, by recombining the plurality of intermediate signals. A digital signal processing (DSP) circuit receives the control unit output signal or the NNE output signal and generates a processed signal by applying gain (amplification) for each band, dynamic range compression, and / or frequency adjustment. Equipped with, The control unit determines the processing path of the control unit output signal based on a determination logic that takes into account at least one of user-defined characteristics, user-independent characteristics, characteristics of the incoming signal, and / or feedback of the NNE circuit. The processing path is either a first processing path that passes the control unit output signal to the DSP circuit, or a second processing path that passes the control unit output signal to the NNE circuit to generate the NNE output signal and passes the NNE output signal to the DSP circuit. Device.
2. The apparatus according to claim 1, wherein the determination logic takes into account the user-defined characteristics.
3. The apparatus according to claim 2, wherein the user-defined characteristic includes the user's selection of an operating mode via an application on the user's smartphone.
4. The apparatus according to claim 2, wherein the user-defined characteristic includes the user's selection of an operating mode via an input on the apparatus.
5. The apparatus according to claim 1, wherein the NNE circuit is configured to improve continuously received signal samples of the incoming audio signal and output the NNE output signal, and the DSP circuit is configured to output the processed signal as a continuous audible signal based on the NNE output signal.
6. The aforementioned continuous audible signal is generated within 32 milliseconds or less from the reception of the incoming audio signal, The NNE circuit is configured to perform at least 1 billion operations per second, The apparatus according to claim 5, wherein the NNE circuit is configured to process the incoming audio signal with associated power consumption of 2 milliwatts or less, or at least one of the above.
7. The apparatus according to claim 5, wherein the NNE circuit is configured to improve the incoming signal by estimating a complex mask for each received signal sample and obtaining the signal components of the incoming audio signal.
8. The apparatus according to claim 7, wherein the signal component is speech.
9. The apparatus according to claim 8, wherein the processed signal indicates the level of reduced background noise, and the level of background noise is user-configurable.
10. The apparatus according to any one of claims 1 to 9, wherein the control unit determines the processing path of the output signal of the control unit in real time.
11. The apparatus according to claim 1, wherein the control unit, DSP circuit, and NNE circuit are integrated on a system-on-chip (SOC).
12. The apparatus according to any one of claims 1 to 11, wherein the control unit, DSP circuit, and NNE circuit are integrated into a hearing aid configured to be worn in a human ear.
13. The apparatus according to any one of claims 1 to 12, wherein the DSP circuit is configured to perform one or more of the following: dynamic range compression, amplification, and frequency adjustment.
14. A method for improving the quality of incoming audio signals, The control unit receives the incoming audio signal and provides a control unit output signal. The processing path of the control unit output signal is selected based on a decision logic that considers at least one of user-defined characteristics, user-independent characteristics, characteristics of the incoming signal, and / or feedback from the neural network engine (NNE) circuit. The processing path is either a first processing path that passes the control unit output signal to a digital signal processing (DSP) circuit, or a second processing path that passes the control unit output signal to the NNE circuit to generate an improved digital signal, which is an NNE output signal, and passes the NNE output signal to the DSP circuit. When the second processing path is selected, the NNE circuit applies a mask in the Short-Time Fourier Transform (STFT) domain to the control unit output signal to generate a plurality of intermediate signals corresponding to the speech class, individual speakers and / or noise classes, and outputs the NNE output signal by recombining the plurality of intermediate signals. The DSP circuit receives the control unit output signal or the NNE output signal and generates a processed signal by applying gain (amplification) to each band, compression of the dynamic range, and / or adjustment of the frequency. Equipped with, method.
15. A computer program that, when executed by one or more processors, comprises machine-readable instructions that cause the one or more processors to perform the method according to claim 14.