In-ear tactical communication and / or hearing protection system

The system uses a combination of ambient and vibration-based microphones processed by a trained AI/ML model to dynamically adjust microphone contributions, ensuring clear and robust voice communication across varying noise levels, addressing the limitations of existing systems.

WO2026139463A1PCT designated stage Publication Date: 2026-07-02INVISIO COMM

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
INVISIO COMM
Filing Date
2025-12-22
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing communication systems fail to provide crystal-clear audio transmissions and optimal speech intelligibility across diverse acoustic environments, particularly in high-noise conditions, and often compromise natural voice characteristics.

Method used

A communication system using a combination of ambient and vibration-based microphones, processed by a trained artificial intelligence or machine learning model, dynamically adjusts the relative contributions of these microphones to enhance speech-relevant frequency bands and suppress noise, optimizing voice transmission across varying noise levels.

Benefits of technology

The system ensures consistently clear and robust voice communication, adapting to changing acoustic conditions, preserving naturalness in low-noise scenarios and maintaining intelligibility in high-noise environments, thereby enhancing user safety and operational effectiveness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure EP2025088695_02072026_PF_FP_ABST
    Figure EP2025088695_02072026_PF_FP_ABST
Patent Text Reader

Abstract

The present invention relates to a communication system configured to be used in a demanding envi- ronment, the communication system comprising at least one communication and hearing protection de-5 vice (103, 303a, 303b), in particular at least one in-ear communication and hearing protection device (103, 303a, 303b), configured to be worn by a user (101), the communication and hearing protection device (103, 303a, 303b) comprising an ambient microphone (315, 315a, 315b) configured to provide an ambient microphone input signal (505a, 505b, 513), and a transmit (Tx) microphone (317, 317a, 317b) configured to provide a transmit microphone input signal (503a, 503b, 511) in response to regis-10 tering speech from the user (101), one or more processors (205, 207) configured to implement and ex- ecute a trained artificial intelligence or machine learning method or component (515, 607) to generate a correction function or signal (517, 609) in response to the ambient microphone input signal (505a, 505b, 513) and the transmit microphone input signal (503a, 503b, 511)
Need to check novelty before this filing date? Find Prior Art

Description

[0001] In-ear tactical communication and / or hearing protection system

[0002] TECHNICAL FIELD

[0003] The present invention relates in one aspect to an in-ear tactical hearing protection and / or communication system to be used in a demanding and high-noise environment where a trained artificial intelligence or machine learning method or component is implemented and executed. The invention relates according to another aspect, to method of training an artificial intelligence or machine learning method or component to be executed by a device of a communication system.

[0004] BACKGROUND

[0005] Hearing loss is by far the most prevalent service-connected disability among veteran soldiers worldwide. Where sensorineural hearing loss, caused by damage to the inner ear and auditory nerve, result in permanent irreversible loss of hearing for the individual resulting in difficulty understanding speech which can cause social isolation and a significant drop in life quality. Conditions such as auditory processing disorder and hearing loss is often associated with blast exposure from gunfire and grenade explosions or other severe noise exposure both short and long-term.

[0006] Even though hearing loss may be helped through the use of hearing aids, it is desirable to preserve natural hearing abilities and prevent any external nigh noise exposures for individuals who are required to operate in demanding environments.

[0007] Hearing protection devices are generally known and used amount soldiers, police forces, etc. for noise attenuation. Typically, passive hearing protecting devices such as foam earplugs or earmuffs are used to physically block or damp sound waves thereby protecting the user against harmful noise exposure; however this type of hearing protecting devices block out all types of sound, which can be extremely problematic for many types of operations where it can be critical that the user is able to maintain situational awareness of the surroundings. To maintain situational awareness, active hearing protecting devices may be used relying on a combination of passive sound attenuation, speakers, and microphones typically in combination with noise filtration or other for transmitting ambient sound into the ear canal of the user while maintaining the noise level below a predetermined limit.

[0008] Additionally, clear and undistorted communication is vital for military and public safety professionals operating in extreme and demanding environments. Clear and undistorted communication is very significant e.g. for soldiers, police, rescue personnel, fire fighters, and other task forces as it ensures or at least facilitates improved coordination among team members and enhances safety and security by reducing misunderstandings. Rapid and swift communication facilitates quick decision-making and response to changing situations. The ability to share observations and intelligence in real-time between relevant groups or individuals relies on clear and undistorted communication and enhances mission success and maximizes safety. Being able to transmit or receive the correct voice communication in a complex communication setup, independent of audio communication device types and under stressful circumstances, can make the difference between life and death.The ability to receive or transmit field intelligence is essential to coordinate any special operation successfully. Targeted communication between units, individuals, and / or central commands is essential and can make the difference between success or complete failure. The need for communication is high and the complexity of different operations are increasing as multiple different groups or individuals may join forces and work closely together to accomplish a mission. Forces such as police, rescue personal, and firefighters may collaborate at an emergency site or different special forces or marine troopers from different countries or organizations may collaborate in a joint coalition force. Such constellations require complex and often dynamic communication steps, configurations, setups, etc. containing and / or involving multiple communication devices, communication channels etc. making unambiguous and reliable communication flow challenging. Multiple communication channels may be available e.g. with different levels of confidentiality in relation to a message classification level thereby making administration and control of communication links challenging.

[0009] Some modern tactical communication systems are designed to try to enhance speech intelligibility under challenging conditions such as high background noise, fluctuating sound levels, and unpredictable operational scenarios. Certain conventional solutions have introduced machine learning techniques to improve audio quality by transforming degraded speech signals into clearer outputs. These approaches typically rely on bone-conducted speech signals as the primary input during real-time operation, while using high-quality speech recordings as the target output during training. Although such methods can enhance speech clarity to some extent, they remain limited in their ability to adapt dynamically to diverse noise conditions and often fail to preserve natural voice characteristics when environmental noise levels vary significantly.

[0010] Other established solutions rely on conventional, logical signal processing techniques that employ traditional algorithms for signal mixing, noise reduction, and voice extraction, typically using fixed or adaptive filters and deterministic processing chains.

[0011] US20190230431 A1 describes a hearing protection system in which a processor receives input signals from external (ambient) microphones and ear canal (voice) microphones, applies noise reduction and hear-through processing, and generates output signals for communication and hearing protection. EP 3188508 A1 describes a hearing device that receives input signals from both an ambient microphone (for environmental sounds) and an in-ear or body-conducted microphone (for the user’s own voice). The device processes these signals to extract an improved voice signal, applying noise cancellation and filtering to suppress environmental noise.

[0012] It is therefore an objective of the present disclosure to overcome the above limitations at least in part by enabling crystal-clear audio transmissions and optimal speech intelligibility and noise suppression across a wide range of acoustic environments. SUMMARY

[0013] One object of the present invention is to overcome at least some of the above-mentioned drawbacks and / or other disadvantages (at least to an extent), or at least to provide an alternative to existing solutions.According to a first aspect disclosed herein are embodiments of a communication system according to independent claim 1 with advantageous embodiments as defined by the dependent claims and as disclosed herein.

[0014] According to the above first aspect, disclosed herein are embodiments of a communication system configured to be used in a demanding environment as disclosed herein.

[0015] According to the first aspect, the present disclosure relates to a communication system for enhancing speech intelligibility and maintaining robust voice transmission in a demanding environment. A communication system may be understood as an arrangement of interconnected devices and components designed to enable the reliable exchange of audio signals, particularly voice, between users or between a user and external devices, often under challenging acoustic conditions. Speech intelligibility refers to the clarity and comprehensibility of spoken words as perceived by a listener or receiving device, while robust voice transmission denotes the consistent and reliable conveyance of speech signals even in the presence of adverse noise or interference.

[0016] In a preferred embodiment according to the first aspect, the system comprises at least one communication and hearing protection device, in particular at least one in-ear communication and hearing protection device, configured to be worn by a user. A communication and hearing protection device may be understood as a wearable apparatus that both facilitates audio communication and provides protection against harmful noise exposure. One advantage of this arrangement is that it enables hands-free, continuous communication while safeguarding the user’s hearing in environments with fluctuating or high noise levels.

[0017] The communication and hearing protection device comprises an ambient microphone configured to provide an ambient microphone input signal comprising, when the user is speaking, airborne acoustic components of the user’s voice and surrounding environmental sounds. An ambient microphone may be understood as a transducer that converts airborne sound waves, including both the user’s speech and environmental noise, into an electrical signal. One advantage of this arrangement is that the ambient microphone can capture high-fidelity speech signals in low-noise conditions, offering a broad frequency response and natural sound quality, which is particularly beneficial for clear communication when background noise is minimal.

[0018] The device further comprises a transmit microphone comprising a vibration-based transducer, mechan-ically / physically coupled to the user and configured to provide a transmit microphone input signal comprising, when the user is speaking, speech vibrations conducted through the user’s jawbone and / or tissue. A vibration-based transducer may be understood as a sensor, such as a bone conduction or accelerometer-based microphone, that detects mechanical vibrations generated by the user’s speech as they propagate through bone or tissue. One advantage of this arrangement is that the transmit microphone is inherently robust against external airborne noise and excels in isolating the user’s voice in loud surroundings, thereby ensuring reliable voice capture in high-noise environments where airborne microphones may be compromised.The system further comprises one or more processors configured to implement and execute a trained artificial intelligence or machine learning method or component, the processors being further configured to receive the ambient microphone input signal and the transmit microphone input signal (e.g., in response to the user speaking), generate, based on the received signals, a combined correction function or signal, and provide, using the correction function or signal, an improved output voice signal to be transmitted to one or more other users and / or one or more devices.

[0019] A processor may be understood as a hardware component capable of executing computational tasks, while a trained artificial intelligence or machine learning method or component refers to an implementation of a computational model that has been trained on selected representative data to perform tasks such as speech enhancement and noise suppression. A combined correction function or signal may be understood as a set of signal modification parameters or a processed signal generated by the machine learning model through the fusion of the ambient microphone input signal and the transmit microphone input signal. The term “combined” in connection with the correction function or signal may be understood as referring to the process by which the trained artificial intelligence or machine learning method or component extracts and integrates relevant information from both the ambient microphone input signal and the transmit microphone input signal, using learned data-driven fusion techniques to provide a single output. One advantage of this arrangement is that the system can leverage complex patterns in the input signals to generate a correction function or signal that selectively enhances speech-relevant frequency bands and suppresses noise, resulting in a superior, more robust voice signal for transmission.

[0020] In a preferred embodiment according to the first aspect, the combined correction function or signal is dynamically generated, by the trained artificial intelligence or machine learning method or component, by adjusting the relative contributions of the ambient and transmit microphone input signals according to the surrounding environmental sounds of the ambient input signal. The phrase “relative contributions” in this context refers to the proportion or weighting of the parts of the input signals that contain the user’s voice, and more specifically the airborne acoustic components of the user’s voice captured by the ambient microphone and the speech vibrations conducted through the user’s jawbone and / or tissue captured by the transmit microphone. The machine learning model dynamically analyses both input signals, extracts the relevant voice information from each, and adjusts how much each of these voice components influences the final output signal. This adaptive adjustment is performed in response to the detected surrounding environmental sounds, allowing the system to optimize the clarity and intelligibility of the user’s speech by emphasizing the most reliable voice source under current noise conditions thereby fully exploiting the complementary characteristics of airborne and bone-conducted speech signals. By learning the statistical relationships between noise levels and signal quality during training, the trained artificial intelligence or machine learning method or component can infer optimal weighting strategies for real-time operation enabling the adaptive or dynamic adjustment behaviour. One advantage of this arrangement is that the system automatically and readily can adapt to changing acoustic conditions in an optimised manner, leveraging the high-fidelity capture of the ambient microphone in quiet settings and the noise-robust capture of the vibration-sensitive microphone in loud set-tings. This dynamic adaptation of the trained machine learning model ensures that the output voice signal is consistently optimized for intelligibility and robustness, overcoming the limitations inherent in using either microphone type alone and providing a substantial improvement in communication reliability and user safety, particularly in mission-critical or hazardous environments. Using a trained machine learning model together with both the ambient voice signal and the vibration voice signal offers a significant advancement compared to traditional signal processing approaches. Conventional algorithms and signal processing rely on fixed rules or manually designed features and often degrade sharply when exposed to real-world variability such as fluctuating noise, overlapping sounds, non-stationary signals, and diverse accents or speaking styles. In contrast, a machine learning model learns complex, nonlinear patterns directly from data, enabling it to model intricate relationships between airborne and bone-conducted speech components. This data-driven approach allows the system to handle variability and ambiguity, adapt dynamically to changing acoustic conditions, and generalize across environments when trained on diverse datasets. As a result, the trained machine learning model can intelligently fuse both input signals to produce a consistently clear and robust output voice signal, delivering superior performance in unpredictable, noisy, and mission-critical scenarios where traditional algorithms fail to maintain intelligibility and naturalness.

[0021] In one embodiment, the correction function or signal is dynamically generated such that in low-noise environments the contribution from the ambient input signal is increased, and in high-noise environments the contribution from the transmit microphone input signal is increased, thereby providing optimal speech intelligibility and noise suppression across varying ambient acoustic (e.g., demanding) environments. One advantage of this arrangement is that in low-noise scenarios, typically below 85 dB, the ambient microphone can dominate, ensuring natural and intelligible speech transmission, while in high-noise scenarios typically above 85 dB, the system can prioritize the vibration-based signal to maintain speech intelligibility and suppress environmental noise. This adaptive fusion results in a transmitted voice signal that is consistently clear, intelligible, and robust across a wide range of acoustic conditions, thereby overcoming the limitations inherent in using either microphone type alone while fully exploiting the complementary characteristics of airborne and bone-conducted speech signals. The technical effect is a substantial improvement in communication reliability and user safety, particularly in mission-critical or hazardous environments where clear voice transmission is essential.

[0022] In one embodiment, the trained artificial intelligence or machine learning method or component has been trained on a dataset comprising paired data records, each data record including simultaneously recorded speech signals from both a vibration-sensitive transmit microphone and an ambient microphone, captured under a range of background noise conditions representative of demanding operational environments, and a corresponding reference voice signal, the reference voice signal being a corresponding reference voice signal obtained using an external microphone positioned to capture high-quality airborne speech from the user 101. The training data reliably and efficiently enables the model to learn to adaptively weight the contributions of the two microphone signals in response to varying noise conditions in a particular advantageous way. A dataset may be understood as a structured collection of data used to train a machine learning model, and paired data records refer to input-output pairs that allow the model to learn the relationship between noisy input signals and the desired output.One advantage of this arrangement is that the model learns the statistical relationships between noise levels and signal quality, enabling it to infer optimal weighting strategies for real-time operation. In low-noise environments, where airborne speech components are less corrupted, the contribution from the ambient microphone input is increased to preserve naturalness and full-bandwidth fidelity. Conversely, in high-noise environments, where airborne signals are heavily masked, the contribution from the vibration-sensitive microphone input is increased to maintain intelligibility and suppress environmental noise. This dynamic adjustment ensures that the output voice signal remains optimized for clarity and robustness across a wide range of acoustic conditions, thereby achieving the technical effect of improved speech intelligibility and reliable communication in demanding operational environments. In other words, the specific training strategy enables the trained machine learning model to dynamically adjust the weighting of airborne and bone-conducted speech components in real time, preserving naturalness in low-noise conditions and maintaining intelligibility in high-noise conditions. The learned adaptive fusion mechanism ensures that the output voice signal remains consistently clear and reliable even when background noise, overlapping sources, or non-stationary signals would cause conventional algorithms to fail. By training on paired data records that include simultaneous inputs from an ambient microphone and a vibration-sensitive transmit microphone across diverse noise environments, the machine learning model achieves superior accuracy, robustness, and adaptability in real-world audio processing scenarios compared to deterministic signal processing methods, which rely on fixed rules and linear assumptions and therefore fail to maintain performance under dynamic and unpredictable acoustic conditions.

[0023] In one embodiment, the correction function or signal is a processed signal being the improved output voice signal to be transmitted to one or more other users and / or one or more devices, directly generated by the trained artificial intelligence or machine learning method or component. A processed signal may be understood as an output signal that has been transformed by the machine learning model based on the received input signals, such that the resulting signal is immediately suitable for transmission without requiring further post-processing or application of a separate correction function. One advantage of this arrangement is that the machine learning model can directly synthesize an output voice signal that is already optimized for speech intelligibility and noise suppression, streamlining the signal processing chain and reducing latency. By generating the improved output voice signal in a single step, the system can more efficiently adapt to rapidly changing acoustic environments and deliver consistently high-quality voice transmission. This direct generation approach enables robust and reliable communication, particularly in mission-critical or hazardous environments where clear and immediate voice transmission is essential.

[0024] In one embodiment, the correction function or signal is a set of signal modification parameters that selectively enhances speech-relevant frequency bands and suppresses noise. In one embodiment, the step of providing, using the correction function or signal, an improved output voice signal to be transmitted to one or more other users and / or one or more devices comprises applying the correction function or signal to the ambient microphone input signal and / or the transmit microphone input signal. Signal modification parameters may be understood as values or coefficients, such as a gain vector, thatare computed by the machine learning model and used to modify the spectral or temporal characteristics of the one or both of the (raw) input signals in order to emphasize desired speech components and attenuate unwanted noise. One advantage of this arrangement is that it enables efficient, real-time processing of speech signals on resource-constrained edge devices, such as a push-to-talk control unit or in-ear device, which often have limited computational power and strict power consumption requirements. By configuring the machine learning model to output a computationally lighter correction function rather than a full, directly corrected audio signal, the system reduces the processing effort and power consumption required for real-time speech enhancement and noise filtration. This approach supports robust and reliable voice communication in demanding environments while maintaining low latency and high energy efficiency, which are essential for portable and battery-powered communication systems.

[0025] The communication system comprises at least one communication and hearing protection device, in particular at least one in-ear communication and hearing protection device, configured to be worn by a user. The communication and hearing protection device comprises an ambient microphone configured to provide an ambient microphone input signal and a transmit (Tx) microphone configured to provide a transmit microphone input signal in response to registering speech from the user. The label ‘transmit’ refer to that, at least in some embodiments, the transmit microphone input signal (or rather a processed form thereof, processed according to the first aspect and embodiments thereof) is intended to be transmitted, e.g. as a communication (Tx) signal via a radio or other communication device).

[0026] The communication system further comprises one or more processors configured to implement and execute a trained artificial intelligence or machine learning method or component to generate a correction function or signal in response to the ambient microphone input signal and the transmit microphone input signal. The generated enables a corrected or improved voice signal.

[0027] In some embodiments, only a correction function is generated. In some alternative embodiments, only a correction signal is generated.

[0028] In some embodiments, the correction function or signal enables a corrected or improved voice signal and / or the one or more processors is / are further configured to apply the correction function or signal to the transmit microphone input signal thereby providing a corrected or improved voice signal. The corrected and / or improved part may relate to both speech optimisation (i.e. a clearer and more understandable voice signal) and noise suppression / removal specifically removing noise contributions from the demanding environment.

[0029] In some embodiment, the correction function or signal comprises a set of parameters that are applied to the transmit microphone input signal and / or the ambient microphone input signal to generate a corrected or modified voice signal.

[0030] In one embodiment, the correction function or signal is generated based on the measured backgroundnoise level.In some embodiments, the corrected or improved voice signal is provided as signal for transmission, e.g. to be received by one or more communication devices, e.g. by one or more one or more radios. In some embodiments, the transmit (Tx) microphone is

[0031] a vibration-based transducer, e.g. an accelerometer type vibration sensitive transducer such as a digital voice pick up (VPU) type microphone, acoustically coupled with the user, when the user is wearing the communication and hearing protection device, where the vibration-based transducer is configured to obtain a voice signal of the user in response to vibrations caused by the user speaking and provide the transmit microphone input signal in response thereto. In some embodiments, the ambient microphone is

[0032] a microphone configured to register an airborne ambient acoustic signal of a demanding environment and provide the ambient microphone input signal in response thereto.

[0033] In some embodiments,

[0034] the ambient microphone input signal represents an airborne acoustic signal, and the transmit microphone input signal represents a vibration signal.

[0035] In some embodiments, the generated correction function or signal is or comprises data representing a gain vector or similar, wherein the gain vector or similar comprises a number of predicted adjustment values, predicted by the trained artificial intelligence or machine learning method or component, for different segments or parts of the transmit microphone input signal or a processed version thereof, where an adjustment value for a particular segment or part specify whether the transmit microphone input signal or a processed version thereof in the particular segment or part should be kept or should be increased or decreased and to what extent.

[0036] In some embodiments, predicted adjustment values each are limited or clipped to be within predetermined maximum and / or minimum threshold, wherein the predetermined maximum and / or minimum threshold are set in response to a derived or estimated sound pressure value representing or estimating an external background noise level, wherein the predetermined maximum and / or minimum threshold are set to relatively high threshold(s) in case the derived or estimated sound pressure value indicates no or little background noise and are set to relatively low threshold(s) in case the derived or estimated sound pressure value indicates a relatively high background noise.

[0037] In some further embodiments, the external background noise level is represented measured by the ambient microphone input signal.

[0038] In some embodiments, the communication system is configured to:apply a processed version of the generated correction function or signal to the transmit microphone input signal thereby filtering or reducing noise from the transmit microphone input signal and providing the corrected or improved voice signal.

[0039] In some embodiments, the transmit (Tx) microphone is digital, and wherein the communication system or the communication and hearing protection device comprises a dedicated direct digital-to-analog converter (DAC) circuitry coupled (e.g. or preferably directly) to the transmit (Tx) microphone and configured to perform lossless front-end digital to analog signal conversion. This is advantageous since no additional noise is generated / introduced in this manner and nor will the signal be degraded as a consequence of an additional "signal path” and / or additional logical operation(s) by other circuits and components.

[0040] In some embodiments, the transmit (Tx) microphone is configured to output a digital Pulse Density Modulation (PDM) signal representing an obtained voice signal ofthe user, and wherein the dedicated direct digital-to-analog converter (DAC) circuitry is configured to receive the digital Pulse Density Modulation (PDM) signal and to convert it into an analog signal using a D-FlipFlop and an active lowpass filter, preferably applying a fourth order Bessel function, comprised by the dedicated direct digital-to-analog converter (DAC) circuitry.

[0041] In some embodiments, the one or more processors is further configured

[0042] to process the transmit microphone input signal to generate a transmit magnitude component thereof, or

[0043] to process the transmit microphone input signal to generate a further processed transmit magnitude component thereof,

[0044] and to provide the generated transmit magnitude component or the generated further processed transmit magnitude component as input to the trained artificial intelligence or machine learning method or component to generate the correction function or signal.

[0045] In some embodiments, the one or more processors is further configured

[0046] to process the ambient microphone input signal to generate an ambient magnitude component thereof, or

[0047] to process the ambient microphone input signal to generate a further processed ambient magnitude component thereof,

[0048] and to provide the generated ambient magnitude component or the generated further processed ambient magnitude component as input to the trained artificial intelligence or machine learning method or component to generate the correction function or signal.

[0049] In some embodiments, the one or more processors is further configuredto process the generated ambient magnitude component to derive a derived or estimated sound pressure value thereof, the derived or estimated sound pressure value representing or estimating a background noise level,

[0050] to adjust the correction function or signal, prior to the correction function or signal being applied to the transmit microphone input signal, in response to the derived or estimated sound pressure value resulting in an adjusted correction function or signal, and

[0051] applying the adjusted correction function or signal to the transmit microphone input signal instead of the correction function or signal in order to provide the corrected or improved voice signal.

[0052] In some further embodiments, the one or more processors is / are further configured

[0053] to, instead of applying the correction function or signal to the transmit microphone input signal, perform noise filtration of the generated further processed transmit magnitude component in response to (or taking into account) the adjusted correction function or signal resulting in a noise corrected further processed transmit magnitude component, and

[0054] providing the corrected or improved voice signal in response to the noise corrected further processed transmit magnitude component.

[0055] In some embodiments, the communication system comprises a push-to-talk control unit comprising one or more, e.g. all, of the one or more processors. In some alternative embodiments, the in-ear communication and hearing protection device comprises one or more, e.g. all, of the one or more processors. In some further embodiments, the one or more processors, or their functionality, is shared between the push-to-talk (PTT) control unit and the in-ear communication and hearing protection device, and / or e.g. shared with one or more other devices (of the communication system).

[0056] In some embodiments, some or all of the processing functionality of the one or more processors is in another device (than the PTT and the in-ear communication and hearing protection device).

[0057] In some embodiments, the communication system comprises one or more of

[0058] a wireless remote PTT device,

[0059] one or more communication devices,

[0060] one or more radios,

[0061] a radio of a first type and a radio of a second type, and

[0062] one or more end-user-devices (EUDs).

[0063] In some embodiments, the computer program or routine implementing the trained artificial intelligence or machine learning method or component has been trained according to the training method (701 ) according to the second aspect as disclosed herein.

[0064] In some embodiments, the communication system comprises two communication and hearing protection devices, each corresponding to the communication and hearing protection device as describedabove, where a first communication and hearing protection device is configured to be inserted in the right ear of the user and the other second communication and hearing protection device is configured to be inserted in the left ear of the user, wherein

[0065] the first communication and hearing protection device is configured to

[0066] provide a first ambient microphone input signal by a first ambient microphone,

[0067] a first transmit microphone input signal by a first transmit (Tx) microphone,

[0068] the second communication and hearing protection device is configured to

[0069] provide a second ambient microphone input signal by a second ambient microphone, a second transmit microphone input signal by a second transmit (Tx) microphone, and wherein the

[0070] one or more processors is / are configured to implement and execute the trained artificial intelligence or machine learning method or component to generate the correction function or signal in response to a combination of the first and the second ambient microphone input signals and a combination of the first and the second transmit microphone input signal, wherein the one or more processors is / are configured to apply the correction function or signal to the combined transmit microphone input signal in order to provide the corrected or improved voice signal. According to a second aspect disclosed herein are embodiments of a method according to independent claim 18 with advantageous embodiments as defined by the dependent claims and as disclosed herein.

[0071] According to the above second aspect, disclosed herein are embodiments of a method of training an artificial intelligence or machine learning method or component to be executed by at least one device of a communication system, as disclosed herein.

[0072] The artificial intelligence or machine learning method or component is configured to generate real-time processing of a user’s voice signal in a demanding environment, where the method comprises

[0073] a) obtaining first data representing a reference speech signal including a speech signal of a user obtained in accordance with a first way,

[0074] b) obtaining second data representing a training transmit (Tx) signal including the speech signal obtained in accordance with a second way,

[0075] c) obtaining third data representing a training ambient signal including the speech signal obtained in accordance with a third way,

[0076] d) providing the second data and the third data to the artificial intelligence or machine learning method or component generating a predicted output in response thereto,e) comparing the predicted output and the first data and determining a difference therebetween, and

[0077] f) updating parameters of the artificial intelligence or machine learning method or component in response to the determined difference,

[0078] wherein the method further comprises repeating steps a) - f) for new first, second, and third data a plurality of times, typically a large number of times, until the generated difference of the predicted output and the first data is within a predetermined threshold or the improvement of generated difference (from cycle to cycle) stops improving sufficiently.

[0079] In some embodiments,

[0080] the first way comprises providing an acoustic signal by a microphone or transducer of a first type being a high quality professional stationary voice recording microphone or transducer, the second way comprises providing an acoustic signal by a microphone or transducer of a second type being a vibration pick-up sensor, and / or

[0081] the third way comprises providing an acoustic signal by a microphone or transducer of a third type being an ambient microphone.

[0082] In some embodiments,

[0083] the second data, representing a training transmit (Tx) signal including the speech signal, further includes user-generated (such as natural or involuntary) noise e.g. or preferably obtained in accordance with the second way, and / or

[0084] the third data, representing a training ambient signal including the speech signal, further includes user-generated noise e.g. or preferably obtained in accordance with the third way.

[0085] In some embodiments,

[0086] the second data, representing a training transmit (Tx) signal including the speech signal, further includes noise being representative of loud noises of a demanding environment, the third data, representing a training ambient signal including the speech signal, further includes noise being representative of loud noises of a demanding environment, and the first data, representing a reference speech signal including the speech signal, does not include noise being representative of loud noises of a demanding environment.

[0087] In some embodiments,

[0088] the noise being representative of loud noises of a demanding environment of the second data has been obtained in accordance with the second way, and / orthe noise being representative of loud noises of a demanding environment of the third data has been obtained in accordance with the third way.

[0089] In some embodiments, the noise being representative of loud noises of a demanding environment of the second data and / or the noise being representative of loud noises of a demanding environment of the third data have been obtained at the same time and / or have been obtained when the user is not speaking.

[0090] In some embodiments, the method further comprises

[0091] processing the second data to generate a transmit magnitude component thereof, and providing the generated transmit magnitude component to the artificial intelligence or machine learning method or component instead of providing the second data in step d), and / or processing the third data to generate an ambient magnitude component thereof and providing the generated ambient magnitude component to the artificial intelligence or machine learning method or component instead of providing the third data in step d).

[0092] Further embodiments of the method according to the second aspect correspond to further embodiments of the system according to the first aspect and have the same advantages for the same reasons. According to a third aspect, the present disclosure relates to a method for enhancing speech intelligibility and maintaining robust voice transmission in a tactical communication system when the user is speaking. A method herein may be understood as a sequence of steps or actions carried out by one or more devices or systems to achieve a technical result. Speech intelligibility refers to the clarity and comprehensibility of spoken words as perceived by a listener or a receiving device. Robust voice transmission denotes the reliable conveyance of speech signals, even in the presence of adverse acoustic conditions, such as high background noise or rapidly changing noise or sound environments. A tactical communication system may be understood as a set of devices and protocols designed to enable secure and effective voice communication for users operating in challenging or mission-critical environments, such as military, law enforcement, or emergency response scenarios.

[0093] The method comprises receiving, from a vibration-sensitive transmit microphone, a first signal representing speech vibrations conducted through the user’s jawbone and / or tissue. A vibration-sensitive transmit microphone, such as a bone conduction or accelerometer-based transducer, may be understood as a sensor that detects mechanical vibrations generated by the user’s speech as they propagate through bone or tissue, rather than through the air. One advantage of this arrangement is that the vibration-sensitive microphone is inherently robust against external airborne noise and excels in isolating the user’s voice in loud surroundings, thereby ensuring reliable voice capture in high-noise environments where airborne microphones may be compromised.

[0094] The method further comprises receiving, from an ambient microphone, a second signal representing airborne acoustic components of the user’s voice and surrounding environmental sounds. An ambient microphone may be understood as a transducer that converts airborne sound waves, including boththe user’s speech and environmental noise, into an electrical signal. One advantage of this arrangement is that the ambient microphone can capture high-fidelity speech signals in low-noise conditions, providing a broad frequency response and natural sound quality, which is particularly beneficial for clear communication when background noise is minimal.

[0095] The method further comprises generating feature-extracted representations of the first signal and the second signal. Feature extraction may be understood as the process of transforming raw audio signals into a set of parameters or descriptors that are more suitable for subsequent processing, such as spectral features, temporal characteristics, or other signal attributes. One advantage of this arrangement is that feature extraction enables more efficient and effective processing by a machine learning model, facilitating the identification and enhancement of speech-relevant components while suppressing noise. The method further comprises receiving, by a trained machine learning model implemented on a processing unit, the feature-extracted representations of the first signal and the second signal as input. A trained machine learning model may be understood as an implementation of a computational model, such as a neural network, that has been trained on representative data to perform a specific task, in this case, speech enhancement and noise suppression. A processing unit may be understood as a hardware component, such as a microprocessor or digital signal processor, capable of executing the model. One advantage of this arrangement is that the machine learning model can leverage complex patterns in the input signals to generate an improved output voice signal, adapting to a wide range of acoustic environments and user conditions.

[0096] In a preferred embodiment according to the third aspect, the method further comprises dynamically generating, by the trained machine learning model, an improved output voice signal, to be transmitted to other users or devices, being a combined signal from both the first signal and second signal, enhancing speech-relevant frequency bands and suppressing noise. One advantage of this arrangement is that the system can intelligently combine the strengths of both the ambient and vibration-sensitive microphones, ensuring that the transmitted voice signal remains clear and intelligible across varying noise conditions. By enhancing speech-relevant frequency bands and suppressing noise, the method improves the reliability and quality of voice communication, which is essential for user safety and operational effectiveness in demanding environments.

[0097] In a preferred embodiment according to the third aspect, the trained machine learning model dynamically adjusts the relative contributions of the first and second signals according to the background noise level derived from the second signal, such that in low-noise environments (e.g., below 85 dEB) the contribution from the second signal is increased, and in high-noise environments (e.g., above 85 dEB) the contribution from the first signal is increased, thereby maintaining optimal speech intelligibility and noise suppression across varying acoustic environments. One advantage of this arrangement is that the system can automatically adapt to changing acoustic conditions, leveraging the high-fidelity capture of the ambient microphone in quiet settings and the noise-robust capture of the vibration-sensitive microphone in loud settings. This dynamic adaptation ensures that the output voice signal is consistentlyoptimized for intelligibility and robustness, overcoming the limitations inherent in using either microphone type alone and providing a substantial improvement in communication reliability for users operating in mission-critical or hazardous environments.

[0098] In other preferred embodiment according to the third aspect, the method further comprises dynamically generating, by the trained machine learning model, a correction function, the correction function being signal modification parameters configured to selectively enhance speech-relevant frequency bands and suppress noise. One advantage of this arrangement is that generating a correction function provides an interpretable intermediate, such as a gain vector per frequency bin, which can be inspected, constrained, or post-processed. This arrangement allows the correction function to be dynamically adjusted or blended with traditional signal processing methods, enabling hybrid or fallback modes if the machine learning model is uncertain or encounters out-of-distribution data.

[0099] In some embodiments according to the third aspect, the method further comprises subjecting the generated correction function to an output postprocessing step that performs a signal conditioning operation configured to modify the generated correction function, thereby providing a modified correction function based on predefined criteria, the predefined criteria comprising a predetermined threshold being determined as a function of a background noise level derived from the second signal. One advantage of this arrangement is that the output remains within expected bounds, for example by clipping or thresholding the gain vector, reducing the risk of artifacts or instability that can occur with direct signal generation. This two-step approach can be optimized for low-latency operation, as the correction function can be computed with lower resolution than the full output signal, and the signal modification step can be highly efficient.

[0100] In some embodiments according to the third aspect, the method further comprises providing an improved output voice signal, to be transmitted to other users and / or devices, by applying the modified correction function to the first signal recorded by the vibration-sensitive microphone and / or the second signal recorded by the ambient microphone. One advantage of this arrangement is that the correction component is often lower-dimensional than the full output signal, reducing computational load and memory requirements, which is especially important for real-time, embedded, or battery-powered devices. This arrangement enables reliable, real-time enhancement of voice transmission quality in demanding environments, while maintaining system efficiency, adaptability, and user safety.

[0101] In one embodiment, the trained machine learning model has been trained on a dataset comprising paired data records using supervised learning, each data record including speech signals obtained simultaneously from a vibration-sensitive transmit microphone and an ambient microphone under varying noise conditions representative of demanding operational environments. Supervised learning may be understood as a machine learning paradigm in which a model is trained on input-output pairs, allowing the model to learn a mapping from input data to a desired output. A dataset may be understood as a structured collection of data, in this case comprising paired records of microphone signals and reference signals. A vibration-sensitive transmit microphone may be understood as a sensorthat detects mechanical vibrations generated by the user’s speech as they propagate through bone or tissue, whilean ambient microphone may be understood as a transducer that converts airborne sound waves, including both the user’s speech and environmental noise, into an electrical signal. One advantage of this arrangement is that the model is exposed to the complementary characteristics of both vibrationsensitive and ambient microphone signals in realistic, noisy environments (e.g., including environmental noise, involuntary user sounds, and equipment-induced vibrations), enabling it to learn how each signal type responds to different noise conditions and to develop optimal strategies for combining them. By training on simultaneous recordings from both microphones, the model can generalize more effectively to real-world operational scenarios, improving its robustness and reliability in tactical communication systems.

[0102] The paired data records further include a corresponding reference voice signal, the reference voice signal being obtained using an external microphone positioned to capture high-quality airborne speech from the user. An external microphone may be understood as a microphone placed in a position to record the user’s speech in optimal conditions, typically outside the influence of the device’s own noise environment. One advantage of this arrangement is that the use of a high-quality external microphone as the reference voice signal ensures that the model is trained to produce output that closely matches the best possible speech intelligibility and naturalness, as perceived in ideal conditions. This enables the machine learning model to generate output signals that are optimized for clarity and intelligibility, even when the input signals are degraded by noise or other environmental factors. By leveraging this training strategy, the method supports the development of communication systems that maintain high-quality voice transmission and intelligibility across a wide range of demanding and unpredictable environments.

[0103] Further embodiments of the method according to the third aspect correspond to further embodiments of the method according to the second aspect and have the same advantages for the same reasons. According to a fourth aspect, the present disclosure relates to a computer-implemented method of generating a training dataset for a machine learning model, particularly the machine learning model according to the first aspect. A machine learning model may be understood as a sequence of steps executed by a computing device to create structured data suitable fortraining a model that learns to perform a specific task, such as speech enhancement or noise suppression. A machine learning model may be understood as a computational system, such as a neural network, that is trained to map input data to desired outputs by learning from examples. A training dataset may be understood as a collection of input-output pairs or records used to teach the model howto perform its task.

[0104] In a preferred embodiment according to the fourth aspect, the method comprises obtaining noise data by recording simultaneous output signals from the vibration-sensitive microphone and the ambient microphone while in a mute mode (i.e. , the user is not speaking) under varying noise conditions representative of demanding operational environments. A vibration-sensitive microphone may be understood as a sensor that detects mechanical vibrations, such as those caused by equipment or user movement, while an ambient microphone may be understood as a transducer that captures airborne sounds from the environment. One advantage of this arrangement is that it enables the collection ofrealistic noise profiles from both microphones, reflecting the types of environmental noise, involuntary user sounds, and equipment-induced vibrations encountered in operational scenarios. By recording in mute mode (i.e. , the user is not speaking), the method ensures that the captured signals represent only noise, without contamination from speech, which is significant for generating accurate training data. The method further comprises obtaining speech data by recording simultaneous output signals from the vibration-sensitive microphone, the ambient microphone, and an external microphone positioned to capture high-quality airborne speech from the user, while the user is speaking in a silent background environment. An external microphone may be understood as a device placed to record the user’s speech in optimal conditions, free from environmental noise. One advantage of this arrangement is that it provides clean speech signals from all relevant microphones, allowing for the creation of ground truth reference data that is optimized for speech intelligibility and suitable for use as a target in supervised learning.

[0105] The method further comprises generating paired data records by mixing the obtained speech data from the vibration-sensitive and ambient microphones with the corresponding noise data from the respective microphones, thereby simulating speech signals under varying noise conditions representative of demanding operational environments. One advantage of this arrangement is that it allows for precise control over the signal-to-noise ratio in the mixed signals, enabling the training dataset to cover a wide range of operational noise conditions. By generating paired data records in this way, the method ensures that the model is exposed to realistic combinations of speech and noise, improving its ability to generalize to real-world scenarios.

[0106] The method further comprises associating each paired data record with a reference voice signal, the reference voice signal being the high-quality airborne speech from the user as captured by the external microphone. One advantage of this arrangement is that it provides a reliable target for supervised learning, enabling the model to learn to produce output signals that closely match the best possible speech intelligibility for radio transmission purposes. By structuring the training data in this way, the method supports the development of machine learning models that are robust, adaptable, and capable of delivering high-quality voice communication in demanding and unpredictable environments.

[0107] Further embodiments of the method according to the fourth aspect correspond to further embodiments of the method according to the third aspect and have the same advantages for the same reasons. According to a fifth aspect, the present disclosure relates to a data processing system comprising one or more processors configured to carry out the method according to the third aspect and / or fourth aspect.

[0108] BRIEF DESCRIPTION OF THE DRAWINGS

[0109] Embodiments of the invention will now be described in more detail. Various embodiments of the systems and / or the methods according to the different aspects as disclosed herein will be described in connection with the appended drawings, in which:FIG. 1 schematically illustrates an example of a user wearing an exemplary personal in-ear tactical hearing protection and communication system configured to be worn by a person in communication with other connected device, people or both;

[0110] FIG. 2A schematically illustrates an example of a PTT control unit;

[0111] FIG. 2B schematically illustrates an example of a PTT-button allocation scheme;

[0112] FIG. 2C schematically illustrates an example of a configuration of communication devices connected to a PTT control unit, such as the one of FIG. 2A, and a corresponding PTT button allocation scheme; FIG. 2D schematically illustrates an example of a modified configuration of communication devices connected to a PTT control unit and a corresponding PTT button allocation scheme;

[0113] FIG. 3A schematically illustrates an example of an improved in-ear tactical communication and hearing protection system comprising an in-ear tactical communication and hearing protecting headset connected to a PTT control unit via a cable and a connector;

[0114] FIG. 3B schematically illustrates an example of a right (R) earpiece of the in-ear tactical communication and hearing protection headset of FIG. 3A when fitted properly inside the right ear-canal of a user; FIG. 3C schematically illustrates an example of an exploded view of the right (R) earpiece of the in-ear tactical communication and hearing protection headset of Fig. 3A;

[0115] FIG. 3D schematically illustrates an example of a cross-sectional view of the ear tip of the in-ear tactical communication and hearing protection headset of Fig. 3A.

[0116] FIG. 3E schematically illustrates an example of a Right R earpiece having an attached ear tip 323 in a compressed state;

[0117] FIG. 3F schematically illustrates an example of a Right R earpiece being inserted into the ear canal of a user’s right ear having an attached ear tip in a less compressed state;

[0118] FIG. 3G schematically illustrates an example of a DAC circuitry located in the earpiece of the in-ear tactical communication and hearing protection headset of FIG. 3A configured to perform a lossless front-end digital to analog signal conversion of a digital output signal from Tx microphone;

[0119] FIG. 4A illustrates an example of the standardized test setup used for quantifying the expansion time of the in-ear tactical communication and hearing protecting headset foam tip;

[0120] FIG. 4B schematically illustrates a cross-sectional view of a stainless-steel acoustic coupler unit of the test setup of FIG. 4A used for measuring foam-tip expansion time;FIG. 4C Illustrates a graphical representation of a data series showing the expansion time measurements for the foam-type ear tip according to the present disclosure plotted together with expansion measurements of prior art examples;

[0121] FIG. 5 schematically illustrates an example of a processing architecture of the in-ear tactical communication and hearing protecting system configured to remove noise and enhance the speech signal quality of the user;

[0122] FIG. 6A schematically illustrates an example of a processing method executed by an in-ear tactical communication and hearing protecting system as disclosed herein to provide a clear and undistorted voice signal via a communication device in demanding environments;

[0123] FIG. 6B schematically illustrates an example of a combined pre-processing- and feature extraction step for a Tx microphone input signal according to embodiments of the processing method executed by the in-ear tactical communication and hearing protecting system;

[0124] FIG. 6C schematically illustrates an example of a combined pre-processing- and feature extraction step for an ambient microphone input signal according to embodiments of the processing method executed by the in-ear tactical communication and hearing protecting system;

[0125] FIG. 6D schematically illustrates an example of a neural network processing step using a deep neural network (DNN) model according to an embodiment of the processing method executed by the in-ear tactical communication and hearing protecting system;

[0126] FIG. 6E schematically illustrates an example of an output postprocessing step according to embodiments of the processing method executed by the in-ear tactical communication and hearing protecting system;

[0127] FIG. 6F schematically illustrates an example of a noise filtration step according to embodiments of the processing method executed by the in-ear tactical communication and hearing protecting system; FIG. 6G schematically illustrates an example of a feature reconstruction step according to embodiments of the processing method executed by the in-ear tactical communication and hearing protecting system;

[0128] FIG. 7A schematically illustrates an example of a training method applied to train the neural network model of FIG. 6D to provide real-time processing of a user’s voice signal to produce clear and undistorted communication in demanding environments;

[0129] FIG. 7B schematically illustrates an example of a data collection process used forthe training of the neural network model according to some embodiments;

[0130] FIG. 7C Illustrates a graphical representation of three exemplary audio signal data series in a first subplot, a second subplot, and a third subplot arranged in a vertical stack, representing training data usedto train the training of the neural network model to obtain trained neural network model according to the training method of FIG. 7 A; and

[0131] FIG. 7D Illustrates a graphical representation of two exemplary audio signal data series in a fourth subplot and a fifth subplot segment arranged in a vertical stack, representing noise data used to train the training of the neural network model to obtain a trained neural network model according to the training method of FIG. 7A.

[0132] DETAILED DESCRIPTION

[0133] Fig. 1 schematically illustrates an example of a user wearing an exemplary personal in-ear tactical hearing protection and communication system configured to be worn by a person in communication with other connected units (e.g. devices, persons or both).

[0134] Illustrated is a user 101 wearing a wearable personal in-ear tactical hearing protection and communication system. In the illustrated example, the personal in-ear tactical hearing protection and communication system comprises one or more devices for tactical communication and an in-ear communication and hearing protection device 103. In the illustrated example, the user 101 is in a so-called dismounted configuration that refers to a situation where a user 101 may maintain one or more communication links with one or more remote parts (system(s) and / or other user(s) via their respective personal communication equipment) (e.g. such as an in-ear tactical hearing protection and communication system) without a physical link (such as a cable, etc). Being dismounted enables the user 101 to freely move around while being able to maintain communication.

[0135] The personal in-ear tactical hearing protection and communication system may comprise an in-ear communication and hearing protection device 103, e.g. in the form of a pair of in-ear earbuds configured to be arranged the ear canal of the user 101, and one or more PTT control units 105, 113. Additionally, the user 101 may carry and use one or more additional electronic devices (such as a Tactical Display Unit (TDU) or End-User-Device (EUD) 107) and one or more communication devices 109, 111 operably connected to the PTT control unit 105 and to the headset 103 for establishing audio and data links with other remote units, teams or devices.

[0136] In the illustrated example, the user 101 is wearing an in-ear hearing protection headset 103, a PTT control unit 105 (also sometimes referred to a PTT control hub or box or simply control box), an EUD 107, and two radios 109, 111 of different types or of different settings, where the in-ear headset 103, the EUD 107, and the two radios 109, 111 respectively is connected to the PTT control unit 105 via respective cables. Alternatively, one or more of the cabled connections may be replaced by suitable wireless connection(s). Additionally, a wireless remote PTT device 113 containing additional PTT button(s) may be located elsewhere on the user 101 or equipment for easy operation during action and / or expanding the number of available PTT-buttons. The wireless remote PTT may be in wireless connection 115 with the PTT control unit 105 for transmitting PTT button press actions or other control actions tothe main PTT control unit 105. The PTT control unit 105 may e.g. be a PTT control unit as disclosed herein, e.g. as described in connection with Fig. 2A and elsewhere.

[0137] Communication devices

[0138] A communication device may be a handheld radio 109, 111 providing voice and data communication e.g. in the VHF and UHF bands and offering secure and reliable communication in various operational environments such as an AN / PRC-152 or another radio 109, 111 for dismounted use providing interoperable communication, e.g. in multiple frequency bands, and supports voice and data transmission. A communication device may e.g. be a specialised radio system tailored for mounted operations, including the AN / VRC-110, a multiband radio system featuring components like the AN / PRC-117F(C) or AN / PRC-117G(C); the Thales AN / PRC-148 Vehicle Adapter Amplifier (VAA), which adapts the AN / PRC-148 handheld radio for vehicular use; the Collins Aerospace AN / ARC-210, primarily an airborne radio system but also utilized in some mounted vehicle applications for secure voice and data communication; and the Barrett PRC-2091 , a vehicular-mounted version of the Barrett PRC-2090 HF tactical radio, offering long-range communication capability in the HF band for mounted operations. These systems are important for providing reliable communication links between military vehicles and command centers and facilitating coordination and situational awareness on the battlefield or in emergency situations. A communication device may be connected to the PTT control unit 105 via a dedicated cable containing a plurality of terminals for exchanging, control signals, power, analog voice signals, and data signals as generally known in the field.

[0139] A communication device may e.g. also be a mobile phone, a satellite phone, etc. Generally, connected communication devices can be wired or wirelessly connected to another device / communication device.

[0140] A user may carry and / or be connected to one or more of such mentioned communication devices and / or other types of communication devices, in particular as disclosed herein.

[0141] PTT control unit

[0142] The PTT control unit 105 may be an intelligent control box adapter with Push-To-talk (PTT) functionality such that communication to and from a connected headset 103 and / or one or more communication devices 109, 111 may be controlled via the PTT interface on the PTT control unit 105. The PPT control unit 105 may e.g. be in the form of a relatively small body-worn box e.g. to be attached to a vest, suit, or other of the user 101. The unit 105 may contain several interfaces, for connecting a headset 103 and one or more communication devices 109, 111, typically referred to as communication ports or simply “COM port". The PTT control unit 105 may contain a simple (e.g. stealth mode) user interface for controlling the operation of the headset 3 and / or connected communication devices 109, 111 in an easy and intuitive manner. Stealth mode means without any or at least little emission of visual or auditory feedback to the surrounding environment. The user interface of the PTT control unit 105 may e.g. be in the form of tactile buttons for controlling one or more devices (103, 105, 107, 109, 111) as worn / carried and used by the user 101. The user interface of the PTT control unit 105 may also controlor communicate with other external devices via an intercom system / communication hub. The user interface of the PTT control box may for example contain two main PTT buttons on the side and two additional PTT buttons on the front for controlling respectively connected communication devices. The PTT control unit 105 may contain one or more buttons for controlling the connected headset 103 such that the different operation modes of the headset 103 may be activated, deactivated, or otherwise influenced.

[0143] The PTT buttons are typically configured to control the transmission of data and / or voice signals from the user 101 via one or more connected devices 109, 111 such that when one or more buttons are activated, the PTT control unit 105 is signalling one or more radios 109, 111 to start transmission.

[0144] An in-ear tactical hearing protection and communication system configured for dismounted operation according to an exemplary embodiment is illustrated in Fig. 1. A user 1 , such as a soldier or public safety agent, is wearing a rugged in-ear hearing protection headset 103, such as the INVISIO X7, connected to a PTT control unit 105, such as the INVISIO V60 II ADP, comprising four dedicated PTT buttons, a mode button, three COMM interfaces and one headset interface. The user 101 may additionally be carrying (as illustrated in Fig. 1) an End-User-Device (EUD) 107 e.g. in the form of a chest mounted rugged casing holding a smart phone or tablet like the Samsung® S23 Galaxy Tactical Edition smartphone connected with the PTT control unit 105. The EUD 107 may e.g. be running a “battle management system” such as Android Team Awareness Kit (ATAK) for precision targeting, surrounding land formation intelligence, situational awareness, navigation, and data sharing via a smart phone touch screen interface. The user 101 may e.g. be a team leader in a squat or similar carrying a first radio 109, such as a portable Thales SquatNet® soldier radio, for communication between team members in the squad and a second radio 111, such as a Harris RF-7800M-HH, for voice and data communication with a connected (communication device or system of) a headquarter or similar. In the illustrated example, both the first 109 and second radio 111 is connected to the PTT control unit 105.

[0145] Fig. 2A schematically illustrates an example of a PTT control unit 105. The PTT control unit 105 may as an example comprise a rugged casing or housing 201 having four connection interfaces 203a-d, one 203a for connecting with a headset 103 and three 203b-d for respectively connecting to one or more of a plurality of communication devices e.g. such as 107, 109 and 111 of Figure l and elsewhere. The PTT control unit 105 may contain a main control unit (MCU) 205 such as a microcontroller system (e.g. ST from STMicroelectronics) on a chip (SoC) e.g. using an ARM processor core based on a Reduced Instruction Set Computing Architecture (RISC) or an FPGA processer or similar for executing instructions, controlling other units or elements of the PTT control unit 105, and / or performing complex computational tasks. The MCU 205 may be the central unit responsible for overall PTT control unit 105 control and coordination such as managing connection interface type determination, group configurations, and communication with the other processors, elements, and electronic circuits. The MCU 205 may e.g. execute desired / available functions of the software / firmware of the PTT unit 105 in accordance with connected devices to enable the functionality / behaviour of connected devices and the PTT unit 105, e.g. or preferably in response to respective cable chip settings of one or more connected devices. The MCU 205 may be in connection with a dedicated digital signal processing DSP unit 207 foradvanced analog and / or digital signal processing. The DSP unit may be a dedicated processor for handling audio mixing and processing tasks to ensure correct audio distribution across devices connected to the PTT control unit 105. The MCU 205 and DSP 207 processors may separately or in conjunction be configured to operate at least one artificial neural network engine configured to execute functionality such as one or more of speech recognition, voice-to-text, image classification, enhanced situational awareness, 3D directional sound processing, advanced signal processing, active noise cancellation, etc.

[0146] The processors 205,207 may be in connection with a memory component 211 for storing code, setting, instructions, and other relevant data. The MCU 205 may further be in connection with a USB hub element 213 or similar for administrating and negotiation of digital data protocols with connected devices with digital capabilities and with a power manager 215 controlling power routing via the PTT control unit 105, such as powering the PTT unit 105 itself and a connected headset 103 via an internal battery 217 and / or a connected radio 109,111 with power sharing capabilities, and / or directing power to or from a connected power bank or external power source to e.g. charge a radio 109 or 111 or the EUD 107. The PTT control unit 105 may further contain a wireless module 219 such as a Bluetooth transmitter or near-field wireless communication component for short range wireless communication with additional devices used by the user 101. The PTT control unit 105 may additionally contain a push-to-talk actuator module 221 handling activation events associated with push-to-talk (PTT) buttons 223a and 223b on the PTT control unit 105 in relation to the connected communication devices (107, 109, 111) so that a communication device 107, 109, 111 or headset 103 respectively can be operated via one or more push buttons on the PTT control unit 105. When a user activates the one or more of the physical PTT buttons 223a and 223b on the PTT control unit 105, a Carrier Operated Relay (COR) or Carrier Operated Switch (COS) signal or similar may be generated or triggered by the actuator module 221 and transmitted to the corresponding communication device 109 or 111 associated with the activated PTT button and additionally the microphone in the headset 103 may be unmuted. The COR / COS operates by generating a digital signal that switches between logic levels (typically +5V and Ground), indicating the activation of the transmitter by activating a squelch circuit or similar in the communication device thereby allowing the communication device to transmit a signal.

[0147] It is noted that the PTT control unit 105 may comprise further push buttons than illustrated in Fig. 2A, e.g. two additional PTT buttons on the front side, PTT-3 button (see e.g. 223c in Figs. 2C, 2D, 3A, etc.) and PTT-4 button (see e.g. 223d in Figs. 2C, 2D, 3A, etc.) as described elsewhere along with a dedicated “mode” button (see e.g. 223e in Figs. 2C, 2D, 3A, etc.).

[0148] When a communication device 107, 109, 111 and / or headset 103 is connected via a respective dedicated (e.g. male) connector 225 configured to interface with a respective (e.g. female) connection interface 203a-d on the PTT control unit 105, information and functional settings may be transmitted via the connection interface e.g. or preferably as described at least in part in EP2845115B1 (hereby incorporated by reference in its entirety) from a “cable chip” 227 embedded or located in the connector 225 (oralternative embedded or located in the cable of the connector 225 or elsewhere). The cable chip / micro-chip 227 may comprises an embedded memory storing data representing code, settings, instructions, and / or other data. When the connector 225 is connected to a respective connection interface 203a-d, the data (such as information, (configuration) settings, code, instructions, etc.) may be transferred directly from the cable chip 227 to the MCU 205 for configuration or other related tasks and uses of the connected device (as connected by the specific connector 225). Audio signal handling in the PTT control unit 105 may be facilitated by one or more CODEC modules 229.

[0149] The connector pair 203a-d and 225 should preferably be suitable for military and security applications, such as an ODU AMC® connector with between 3-55 contacts / terminals and be water resistant.

[0150] Inside the cable chip 227 (and / or an associated memory of the chip), a set of settings may be stored. When the memory is read, the settings may be transferred into the MCU 205 and distributed to the appropriate sections of the code and / or peripheral internal units such as the DSP 207, etc. The stored in-formation / data of the cable chip may e.g. be organized as a list of “feature calls”, also denoted “feature requests”, followed by specific settings and e.g. other data respectively associated with the requested feature (i.e. the feature request). Thus, the cable chip 227 may store feature-requests (of / forthe connected device) that when obtained by the MCU 205 will cause relevant processor(s) to execute instructions stored in the memory 211 of the control unit 105 corresponding to and / or carrying out the specific functionality associated with a requested “feature” using the specific set of settings and / or data transferred from the cable chip 227 for the particular feature. Examples of features may be:

[0151] • Audio Interfacing features, which may define an audio interface, impedance, gain, etc, • Push-To-Talk / Protocol Interfacing, which may define protocols such as UART, USB, pulsing interface (e.g. by using a variable resistance between terminals and Ground to signal different actions by generating voltage pulses. For example, shorting a particular microphone terminal to Ground for signalling to a connected phone e.g. to answer / hangup / take picture, etc.), etc.,

[0152] • Control functions (e.g. enabling specific functionality based on conditions such as VOX (“voice-operated exchange”), etc.),

[0153] • Ul definitions (which may define short / long press of PTT button functionality or mode button functions and key combos (combination of different simultaneous button presses),

[0154] • Various different audio algorithms, and

[0155] • Audio routing (that may adjust the flow of the audio including multi headset setup, crossbanding (relay) between connected radios, etc.) and side tone in headset.

[0156] The cable chip 227 may additionally or alternatively be used for device authentication and / or for specific user rights or accesses associated with the specific cable and / or the connected device connected by the specific cable.

[0157] Thus, the PTT control unit 105 may be configured to operate both data and voice communication channels on connected communication devices 107, 109 and 111 in response to the user speaking (VOX)and / or pushing one or more butons 223a, 223b on the PTT control unit 105. Based on the particular configuration of devices illustrated in Fig. 1 (i.e. setup e.g. like their number, type, and / or respective connections), the personal in-ear tactical hearing protection and communication system (e.g. see Fig.

[0158] 1) may be configured such that communication is handled in the following manner. Communication from the first radio 109 (e.g. receive or“Rx”) is directed to the user 101 via loudspeakers in both the left and right ear of the headset 103. Communication from the user 101 (e.g. transmit or “Tx”) via the first radio 109 may be activated by the voice of the user referred to as “voice-operated exchange” or “VOX”. In a voice activated transmission configuration (VOX), the PTT control unit 105 may be configured to process signals obtained by a transmit (Tx) microphone (see e.g. 317 in Fig. 3C) dedicated for user voice communication in the headset 103, such that when the user 101 is speaking, the MCU 205 may process the sound signal from at least the microphone 317 to detect a speech signal and in such cases thereby recognising an activation event and the MCU 205 may thus signal the PTT actuator module 221 accordingly. In response to the activation event, the PTT actuator module 221 may send an electrical signal to a PTT circuit in the first radio 109 triggering the radio to activate a transmission mode thereby allowing the user to transmit a voice signal via the radio 109. The second radio 111 may e.g. be configured in a dual net operation, such that voice communication can be performed via two separate channels or frequency bands simultaneously. The PTT control unit 105 may be adapted such that two PTT-butons 223a and 223b on the PTT control unit 105 may respectively be assigned to transmitting a voice message via the respective two separate nets, a first net “net 1” and a second net “net 2”. In response to the user activating the first PTT-1 button 223a by pressing the buton physically, the PTT actuator module 221 may register an activation event similar to the VOX situation but associated with a PTT-1 buton 223a and thereby signalling a PTT circuit in the second radio 111 associated with net 1, as described elsewhere. Similarly, the second PTT-2 buton 223b may be used to transmit a voice signal via net 2 on the second radio 111.

[0159] Accordingly, the PTT control unit 105 may enable the user to communicate via a specific communication device by performing an action, such as pushing a buton or starting to speak (VOX), where this action may also be referred to as latching in on a communication channel. Voice communication via the EUD 107 may e.g. be controlled via two additional push-butons (not shown, see 2C and 2D), one button associated to digital “picking up call” signalling from the PTT control unit 105 to the EUD 107 in relation to receiving a cell phone call and one buton for “hang up call” for ending a call. Different pushbuton combinations (“key combos”) may be used for more advanced actions, such as transmission of specific data or configuring the way the loudspeakers in the headset should emit sound, such as left / right ear only, mute all communication, mute single communication channel, etc.

[0160] The PTT control unit 105 may additionally or alternatively contain instructions related to assigning PTT butons 223a and 223b and functionality following a hierarchical scheme or other suitable scheme. Depending on the type of connected communication device, one or more buttons may be requested (e.g. as part of the information stored in the cable chip 227) to support the communication capabilities of the connected device such as two butons would be requested if a dual net ratio is connected and three butons in case of a tri net interface, etc. This is advantageous in combination with the cable chip functionality, as a default scheme may be applied if the cable chip data fails to specify unique instructionsthereby providing a system with an intuitive default configuration behaviour across communication platforms.

[0161] A special set of PTT allocation rules may be applied by the PTT control unit 105 to facilitate easy and intuitive operation and a dynamic use. Individually connected communication devices 107, 109 and 111 may request a number of PTT buttons to operate the respective device via the simple push button interface on the control unit 105. As a limited number of physical buttons are available on the device 105, a negotiation scheme may be implemented for PTT-button allocation.

[0162] An example of a PTT-button allocation scheme 231 is schematically illustrated in Fig. 2B. The PTT-but-ton allocation may follow an intuitive hierarchical structure, where buttons 223a-d are assigned to respective communication devices 107,109,111 following a prioritized scheme depending on which connection interface 203b-d a communication device is connected to. One of the “X”s 233 in the allocation scheme 231 show that a communication device 107, 109, 111 connected to connection interface 203c designated “COM 1 ” (in Fig. 2C pointing downwards and to the right as seen from the users 101 point of view while it is left on the illustration), when the PTT control unit 105 illustrated in Fig. 2A, etc. is worn by a user 101) will be assigned to use PTT-1 button 223a for a primary communication channel of the communication device connected to connection interface 203c. Likewise, a communication device 107,109,111 connected to connection interface 203d designated “COM 2” (in Fig. 2C pointing downwards and to the left as seen from the users point of view while is right on the illustration) when the PTT control unit 105 illustrated in Fig. 2A, etc. is worn by a user 101 ) will per default be assigned to use PTT-2 button 223b (as indicated by the appropriate “X” in Fig. 2B) for operating the primary communication channel for the communication device connected to connection interface 203d and lastly a communication device 107,109,111 connected to connection interface 203a designated “COM 3” (i.e. pointing upwards next to the headset connector 203a when the PTT control unit 105 illustrated in Fig.

[0163] 2A, etc. is worn by a user 101) will per default be assigned to use PTT-3 button 223b (as indicated by the appropriate “X” in Fig. 2B) for operating the primary communication channel for the communication device connected to connection interface 203a. The “(— >)”235 in Fig. 2B indicate how the allocated PTT button 223a-d may be re-assigned when / if additional communication channels are requested by connected communication devices 107,109,111, where an example is given in the following.

[0164] In addition to the allocation scheme 231, one or more rules may be applied to ensure easy, intuitive and clear communication via one or more connected communication devices 107,109,111.

[0165] The rules may as an example e.g. be:

[0166] Rule 1 : All connected devices, being communication devices 107, 109 and 111 and / or non-user specific radios, are allowed a minimum of one PTT button if requested by the device / cable.

[0167] Rule 2: PTT allocation is prioritized according to port number 203b-d (when port number 203a is used / to be used by a headset according to the allocation scheme 231), i.e. 203b is prioritised before 203c, 203c before 203d, etc.Rule 3: When a tri-net Interface (for a communication device having three dedicated communication channels) has highest priority and another multi-net communication device is connected, one of the nets of the tri-net will be given up (i.e. no dedicated PTT button is no longer assigned to the net given up).

[0168] Rule 4: When a quad-net interface has highest priority and another multi-net interface is connected, two of the nets of the quad-net will be given up (no dedicated PTT button is assigned any longer to the two nets given up) to allow more than one net on another multi-net interface to function.

[0169] In case additional wireless remote PTT buttons are connected, additional rules may be applied such as:

[0170] A five-button wireless remote PTT (see e.g. 113 in Fig. 1) may mimic the entire control unit 105 thereby expanding the number of available PTT buttons or mirroring the buttons on the PTT control unit 105. To illustrate the PTT allocation functionality byway of an example, Figs. 2C and 2D schematically illustrate two different configurations of communication devices connected to the PTT control unit 105 and the corresponding PTT button allocation scheme 231.

[0171] Fig. 2C illustrate a configuration where a first communication device 109 (being as an example a L3Harris AN / PRC-163 Multi-channel Handheld Radio) is configured for single channel communication configuration and is connected to the “COM 1” interface 203c of a PTT control unit 105 and a second communication device 111 (being as an example a Thales AN / PRC-148D IMBITR 2-channel radio) configured for multi-channel, or more specifically for dual channel, communication configuration and is connected to the “COM 2” interface 203d of the PTT control unit 105. The corresponding PTT button allocation scheme 231 is additionally displayed in Fig. 2C and illustrate that the single communication channel of communication device 109 is activated using the PTT-1 button 223a, the first communication channel of communication device 111 is activated using the PTT-2223b button, and the second communication channel of communication device 111 is activated using the PTT-3223c button. In the configuration displayed in Fig. 2C, all available communication channels are assigned to a respective PTT-button on the PTT control unit 105 as the number of requested communication channels does not exceed the number of available buttons.

[0172] Fig. 2D illustrate a modified (compared to Fig. 2C) configuration where the first communication device 109 (still being a L3Harris AN / PRC-163 Multi-channel Handheld Radio) is configured for single channel communication configuration but is now being connected to the “COM 2” interface 203d of a PTT control unit 105 and where the second communication device 111 (still being a Thales AN / PRC-148D IMBITR 2-channel radio) is now connected to the “COM 3” interface 203b of the PTT control unit 105. A third communication device 107 (being as an example a Persistent Systems MPU5 network radio, operating the Wave Relay® MANET solution) is configured with three individual channels for voice and / or data transmission that is connected to the “COM 1” interface 203c. The corresponding modified PTT button allocation scheme 231 is additionally illustrated in Fig. 2D and show that in this configuration, (infig. 2D) the number of requested communication channels across the connected communication devices 107,109,111 exceeds the number of available PTT buttons on the PTT control unit (as an example, six available nets vs. four available PTT buttons). This requires that the PTT control unit 105 may utilize the one or more additional rules. Rule 3 will be applied, such that one of the communication channels of both the third 107 and the second 111 communication device will not be assigned to a PTT button to obey Rule 1 and thereby allowing the user (see e.g. 101 in Fig. 1 ) to operate at least the primary communication channel on each of all the connected communication devices directly via a dedicated PTT button which is advantageous in demanding environments as clear communication via a plurality of different communication devices may be required.

[0173] In one embodiment, the one or more processors, electronic circuits, and logical components configured to execute desired / available functions of the embedded software / firmware (as explained by some examples in the previous) comprised by the PTT control 105 may constitute a communication module 237 (see e.g. Fig. 2A). Thus, the communication module 237 may be configured to receive an audio signal from a communication device 107, 109, 111 (e.g. or preferably a radio 109, 111) and provide the received audio signal, or a processed version thereof, to the user 101, more specifically to the inner ear of the user 101.

[0174] In-ear Headset

[0175] In-ear tactical hearing protection and communication headsets are generally providing a higher level of hearing protection then a circumaural headset variant. Additionally, full situational awareness of the surroundings may be achieved by using in-ear devices instead of over-the ear types, as an in-ear device does not obstruct the natural geometry of the ear, thereby allowing the user to determine the direction of incoming sound almost as precisely as with the naked ear. This provides a clear advantage when used in demanding environments, such as a riot control operation by a police officer or on a battlefield by a soliderto precisely pinpoint the direction of voice or sound. To achieve crystal clear communication even in demanding and high noise environments in-ear headsets may utilizes bone conduction for obtaining and transmitting a voice signal from the user to team members via a land mobile radio (LMR) for example. This means that instead of picking up air-borne vibrations like traditional microphones, bone conduction microphones pick up vibrations directly from the user’s jawbone when speaking. This method allows the in-ear headset comprising a bone conduction microphone to deliver clear communication under extreme noise conditions as ambient external environmental noise are automatically excluded. Examples of such a solution is known from the I N VI S IO X5 in-ear headset using a more traditional microphone transducer modified with a rubber dome or bladder to transfer a jaw vibrations signal from the inner part of the ear into an electrical voice signal as at least described partly in document EP3298800B1. However, the nature of bone conducted speech can lead to sound degradation due to the lack of high frequency components in the speech signal, resulting in a muffled effect. That degradation, combined with typical aggressive audio compression in tactical radios, can lead to compromised quality when it comes to narrowband wireless RF voice communication via radios and other. Additionally, the fitting of the in-ear hearing protection and communication headset may be very significant for the performance of the in-ear headset. If the bone conduction microphone is positioned non-optimally with less, little, or even no contact to the tissue of the ear canal, speech from the user will typically severely be deteriorated or not obtained at all. Furthermore, the otherwise superior hearing protection ability of an in-ear device rely on the fitting into the ear canal to a high degree making it very important that the in-ear device is mounted precisely in the ear-canal and with a proper fit for providing the seal to the ambient environment.

[0176] Figure 3A schematically illustrate an example of an improved in-ear tactical communication and hearing protection system 301 comprising an in-ear tactical communication and hearing protecting headset 103 connected to a PTT control unit 105 (e.g. or preferably as disclosed herein) via a cable and a connector 225. The in-ear tactical communication and hearing protecting headset 103 comprising a pair of earpieces 303a, 303b adapted to be mounted inside the right and left ear canal, respectively, of the user when worn to provide hearing protecting, situational awareness, and clear and undistorted communication via one or more communication devices (see e.g. Fig. 1 and elsewhere).

[0177] The headset 103 may be configured to receive and transmit audio and data signal via the wired cable and connector 225. The headset comprises two similar earpieces, left 303b L and right 303a R. Both earpieces contain similar components and may provide similar functionality. Even though the following description may refer only to the schematics, components, and functions of one earpiece, it should be understood that similar or identical components are present in the other, and furthermore, functionality may be achieved when both earpieces 303a, 303b work in conjunction.

[0178] Each of the earpieces may contain a plurality of transducers for obtaining and emitting acoustic signals to the user and potentially other sensors such as an optical heart rate monitor. Each of the transducers are communicatively in contact with the PTT control unit 105 via the cable and connector 225 for audio and data exchange, signal processing, and supply of power. When the headset 103 receives an audio signal (Rx) via a connected communication device 107, 109,111, (e.g. see fig. 1 and elsewhere), the audio signal may be directed and emitted to the ear canal of the wearer via a loudspeaker or similar (see e.g. 319 in Fig. 3C) inside each of the earpieces 303a, 303b. When the wearer is talking, the speech signal may be obtained and transmitted (Tx) via a radio, through the PTT control unit 105, by one or more dedicated respective transducers 315, 313 (see e.g. 315 and 317, respectively, in Fig. 3C).

[0179] The headset 103 may have a set of ambient microphones or similar (see e.g. 315 in Fig. 3C), one on each earpiece 303a, 303b pointing outwards from the wearers head when mounted, configured for picking up ambient sounds, which may then be processed in real-time by the PTT control unit 105 (and / or by the in-ear headset 103 itself in case the headset comprises one or more processors as described in relation to Fig. 2A, describing processors in the PTT control unit 105) before the sound signals are emitted to the wearer via the loudspeakers. This provides the effect of making the headset “transparent” in terms of listening to the surroundings thereby simultaneously providing situational awareness and hearing protection. Situational awareness should be understood as the ability to hearand / or perceive the surroundings and determine the direction of sounds. Thus, the in-ear tactical communication and hearing protection system 301 may be able to operate in this active mode of operation when the system 301 is powered.

[0180] Fig. 3B schematically illustrates an example of an embodiment of the in-ear tactical communication and hearing protection headset 103 as disclosed herein when fitted properly inside the ear-canal of a user 101. Fig. 3B show the Right 303a R earpiece fitted into the right ear of the user 101. When securely fitted in the ear-canal of the user the in-ear piece expose parts of the enclosure or shell 305, an earhook 307 or similar for providing cable guiding and supporting a secure fit in the ear and parts of a silicone sleeve or other sealing element 309 is visible and used as a supporting layer between the shell 305 and the ear tissue of the user 101 for improved comport when worn. Additionally, a wind filter 311 is shown in more or less the centre of the earpiece covering an ambient microphone (see e.g. 315 in Fig. 3C) for mechanically removing or reducing turbulent and / or static wind noise. As seen from Figu.

[0181] 3B, the entire personal ear geometry is exposed to the ambient environment providing a natural structure for pinpointing or registering the direction and origin of an incoming sound signal compared to conventional “over-the-ear” headsets thereby providing an improved directional determination of a sound source via the ambient microphone (see e.g. 315 in Fig. 3C) when a situational awareness mode is activated (e.g. by powering the system 301 on) since the in-ear headset 103 do not obstruct the natural structure of the ear and register sound at a location at or near the entry of the ear canal of the user 101.

[0182] Fig. 3C schematically illustrates an example of an exploded view of the right 303a R earpiece of the in-ear tactical communication and hearing protection headset 103 of Fig. 3A. A main (flexible) printed circuit board 313 (PCB) is supporting electronic components and circuits of the earpiece 303a while being enclosed by a shell or housing 305. When the earpiece 303a is assembled, it is configured to be submersible in water up to at least 1 meter, and be dust and sand-proof, e.g. according to a suitable specific ingress protection (IP) rating, e.g. an IP68 rating, to withstand hash conditions of demanding environments.

[0183] In some embodiments, the PCB 313 may contain the communication module 237 (e.g. one or more processors and / or other logical components) as discussed in relation to the PTT control 105 unit and elsewhere and may then carry out functionality as described in relation to these (either instead or in addition or as a supplement).

[0184] The main functionalities of the headset, i.e., listening to surroundings (situational awareness) and transmitting and receiving sounds, are enabled by three main transducer components: The ambient microphone 315, the transmit (Tx) microphone 317, and the loudspeaker 319.

[0185] The speaker unit 319 (i.e. loudspeaker) is used to convert an electric audio signal into an acoustic signal to play audio into the user’s ear. The sound signal may comprise both the ambient sounds from the user’s surroundings received by the ambient microphone 315 and received radio signals (Rx) in case the PTT control unit 105 is connected to one or more communication devices.In order to minimize of the overall size of the earpiece 303a, it may be an advantage to utilize a balanced armature driver as a speaker unit 319, as they have a very small formfactor and ability to produce a fairly high sound pressure level relative to their size.

[0186] The speaker unit 319 in earpiece unit 303a may be directly and / or dedicatedly wired to the PTT control unit 105 such as either via a direct dedicated connection line through the cable and connector; see e.g.

[0187] 225 in Fig. 3A, or alternatively indirectly via the flex PCB 313 sharing connection lines with the PCB 313 to the PTT control unit 105 or internal processors and other logical components.

[0188] The speaker unit 319 may be mounted in the earpiece 303a in such a way that the sound is guided via a “spout” or “funnel” 321 structure of the earpiece 303a shell, pointing into the ear-canal of the user 101 when the earpiece 303a is worn. The shape and size of the spout or funnel 321 may affect the sound received by the user, and the spout or funnel 321 may also serve as a coupling mechanism for an ear tip 323 that may be mounted as an extension of the shell 305 and forming part of the earpiece 303a. The transmit (Tx) microphone(s) 317 employed in the headset 103 may be an accelerometer type vibration sensitive transducer. When the speaker unit 319 produce a sound signal (e.g. when outputting an incoming radio message), it may cause the speaker unit 319 to vibrate slightly, which may induce an effect called “crosstalk” between the speaker unit 319 and the transmit (Tx) microphone 317 as the Tx microphone 317 may register the vibration of the speaker unit 319. Crosstalk refers to unintended transmission of audio signals between communication channels or circuits, which is undesirable in tactical communication where sensitive or confidential messages may be transmitted. To decrease the vibrations produced by the speaker unit 319, the chosen transducer type is advantageously a so called “dual” balanced armature driver, in which two identical drivers are mounted together, up against each other, to cancel out the vibrations that each driver produces as they will be operating in antiphase. Another way to decrease the vibrations transferred between the speaker 319 and the rest of the headset earpiece 303a may be by vibrational decoupling between the speaker unit 319 and the earpiece shell 305 and other internal components. This may e.g. be achieved by two methods. First, the electrical wires between the speaker unit 319 and the rest of the earpiece 303a may be so called “litz” wires, which are very thin and flexible, ensuring a low amount of vibration transfer. Secondly, the speaker unit 319 may be inserted and mounted into a rubber-sleeve or similar 325. The rubber-sleeve may be designed to both hold the speaker unit 319 and to provide a sound bore for directing the acoustic signal to the sprout or funnel 321. The rubber-sleeve may thereby function as a suspension with only a single point of contact with the rest of the headset (i.e. , at the sprout or funnel 321 of the earpiece shell 305). The rubber-sleeve may thereby advantageously make the speakers “float” or freely hang inside of the earpiece 303a, such that the speaker unit 319 is not in direct physical contact with the flex PCB 313 and associated components 315, 317 at any point.

[0189] The ambient microphone unit 315 may comprise a so called “MEMS” microphone, which has a small formfactor. Since the ambient microphone 315 may be a small flat component, it may be possible to mount the microphone directly on the flex PCB 313. The ambient microphone is positioned such thatthe microphone port or active side is pointing outwards from the headset, almost perpendicular to the head of the user when worn (see e.g. Fig. 3B). Inside the earpiece 303a, a flexible membrane 327, e.g. a rubber membrane, is placed in front of the microphone port (i.e. in front of the active side) to protect components inside the earpiece 303a from dust and debris and to achieve submergibility in water. On top of the membrane is a grid, grille, or similar 329 (e.g. forming part of the shell 305) to further protect the microphone-membrane assembly from damage. On the grid, etc. 329, a circular porous foam wind filter 311 is placed to reduce wind noise otherwise affecting the microphone, which is advantageous when operating the headset 103 in high wind speeds such as sailing in a RHIB (rigid-hulled inflatable boat) across the water or in other windy environments.

[0190] The main purpose of the ambient microphone 315 is to act as the artificial hearing of the user when wearing the headset 103. The ambient microphone 315 receives sound from the surroundings around the user and provides an audio signal in response thereto, which can then be transmitted to the user via the speaker unit 319 after being processed in the PTT control unit 105 (and / or the headset) so the user can hear the surroundings thereby providing situational awareness. A secondary purpose of the ambient microphone 315 may be to measure the sound pressure level of the external ambient sounds so that the “active hearing protection” algorithm (e.g. executed by the MCU 205 and DSP 207 processors of the PTT control unit 105; see elsewhere) can process and adjust accordingly. It should be understood that “active hearing protection” (AHP) is a form of active noise control, carried out via the headset transducers 315, 319 and the electrical circuitry of the PTT control unit 105, in which ambient sound is allowed to be transmitted to the user while limiting its overall amplitude so as to protect the user's hearing. This feature provides the user with situational awareness by enabling the user to hear ambient sounds while protecting the user's hearing from overly loud and potentially damaging sounds such as heavy machinery, a gunshot, etc.

[0191] In the in-ear tactical communication and hearing protection headset 103, the ambient microphone 315 may additionally be used to support the Tx microphone 317 when creating a clear communication signal to be transmitted via a communication device (e.g. see Fig. 1 and elsewhere) in situations with a varying ambient noise level. This is advantageous as a speech signal obtained by the ambient microphone 315 may have a significantly higher quality than the Tx microphone 317 in low ambient noise. By dynamically combining the input signal from the Tx microphone 317 and the ambient microphone (i.e. preferably from both earpieces R 303a and L 303b), the voice of the user may become gradually clearer and more undistorted, when the external ambient noise level is low, since the ambient microphone 315 may “take over” or provide the dominant contribution in low ambient noise situations and vice-versa in high ambient noise situations. The voice signal mixing between the ambient micro-phone(s) 315 and the Tx microphone(s) may be handled by the PTT control unit 105.

[0192] The main purpose of the Transmission (Tx) microphone 317 is to obtain the voice signal of the user when speaking so the voice signal can be forwarded to the PTT control unit 105 and subsequently be transmitted to a remote receiver via other equipment such as connected communication devices 107, 109, 111 or an intercom system, e.g. or preferably an intercom system as disclosed in European patent applications number 24174205.5, 24182433.3, and 24189349.4 (all hereby incorporated by referencein their respective entirety) respectively disclosing embodiments of an intercom system and aspects thereof.

[0193] An important functionality of the Tx microphone 317 may be to ensure clear and undistorted communication by obtaining only the voice of the user (at least to a large or an optimal extent) to be received and not obtaining external ambient noise (at least to a large or an optimal extent) from around the user even in high noise environments. One way to achieve this is by using so called bone-conduction microphones (BCM) e.g. as know from the I N VI S IO® X5 headset. However, some types of BMC microphones may be relatively difficult to fit properly in an ear canal of the user, which may cause deterioration of the voice signal. Therefor a different approach is applied in the tactical communication and hearing protection headset 103 as disclosed herein by utilizing (at least in some / preferred embodiments) a vibration pick-up sensor (VPU) as the Tx microphone 317.

[0194] The Tx microphone 317 may accordingly function as an accelerometer picking up the vibrations from the pinna (e.g. outer ear structure) of the user when speaking. The Tx microphone 317 may be a component comprising a MEMS microphone with a tiny mass attached to an internal membrane thereby converting the microphone into an accelerometer, particularly when the microphone port hole / surface is closed (e.g. no air borne acoustic signal can reach the active surface of the MEMS microphone). In existing types of BCM microphones such as the ones known from the INVISIO® X5 headset, the Tx microphone is a standard type microphone (e.g. air borne acoustic signal to electric signal transducer) with a small rubber dome on top of the sensing area, which thereby may be modified to pick up vibrations and converting them into varying sound pressure fluctuations in front of the microphone port (e.g. as partly described in document EP3298800B1). This type and similar BCM microphones may be quite sensitivity to the actual placement in the ear of the user since the rubber dome must have firm contact with the user’s ear to ensure good vibration transmission. It is therefore advantageous to utilize a VPU component as the Tx microphone 317, which does not rely as much on direct firm physical contact with the user’s ear (canal) as the BCM variant that requires an abutment part (e.g., small rubber dome) to be in direct contact with tissue of user ear. Rather, the (VPU) Tx microphone 317 may be placed as an internal component in the earpiece 303a as it is less sensitive to the fit of the earpiece 303a in the ear. However, the earpiece 303a still requires to be positioned firmly in the ear-canal of the user, such that a good vibration transfer coupling will be achieved and hereby a good voice signal may be obtained by the VPU type Tx microphone 317.

[0195] The Tx microphone 317 may be placed inside the earpiece 303a as close as possible below the part of the shell 305 that is resting against the ear-canal of the user 101 when worn to maximize the vibration transfer from the user to the earpiece 303a when speaking. Furthermore, the Tx microphone 315 may be oriented such that the most sensitive axis of the sensor is perpendicular or substantially perpendicular to the ear canal of the user when worn. This may result in the highest signal output in response to obtaining speech with the flattest frequency response such that resonance contributions are minimized while also being less sensitive to external noise.The (VPU) Tx microphone 317 may be a digital component whereas other components may be analog, such as for example the speaker unit 319. To be able to interface the Tx microphone 317 in an analog signal architecture, for example using analog audio signal exchange with the PTT control unit 105, it may be advantageous to convert the digital signal to an analog signal. This may be achieved in multiple ways involving a digital-to-analog converter component as generally known. However, a dedicated DAC circuitry 349 configured to perform a lossless front-end digital to analog signal conversion may advantageously be applied.

[0196] Fig. 3G schematically illustrates an example of a circuitry 349 located in the earpiece of the in-ear tactical communication and hearing protection headset of FIG. 3A configured to perform a lossless frontend digital to analog signal conversion of a digital output signal from Tx microphone 317.

[0197] As mentioned, the Tx microphone 317 may be a Digital VPU (Voice Pick Up) Microphone type, which output for example a digital Pulse Density Modulation (PDM) signal. For subsequent signal handling in an analog processing framework, each earpiece R 303a and L 303b may advantageously have a direct Digital to Analog signal conversion (DAC) circuit 349 on the PCB 313 configured to performi a lossless front-end conversion of the digital signal to an analog signal. An example of the DAC circuitry 349 in the Left earpiece 303b is shown in fig. 3G. The DAC circuitry 349 may comprise a D-FlipFlop and a 4’th Order Active Filter with a Bessel function for providing the best possible Analog Audio conversion compared to other mathematical functions such as e.g. Eliptic, Chebychev, Butterworth, etc. The converted Analog speech signal may be routed via a down-lead cable and connector 225 to the headset port 203a on a PTT control unit 105 for further Audio Processing before being forwarded to one or more connected communication devices.

[0198] To ensure an acceptable level of EMC protection, the earpiece 303a may utilize a metalized housing. The metallization is achieved by coating the inner part of the shells 305 in a thin electrically conductive layer, such as a thin metal layer. Both of the shell parts 305 may be metallized and connected to each other at the rim around the edge of the shells 305. The metalized housing may also be connected to a drain-wire that also acts as shielding in the headset 103 cable and connector 225.

[0199] Due to the vibration-sensitive nature of the Tx microphone 317 certain challenges also arises, such as scratching noise from cord movement of cables on the body of the user 101 due to physical movement (e.g. turning, running, jumping, crewing etc.) and / or wind noise caused by the wind making the headset 103 and / or cables shake thereby causing the Tx microphone 317 to pick up noise. These challenges may at least be alleviated by signal processing algorithms ensuring that the signal-to-noise ratio is kept at least at an acceptable level even when these types of noises are introduced by for example performing a machine learning based voice signal filtration method with a speech enhancement element as explained in more details elsewhere (see e.g. Fig. 5 and elsewhere).

[0200] In other words, existing in-ear bone-conduction voice capture solutions (e.g., INVISIO® X5 headset) that use a pressure-sensing transducer with an abutment element operate as a differential measurement system: the transducer (e.g., BCM) outputs a signal proportional to relative motion between theuser’s tissue / bone and the sensing diaphragm via pressure variation at the abutment interface. Consequently, when the transducer and the local cranial tissue co-move under global head acceleration), the differential term approaches zero and little signal is produced; this affords inherent rejection of common-mode vibrations and certain structure-borne cable disturbances, but it also demands a firm, continuous, and low-compliance contact preload to avoid loss of sensitivity as the interface softens (sweat, slippage, jaw motion) or as the abutment lifts momentarily. By contrast, an accelerometer-type transducer (e.g., the Tx microphone 317) implements an absolute measurement: any acceleration of the sensor mass is converted to an electrical signal irrespective of whether it originates from vocal fold-induced bone vibration or from external excitations, so intimate, pressure-tight contact is less critical provided a mechanically efficient coupling path exists (ear-tip, housing, or bridge structure), yet the VPU sensor 317 becomes intrinsically sensitive to all vibration sources. In in-ear corded implementations (e.g., see Fig. 1, Fig. 3A-3B ), cable rubbing on clothing / skin (scratch noise), plug / strain-relief flexure, and jacket stick-slip / triboelectric events inject broadband structure-borne energy (typically perceived as rasping noise across =100 Hz-5 kHz with superposed low-frequency thumps <100 Hz) into the housing, which an absolute accelerometer will faithfully transduce along with user-induced signals; additional involuntary user sounds and motions (footfalls, mastication, jaw clench, breathing / sniffing, tongue clicks) likewise appear as coherent body-borne accelerations. Thus, while the absolute approach relaxes contact constraints and can preserve speech under imperfect seating, it trades that benefit for elevated susceptibility to cable microphonics and body-motion artifacts, whereas the differential pressure approach suppresses common-mode vibrations and cable-borne noise when contact is well maintained but risks signal collapse under degraded coupling.

[0201] In further embodiments, the earpiece 303a may be provided with an additional microphone unit and / or loudspeaker unit positioned to monitor the in-ear space (e.g. between the eat-tip 323) and the inner part of the users ear) used in combination with a dedicated circuitry to provide active and / or automatic noise cancelation (ANC) and / or active noise reduction (ANR). The ANR / ANC functionality may use one or more dedicated electronic circuits (e.g. a part of the flex PCB 313 and / or PTT control unit 105) to generate anti-noise signals (e.g. via the speaker units 319a, 319b or the additional loudspeaker unit) that destructively interfere with ambient sound to cancel it thereby providing an improved hearing protections compared to utilizing only passive sound attenuation. Thus, ANC / ANR is a form of active noise control in which certain selected ambient sounds, that may be repetitive, are filtered out so that other sounds, such as radio communications, can be heard more clearly. As one example, a user riding in a helicopter may configure (e.g. via one or more button combination activations) the tactical communication and hearing protection system 301 to attenuate the repetitive sound of the helicopter engine and rotor while not attenuating desirable sounds such as speech and audible warning signals. By providing the earpiece 303a with an additional microphone for ANC / ANR, the overall hearing protection performance may be enhanced (and e.g. provide additionally 4-5 points to the SNR rating of the headset 103) compared to relying on passive hearing protection alone. Hearing protection may generally be quantified or represented by a SNR value being a Single Number Rating (SNR) system as per the International Organization for Standardization's ISO 4869 certification. The implementation of an additional loudspeaker may be embodied by fitting an additional loudspeaker transducer into the existingrubber sleeve 325 or preferably by adapting the interior of the earpiece 303a to fit a second additional separate rubber sleeve adjacent to the first rubber sleeve 325 so the acoustic performance of the loudspeakers may be preserved and engineered individually for providing optimal or enhanced performance. Additionally, or alternatively, an additional microphone may be placed in acoustic connection with the sprout or funnel 321 and / or rubber-sleeve for performing a seal-test to enable a check of whether the earpiece 303a is properly fitted prior to exposing the wearer to the demanding environment with potentially damaging noise levels.

[0202] The maximum level of passive hearing protection provided by the tactical communication and hearing protection headset 103 may be achieved when operating the headset 103 in a “passive noise control” mode, which occurs when the electrical circuitry of the headset 103 is turned off or if no ANC / ANR functionality is available, thereby relying only on the earpieces 303a R and 303b Lto physically block soundwaves from reaching the eardrum of the user 101. Otherwise, a maximum level of hearing protection may be achieved having the ANC / ANR circuitry operating, as previously described, thereby relying on passive blocking in combination with anti-noise generation. As aforementioned, the level of hearing protection may be quantified by means of a SNR value according to the ISO 4869 standard measured in dB (decibel). The tactical communication and hearing protection headset 103 provided in this disclosure is configured to provide at least 30 dB SNR. A main factor for achieving a high level of hearing protection is the ear tip 323 configured to be mounted on the spout or funnel 321 of the shell 305. Ear tips for in-ear headsets are generally well known in the art and may be embodied in many different variations such as tri-flange silicone ear plugs like the SureFire™ EP4 Sonic Defenders Plus or foam type ear plugs like the Comply™ 400 Series Foam Ear Tips. However, it has been realized that some specific properties for a foam-type ear tip 323 are specifically advantageous for achieving improved performance for an in-ear tactical communication and hearing protection headset 103.

[0203] Ear tip

[0204] Fig. 3D schematically illustrates an example of a cross-sectional view of the ear tip 323 of the in-ear tactical communication and hearing protection headset 103 of Fig. 3A. In some embodiments, the ear tip 323 may have a distal end 331 facing the shell 305 of the in-ear device (e.g. 303a and / or 303b) and a proximal end 333 in the opposite direction facing the user 101 when wearing the headset 103. The ear tip may contain a sound bore 335 as an inner core of the ear tip 323. The sound bore 335 may be configured as an acoustic channel or tube for directing an air borne audio signal (e.g. generated at least by the loudspeaker 319) to the user 101. Thus, the sound bore may preferable be made in a semi-rigid material being substantially non-compressible such that the sound bore is unobstructed at all times providing free passage for acoustic signals to be transmitted. The sound bore may be made of plastic and have a structured region in the portion at its distal end 331 and be configured to engage with the spout 321 of the in-ear shell 305, such that the ear tip 323 may locked into place in relation to the shell of the headset 103. The body 337 of the ear-tip 323 may advantageously be made in user compressible material such as a foam-type material. Generally, foam-type ear tips work by first being compressed or squeezed together (e.g. into a compressed state, see e.g. 339 in Fig. 3E), whereafter the compressed foam earplug is positioned right away in the ear canal and allowed to relax so thefoam will re-expand again and adapt firmly to the surrounding ear canal of the wearer (e.g. into a predefined less compressed state, see e.g. 341 in Fig. 3F) thereby creating an acoustic barrier for external sounds to reach eardrum of the user 101. Figure 3E schematically illustrates an example of a Right R earpiece 303a having an attached ear tip 323 in a compressed state 339 (also referred to herein as a first state). The body 337 of the ear-tip 323 is compressed e.g. by the user 101 as tight as possible around the semi-rigid sound bore 335 thereby allowing the user 101 to insert the earpiece 303a into the right ear. Further illustrated is Tx microphone herein in the form of a VPU 317 as disclosed herein. Fig. 3F schematically illustrates an example of a Right R earpiece 303a being inserted into the ear canal of a user’s 101 right ear and having an attached ear tip 323 in a predefined less compressed state 341. The ear tip in Fig. 3F may thus have expanded over a period of time from the first compressed state 339 (e.g. see Fig. 3E) into a second less compressed state 341 where at least a part of an exterior of the tip (323, 323a, 323b) touches and engages with the ear canal of the user 101 thereby preventing or at least attenuating sound from the external environment before reaching further into the ear canal. Thus, the body of the ear-tip 337 has now (e.g. in the second less compressed state 341 ) adapted to the precise geometry of the inner ear canal structure. It is advantageous that the second less compressed state 341 is different from the fully expanded state, as the less compressed state 341 provide a spring effect, whereas the ear tip 323 may be firmly situated in the ear canal providing an efficient acoustic seal to the external environment and additionally provide a tight and firm interface between the earpiece 303a and the soft tissue 343 and bone structure 345 of the user, thereby allowing vibrations such as jaw bone vibrations 347 (caused by the user speaking) to propagate efficiently from the bone structure 345 into the transmit VPU microphone 317 as disclosed herein thereby enabling the in-ear headset 103 to efficiently pick up the voice of the user 101 even in high noise and demanding environments.

[0205] The user compressible material (e.g. 323, 323a, 323b) may advantageously have a predetermined expansion rate so that the tip 323 expands from the (first) compressed state 339 at a first point in time (To) to the (second) less compressed state 341 at a second point in time (Ti), where a sound attenuation effect of the tip 323, at the second point in time (Ti ), has reached at predetermined attenuation level, e.g. or preferably so that the sound pressure level of ambient sound in the ear canal of the user 101 between the proximal end 333 and the inner ear is reduced to a level of 50% or less of the ambient sound outside the ear / ear canal.

[0206] It is particularly advantageous to utilize a user compressible material being a foam-material with mechanical properties that allows for a relative long expansion time (from To to Ti ), such as being at least 20 second, or more preferably being larger than 40 seconds, e.g. 50 or 60 seconds.

[0207] However, a too long expansion time is not preferable, as the fitting procedure of the ear tip 323 then would take too long thereby prolonging the time for the foam to fully expand (e.g. into the second less compressed state 341) and provide the necessary hearing protection for use in demanding environments, for example a law enforcement personal sitting in a vehicle mounting the in-ear tactical communication and hearing protecting headset 103 just before being deployed in a riot control operation andpotentially exposed to loud fireworks, etc. making rapid and full hearing protection extremely important. Additionally, as the ear tip 323 is required to work as part of the earpiece 303a for providing both communication and hearing protection capabilities, the ear tip 323 is required to have a hollow core or sound bore 335 for directing an acoustic signal from at least the loudspeaker 319 towards the eardrum of the user. Such an acoustic channel may be of a rigid or simi-rigid nature being less compressible than the surrounding foam material. This design requires the foam material expansion time to be slower than in cases with a standard passive foam earplug that can be squeezed completely together as part of the fitting process.

[0208] Accordingly, it is beneficial that the expansion time (from To to Ti ) is between about 20 second to about 100 or about 120 seconds.

[0209] In some embodiments, the expansion time (from Toto Ti) is at least about 30 seconds or about 35 seconds, at least about 40 seconds, at least about 60 seconds, or at least about 70 seconds.

[0210] In some further embodiments, the expansion time (from To to Ti ) is selected from about 20 to about 90 seconds, selected from about 30 to about 90 seconds, selected from about 60 to about 90 seconds, selected from about 70 to about 90 seconds, or selected from about 70 to about 85 seconds.

[0211] A typical composition of the foam (e.g. the body 337) may for example be a mixture of Polyurethan foam containing a specific combination of materials that enables the sound isolation characteristics and thermoplastic elastomers a such as a blend of soft absorbing foam with temperature dependent “memory” rubber.

[0212] Memory rubber may generally be porous materials composed of a solid polymer skeleton (also called matrix) and air-filled pores. They can be separated into two main groups according to the nature of their polymer skeleton: thermoplastic and thermoset foams. Within these groups, they can even be further differentiated according to their composition, cellular morphology, and other physical and thermal aspects. Their main features are resilience, lightweight, high porosity, and good energy absorption. Slower expansion (especially in room temperature) will allow the user more time to fit the earpiece properly, which provide advantages such as enabling deeper insertion of the earpiece into the ear canal of the wearer. Generally, the quality of the user’s transmitted voice signal will increase with a deeper insertion of the ear tip 323 into the ear. This is because the earpiece 303a utilizes the vibrationbased sensor 317 (accelerometer, VPU) to “pick up” the voice signal through sound propagation via vibrations 347 of / in the user’s body (e.g. bone structure 345) as they speak. As such, with a larger surface area between the ear tip 323 and the ear canal, the vibrations will have a better signal path to propagate through, into the earpiece 303a, thereby increasing the amplitude / loudness of the transmitted signal. Additionally, deeper insertion may provide better passive hearing protecting (i.e. improve the SNR rating). The degree of passive hearing protection of the ear tip 323 will increase with a deeper insertion of the ear tip 323 into the ear canal as it allows for a longer ear tip 323. The ear tip 323 essentially acts as an absorber / dampener of the external sound that is exposed to the user’s ears from theexternal environment, by which the incoming sounds will be reduced in amplitude significantly compared to free air. By having more foam material between the eardrum and the external sounds, the external signal (noise) will have a longer path through the ear tip before it reaches the eardrum, resulting in a reduced sound pressure level of the noise. Moreover, a deeper insertion may result in a firmer and more secure fit of the earpiece 303a in the ear canal, so the earpiece does not get loose / pull out easily. Additionally, the force required to pull the ear tip 323 out of the user’s ear will increase the further the ear tip is inserted into the ear canal since the area applying the friction will be higher when more of the tip 323 are in contact with the ear canal. This is important, as a loose fit could jeopardize users’ hearing if an earpiece 303a or 303b comes loose during operation in high noise demanding environments. A slower foam expansion time, preferrable additionally temperature dependent, may enable the user to make quick refitting of the earpiece easier. In case the user is required to remove one or both the earpieces 303a / 303b out of the ears momentarily or for some other purposes, the ear tips 323 (e.g. 323a and 323b) will typically have reached a temperature close to the body temperature of the user (e.g. increased temperature may cause an increased faster expansion. When the earpiece is removed, the temperature of the tip 323 will decrease which may cause the ear tips’ expansion time to decrease so that the ear tips 323 may maintain their shape for a little while thereby making quick refitting possible. With foam tips generally known in the field, a quick reinsertion would be quite difficult due to a generally fast expansion rate of standard foam materials (i.e. short expansion time) both at room temperature and at body temperature thus resulting in a rapid shape deformation of the tip upon removal from the ear, which would make quick direct re-insertion difficult. A longer expansion time enables the shape of the foam tip 323 to be maintained for a longer period of time making it easier to refit them quickly and securely.

[0213] As used herein, the foam expansion time is defined as the time it requires a foam-type ear tip 323 to expand from a first compressed state 339 (e.g. see fig. 3E) to a second less compressed state 341 (e.g. see fig. 3F) when inserted into an ear canal of a user (of any typical user) that results in a 50% of maximum steady state attenuation of the external sound pressure. The foam tip expansion time measurements may be quantified using a standardized test setup 401 as shown in Fig. 4A. The test setup 401 is designed to simulate a realistic user scenario such that the expansion time measured in the test setup 401 may be directly related to real usage of the in-ear device (103, 303a, 303b). Test setup 401 shown in Fig. 4A comprise a sound isolated box 401 that may acoustically block external noises. The box 401 is equipped with a loudspeaker 405 configured to apply a high noise calibrated pink noise signal at a volume of 94 dB SPL (Sound Pressure Level) (mimicking ambient noise of a demanding high-noise environment) and with a frequency range of 100 Hz - 16 kHz. A reference microphone 407 is used to monitor the sound pressure of the applied high noise signal inside the box during the measurement. A stainless-steel acoustic coupler unit 409 is used to simulate the ear canal of the user having a top part being cylindrically shaped with a conical bore adapted to receive the right R 303a or left L 303b in-ear device containing a foam-type ear tip 323. In the bottom of the stainless-steel acoustic coupler unit 409, an internal microphone (see e.g. 411 in Fig. 4B) is placed for measuring the sound pressure inside the unit 409.Figure 4B schematically illustrate a cross-sectional view of the stainless-steel acoustic coupler unit 409 of Fig. 4A having a right R 303a in-ear device inserted with a compressed foam-type ear tip 323. The stainless-steel acoustic coupler unit 409 used in the test setup may e.g. be a GRAS 43AC Ear Simulator Kit According to IEC 60318-4 comprising a GRAS RA0401 high-frequency ear simulator 411 having a frequency range of 10kHz to 20 kHz, a GRAS 40AG 1 / 2" Pressure Microphone 413, and a GRAS 26AC 1 / 4" Preamplifier 415 mounted in a test jig 417 commercially available via the GRAS® Sound & Vibration webpage where the stainless-steel acoustic coupler unit 409 may be placed on a heater 419 for elevating the temperature to a desired level. The loudspeaker 405, the reference microphone 407 and the internal microphone 411 are all connected to a control PC (not shown) for controlling the units 405,407,411 and performing data collection.

[0214] An experimental sequence conducted using the test setup 401 for measuring expansion time of the foam-type ear tip 323 according to the invention is performed according to the following: The earpiece 303a with the attached ear-tip 323 in a fully compressed state is inserted into the stainless-steel metal coupler unit 409 and exposed to external noise outside of the coupler unit 409. The sound pressure level (SPL) is then measured inside the coupler unit 409 continuously as a function of time. As the eartip 323 expands, the SPL measured by the internal microphone 413 in the coupler unit 409 will decrease until reaching a steady state maximum attenuation value. Before the experiment is initiated, the stainless-steel coupler unit 409 is heated to a temperature of 34°C to simulate a temperature comparable to the human ear. And both the reference- 407 and internal microphone 413 is calibrated and checked to perform equal reading when exposed to external noise (e.g. pink noise signal at 94 dB SPL) via the speaker unit 405. A stepwise description of the experimental sequence method follows:

[0215] 1. Completely compress the foam-type ear tip 323 (in the first compressed state) around the inner sound bore using the thumb and index finger to squeeze.

[0216] 2. Insert the earpiece 303a containing the compressed ear tip into the coupler unit 409. At this point, the ear tip 323 should still be compressed and sit very loosely in the coupler.

[0217] 3. Close the sound box 403 and initiate measurement by starting the external noise exposure and measure the dB SPL every 1-5 seconds using the internal microphone 413. Measurement lasts 180 seconds to ensure that full attenuation is achieved no matter the compression. The above steps are carried out in immediate extension of each other, such that the starting point of the measurement (To) is not substantially delayed after the placement of the compressed ear tip 303 in the test setup 401 (e.g. the configuration displayed in Figs. 4A and 4B). The experimental sequence is repeated multiple times for the in-ear tactical hearing protection headset 103 according to the invention and multiple times for the prior art headset I N VIS IO X5.

[0218] Fig 4C Illustrates a graphical representation 421 of a number of data series showing the expansion time measurements for the foam-type ear tip 323 according to the invention plotted together with expansion measurements of prior art examples, all obtained according to the aforementioned method and test setup 401.The graphical representation 421 shows the min-max normalized sound pressure level along the y-axis and the time in seconds along the x-axis. The data series 423 indicated by a dark grey dashed line and triangular markers represent data points measured for prior art headsets and the data series 425 indicated by the solid black line and circular markers represent the measurements obtained for the foamtype eartip 323 of the in-ear hearing protection and hearing protection headset according to the invention and as disclosed herein. The horizontal solid line 427 represents the threshold boundary at 50% attenuation effect relative to the steady state maximum attenuation (i.e. representing the second less compressed stage 341). The expansion time is measured from the time of earpiece insertion (To) in the coupler 409 until a time point (Ti) defined by a drop in normalized sound pressure signal to 50% as measured with the internal microphone 413 of the stainless-steel acoustic coupler unit 409.

[0219] Real-time Voice filtering using machine learning

[0220] Conventional headsets often struggle to deliver fully clear speech quality in noisy and demanding environments. The in-ear tactical communication and hearing protection headset as disclosed herein (see e.g. 103 in Fig. 3A and elsewhere) utilize a vibration-based Voice Pickup Unit (VPU) as a Tx microphone (see e.g. 317 in Fig. 3C) to capture the users voice for transmission via one or more connected communication devices (see e.g. 107,109, 111 in Fig. 1). Hence, the transmit (Tx) microphone (317, 317a, 317b) comprises a vibration-based transducer, mechanically coupled to the user and configured to provide a transmit microphone input signal (503a, 503b, 511 ) comprising, when the user is speaking, speech vibrations conducted through the user’s jawbone and / or tissue. While a vibration-based Tx microphone 317 may significantly reduce ambient noise compared to traditional air conduction microphones, they may still be sensitive and e.g. pick up internal noises generated from within the user such as chewing, breathing, or other involuntary body sounds, which all can interfere with voice clarity. Existing noise reduction techniques typically rely on basic filtering methods, which may not adequately differentiate between speech and noise in a vibration-based signal response. Additionally, some external factors such as scratching noise from cable routing combined with user movement and / or wind-induced vibrations may lead to a degradation of the voice signal when the in-ear tactical communication and hearing protecting headset 103 is used in demanding environments. To overcome at least some of the above mentioned drawbacks, one or more processors of the in-ear tactical communication and hearing protection system 301 (e.g. see Fig. 3A) may be configured to perform a machine learning based voice signal filtration method with a speech enhancement scheme of the Tx microphone 317 signal to provide a clear voice signal in demanding environments.

[0221] Fig. 5 schematically illustrate an example of a processing architecture 501 of the in-ear tactical communication and hearing protecting system as disclosed herein (see e.g. 301 in Fig. 3A and elsewhere) configured to remove noise and enhance the speech signal quality of the user (see e.g. 101 elsewhere). Illustrated is a real-time voice signal filtration method with a speech enhancement scheme applied using a machine learning engine 515 configured to be operated or executed by one or more processors (i.e. the DSP 207 and / or MCU 205 processors of Fig. 2A) of the PTT control unit (see e.g. 105 elsewhere). The machine learning engine 515 may be designed to process and enhance speech signals in real-time for subsequent transmission via one or more connected communication devices (seee.g. Fig. 1). The machine learning engine 515 may be based on a deep neural network (DNN) model trained using a supervised learning approach as discussed elsewhere in relation to Fig. 7 A. Unlike conventional deterministic signal processing methods that rely on fixed mathematical assumptions, manually engineered features, and predefined rule sets, the machine learning engine 515 (e.g., trained neural network) introduces a data-driven approach capable of learning complex, non-linear relationships directly from diverse training data (see Figs. 7A-7D). This enables the system to adapt dynamically to unpredictable acoustic environments, including non-stationary signals, overlapping sound sources, and variable noise conditions, which traditional algorithms cannot handle without significant performance degradation. By leveraging learned representations instead of static spectral or temporal descriptors, the machine learning engine 515 can disambiguate distorted or overlapping audio components and optimize speech-relevant frequency bands in real time. Furthermore, the architecture may capture both short-term and long-term temporal dependencies, allowing improved modelling of speech patterns and environmental noise evolution overtime. A neural network-based machine learning engine 515 may learn its own internal representations directly from raw waveforms or spectrograms, eliminating the need for hand-crafted feature engineering pipelines. This capability allows the trained machine learning model 515 to discover subtle temporal and spectral cues that deterministic algorithms or human-designed filters might overlook. Once the architecture is established (e.g., model training is completed), performance can be improved simply by adding more training data or fine-tuning the trained machine learning model even further, without redesigning the processing pipeline. The approach also supports transfer learning, enabling reuse of pre-trained models across different environments or operational scenarios, which significantly reduces development time and enhances adaptability.

[0222] Thus, the use ofthe machine learning engine 515 (i.e. , trained machine learning model) provides enhanced robustness, superior generalization, and consistent speech intelligibility under highly variable conditions, while maintaining low latency suitable for edge devices such as the PTT control unit 105 (see Fig. 2A). These advantages represent a substantial technical improvement over deterministic signal processing methods, which are constrained by fixed assumptions and lack the flexibility to handle real-world variability.

[0223] The machine learning engine 515 may be configured to process either the audio signals from the Tx microphone 517 in a single earpiece, such as the Right earpiece 303a or the Left earpiece 303b, or the combined audio signals from the Tx microphone in 317 of both earpieces 303a and 303b in combination. Additionally, the machine learning engine 515 may be configured to process both the audio signals from the Tx microphone 317 and the ambient microphone 315 from a single earpiece, such as the Right earpiece 303a or the Left earpiece 303b, or the combined audio signals from the Tx microphone 517 and the ambient microphone in 315 of both earpieces 303a and 303b in combination. It may be advantageous to configure the neural network engine 515 to process as input both the audio signal from the Tx microphone 317 and the ambient microphone 315 in combination.

[0224] A drawback of using a bone conducted speech signal obtained by the Tx microphone 317 may be a relatively limited frequency bandwidth in the vibration-based voice signal propagating through the bone structure (e.g. 347 in Fig. 3F). Traditional air conduction speech signals may comprise a much broaderfrequency bandwidth providing a beter representation of the voice signal. Accordingly, some valuable speech information for providing clear communication may be absent or missing in the bone conducted speech signal if directly transmited via a radio in a raw form. However, in high noise environments, the bone conducted speech signal is far superior in terms of isolating the users voice as compared to an air conduction microphone, as the bone conducted signal is less impacted by external ambient air borne noise. By combining both a bone conducted speech signal (e.g. recorded by the Tx microphone 317a, 317b) and a traditional air conducted speech signal (e.g. recorded by the ambient microphone 315a, 315b) in a machine learning based filtering process, a superior quality of a voice signal (to be transmitted via a radio) may be achieved, even in demanding environments across a varying and unpredictable ambient noise environment. Thus, the ambient microphone (315, 315a, 315b) is configured to provide an ambient microphone input signal (505a, 505b, 513) comprising, when the user is speaking, airborne acoustic components of the user’s voice and surrounding environmental sounds.

[0225] This dynamic combination of the two input signals, one representing airborne sound from an ambient microphone (e.g., ambient microphone 315 providing an ambient microphone input signal 505a, 505b, 513) and the other representing vibrations associated with the user’s speech from a vibration-based transmit microphone (e.g., transmit (Tx) microphone 317, such as a VPU), provides a synergistic technical effect that significantly enhances the quality and reliability of the transmited voice signal (e.g., corrected or improved voice signal 525) in demanding acoustic environments. When the user is speaking, the ambient microphone (e.g., 315) is capable of capturing high-fidelity speech signals in low-noise conditions, offering a broad frequency response and natural sound quality, but its performance can degrade rapidly in high-noise environments due to the intrusion of external sounds. Conversely, the vibration-based Tx microphone (e.g., 317), which detects speech via bone or tissue conduction, is inherently robust against external airborne noise and excels in isolating the user’s voice in loud surroundings, but typically suffers from a limited frequency range and may produce a muffled or less natural sound in quiet conditions. By intelligently combining these two distinct input signals, preferably using adaptive signal processing or machine learning techniques (e.g., 515), the system can dynamically leverage the strengths of each microphone depending on the prevailing noise environment. In low-noise scenarios, the ambient microphone (e.g., 315) can dominate, ensuring natural and intelligible speech transmission; in high-noise scenarios, the system can prioritize the vibration-based Tx signal (e.g., transmit microphone input signal 503a, 503b, 511 ) to maintain speech intelligibility and suppress environmental noise. This adaptive fusion (e.g., combination) results in a transmited voice signal (e.g., 525, 617) that is consistently clear, intelligible, and robust across a wide range of acoustic conditions, thereby overcoming the limitations inherent in using either microphone type alone. The technical advantage is a substantial improvement in communication reliability and user safety, particularly in mission-critical or hazardous environments where clear voice transmission is essential (e.g., when the corrected voice signal 525 is output for transmission via one or more communication devices 107, 109, 111, such as radios 109, 111).

[0226] In practical use, individuals operating in demanding environments, such as military, law enforcement, or rescue personnel, often move rapidly between areas with vastly different ambient noise characteristics, for example when transitioning from a quiet corridor into a noisy machinery room or movingthrough a building with fluctuating background sounds. The dynamic combination of the ambient (airborne) microphone input (e.g., 505a, 505b, 513 from 315) and the vibration-based (Tx) microphone input (e.g., 503a, 503b, 511 from 317) enables the system to continuously adapt to these changing acoustic conditions. By leveraging the high-fidelity, natural speech capture of the ambient microphone (e.g., 315) in low-noise environments and the robust noise immunity of the vibration-based microphone (e.g., Tx microphone 317) in high-noise environments, the system can intelligently select, weight, or fuse the two input signals in real time (e.g., via machine learning engine 515). This adaptive approach ensures that the transmitted voice signal (e.g., 525), such as one sent via a radio (e.g., 109, 111) to one or more other users and / or one or more devices, remains consistently clear and intelligible, regardless of the user’s location or the surrounding noise level. Notably, this results in a significant improvement over systems relying solely on a vibration-based Tx signal (e.g., 503a, 503b, 511 from 317), which, while effective at suppressing external noise, often produces a muffled or unnatural sound and may fail to capture the full richness of the user’s speech in quieter settings. The described combination thus provides an advantage by maintaining optimal speech quality and communication reliability as the user moves through diverse and unpredictable noise environments, directly supporting mission-critical operations where effective communication is essential for safety and success (e.g., coordinated transmission via communication devices 107, 109, 111 under control of a PTT control unit 105).

[0227] In other embodiments, the one or more processors is / are configured to implement and execute a trained artificial intelligence or machine learning method or component to generate a background noise dependent correction function or signal in response to the ambient microphone input signal and the transmit microphone input signal.

[0228] In one embodiment, the one or more processors are further configured to dynamically adjust the relative contribution of the ambient microphone input signal and the transmit microphone input signal to the corrected output signal in dependence on a background-noise level derived from the ambient microphone input signal, such that the corrected output signal comprises a greater contribution from the ambient microphone input signal in low-noise environments and a greater contribution from the transmit microphone input signal in high-noise environments.

[0229] In some embodiments, the correction function or signal is configured to dynamically balance or weight the contributions of the ambient and transmit microphone input signals to the output signal based on the detected background-noise level, thereby optimizing speech clarity and noise suppression according to the acoustic environment

[0230] In further embodiments, the artificial intelligence or machine learning method is configured to dynamically balance the contributions of the ambient microphone input signal and the transmit microphone input signal to the corrected output signal in accordance with the prevailing background-noise level, such that in high-noise environments the contribution from the ambient microphone input signal is reduced and in low-noise environments the contribution from the ambient microphone input signal is increased, relative to the contribution from the transmit microphone input signal.According to an aspect, the present disclosure provides a communication system for use in demanding environments, comprising: at least one in-ear communication and hearing protection device configured to be worn by a user, the device comprising: an ambient microphone configured to provide an input signal representing airborne sound; and a vibration-based transmit microphone configured to provide an input signal representing vibrations caused by the user’s speech; one or more processors operatively connected to the microphones, wherein the one or more processors are configured to: receive both the ambient microphone input signal and the vibration-based transmit microphone input signal; determine a measure of the surrounding noise level based on at least the ambient microphone input signal; dynamically combine the ambient microphone input signal and the vibration-based transmit microphone input signal to generate a transmit voice signal for communication, wherein the relative contribution of each input signal to the transmit voice signal is automatically adjusted in dependence on the determined noise level, such that: when the surrounding noise level is low, the transmit voice signal is generated to include a greater contribution from the ambient microphone input signal to provide high-fidelity speech; and when the surrounding noise level is high, the transmit voice signal is generated to include a greater contribution from the vibration-based transmit microphone input signal to provide robust speech capture with improved noise immunity.

[0231] As illustrated in Fig. 5, an audio signal 503a from the Tx microphone 317a and an audio signal 505a from the ambient microphone 315a in the Right earpiece 303a is routed into a multiplexing unit 509. Additionally, the audio signal 503b from the Tx microphone 317b and the audio signal 505b from ambient microphone 315b in the Left earpiece 303b is similar routed into the multiplexing unit 509 such that the input Tx microphone audio signal 511 and input ambient microphone audio signal 513 for the neural network engine 515 may be dynamically configured to originate from either of the earpieces (i.e. R 303a or L 303b), both earpieces collectively (i.e. R 303a and L 303b), or any combination thereof. Thus, the neural network engine 515 is configured to receive the ambient microphone input signal (505a, 505b, 513) and the transmit microphone input signal (503a, 503b, 511). However, it may be advantageous to use the audio signals from a single earpiece only due to resource constrains on the processors (e.g. see e.g. 205, 207 in Fig. 2A) and to decrease power consumption of the PTT control unit 105 which is essential for long operation in demanding environments. Additionally, it may be advantageous to only use the audio signals from a single earpiece to avoid crosstalk.

[0232] As previously mentioned, crosstalk is generally understood as an undesirable leakage of an electromagnetic signal from one circuit or channel to another. In radio communication systems, cross-channel signal leakage may cause audio signals from one communication channel to unintentionally "leak" into another communication channel.

[0233] In an exemplary scenario, a user 101 being a commander of a team may wear the in-ear tactical hearing protection and communication system 301 connected to two individual radios (see e.g. Fig. 1), one being a secure radio for classified information sharing and the other radio being a team radio. In one situation, the commander may receive a classified voice message (Rx signal) via the secure radio being outputted via the Left 319b and Right 319a speaker units in the headset 103. At the same time, thecommander may key the team radio via a PTT button on the PTT control unit 105, thereby activating the Tx microphone 317 and initiating radio transmission to the team members. Thus, the TX microphone 317 may record it the vibrations from the emitting speaker units 319a, 319b which may cause the classified Rx signal from the secure radio to be re-transmitted (Tx signal) via the team radio, thereby allowing unauthorized personnel to hear sensitive or classified information that is transmitted on a different radio channel. Thus, in military operations crosstalk may cause personnel on a non-se-cure or lower-security channel to unintentionally receive sensitive information from a secure channel. For example, a unit discussing routine logistics might inadvertently hear classified information about strategic movements or intelligence thereby posing a major security risk.

[0234] Cross-channel signal leakage may be avoided by configuring the in-ear hearing protecting and communication system such that only one Tx microphone 317a (e.g. in the Right earpiece 303a) is active / used for obtaining the voice of the user while simultaneously focusing any received audio signals from connected communication devices only to be outputted to the user temporarily via a single speaker unit 319b (e.g. in the Left earpiece 303b) opposite of each other, when a radio is keyed by the user (i.e. a PTT button or VOX is activated). Additionally, or alternatively the machine learning based voice signal filtration method with a speech enhancement scheme may be adapted to perform crosstalk cancellation. The signal strength or level of a crosstalk signal may be much lower in intensity than a speech signal from a user, which causes the machine learning engine 515 to remove crosstalk signals as they may be treated as noise.

[0235] As seen from Fig. 5, the input audio signal 511 from the Tx microphone(s) 317 may be split into two data streams, where one data stream is forwarded directly to a final noise filtration step 523 and the other data stream is provided as an input component to the machine learning engine 515 for processing as disclosed herein. The input audio signal from the ambient microphone 513 may additionally be split into two data streams, where one data stream is provided as an additional input component to the machine learning engine 515 and the other data stream is forwarded to an output postprocessing operation 519 as explained further in the following. Thus, when suitably trained, the machine learning engine 515 may be configured to receive the input audio signal from the Tx microphone(s) 511 and the ambient microphone(s) 513 as input and output (e.g., generate) data representing a correction function being signal modification parameters 517 that, when applied to the input Tx microphone signal 511, filter unwanted or degrading noise from the Tx microphone(s) signal 511 in the noise filtration operation 523 thereby providing a clear / clearer speech signal 525. The output correction function 517 outputted by the machine learning engine 515 may advantageously be adjusted based on the input audio signal 513 from the ambient microphone(s) in an output postprocessing step 519 before the adjusted correction function 521 is applied to the input audio signal 511 from the Tx microphone(s) 317 in the noise filtration step 523. Subsequent to the noise filtration step 523, a final corrected voice signal 525 may be provided for transmission (Tx) via one or more connected communication devices (see e.g.

[0236] 107,109,111 in Fig. 1). Thus, the machine learning engine may provide, using the correction function 517, 609, a corrected voice signal 525 (e.g., improved output voice signal) to be transmitted (e.g., via radio (109, 100)) to one or more other users and / or one or more devices. The corrected voice signal 525 may additionally be routed back as an input signal 507a, 507b to one or both loudspeaker(s) 319a,319b as a so called “sidetone” signal, which make real-time processing important as the voice signal is provided to the users’ ears while they are talking making a time delay above 100-150ms unacceptable. Preferably it should be below 50-100ms.

[0237] In some embodiments, the correction function being signal modification parameters is configured to selectively enhance speech-relevant frequency bands and suppress noise.

[0238] In one embodiment, the correction function or signal is not itself the final output signal, but is used by the processor to modify the input signal(s) to obtain a corrected or improved output signal.

[0239] In some embodiments, the correction function or signal comprises a set of adjustment parameters, such as a gain vector or filter coefficients, and is generated in response to the ambient microphone input signal and the transmit microphone input signal (503a, 503b, 511 ), wherein said correction function or signal is configured to be applied to at least one of the input signals (e.g., the transmit microphone input signal) to modify and improve the resulting output signal.

[0240] As previously mentioned, a primary purpose of the ambient microphone 315 may be to act as the artificial hearing of the user 101 when wearing the headset 103. The ambient microphones 315a, 315b obtain sound from the surroundings of the user 101 that then can be transmitted to the user’s ears via the speaker units 319a, 319b. Thus, the audio signal 505a from ambient microphone 315a in the Right earpiece 303a may be routed via the PTT control unit 105 to input signal 507a for the Right loudspeaker 319a. Additionally, the audio signal 505b from ambient microphone 315b in the Left earpiece 303b may similar be routed via the PTT control unit 105 to be an input signal 507b for the Left loudspeaker 319b, such that the user can hear the surroundings thereby provide situational awareness even though the respective audio connection is not shown in Fig. 5. In other words, the ambient microphone (315, 315a, 315b) is configured to provide an ambient microphone input signal (505a, 505b, 513) comprising, when the user is speaking, airborne acoustic components of the user’s voice and surrounding environmental sounds.

[0241] Alternatively or additionally, the in-ear communication and hearing protection system 301 may be adapted to perform audio processing of the ambient microphone signals 505a, 505b prior to being routed back to be respective speaker units 319a, 319b of the earpieces 303a, 303b using one or more additional trained neural network model (not shown in Fig. 5) configured to provide active hearing protection and / or enhanced situational awareness by performing real-time processing to filter away unwanted noise contributions, such as wind noise not mechanically removed by the filters 311 and / or perform noise filtration of the input signals 507a, 507b below a harmful limit when played to the user via the speaker units 319a, 319b hereby providing an enhanced situational awareness, where specific segments of interest (SOI) or features in the surrounding audio signal may appear more clear or isolated, such that the user can perform quick vital actions in response. Examples may be enhanced identification of distant gunfire, footsteps, or a person shouting warnings like “Granade incoming”, “medic”, etc.Fig. 6A schematically illustrate an example of steps which may be comprised in a processing method 601 executed by the in-ear tactical communication and hearing protecting system 301 to provide clear and undistorted voice signal via a communication device (see e.g. 107, 109, 111 in Fig. 1A) in demanding environments. In other words, Fig. 6A illustrates a method 601 for enhancing speech intelligibility and maintaining robust voice transmission in a tactical communication system 301 when the user 101 is speaking. One or more processors (see e.g. 205, 207 in Fig. 2A) in a PTT control unit (see e.g. 105 elsewhere) may be configured to execute the method 601 illustrated in fig. 6A-G in accordance with the processing architecture 501 displayed in Fig. 5.

[0242] Initially, both a Tx microphone input signal 511 and an ambient microphone input signal 513, e.g. in the form of a continuous voice signal, may be obtained from the one or more Tx microphone(s) 317 and ambient microphone(s) 315 in the R 303a and / or L 303b earpieces as descried previously. Thus, the method 601 comprises receiving, from an ambient microphone (315a, 315b), a second signal (i.e. , ambient microphone input signal 513) representing airborne acoustic components of the user’s voice and surrounding environmental sounds and receiving, from a vibration-sensitive transmit microphone (317a, 317b), a first signal (i.e., Tx microphone input signal 511) representing speech vibrations conducted through the a user’s jawbone and / or tissue. The input signals 511, 513 may be obtained in response to the user performing an action, such as pressing and holding a PTT button or similar (see e.g. 223a-d in Fig. 2A and elsewhere) on the PTT control unit, thereby activating the one or more Tx microphones 317 and simultaneously generating a COS / COR signal to key a connected communication device (see e.g. 107, 109, 111 elsewhere) to initiate a voice message radio transmission. Alternatively, a VOX function may activate the generation of the input signals 511,513 in response to the user starts speaking. The Tx microphone input 511 signal and the ambient microphone input 513 signal may be treated in a preprocessing step 603a, 603b followed by a feature extraction step 605a, 605b for preparing the data for processing by the machine learning engine 607 / 515. Thus, the method 601 further comprises generating feature-extracted representations 629, 633 of the first signal and the second signal. The preprocessing steps 603a, 603b and feature extraction steps 605a, 605b may employ at least partly identical operations for the two different input signals from the Tx microphone(s) 511 and the ambient microphone(s) 513. Additionally, the pre-processing- 603a, 603b and feature extraction step 605a, 605b may be combined into one step including at least two operations for each input signal 511 , 513 separately (i.e. for the Tx microphone input 511 and for the ambient microphone input 513).

[0243] Speech signals in general are considered to be highly nonstationary signals. However, a speech signal over a short time period of about 10 to 100ms has a characteristic which may be fairly stationary. The preprocessing step 603a, 603b may constitute data treatment operations related to short-time processing techniques in which short segments of audio data in the continuous signal is isolated and processed separately as though they were short segments from a sustained audio with fixed properties. The preprocessing step may thus continuously segment the input audio signals 511, 513 into data frames of about 10-50ms. The feature extraction step 605a, 605b may be a processing sequence for analysing and extracting relevant information / data from the input signals 511, 513, which the neural network processing step 607 / 515 then may evaluate or use.Fig. 6B schematically illustrate an example of a combined pre-processing 603a and feature extraction step 605a for the Tx microphone input signal 511. The operations which may be performed in the combined pre-processing 603a and feature extraction step 605a may generate a spectral representation of the input audio signal 511. The continuous time varying Tx microphone input signal 511 “Xi" may be pre-processed into short time audio data frames and transformed from the time domain to the frequency domain in a single operation using a Short Time Fourier Transformation (STFT) 619 algorithm. The STFT operation 619 may alternatively be performed in two steps of initially processing the continuous time varying Tx microphone input signal 511 “X1” into short time data frames and perform a standard Fourier transformation (FT) analysis subsequently. A Fast Fourier Transform (FFT) size parameter that defines how many frequency bands or “bins” will be applied in the operation 619 (i.e. frequency resolution) may be set to between 32-2048N (bins). A higher FFT Size (i.e. number of bins) will result in a higher frequency resolution of the voice characteristics but would require a longer time span of the data frames, in addition to increased computational time and power usage. For real-time voice analysis of a communication system as disclosed herein, an optimum has been found to be between 64-128N (bins). The output of the STFT spectral transformation 619 may be a complex array which can be split into a magnitude 621a and a phase 623a component. One or both ofthe magnitude 621a and / or phase 623a component may be used separately or collectively for any subsequent analysis. However, it has been found advantageous to use at least the magnitude component 621 for subsequent data processing according to the method 601. The magnitude component 621a may be subjected to additional mathematical operations such as a logarithm (base-10 or base-2) operation 625 followed by a standardization operation 627 as part ofthe feature extraction step 605a. The logarithm operation 625 may be advantageous to apply to a voice signal as the human auditory system perceives the strength of the different frequency components similar to a log-scale. The standardization operation 627 ofthe magnitude component 621 may be performed to transform the data into a Gaussian distribution with zero mean / unit variance to provide a suitable data format for a neural network processing. Thus, the output from the feature extraction step 605a for the Tx microphone input signal 511 “X1” may be in the form of at least a processed Tx magnitude component "IXJ" 629. The processed Tx magnitude component "IX " 629 may be provided both as input to the neural network processing step 607 / 515 and forwarded as an input to the noise filtration step 613 / 523 as seen in Fig. 6A. Additionally, a phase component “Xf" 631 may be outputted from the feature extraction step 605a as an otherwise unprocessed output from the spectral representation operation 619 (i.e. phase component 623a) and may be and passed directly to the feature reconstruction step 615 (see e.g. Fig. 6G).

[0244] The continuous time varying ambient microphone input signal 513 may at least partly be treated in a similar way as the Tx microphone input signal 511 in the preprocessing 603b and feature extraction step 605b.

[0245] Fig. 6C schematically illustrate an example of a combined pre-processing 603b and feature extraction step 605b for the ambient microphone input signal 513 which contain similar operations as previously described. The output of an identical STFT spectral transformation operation 619 may likewise be a complex array which can be split into a magnitude 621 b and a phase 623b component. At least the magnitude component 621b may be subject to additional mathematical operations such as a logarithm(base-10 or base-2) operation 625 followed by a standardization operation 627 as part of the feature extraction step 605b to provide a suitable data format for a neural network processing as explained in relation to Fig. 6B. The magnitude component 621 b may additionally be used for a Sound Pressure Level (SPL) analysis where the total sound level in dB (decibel) may be computed as a logarithmic sum of the individual frequency bands (i.e. bins) in one audio data frame. The apparent SPL level may additionally be an average value calculated based on several consecutive audio data frames, such as the previous 50-200 data frames, in order to compute an apparent SPL value 637 for the signal obtained by the ambient microphone(s) 315 on a per second basis rather than in milliseconds to avoid high fluctuations. Thus, the output from the feature extraction step 605b for the ambient microphone input signal 513 “X2“ may be in the form of at least a processed Ambient magnitude component "|X2|" 633 and an apparent SPL value 637. The processed Ambient magnitude component "|X2|" 633 may be provided as an additional input variable to the neural network processing step 607 / 515 and the apparent SPL value 637 may be forwarded to the neural network output postprocessing step 611 / 519. The generated phase component 623b may be ignored, as the time domain signal for the ambient microphone may not be required to be reconstructed for the specific purpose of the method 601.

[0246] The machine learning engine (e.g. see e.g. 515 Fig. 5) may be adapted to perform a neural network processing step 607 as illustrated in Fig. 6A which may be the central part of the processing method 601 for performing the machine learning based voice signal filtration method with a speech enhancement scheme. The neural network processing step 607 / 515 may utilize one or more trained neural network models to perform a non-linear mapping between the input and the output features. For the method 601, to enhance the quality of transmitted speech in the headset (see e.g. 103 in Fig. 3A), a regression type algorithm using a supervised learning technique is advantageous as the one or more trained modes should be capable of removing noise (i.e. predicting a clear voice signal) from a noisy voice signal. Several types of neural network (NN) architectures may be used in the neural network processing step 607 / 515, such as an “artificial neural network” (ANN), like a Feedforward neural network (FNN) or a “convolutional neural network” (CNN), and / or “deep neural network” (DNN) such a Recurrent Neural Network (RNN) or alike. Other types of networks like Generative adversarial networks (GANs) may alternatively be used for audio signal processing. In one example, the machine learning engine 515 may be composed of a plurality of neural network models or components arranged in a consecutive order where an output from one neural network (NN) may be provided as input to another neural network (NN). Alternatively, other non-neural network machine-learning or artificial intelligence architectures or models may be used.

[0247] Figure 6D schematically illustrate an example of the neural network processing step 607 / 515 including a deep neural network (DNN) model 639. The deep neural network (DNN) 639 may be composed of several key components that work together to model the complex relationships in data. The DNN model 639 include a data structure such as an array of neurons 641 which may receive input, apply a transformation, and produce an output. Typically, a neuron 641 performs a weighted sum of the inputs, adds a bias, and applies an activation function. The array of neurons 641 are typically arranged in a plurality of layers 643a-e. The first layer being “input layer” 643a which receives the input data 629, 633 (e.g., feature extracted representations of the ambient and transmit microphone input signals). Thesubsequent layers 643b-d may be referred to as the “hidden layers” and constitute intermediate layers between the input and final “output layer” 643e. These “hidden layers” 643b-d process the inputs 629, 633 through multiple transformations, enabling the network 639 to learn complex patterns and perform predictions. The final layer 643e that produces the network’s output 609 / 517 corresponding to the prediction computed by the DNN model 639. According to the disclosure, the machine learning model 639 may be trained to adjust the relative contributions of the ambient and transmit microphone input signals (503a, 503b, 511, 505a, 505b, 513). The individual connections between neurons 641 in the network is associated with individual weights 645. Weights 645 are the parameters that the network adjusts and “learns” during a training procedure. The weights 645 determine the balance between the different neurons 641 in the different layers 643a-e and thereby how they contribute to the output. Adjusting the weights allows the model to minimize error and make accurate predictions which is tuned during the training procedure (e.g., according to the training method as disclosed herein in relation to figure 7A and elsewhere). When the neural network model 639 is trained to a particular level, the weights 645 are static and may not be subject to change. The size of the network i.e. number of neurons 641 and hidden layers 643b-c may vary, however it may be advantageous to limit the size of the neural network 639 as the neural network processing method 607 / 515 may be configured to be executed on one or more processors (e.g. MCU 205 and DSP 207 in Fig. 2A and elsewhere) on an edge device or similar (e.g. such as a PTT control unit 105 configured to be worn by a person or correspondingly in an in-ear device) with limited computational resources and power constraints. For e.g. optimal performance (and providing real-time processing) of the neural network model, it may be advantageous to utilize more the 6.000 individual weights 645 (i.e. and associated neurons 641 and layers 643b-c).

[0248] Artificial Neural Networks (ANNs), such as the machine learning engine 515 illustrated in Fig. 6D, represent a class of machine learning models particularly suited for complex audio signal processing tasks, including speech enhancement, noise suppression, and pattern recognition in real-time communication systems. An ANN is a biologically inspired computational architecture that learns from training data rather than relying on fixed mathematical assumptions or manually engineered features. In the context of the disclosed system, the ANN may be implemented as a Deep Neural Network (DNN) comprising an input layer, an output layer, and multiple fully connected hidden layers, each consisting of neuron arrays configured to process audio features extracted from the ambient microphone input signal 505a, 505b, 513 and the transmit microphone input signal 503a, 503b, 511. Each neuron computes an activation based on an activation function using weighted inputs from the previous layer, where synaptic circuits store these weights in memory to enable adaptive learning. A synaptic circuit may include a memory for storing a synaptic weight. In an exemplary embodiment, a neuron may comprise a register, a microprocessor, and at least one input. In some embodiments, the ANN may be realized through software executed on processors such as DSP 207 or MCU 205 (see Fig. 2A), or through specialized hardware such as an application-specific integrated circuit (ASIC) optimized for artificial intelligence workloads. ASIC implementations can deliver superior computational efficiency and reduced power consumption compared to traditional CPUs, which is particularly advantageous for edge devices like the PTT control unit 105 operating under strict latency and energy constraints. By dynamically generat-ing the correction function or signal 609 based on learned non-linear relationships between the two microphone inputs, the ANN ensures robust speech enhancement and noise suppression in real time, achieving a technical effect of improved intelligibility and reliability under highly variable acoustic conditions. In some embodiments, the machine learning engine 515 may be implemented using an Artificial Neural Network (ANN) architecture realized through an application-specific integrated circuit (ASIC). ASICs can be specially customized for artificial intelligence workloads, providing superior computational capabilities and significantly reduced power consumption compared to traditional CPUs or DSPs. This hardware-based implementation is particularly advantageous for edge devices such as the PTT control unit 105 (see Fig. 2A), which operate under strict latency and energy constraints. By embedding the ANN in an ASIC, the system achieves high-throughput, low-latency inference for real-time audio signal processing tasks, including speech enhancement and noise suppression, while maintaining battery efficiency. This approach enables robust and adaptive performance in demanding acoustic environments without sacrificing portability or operational endurance, delivering a substantial technical improvement over deterministic signal processing methods that cannot meet these constraints on re-source-limited hardware. Thus, in one embodiment, the one or more processors configured to implement the machine learning engine 515 (see Fig. 6D) comprise at least an application-specific integrated circuit (ASIC) optimized for executing an artificial neural network (ANN). The ASIC may include a plurality of neurons organized in one or more arrays, wherein each neuron comprises a register, a microprocessor, and at least one input for receiving weighted signals from a preceding layer. Each neuron produces an activation based on an activation function using the outputs of previous neurons and associated synaptic weights. The ASIC further comprises a plurality of synaptic circuits, each synaptic circuit including a memory for storing a synaptic weight, wherein each neuron is connected to at least one other neuron via one of the plurality of synaptic circuits. In some embodiments, the ANN implemented on the ASIC may be a deep neural network comprising an input layer configured to receive feature-extracted representations of the ambient microphone input signal 505a, 505b, 513 and the transmit microphone input signal 503a, 503b, 511 , a plurality of hidden layers for non-linear feature transformation, and an output layer configured to generate the correction function or signal 609 for application to the input signals to produce the improved output voice signal 525, 617.

[0249] Both the processed Tx magnitude component "IX " 629 and the processed Ambient magnitude component "|X2I" 633 may be provided as input to the deep neural network 639 in the neural network processing step 607 / 515. Thus, the method 601 further comprises receiving, by a trained machine learning model implemented by a processing unit, the feature-extracted representations of the first 629 and second signals 633 as input. The neural network may thus process the input signals 629, 633 and generate an output vector 609 / 517 (e.g., correction function) in response thereto. Thus, the neural network processing step 607 / 515 may fuse the two input signals to form a signal combined output vector 609 / 517. The DNN model 639 may be configured such that the output vector may be in the form of a gain vector “G(x)” 609 / 517, containing an array of gain value coefficients for each frequency band (i.e. “bins”) associated with the processed Tx magnitude component "IX ". Thus, the method 601 may fur-ther dynamically generating, by the trained machine learning model, a correction function, the correction function being signal modification parameters configured to selectively enhance speech-relevant frequency bands and suppress noise

[0250] In alternative embodiments, the DNN model 639 may be configured to output the corrected input signal directly. Hence, in one embodiment, the method further comprises the step of dynamically generating, by the trained machine learning model, an improved output voice signal, to be transmitted (e.g., via radio) to other users or devices, being a combined signal from both the first signal 511 and second signal 513, enhancing speech-relevant frequency bands and suppress noise.

[0251] However, it is advantageous to configure the DNN 639 to output only a correction function (i.e. gain vector “G(x)” 609 / 517), which contain the predicted adjustments to the individual frequency bands (i.e. bins) of the input processed Tx magnitude component 631 in order to obtain a clear or at least clearer voice signal (e.g., signal modification parameters). The computation of a correction function may be computed faster and with less computational effort and power usage compared to a full signal, such that real-time processing of the voice signal may be performed by the in-ear tactical communication and hearing protecting system (e.g. see e.g. 301 in Fig. 3A and elsewhere). The deep neural network 639 may be trained on a large dataset of speech and noise samples (e.g., paired data records as explained in detail elsewhere), such that the output gain vector 609 / 517 may be optimized to maximize speech clarity while minimizing noise. In other words, a fundamental distinction exists between generating a correction component being signal modification parameters, such as a correction function, gain vector, or set of filter coefficients, to be applied to an input signal (component-based approach), and generating the final output signal (e.g., improved output voice signal 617 / 525) directly using a machine learning system (direct approach). By adaptively applying the generated correction function to the speech signal recorded by the vibration-sensitive microphone (e.g., transmit microphone input signal), the system enables efficient, real-time processing of speech signals on resource-constrained edge devices (such as a PTT control unit or in-ear device) with limited computational power and power restrictions. This is achieved by configuring the machine learning model to output a computationally lighter correction function (gain vector) rather than a full, directly corrected audio signal (e.g., improved output signal), thereby reducing the processing effort and power consumption required for real-time speech enhancement and noise filtration.

[0252] The output gain vector 609 / 517 (e.g., correlation function) may be subjected to a subsequent postprocessing step 611 / 519. Such a postprocessing step 611 / 519 may be advantageous to perform in an in-ear tactical communication and hearing protection system (see e.g. 301 in Fig. 3A) as the one or more processors (see e.g. 205, 207 elsewhere) configured to perform the method 601 may be operating on an edge device, configured to be worn by a person, with limited processing capacity, power restrictions, and requirement of processing audio date in real-time. Such constraints may prompt the deep neural network 639 to be optimized for power efficiency and to be able to perform real-time processing of speech signals, making a small neural network size advantageous (i.e. reduced number of neurons 641 and / or hidden layers 643b-d). The drawback of utilizing a small network may be a reduc-tion in performance with respect to signal quality such as speech intelligibility as the mapping and prediction may not be sophisticated enough to cover a large noise input domain (i.e. to provide sufficiently accurate predictions). To address this, the postprocessing step 611 / 519 may be performed to tune and adjust the output 609 / 517 from the neural network processing 607 / 515 so otherwise suboptimal neural network performance (due to performance constrains) may be corrected and / or avoided. This prevents the machine learning model from performing over-aggressive noise filtration that could inadvertently degrade or remove parts of the user's superimposed voice signal, particularly in extreme high-noise environments. Thus, the method 601 may further comprises the step of subjecting the generated correction function to an output postprocessing step that performs a signal conditioning operation configured to modify the generated correction function, thereby providing a modified correction function based on predefined criteria, the predefined criteria.

[0253] Fig. 6E schematically illustrates one exemplary embodiment of the output postprocessing step 611 / 519. The output postprocessing step 611 / 519 may perform a correction operation 647, being a signal conditioning operation, implementing a mathematical operation defining overall constraints on neural network output 609 / 517 based on one or more criteria (e.g., predefined criteria). The neural network output 609 / 517 may be in the form of a gain vector “G(x)” 609 / 517 as previously mentioned, which provide a frequency dependent correction function such as an array containing amplification- or attenuation coefficients in accordance with each individual frequency band (i.e. “bin”) of the FFT, STFT, etc. of processed Tx microphone input signal 629. The correction operation 647 of the output postprocessing 611 / 519 may be a signal conditioning operation configured to adjust or modify the value of the individual amplification / attenuation coefficients of the gain vector “G(x)” 609 / 517 predefined criteria such, as if they are above or below a threshold value depending on a background noise level (p2) 635 measured by the ambient microphone(s) as explained elsewhere (e.g. in relation to step 603b and step 605b see e.g. Figs. 6A and 6C).

[0254] For the purposes of the present disclosure, a “high-noise environment” may be defined as any ambient environment in which the A-weighted sound pressure level (SPL) is equal to or greater than 85 dB(A). Environments in which the SPL remains below 85 dB(A) may correspondingly be defined as “low-noise environments.” The threshold of 85 dB(A) is adopted in accordance with generally recognized occupational health principles and is consistent with the recommended exposure limit established by the National Institute for Occupational Safety and Health (NIOSH), which identifies 85 dB(A) as the level above which long-term noise exposure presents a significant risk of hearing impairment. This threshold therefore provides a technical and regulatory basis for distinguishing between low-noise and high-noise operating conditions within the context of the present invention and is used by the machine learning model to dynamically adjust the relative contributions of the vibration-sensitive and ambient microphone signals.

[0255] The correction operation 647 may be in the form of “clipping function” that limit the value of the amplification / attenuation coefficients of the gain vector to a specified maximum and / or minimum threshold, such that if any amplification / attenuation coefficient exceed the threshold, the value of the coefficient is clipped or truncated to the threshold value. The threshold of the clipping function may be determinedas a function of the background noise level (p2) 635, e.g. such as being inversely proportional to the background noise level (p2) 635. This means that if the user is exposed to high background noise while talking (i.e. high SPL value (p2) 365 as measured by the ambient microphone 315) the threshold value may be set low thereby providing a strong constrain or limitation on the gain vector “G(x)” 609 / 517. Oppositely, if the user is speaking in low noise situation, the threshold value may be set high effectively providing little or no alteration of the gain vector “G(x)” 609 / 517. The correlation between the clipping threshold values and the background noise (i.e. SPL value (p2) 635) may be a continuous non-linear relationship or a step-curve providing different threshold values for SPL value intervals. The output of the correction operation 647 may thus be an adjusted gain vector “G'(x)” 649 / 521 (e.g., modified correction function) containing amplification / attenuation coefficients which may have been modified according to the clipping value determined as a function of the apparent background noise level (i.e. SPL value (p2) 635). Thus, the signal conditioning operation (e.g., correction operation 647) is configured to modify the generated correction function (e.g., gain vector “G(x)” 609 / 517), thereby providing a modified correction function based (e.g., adjusted gain vector “G'(x)” 649 / 521) based on predefined criteria, the predefined criteria comprising a predetermined threshold being determined as a function of a background noise level (e.g., background noise level (p2) 635) derived from the first signal (e.g., ambient microphone input signal 513 ).

[0256] It is advantageous to apply the output postprocessing step 611 / 519 in the method 601 since the trained neural network model 639 may be biased to perform a heavy noise filtration with the collateral effect of removing parts of the superimposed voice signal in extreme noise environments. Thus, by constraining the neural network output 609 / 517 based on the ambient noise level (e.g. SPL value 365 measured by the ambient microphone 315) the in-ear tactical communication and hearing protection system 301 may provide a clear or at least clearer voice signal even in demanding high noise environments. This ensures that the final transmitted voice signal is consistently better or equal to the unprocessed vibration-sensitive microphone input, thereby preserving speech quality in challenging conditions. Hence, this implementation enables the system to be fine-tuned or constrained in real time, ensuring that the resulting output remains predictable and can be manipulated to always be at least as good as, or better than, the raw vibration-based input signal. Furthermore, separating the generation of the correction component from the actual signal modification step improves stability and robustness, as the correction parameters can be monitored, limited, or post-processed to prevent undesirable artifacts or excessive signal alteration, thereby safeguarding the quality of the transmitted signal under varying conditions. Collectively, these advantages make the component-based approach especially suitable for real-time, user-worn communication systems operating in demanding and dynamic acoustic environments. This is because, the neural network processing step 611 may be configured to both remove background ambient noise and enhance the speech signal, as obtained by the Tx microphone 317 and ambient microphone 315, simultaneously. This processing may pose an intrinsic challenge for the trained neural network 639 in extreme high noise environments as both the Tx input signal 511 and the ambient input signal 513 may contain a substantial audio signal contribution from the ambient noise. During the training process 701 , e.g. or preferably as described in relation to Fig. 7, the neural network model711 may be configured to adjust one or more of its weights 645 in response to minimizing a loss score 721 or similar based on a comparison 719 between a ground truth clear speech signal 705 and an ambient noise dominated input signal 707 (i.e. in extreme noise environments). Thus, the trained neural network 639 may be biased to perform a too aggressive noise filtration with the collateral effect of removing the superimposed voice signal to minimize the loss function thereby causing a potential degradation of the speech quality in the final Tx output (i.e. without the output postprocessing step 611 / 519). The vibration sensitive nature of the Tx microphone 315 is superior in obtaining a speech signal of a user as compared to a normal air conducting microphone (e.g. the ambient microphone 317) in extreme noise environments, and the other way around in low noise environments (ambient microphone 317 is better than Tx microphone 315). Thus, the correction operation 647 may tune or modify the output gain vector 609 / 517 proposed by the neural network processing step 607 / 515 depending on the external noise environment. This is advantageous as that the final Tx output 617 / 525 may then always be better or equal to the unprocessed Tx microphone input 511 (i.e. lower background noise and enhanced speech signal). Said in another way, the correction operation 647 may allow the output gain vector 609 / 517 to modify the Tx signal 629 in the following step 613 / 523 more in low noise environments when the ambient microphone input 633 is more reliable than in high noise environments where the ambient microphone input 633 may be too noisy. Thus, the trained machine learning model 607 / 515 is configured to dynamically generate a correction function 609 / 517, by adjusting the relative contributions of the ambient and transmit microphone input signals (503a, 503b, 511, 505a, 505b, 513) according to the surrounding environmental sounds (731 ) of the ambient input signal (505a, 505b, 513).

[0257] The adjusted gain vector “G'(x)” 649 / 521 may then be forwarded to the noise filtration step 613 / 523 for performing the actual noise filtration of the speech signal 511 obtained by the Tx microphone(s) 317. Hence, the method 601 further comprises applying the modified correction function (e.g., adjusted gain vector “G'(x)” 649 / 521) to the second signal recorded by the vibration-sensitive microphone (e.g., speech signal 511 obtained by the Tx microphone(s) 317). Fig. 6F schematically illustrates an example of the noise filtration step 613 / 523 implementing an operation 651 configured to apply the adjusted gain vector “G'(x)” 649 / 521 to the processed Tx magnitude component "IX " 629. Such an operation may be a standard vector multiplication thereby generating a noise corrected Tx magnitude component "l / " 653.

[0258] A next processing step in the method (see e.g. 601 in Fig. 6A) may be a feature reconstruction step 615 transforming the noise corrected Tx magnitude component "| i|" 653 back into a Tx output 617 / 525 e.g. or preferably in the form of a continuous time varying audio signal representing a clear or at least clearer speech signal. An example of an implementation of the feature reconstruction step 615 is schematically illustrated in Fig. 6G.

[0259] The goal of feature reconstruction step 615 may be to transform the output of the neural network (or e.g. rather or preferably a noise corrected Tx magnitude component version thereof, i.e. the "| i|"653) back into the same domain as the input to the method 601 , in this case, the time domain. A first operation of the feature reconstruction step 615 may be to shift the magnitude values of the distribution for the corrected Tx magnitude component "|X, |" 653 back in the original range. This may e.g. be done by scaling the variance by an expansion operation 655 that may be applied to shift the numeric values back to the same natural variance as the target speech signal (e.g. as input Tx signal 511 ) and subsequently by inverting with a feature standardization operation 657. A next step may be to perform an operation 659 to inverse the previous log 10 operation (e.g. see operation 625 in Fig. 6B). A final operation may be to calculate the real and imaginary parts for the Inverse Short Time Fourier transform (ISTFT) 661 using the phase component “Xf” 631 (see e.g. also 623a / 631 of Fig. 6B) of the original input 511 and the corresponding processed magnitudes. The output "X1" 617 / 525 may thus be an enhanced reconstructed version of the Tx microphone input signal 511 (e.g., improved output signal) as a continuous voice signal in the time domain. Thus, the method 601 may comprises the step of providing an improved output voice signal (e.g., output "X1" 617 / 525), to be transmitted (e.g., via radio 109, 111) to other users and / or devices, by applying the modified correction function (e.g., adjusted gain vector “G'(x)” 649 / 521) to the first signal (e.g., Tx microphone input signal 511) recorded by the vibration-sensitive microphone 317.

[0260] In an alternative embodiment, it may be advantageous to perform an additional processing of the Tx output "X1" 617 / 525 signal before routing the signal to a radio for wireless transmission even though not shown in Fig 6A. The additional processing of the Tx output "X1" 617 / 525 signal may be performed by a post adjustment of the Tx output "X1" 617 / 525 signal taking into account the type of the communication device (see e.g. 107,109,111 in Fig. 1 and elsewhere) intended as recipient for the wireless transmission of the Tx output "X1" 617 / 525 signal. As previously mentioned, the user 101 may activate a PTT-button 223a-d or similar on the PTT control unit 105 (e.g. see fig 3A) thereby keying a connected radio to transmit a voice signal (i.e. the Tx output 617 / 525 signal). As the PTT control unit 105 may obtain information related to the communication type (e.g. via cable ship settings or other) a set of specific communication device type instructions may be applied to adjust the Tx output 617 / 525 signal before transmission via the specific communication device (i.e. keyed radio). Such a specific communication device type adjustment may be advantageous to achieve a clear and undistorted communication as different communication device types may (or may not) apply an intrinsic audio signal processing as part of the internal communication device itself. Such intrinsic audio signal processing algorithms of a radio may in some cases cause an unwanted warping or distortion of the Tx output "X1" 617 / 525 signal, if for example an additional speech optimization algorithm, signal compression, analog to digital conversion (VoIP), or vice versa, etc. are applied by the radio device substantially altering the original signal (i.e. altering the Tx output "X1" 617 / 525 signal). Thus, the specific communication device type adjustment may be a radio device audio encoding optimization, such that the transmitted signal by the radio may be or remain clear and undistorted despite any communication device type variations. In summary, and according to an aspect, the present disclosure relates to a method for enhancing speech intelligibility and maintaining robust voice transmission in a tactical communication system when the user is speaking, the method comprising:receiving, from a vibration-sensitive transmit microphone, a first signal representing speech vibrations conducted through the user’s jawbone and / or tissue;

[0261] receiving, from an ambient microphone, a second signal representing airborne acoustic components of the user’s voice and surrounding environmental sounds,

[0262] generating a feature-extracted representations of the first signal and the second signal (respectively),

[0263] receiving, by a trained machine learning model implemented on a processing unit, the feature-extracted representations of the first signal and the second signal as input.

[0264] In one embodiment, the method further comprises:

[0265] dynamically generating, by the trained machine learning model, a correction function, the correction function being signal modification parameters configured to selectively enhance speech-relevant frequency bands and suppress noise;

[0266] subjecting the generated correction function to an output postprocessing step that performs a signal conditioning operation configured to modify the generated correction function, thereby providing a modified correction function based on predefined criteria, the predefined criteria comprising a predetermined threshold being determined as a function of a background noise level derived from the second signal; and

[0267] providing an improved output voice signal, to be transmitted (e g., via radio) to other users and / or devices, by applying the modified correction function to the first signal recorded by the vibrationsensitive microphone and / or the first signal recorded by the ambient microphone.

[0268] In an alternative embodiment, the method further comprises:

[0269] dynamically (directly) generating, by the trained machine learning model, an improved output voice signal to be transmitted (e.g., via radio) to other users or devices, being a combined signal from both the first signal and second signal, enhancing speech-relevant frequency bands and suppress noise.

[0270] In some embodiments, the trained machine learning model dynamically adjusts the relative contributions of the first and second signals according to the background noise level derived from the second signal, such that in low-noise environments (e.g., below 85 dB) the contribution from the first signal is increased, and in high-noise environments (e.g., above 85 dB) the contribution from the second is increased, thereby maintaining optimal speech intelligibility and noise suppression across varying acoustic environments.

[0271] Neural networks training method

[0272] The trained neural network engine 515 configured to perform the neural network processing 607 in the in-ear tactical communication and hearing protecting system as disclosed herein (see e.g. 301 elsewhere) may be trained according to the training method 701, schematically illustrated in Fig. 7 A, to pro-vide real-time processing of a user’s voice signal to produce at least clearer and at least more undistorted communication in demanding environments. In the following a computer-implemented method of generating a training dataset fortraining a machine learning model, particularly the machine learning model according to the disclosure is provided. Figure 7A schematically illustrate an initial data collection 703 used to generate a training data set 707 and a target output set (i.e. ground truth) 705 (or rather a series / sufficient plurality of such) for performing the training method 701 following, as an example, a supervised learning scheme.

[0273] The data collection 703 may, as an example, be performed in accordance with the process schematically illustrated in Fig. 7B. The data collection 703 may be divided into two segments or modes in order to obtain a high quality and realistic target output 705 and training data set 707. One mode or a first mode being a “mute mode with high noise background” 725 used to obtain noisy response signals from the in-ear tactical communication and hearing protection system 301 (e.g., obtaining noise data) and another mode or a second mode being a “speech mode with a silent background” 727 used to obtain clear speech signals from the in-ear tactical communication and hearing protection system 301. In the “mute mode with high noise background” 725, a test subject 101 ’ wearing the in-ear technical communication and hearing protecting system (see e.g. 301 in Figs. 3A and 3B and elsewhere) may be situated in a sound isolated room including one or more external loudspeaker(s) 729 directed towards the test subject 10T. The loudspeaker(s) 729 may be configured to generate a high sound pressure such as between 60-140 dB SPL to simulate both a quiet and a loud real-world environment.

[0274] The test subject 101’may be instructed not to speak but otherwise produce a variety of natural so called “involuntary sounds” such as breathing, sighing, swallowing, lip smacking, chewing, teeth grinding, sniffling, etc. while also move around to produce sounds from clothes, worn / carried equipment, and cables, turn the head from side to side, up and down, clicking, tapping, or rustling sounds made by the movement, etc. Simultaneously, the loudspeaker(s) may expose the test subject 101’to a plurality of loud airborne noise segments 731 to simulate different demanding environment situations (e.g., undervarying noise conditions representative of demanding operational environments). Such noise segments 731 may be obtained from a database 733 containing an audio data library from demanding environments, such as heavy machine noise, gunshots, helicopters, explosions, etc.

[0275] The in-ear tactical communication and hearing protection headset 103 may thus obtain several noise responses such as the vibration-based noise signal response 735 generated when the test subject 10T is exposed to the noise segments 731 , “involuntary sounds”, and other vibration-based audio artifacts originating from the equipment (e.g. cable scratching and clothing rustle etc) via the Tx microphone(s) (see e.g. 317 in Fig. 3C) and obtain the air borne noise segments 731 directly via the ambient micro-phone(s) (see e.g. 315 in Fig. 3C). The PTT control unit 105 may be configured to output raw audio Tx signals 737 obtained by Tx microphone(s) thus containing noisy signals originating from the equipment (e.g. cable scratching, etc.), the test subject 10T itself (e.g. involuntary sounds), and the test subject’ 10T exposure to demanding environments to a Tx noise signal database 739. Additionally, the PTT control unit 105 may be adapted to output raw noise ambient signals 741 obtained by the ambient mi-crophone(s) in response to the exposure to demanding environments to an ambient noise signal database 743. Preferably, the vibration-based noise signal response 735 and the ambient microphone signal response (and thereby their raw output versions thereof, 737, 741 ) are obtained at the same time, i.e. they are both obtained of the same noise segment(s) 731 thus offer different ways of obtaining or recording the noise 731 where each way provides its version or its way of obtaining or recording the noise 731 , including particularities of each way, respectively. Thus, the method of generating a training dataset for the machine learning model comprises obtaining noise data by acquiring simultaneous response signals from the vibration-sensitive microphone and the ambient microphone under varying noise conditions representative of demanding operational environments. In other words, each way provides a different propagation path (e.g. air borne / vibration-based) and subsequent recording of the noise 731. Herein, obtaining the noise 731 via the Tx microphone(s) is also denoted, as least in some embodiments, obtaining the noise 731 (with no speech) in accordance with a second way. Additionally, obtaining the noise 731 via the ambient microphone(s) is also denoted, as least in some embodiments, obtaining the noise 731 (with no speech) in accordance with a third way.

[0276] In the “speech mode with a silent background” 727 (i.e. no noise from any demanding environments), a test subject 10T wearing the in-ear technical communication and hearing protecting system (see e.g.

[0277] 301 in Figs. 3A, 3B, and elsewhere) may be situated in a sound isolated room including an external microphone 745. The external microphone 745 may be and preferably is a high quality professional stationary voice recording microphone, such as a Shure SM7B, Sennheiser e935, Audio-Technica AT2010, or the like, and is placed in close vicinity of and facing the test subject 10T (e.g., positioned to capture high-quality airborne speech from the user). The external microphone 745 may be configured to obtain an air conducted voice signal 747 from the test subject 101’when speaking to produce a high quality, clear, and undistorted voice signal 749. Preferably the high quality, clear, and undistorted voice signal 749 may undergo a subsequent filtering step 751 to be processed and optimized for speech pronunciation rather the pure audio quality thereby generating a reference voice signal 753 that is optimized for speech intelligibility rather than audio fidelity when transmitted via a narrow band (e.g. 0-40 kHz) RF wireless transmission via a radio. The filtering step 751 may be configured to apply an equalizer function with an enhancement scheme for selective enhancement of lower frequency bands between 0-3500Hz, preferably between 500-1500 Hz, thus optimizing the reference signal for radio transmission purposes and enhanced speech intelligibility. The voice signal 749 obtained by the external microphone 745 (also referred to as obtaining the voice signal in accordance with a first way) may thus be used as part of the target output set 705 constituting a ground truth for the supervised training method 701 illustrated in Fig. 7A. The test subject 101’may be instructed to move around, turn the head from side to side, up and down while reading aloud from a manuscript or other in order to generate both the air conducted voice signal 747 and a bone conducted speech signal 755. The in-ear tactical communication and hearing protection headset 103 may thus obtain the vibration-based speech signal 755, involuntary sounds and movement induced vibrations using the Tx microphone(s) (see e.g.

[0278] 317 in Fig. 3C), and the air conducted voice signal 747 may simultaneously be obtained by the ambient microphone(s) 315 of the in-ear tactical communication and hearing protecting headset 103. The PTT control unit 105 may be configured to output a raw intermediate Tx signal 757 obtained via the Tx mi-crophone(s) 317 (i.e. obtained in accordance with the second way) containing the vibration-basedspeech signal 755 and audio artifacts originating from the test subject 101 'and equipment (e.g. cable scratching) as previously described (i.e. involuntary sounds) and additionally output a raw intermediate ambient signal 759 obtained via the ambient microphone(s) 315 (i.e. obtained in accordance with the third way) containing the air borne speech signal. Thus, the method of generating a training dataset for the machine learning model 515, 607 further comprises obtaining speech data (e.g., intermediate Tx and ambient signals 757, 759) by recording simultaneous output signals from the vibration-sensitive microphone 317, the ambient microphone 315, and an external microphone 747 positioned to capture high-quality airborne speech from the user, while the user is speaking in a silent background environment.

[0279] The intermediate Tx signal 757 (e.g., obtained speech data recorded by the vibration-sensitive microphone 317) may subsequently be modified in a processing step 761 by mixing the signal 757 with noise data from a Tx noise signal database 739 thereby generating training Tx signal 765 data, thereby simulating vibration-sensitive microphone based speech signals under varying noise conditions representative of demanding operational environments. Similarly, the intermediate ambient signal 759 (e.g., obtained speech data recorded by the ambient microphone 315) may also be modified in a processing step 763 by mixing the audio signal 759 with noise data from the ambient noise signal database 743 thereby generating a training ambient signal 767 data simulating ambient microphone based speech signals under varying noise conditions representative of demanding operational environments. Thus, a reference voice signal 753 and the corresponding pair of a training Tx signal 765 and a training ambient signal 767 may constitute one data entity being a paired data record (e.g., training Tx and ambient signals 765, 767) and a corresponding reference voice signal 753 (i.e. a ground truth and a pair of associated training data) of the target output set 705 (i.e. a reference voice signal 753) and the training data set 707 (i.e. a pair of a training Tx signal 765 and a training ambient signal 767). The data set 707 may comprise multiple pairs of training Tx signal / data 765 and training ambient signal / data 767, both with and without noise data. Thus, the method of generating a training dataset for the machine learning model further comprises the step of generating paired data records (e.g., training Tx signal 765 and a training ambient signal 767) by mixing the obtained speech data from the vibration-sensitive microphone (e.g., intermediate Tx signal 757) and the ambient microphone (e.g., intermediate ambient signal 759) with the corresponding noise data (e.g., Tx noise signal database 739 and ambient noise signal database 743) from the respective microphones 317, 315, thereby simulating speech signals under varying noise conditions representative of demanding operational environments (e.g., training Tx signal 765 and training ambient signal 767). The method further comprises the associating each paired data record (e.g., training Tx signal 765 and training ambient signal 767) with a reference voice signal 753, the reference voice signal 753 being the high-quality airborne speech 747 from the user as captured by the external microphone 745.

[0280] In summary, obtaining data according to the first way may comprises providing an acoustic signal by a microphone or transducer of a first type being a high quality professional stationary voice recording microphone or transducer (e.g. 745). Obtaining data according to the second way comprises providing an acoustic signal by a microphone or transducer of a second type being a vibration pick-up sensor or vi-bration sensitive transducer (e.g. 317 Fig. 3C), preferably of an in-ear communication and hearing protection device 103. Obtaining data according to the third way comprises providing an acoustic signal by a microphone or transducer of a third type being an ambient microphone (e.g. 315 in Fig. 3C).

[0281] Fig. 7C Illustrates a graphical representation of three exemplary audio signal data in a first subplot 769, a second subplot 771, and a third subplot 773 arranged in a vertical stack, representing training data (e.g., paired data records) used to train the neural network model 713 to obtain a trained neural network model 607 / 515 according to the training method 701. The third subplot 773 in Fig. 7C is showing an example of the first data (e.g. 705, 753) representing a reference speech signal including a speech signal 747 of the test subject 10T obtained in accordance with the first way. Thus, the third subplot 773 represent the corresponding reference voice signal, the reference voice signal being obtained using an external microphone positioned to capture high-quality airborne speech from the user. The first subplot 769 in Fig. 7C is showing an example of the second data 707, 709, 765 representing a training transmit (Tx) signal including at least the speech signal 747 obtained in accordance with the second way. Thus, subplot 769 represent speech data obtained by acquiring a response signal from the vibration-sensitive microphone in a speech mode with a silent background). The second subplot 771 is showing an example the third data 707, 711, 767 representing a training ambient signal including at least the speech signal 747 obtained in accordance with the third way. Thus, subplot 771 represent speech data obtained by acquiring a response signal from the ambient microphone in a speech mode with a silent background. The subplots 769,771,773 are illustrated in a vertical stack arrangement aligned according to the same x-axis being time, as the first (e.g. 757), second (e.g. 759) and third data (e.g. 749) are respectively obtained at substantially the same time in response to a speech signal 747 of the test subject 10T (e.g., simultaneously obtained response signals from the vibration-sensitive microphone, the ambient microphone and an external microphone). Each of the subplots 769,771,773 may have different scales on their respective y-axis, showed in arbitrary units (AU). The y-axis of the individual subplots 769,771,773 may be synchronized so that the data scales relative to each other in a meaningful way, such that changes or trends in the signals can be compared proportionally, even though the signals might have different ranges. The audio signal data represented in the first subplot 769, the second subplot 771, and the third subplot 773 may be obtained according to the second mode being a “speech mode with a silent background” 727 used to obtain clear speech signals from the in-ear tactical communication and hearing protection system 301, as described previously.

[0282] In one embodiment, the dashed box 779 in Fig. 7C show an example of a data entity (e.g. target output and training data pair) being a paired data record including a part of the reference voice signals 753 (segment in third subplot 773) being the target output 705 element (e.g., reference voice signal) and a corresponding pair including a part of the training Tx signal 765 (segment in first subplot 769) and a part of the training ambient signal 767 (segment in second subplot 771) being the training data 707 element of a data entity. A signal part may vary in length between 10 ms to 10 s, preferably between 20 ms to 150 ms for real-time performance. The dashed box 779 in Fig. 7C may thus represent an example of a data entity including target output (i.e. ground truth) and corresponding training data, not in-eluding noise representative of loud noises of a demanding environment, used to train an artificial intelligence or machine learning model, and may thus advantageously be used for speech optimization purposes.

[0283] Fig. 7D Illustrates a graphical representation of two exemplary audio signal data in a fourth subplot 775 and a fifth subplot segment 777 arranged in a vertical stack, representing noise data used to train the neural network model 713 to obtain a trained neural network model 607 / 515 according to the training method 701. The fourth subplot 775 in Fig. 7D is showing exemplary noise data representing a training transmit (Tx) signal 737 including noise (e.g. 731 , 733) being representative of loud noises of a demanding environment, obtained in accordance with the second way. Thus, the fourth subplot 775 represents noise data obtaining by recording output signals from the vibration-sensitive microphone and the ambient microphone while in a mute mode ( i. e. , the user is not speaking), under varying noise conditions representative of demanding operational environments. The fifth subplot 777 in fig. 7D is showing exemplary noise data representing a training ambient signal 741 including noise (e.g. 731 , 733) being representative of loud noises of a demanding environment, obtained in accordance with the third way. Thus, the fifth subplot 777 represents noise data obtaining by recording output signals from the ambient microphone while in a mute mode (i.e. , the user is not speaking), under varying noise conditions representative of demanding operational environments.

[0284] The subplots 775,777 are illustrated in a vertical stack arrangement aligned according to the same x-axis being time, as the noise data 737,741 are obtained at around substantially the same time in response to loud noises 731 of a demanding environment (e.g., simultaneous). Both the fourth subplot 775 and the fifth subplot 773 may have different scales on their respective y-axis, showed in arbitrary units (AU). The y-axis of the individual subplots 775,777 may be synchronized so that the data scales relative to each other in a meaningful way, such that changes or trends in the signals can be compared proportionally, even though the signals might have different ranges. The audio signal data represented in the fourth subplot 775 and fifth subplot 777 may be obtained according to the first mode being a “mute mode with high noise background” 725 used to obtain noisy response signals from the in-ear tactical communication and hearing protection system 301, as described previously.

[0285] In another embodiment, the dashed box 781 in Fig. 7D show an example of a data entity including a pair of noise data, representative of loud noises of a demanding environment including a part of the noise Tx signal 737 (segment in forth subplot 775) and a part of the noise ambient signal 741 (segment in fifth subplot 777). A signal part may vary in length between 10 ms to 10 sec, preferably between 20 ms to 150 ms for real-time performance. The parts of the noise signals 737,741 as showed in the dashed box 781 in Fig. 7D may be mixed with the speech signal pair of the part of the training Tx signal 765 (segment in first subplot 769) and a part of the training ambient signal 767 (segment in second subplot 771 ) shown in Fig .7C, such that the second data (e.g. 707, 709, 765), representing a training transmit (Tx) signal including the speech signal (e.g. 747), further includes noise (e.g. 731, 733) being representative of loud noises of a demanding environment, and the third data (e.g. 707, 711, 767), representing a training ambient signal including the speech signal (747), further includes noise (e.g. 731 , 733) being representative of loud noises of a demanding environment. Hence, the dashed box 779 inFig. 7C in addition with the dashed box 781 in Fig 7D may thus represent an example of a data entity including target output (i.e. ground truth) and corresponding training data, including noise representative of loud noises of a demanding environment, used to train the artificial intelligence or machine learning model 713, and may thus advantageously be used for both noise suppression and speech optimization purposes. In other words, and according to some embodiments, the trained artificial intelligence or machine learning method or component has been trained on a dataset comprising paired data records using supervised learning, each data record includes:

[0286] (i) speech signals obtained simultaneously from a vibration-sensitive transmit microphone and an ambient microphone under varying noise conditions representative of demanding operational environments; and

[0287] (ii) a corresponding reference voice signal, the reference voice signal being obtained using an external microphone positioned to capture high-quality airborne speech from the user.

[0288] Advantageously, both data entities representing training data including noise representative of loud noises of a demanding environment and training data not including noise representative of loud noises of a demanding environment may form part of the data framework (e.g. collection of data entities) used to train the neural network model 713 according to the training method 701. Such that the final Tx output 617 / 525 of the trained neural network model 607 / 515 may be optimized for speech intelligibility and noise suppression and thereby by enabled to provide clear and undistorted communication in demanding environments.

[0289] According to an aspect, the disclosure relates to a computer-implemented method of generating a training dataset for a machine learning model, particularly the machine learning model according to the present disclosure, comprising:

[0290] obtaining noise data by recording simultaneous output signals from the vibration-sensitive microphone (see e.g. 317 in Fig. 3C) and the ambient microphone (see e.g. 315 in Fig. 3C) while in a mute mode (i.e., the user is not speaking), under varying noise conditions representative of demanding operational environments (see noise segments 731 and databases 733, 739, 743 in Figs. 7B, 7D); obtaining speech data by recording simultaneous output signals from the vibration-sensitive microphone, the ambient microphone, and an external microphone (see 745 in Fig. 7B) positioned to capture high-quality airborne speech from the user, while the user is speaking in a silent background environment (see signals 747, 749 in Fig. 7B-7C);

[0291] generating paired data records by mixing the obtained speech data from the vibration-sensitive and ambient microphones with the corresponding noise data from the respective microphones (see mixing steps 761, 763 in Fig. 7B), thereby simulating speech signals under varying noise conditions representative of demanding operational environments; andassociating each paired data record with a reference voice signal, the reference voice signal being the high-quality airborne speech from the user as captured by the external microphone (see reference signal 753 in Fig. 7C).

[0292] This approach to generating training data, as illustrated in Figs. 7A-7D, offers several important benefits for the development of robust machine learning models in tactical communication systems. By capturing simultaneous noise responses from both the vibration-sensitive microphone (see e.g. 317 in Fig.

[0293] 3C) and the ambient microphone (see e.g. 315 in Fig. 3C) in mute mode under a variety of realistic noise conditions (see noise segments 731 and databases 733, 739, 743 in Figs. 7B, 7D), the method ensures that the noise data accurately reflects the full range of operational environments, including environmental noise, involuntary user sounds, and equipment-induced vibrations. The separate acquisition of speech data from the vibration-sensitive microphone, the ambient microphone, and an external microphone positioned to capture high-quality airborne speech in a silent background (see external microphone 745 and signals 747, 749 in Fig. 7B-7C) provides clean and uncontaminated speech signals for each channel, as well as a reliable reference voice signal (reference signal 753 in Fig. 7C). By subsequently mixing the obtained speech data with the corresponding noise data for each microphone (see mixing steps 761, 763 in Fig. 7B), paired data records can be created that simulate speech signals under a wide range of noise conditions, with precise control over the signal-to-noise ratio. This enables the training dataset to comprehensively represent the spectrum of operational scenarios, which in turn improves the model’s ability to generalize and perform reliably in real-world environments. Associating each paired data record with a reference voice signal captured by the external microphone (see 753 in Fig. 7C) ensures that the model is trained to optimize for speech intelligibility and naturalness as perceived in ideal conditions. As a result, the machine learning model is able to learn optimal strategies for combining and processing the two microphone signals in the presence of noise, supporting supervised learning with accurate input-output pairs and ultimately delivering improved speech intelligibility, noise suppression, and communication reliability even in demanding and unpredictable acoustic environments.

[0294] Accordingly, at least in some embodiments, a method (701) of training an artificial intelligence or machine learning method or component (515, 607, 713) to be executed by at least one device (103, 303a, 303b, 107, 109, 111, 205) of a communication system is provided, where the artificial intelligence or machine learning method or component (515, 607) is configured to generate real-time processing of a user’s voice signal (525) in a demanding environment, the method (701) comprising

[0295] a) obtaining first data (705, 753) representing a reference speech signal including a speech signal (747) of a user (101, 10T) obtained in accordance with a first way,

[0296] b) obtaining second data (707, 709, 765) representing a training transmit (Tx) signal including the speech signal (747) obtained in accordance with a second way,

[0297] c) obtaining third data (707, 711, 767) representing a training ambient signal including the speech signal (747) obtained in accordance with a third way,d) providing the second data and the third data to the artificial intelligence or machine learning method or component (515, 607, 713) generating a predicted output (715, 717) in response thereto,

[0298] e) comparing (719) the predicted output (715, 717) and the first data (705, 753) and determining (721) a difference therebetween, and

[0299] f) updating parameters (645) of the artificial intelligence or machine learning method or component (515, 607, 713) in response to the determined difference (721, 723, 735), wherein the method (701) further comprises repeating steps a) - f) for new first, second, and third data a plurality of times, typically a large number of times, until the generated difference of the predicted output (715, 717) and the first data (705, 753) is within a predetermined threshold or the improvement of generated difference (from cycle to cycle) stops improving sufficiently.

[0300] Additionally, in some further embodiments,

[0301] the second data (707, 709, 765), representing a training transmit (Tx) signal including the speech signal (747), further includes user-generated noise (e.g. such as natural or involuntary noises), e.g. or preferably obtained in accordance with the second way (757, 761 ), and / or the third data (707, 711, 767), representing a training ambient signal including the speech signal (747), further includes user-generated noise e.g. or preferably obtained in accordance with the third way (759, 763).

[0302] Additionally, in some further embodiments,

[0303] the second data (707, 709, 765), representing a training transmit (Tx) signal including the speech signal (747), further includes noise (731 , 733) being representative of loud noises of a demanding environment,

[0304] - the third data (707, 711, 767), representing a training ambient signal including the speech signal (747), further includes noise (731 , 733) being representative of loud noises of a demanding environment, and

[0305] the first data (705, 753), representing a reference speech signal including the speech signal (747), does not include noise being representative of loud noises of a demanding environment.

[0306] A comprehensive collection of data entities of respective reference voice signals 753 and corresponding pairs of a respective training Tx signal 765 and a training ambient signal 767 constituting the target output set 705 and the training data set 707, respectively, may be generated by repeating the data collection process 703 using multiple test subjects from multiple nationalities and different genders as well as obtaining ambient- 741 and Tx noise data 737 involving a range of different relevant noise environments and combining and mixing audio segments and signal parts from the different signals (e.g. 737, 741, 757, 759).Before starting the training process 701, the training data set 705 and target data set 707 may be split into a two data parts. One data part (i.e. a verification data pool) for evaluation of the performance of the neural network model 713 after the training process 701 is completed, and one data part for actual training ofthe neural network model 713.

[0307] Referring now to training process illustrated in Fig. 7 A. The training Tx signals 765 in the training data set 707 may be processed into a Tx input 709 data having a suitable format fortraining the neural network model 713, which may be or preferably is similar to the data processing described in relation to Fig. 6B. Similarly, the training ambient signals 767 in the training data set 707 may be processed into an Ambient input data 711 also having a suitable format fortraining the neural network model 713 e.g. or preferably similar to the data processing described in relation to Fig 6C.

[0308] The untrained neural network model 713 architecture and type may be similar to the trained neural network 607 / 515 of Fig. 6D where the difference is that the untrained neural network model 713 may initially have more or less arbitrary weights 645 and biases assigned (or alternatively set in any other suitable way) to the individual notes 641 when the training process 701 is started or initially assigned. The goal of the training process 701 may be to update the weights 645 and biases of the individual notes in an iterative manner until the prediction 715 of the neural network 713 is very close or basically identical (for all practical purposes) to the target data set 705 (i.e. ground truth).

[0309] The training method 701 may be performed in an iterative manner, where each iteration cycle may comprise a forward pass followed by a backwards pass.

[0310] During a forward pass, the Tx input 709 and Ambient input 707 is passed through the neural network model 713 layer by layer (see e.g. 643b-d in Fig. 6D) where each layer applies transformations, such as weighted sums and added activation functions. This process results in a prediction 715 as output (see e.g. also 609 in Fig. 6D). The prediction 715 may be forwarded to an output data 717 processing step performed e.g. or preferably similar to the combined noise filtration 613 / 523 and feature reconstruction step 615 (see e.g. Figs. 6A and 6F) where a duplet of the Tx input signal 709 is combined with the prediction 715 e.g. or preferably according to the processing explained in relation to Fig. 6F and 6G. By separating the prediction 715 and output data 717 processing steps into two individual steps, the training process 701 enables the neural network model 713 to generate the prediction 715 as a correction response (i.e. gain vector) to the Tx input 709 rather than data representing the fully corrected signal (e.g., improved output voice signal), which is advantageous as previously described. The signal from the output data step 717 may subsequently be directed to a comparison step 719. The comparison step 719 may contain a so-called “loss function”, which may be configured to compare the output data 717 (i.e. corrected Tx input data) to the corresponding true target value of the Target output set 705 and calculate a loss score 721 as a quantitative value of the difference or "error" between the data sets. Thus, the loss score 721 value indicates how far the model's prediction 715 is from the actual target (i.e. ground truth). For regression tasks, typical loss functions 719 may be the Mean Squared Error (MSE) or Mean Absolute Error (MAE). For example, MSE calculates the squared differences between predicted 717 and the ground truth 705. Designing a proper loss function for audio processing tasks in machine learning issignificant as it guides the model during training to focus on relevant aspects of the audio signal (such as noise removal) while preserving and enhancing useful information (i.e. speech signal) and avoiding trivial solutions such as removing all of the audio signal or introducing distortions. Advantageously, effective loss functions for the audio processing may be MSE with weighted frequency bands or a hybrid loss function as a combination of different loss functions.

[0311] In the subsequent backwards pass, the error score 721 may be used in the optimizer step 723 where the gradient of the loss function may be computed with respect to each weight (see e.g. 645 in Fig. 6D) of the neural network model 713. The calculation of the gradients for each individual weight 645 may be achieved using a method called backpropagation, which applies the chain rule of calculus. Back-propagation computes the gradient of the loss function 719 for each weight 645 by propagating the error backward through the layers of neural network model 713. The gradient may be calculated as the partial derivative of the loss function 719 with respect to the network's weights 645. Thus, the gradient may represent the rate of change of the loss function 719 when a small perturbation is applied to the weights 645 one by one. The gradients calculated in the optimizer step 723 may thus instruct the neural network model 713 how much and in which direction (positive or negative) a particular weight 645 value ofthe neural network model 713 should be adjusted to minimize the loss (i.e. loss score 721). The subsequent weight updater step 724 may then apply the gradients and adjust the parameters (i.e. weights and biases) ofthe neural network model 713 using an optimization algorithm. A common algorithm used for this purpose is e.g. the Gradient Descent (or its variants, like Stochastic Gradient Descent (SGD)).

[0312] The training method 701 may perform consecutive cycles of forward- and backward passes to ongo-ingly train the neural network model 713, gradually reducing the loss function's value as the network optimizes its parameters. The process of making predictions, calculating the loss, performing back-propagation, and updating the weights may be repeated for every data entity of reference voice signal 753 and corresponding pairs of training Tx signal 765 and training ambient signal 767 ofthe target output set 705 and the training data set 707. The training method may be completed when reaching a point where the loss scores 721 are minimized or sufficiently small indicating that the neural network model has learned the relationship between the input features 707 and the ground truth 705 (sufficiently or adequately for the given purpose). The training method may be stopped when convergence occurs, hence when the loss scores 721 stops decreasing significantly thereby indicating that the neural network model 713 has reached or is close to an optimal set of weights 645. A final step may be a validation step using the verification data pool. The neural model 713 may be evaluated on the verification data pool to check the performance on ‘unseen’ data (i.e. not part of the training as such). The same comparison step 719 i.e. loss function (MSE, MAE, etc.) may be used to measure how well the model performs (e.g. generalizes). Additional metrics like R-squared (for adequacy of fit), R-MSE, or adjusted R-squared might be used to evaluate the neural model 713. When the training process 701 is completed and validation is successful, the trained neural network model 607 / 515 / 713 may be deployed in the in-ear tactical communication and hearing protection system 301 performing the real-time voice processing 601 as shown in fig 6A. As a natural consequence ofthe described training method 701 and the system architecture illustrated in Figs. 5, 6A-6G, the trained machine learning model515 / 607 is configured to dynamically generate a correction function or signal 517 / 609 by adjusting the relative contributions of the ambient and transmit microphone input signals (e.g., ambient microphone input signal (505a, 505b, 513) and transmit microphone input signal (503a, 503b, 511)) according to the surrounding environmental sounds captured by the ambient microphone (315, 315a, 315b). This adaptive behaviour derives from the supervised learning process using paired data records 707 that include both microphone signals under varying noise conditions and a high-quality reference voice signal 705. By learning the statistical relationships between noise levels (e.g., airborne acoustic components of the surrounding environmental sounds) and signal quality during training 701, the model 515 / 607 can infer optimal weighting strategies for real-time operation. In low-noise environments, where airborne speech components are less corrupted, the contribution from the ambient microphone input is increased to preserve naturalness and full-bandwidth fidelity. Conversely, in high-noise environments, where airborne signals are heavily masked, the contribution from the vibration-sensitive microphone input is increased to maintain intelligibility and suppress environmental noise. This dynamic adjustment ensures that the output voice signal remains optimized for clarity and robustness across a wide range of acoustic conditions, thereby achieving the technical effect of improved speech intelligibility and reliable communication in demanding operational environments. Thus, according to an aspect, the present disclosure relates to a communication system for enhancing speech intelligibility and maintaining robust voice transmission in a demanding environment, the system comprising:

[0313] at least one communication and hearing protection device (103, 303a, 303b), in particular at least one in-ear communication and hearing protection device (103, 303a, 303b), configured to be worn by a user (101), the communication and hearing protection device (103, 303a, 303b) comprising:

[0314] an ambient microphone (315, 315a, 315b) configured to provide an ambient microphone input signal (505a, 505b, 513) comprising, when the user is speaking, airborne acoustic components (747) of the user’s voice and surrounding environmental sounds (731), and

[0315] a transmit (Tx) microphone (317, 317a, 317b) comprising a vibration-based transducer, mechanically coupled to the user and configured to provide a transmit microphone input signal (503a, 503b, 511 ) comprising, when the user is speaking, speech vibrations (755) conducted through the user’s jawbone and / or tissue; one or more processors (205, 207) configured to implement and execute a trained artificial intelligence or machine learning method or component (515, 607), the processors being further configured to:

[0316] receive the ambient microphone input signal (505a, 505b, 513) and the transmit microphone input signal (503a, 503b, 511 ),

[0317] generate, based on the received signals (505a, 505b, 513; 503a, 503b, 511), a combined correction function or signal (517, 609), and

[0318] provide, using the correction function or signal (517, 609), an improved output voice signal (525, 617), to be transmitted (e.g., via radio (109, 100)) to one or more other users and / or one or more devices;wherein the correction function or signal (517, 609) is dynamically generated, by the trained artificial intelligence or machine learning method or component (515, 607), by adjusting the relative contributions of the ambient and transmit microphone input signals (503a, 503b, 511, 505a, 505b, 513) according to the surrounding environmental sounds (731) ofthe ambient input signal (505a, 505b, 513) such that in low-noise environments (e.g., below 85 dB) the contribution from the ambient input signal (505a, 505b, 513) is increased, and in high-noise environments (e.g., above 85 dB) the contribution from the transmit microphone input signal (503a, 503b, 511 ) is increased, thereby maintaining optimal speech intelligibility and noise suppression across varying demanding environments.

[0319] The disclosed system and method employ a trained machine learning model (e.g., see 515, 607) for audio signal processing, providing technical advantages over traditional audio signal processing algorithms that operate based on fixed mathematical assumptions, manually engineered features, and predetermined rule sets. Conventional algorithms are typically optimized for controlled or idealized audio conditions and often assume stationary signals, limited background interference, or isolated sound sources. In contrast, the trained machine learning model (e.g., see 515, 607) according to the disclosure is capable of processing complex, real-world audio signals that include background noise, overlapping speakers or sound sources, non-stationary signal characteristics, and variations in accent, speaking style, recording device, and acoustic environment, as illustrated in Figs. 6A-6G and else-whare. Whereas traditional algorithms frequently exhibit performance degradation when such conditions deviate from their design assumptions, the disclosed machine learning model (515, 607) learns non-linear and context-dependent relationships directly from diverse training data (see Figs, 7A-7D, and elsewhere), enabling improved robustness and generalization across operating conditions. The machine learning model (e.g., see 515, 607) further replaces or augments classical feature extraction techniques, such as fixed spectral or temporal descriptors, by learning task-specific representations that adapt dynamically to signal variability. This learned representation enables the model to exploit the complementary characteristics of airborne and bone-conducted speech signals (e.g., see 513, 511), disambiguate overlapping or distorted audio components that traditional rule-based filtering, thresholding, or window-based analysis techniques are unable to reliably separate or merge. Additionally, the trained machine learning model captures both short-term and long-term temporal dependencies across the airborne and bone-conducted speech signals (e.g., see 513, 511), enabling improved handling of speech, and environmental sound patterns that evolve overtime for. Traditional algorithms, which are often constrained to short analysis windows and linear processing assumptions, lack the capacity to model such extended temporal context and therefore fail to maintain accuracy under dynamic or ambiguous audio conditions. Accordingly, the disclosed approach achieves superior speech intelligibility, maintaining robust voice transmission, robustness, and adaptability in real-world audio processing scenarios relative to classic audio signal processing algorithms.

[0320] Some preferred embodiments have been shown in the foregoing, but it should be stressed that the invention is not limited to these, but may be embodied in other ways within the subject matter defined in the following claims.It should be emphasized that the term "comprises / comprising" when used in this specification is taken to specify the presence of stated features, elements, steps or components but does not preclude the presence or addition of one or more other features, elements, steps, components or groups thereof. In the claims enumerating several features, some or all of these features may be embodied by one and the same element, component or item. The mere fact that certain measures are recited in mutually different dependent claims or described in different embodiments does not indicate that a combination of these measures cannot be used to advantage.

[0321] In the claims, any reference signs placed between parentheses shall not be constructed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

[0322] The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to an advantage.

[0323] It will be apparent to a person skilled in the art that the various embodiments of the invention as disclosed and / or elements thereof can be combined without departing from the scope of the invention as defined in the claims.

Claims

72CLAIMS1. A communication system for enhancing speech intelligibility and maintaining robust voice transmission in a demanding environment, the system comprising:at least one communication and hearing protection device (103, 303a, 303b), in particular at least one in-ear communication and hearing protection device (103, 303a, 303b), configured to be worn by a user (101), the communication and hearing protection device (103, 303a, 303b) comprising:an ambient microphone (315, 315a, 315b) configured to provide an ambient microphone input signal (505a, 505b, 513) comprising, when the user is speaking, airborne acoustic components (747) of the user’s voice and surrounding environmental sounds (731), anda transmit (Tx) microphone (317, 317a, 317b) comprising a vibration-based transducer, mechanically coupled to the user and configured to provide a transmit microphone input signal (503a, 503b, 511 ) comprising, when the user is speaking, speech vibrations (755) conducted through the user’s jawbone and / or tissue; one or more processors (205, 207) configured to implement and execute a trained artificial intelligence or machine learning method or component (515, 607), the processors being further configured to:receive the ambient microphone input signal (505a, 505b, 513) and the transmit microphone input signal (503a, 503b, 511 ),generate, based on the received signals (505a, 505b, 513; 503a, 503b, 511), a combined correction function or signal (517, 609), andprovide, using the correction function or signal (517, 609), an improved output voice signal (525, 617), to be transmitted (e.g., via radio (109, 100)) to one or more other users and / or one or more devices;wherein the correction function or signal (517, 609) is dynamically generated, by the trained artificial intelligence or machine learning method or component (515, 607), by adjusting the relative contributions of the ambient and transmit microphone input signals (503a, 503b, 511, 505a, 505b, 513) according to the surrounding environmental sounds (731) ofthe ambient input signal (505a, 505b, 513).

2. The communication system according to claim 1 , wherein the correction function or signal (517, 609) is dynamically generated such that in low-noise environments the contribution from the ambient input signal (505a, 505b, 513) is increased, and in high-noise environments the contribution from the transmit microphone input signal (503a, 503b, 511 ) is increased, thereby maintaining optimal speech intelligibility and noise suppression across varying demanding environments.

733. The communication system according to claim 1 or claim 2, wherein the trained artificial intelligence or machine learning method or component (515, 607) has been trained on a dataset comprising paired data records, each data record including:simultaneously recorded speech signals (765, 767) from both a vibration-sensitive transmit microphone (317, 317a, 317b) and an ambient microphone (315, 315a, 315b), captured under a range of background noise conditions representative of demanding operational environments (725, 727), anda corresponding reference voice signal (753), the reference voice signal (753) being being obtained using an external microphone (745) positioned to capture high-quality airborne speech from the user 101.

4. The communication system according to any one of claims 1 - 3, wherein the correction function or signal (517, 609) is a processed signal being the improved output voice signal (525, 617), to be transmitted (e.g., via radio (109, 111)) to one or more other users and / or one or more devices (107, 109, 111), directly generated by the trained artificial intelligence or machine learning method or component.

5. The communication system according to any one of claims 1 - 3, wherein the correction function or signal (517, 609) is a set of signal modification parameters that selectively enhances speech-relevant frequency bands and suppresses noise.

6. The communication system according to claim 5, wherein the step of providing, using the correction function or signal (517, 609), an improved output voice signal (525, 617), to be transmitted (e.g., via radio (109, 111)) to one or more other users and / or one or more devices, comprise applying the correction function or signal (517, 609) to the ambient microphone input signal (505a, 505b, 513) and / or the transmit microphone input signal (503a, 503b, 511 ).

7. The communication system according to any one of claims 5 or 6, wherein the generated correction function or signal is or comprises data representing a gain vector or similar, wherein the gain vector or similar comprises a number of predicted adjustment values, predicted by the trained artificial intelligence or machine learning method or component (515, 607), for different segments or parts of the transmit microphone input signal (503a, 503b, 511 ) or a processed version thereof, where an adjustment value for a particular segment or part specify whether the transmit microphone input signal (503a, 503b, 511 ) or a processed version thereof in the particular segment or part should be kept or should be increased or decreased and to what extent.

8. The communication system according to claim 7, wherein the predicted adjustment values each are limited or clipped to be within predetermined maximum and / or minimum threshold, wherein the predetermined maximum and / or minimum threshold are set in response to a derived or estimated sound pressure value (637) representing or estimating an external background noise level, wherein the predetermined maximum and / or minimum threshold are set to relatively high threshold(s) in case the derived or estimated sound pressure value (637) indicates no or little background noise and are set to relatively74low threshold(s) in case the derived or estimated sound pressure value (637) indicates a relatively high background noise.

9. The communication system) according to any one of claims 5 - 8, wherein the communication system is configured to:apply a processed version of the generated correction function or signal (517, 609) to the transmit microphone input signal (503a, 503b, 511 ) thereby filtering or reducing noise from the transmit microphone input signal (503a, 503b, 511 ) and providing the corrected or improved voice signal (525, 617).

10. The communication system according to any one of claims 1 - 9, wherein the transmit (Tx) microphone (317, 317a, 317b) is digital, and wherein the communication system or the communication and hearing protection device (103, 303a, 303b) comprises a dedicated direct digital-to-analog converter (DAC) circuitry (349) coupled to the transmit (Tx) microphone (317, 317a, 317b) and configured to perform lossless front-end digital to analog signal conversion.

11. The communication system according to claim 10, wherein the transmit (Tx) microphone (317, 317a, 317b) is configured to output a digital Pulse Density Modulation (PDM) signal representing an obtained voice signal of the user (101), and wherein the dedicated direct digital-to-analog converter (DAC) circuitry (349) is configured to receive the digital Pulse Density Modulation (PDM) signal and to convert it into an analog signal using a D-FlipFlop and an active lowpass filter, preferably applying a fourth order Bessel function, comprised by the dedicated direct digital-to-analog converter (DAC) circuitry (349).

12. The communication system according to any one of claims 1 - 11, wherein the one or more processors (205, 207) is further configuredto process (603a, 619) the transmit microphone input signal (503a, 503b, 511 ) to generate a transmit magnitude component (621a) thereof, orto process (603a, 605a, 619, 625, 627) the transmit microphone input signal (503a, 503b, 511 ) to generate a further processed transmit magnitude component (629) thereof, and to provide the generated transmit magnitude component (621a) or the generated further processed transmit magnitude component (629) as input to the trained artificial intelligence or machine learning method or component (515, 607) to generate the correction function or signal (517, 609).

13. The communication system according to any one of claims 1 - 12, wherein the one or more processors (205, 207) is further configuredto process (603b, 619) the ambient microphone input signal (505a, 505b, 513) to generate an ambient magnitude component (621 b) thereof, or75to process (603a, 605b, 619, 625, 627) the ambient microphone input signal (505a, 505b, 513) to generate a further processed ambient magnitude component (633) thereof, and to provide the generated ambient magnitude component (621 b) or the generated further processed ambient magnitude component (633) as input to the trained artificial intelligence or machine learning method or component (515, 607) to generate the correction function or signal (517, 609).

14. The communication system according to claim 13, wherein the one or more processors (205, 207) is further configuredto process (635) the generated ambient magnitude component (621 b) to derive (635) a derived or estimated sound pressure value (637) thereof, the derived or estimated sound pressure value (637) representing or estimating a background noise level,to adjust (519, 611) the correction function or signal (517, 609), prior to the correction function or signal (517, 609) being applied to the transmit microphone input signal (503a, 503b, 511 ), in response to the derived or estimated sound pressure value (637) resulting in an adjusted correction function or signal (521, 649), andapplying the adjusted correction function or signal (521 , 649) to the transmit microphone input signal (503a, 503b, 511) instead of the correction function or signal (517, 609) in order to provide the corrected or improved voice signal (525, 617).

15. The communication system according to claims 12 or 13 - 14 as dependent on claim 12, wherein the one or more processors (205, 207) is further configuredto, instead of applying the correction function or signal (517, 609) to the transmit microphone input signal (503a, 503b, 511), perform noise filtration (523, 613) of the generated further processed transmit magnitude component (629) in response to the adjusted correction function or signal (521, 649) resulting in a noise corrected further processed transmit magnitude component (653), andproviding the corrected or improved voice signal (525, 617) in response to the noise corrected further processed transmit magnitude component (653).

16. The communication system according to any one of claims 1 - 15, wherein the communication system comprises one or more ofa wireless remote PTT device (113),one or more communication devices (107, 109, 111),one or more radios (109, 111),a radio of a first type (109) and a radio of a second type (111), andone or more end-user-devices (EUDs) (107).7617. The communication system according to any one of claims 1 - 16, wherein the computer program or routine implementing the trained artificial intelligence or machine learning method or component (515, 607) has been trained according to the training method (701 ) of any one of claims 18 - 24.

18. A method (701 ) of training an artificial intelligence or machine learning method or component (515, 607, 713) to be executed by at least one device (103, 303a, 303b, 107, 109, 111, 205) of a communication system, the artificial intelligence or machine learning method or component (515, 607) configured to generate real-time processing of a user’s voice signal (525) in a demanding environment, the method (701) comprisingg) obtaining first data (705, 753) representing a reference speech signal including a speech signal (747) of a user (101, 101’) obtained in accordance with a first way,h) obtaining second data (707, 709, 765) representing a training transmit (Tx) signal including the speech signal (747) obtained in accordance with a second way,i) obtaining third data (707, 711, 767) representing a training ambient signal including the speech signal (747) obtained in accordance with a third way,j) providing the second data and the third data to the artificial intelligence or machine learning method or component (515, 607, 713) generating a predicted output (715, 717) in response thereto,k) comparing (719) the predicted output (715, 717) and the first data (705, 753) and determining (721) a difference therebetween, andl) updating parameters (645) of the artificial intelligence or machine learning method or component (515, 607, 713) in response to the determined difference (721, 723, 735), wherein the method (701) further comprises repeating steps a) - f) for new first, second, and third data a plurality of times, typically a large number of times, until the generated difference of the predicted output (715, 717) and the first data (705, 753) is within a predetermined threshold or the improvement of generated difference stops improving sufficiently.

19. The method (701) of training according to claim 18, whereinthe first way (745, 749) comprises providing an acoustic signal by a microphone or transducer of a first type being a high quality professional stationary voice recording microphone or transducer (745),the second way (757, 761) comprises providing an acoustic signal by a microphone or transducer of a second type being a vibration pick-up sensor (317), and / orthe third way (759, 763) comprises providing an acoustic signal by a microphone or transducer of a third type being an ambient microphone (315).

20. The method (701) of training according to claim 18 or 19, wherein77the second data (707, 709, 765), representing a training transmit (Tx) signal including the speech signal (747), further includes user-generated noise e.g. or preferably obtained in accordance with the second way (757, 761 ), and / orthe third data (707, 711, 767), representing a training ambient signal including the speech signal (747), further includes user-generated noise e.g. or preferably obtained in accordance with the third way (759, 763).

21. The method (701) of training according to any one of claims 18 - 20, whereinthe second data (707, 709, 765), representing a training transmit (Tx) signal including the speech signal (747), further includes noise (731 , 733) being representative of loud noises of a demanding environment,the third data (707, 711, 767), representing a training ambient signal including the speech signal (747), further includes noise (731 , 733) being representative of loud noises of a demanding environment, andthe first data (705, 753), representing a reference speech signal including the speech signal (747), does not include noise being representative of loud noises of a demanding environment.

22. The method (701) of training according to claim 21, whereinthe noise (731 , 733) being representative of loud noises of a demanding environment of the second data (707, 709, 765) has been obtained in accordance with the second way, and / or the noise (731 , 733) being representative of loud noises of a demanding environment of the third data (707, 711, 767) has been obtained in accordance with the third way.

23. The method (701) of training according to claim 21 or 22, wherein the noise (731, 733) being representative of loud noises of a demanding environment of the second data (707, 709, 765) and / or the noise (731, 733) being representative of loud noises of a demanding environment of the third data (707, 711, 767) have been obtained at the same time and / or have been obtained when the user (101, 101’) is not speaking.

24. The method (701 ) of training according to any one of claims 18 - 23, wherein the method (701 ) further comprisesprocessing (603a, 619) the second data (707, 709, 765) to generate a transmit magnitude component thereof, and providing the generated transmit magnitude component to the artificial intelligence or machine learning method or component (515, 607, 713) instead of providing the second data (707, 709, 765) in step d), and / orprocessing (603b, 619) the third data (707, 711, 767) to generate an ambient magnitude component (621b) thereof and providing the generated ambient magnitude component to the artificial intelligence or machine learning method or component (515, 607, 713) instead of providing the third data (707, 711, 767) in step d).

25. A method (601 ) for enhancing speech intelligibility and maintaining robust voice transmission in a tactical communication system (103) when a user is speaking, the method (601 ) comprising:receiving, from a vibration-sensitive transmit microphone (317, 317a, 317b), a first signal (503a, 503b, 511) representing speech vibrations conducted through the user’s jawbone and / or tissue,receiving, from an ambient microphone (315, 315a, 315b), a second signal (505a, 505b, 513) representing airborne acoustic components of the user’s voice and surrounding environmental sounds, andgenerating feature-extracted representations of the first signal (629) and the second signal (633);receiving, by a trained machine learning model (607, 515) implemented on a processing unit (205, 207), the feature-extracted representations of the first signal (629) and the second signal (633) as input;26. The method (601 ) according to claim 25, wherein the method (601 ) further comprises:dynamically generating, by the trained machine learning model (607, 515), a correction function (609, 517), the correction function being signal modification parameters configured to selectively enhance speech-relevant frequency bands and suppress noise;subjecting the generated correction function (609, 517) to an output postprocessing step (611, 519) that performs a signal conditioning operation configured to modify the generated correction function (609, 517), thereby providing a modified correction function (649, 521) based on predefined criteria, the predefined criteria comprising a predetermined threshold being determined as a function of a background noise level (635) derived from the second signal (505a, 505b, 513); andproviding an improved output voice signal (617, 525), to be transmitted (e.g., via radio (109, 111 )) to other users and / or devices, by applying the modified correction function (649, 521 ) to the first signal (503a, 503b, 511) recorded by the vibration-sensitive microphone (317, 317a, 317b) and / or the second signal (505a, 505b, 513) recorded by the ambient microphone (315, 315a, 315b).

27. The method (601 ) according to claim 25, wherein the method further comprises:dynamically generating, by the trained machine learning model (609, 517), an improved output voice signal (617, 525), to be transmitted (e.g., via radio (109, 111)) to other users or devices, being a combined signal from both the first signal (503a, 503b, 511 ) and second signal (505a, 505b, 513), enhancing speech-relevant frequency bands and suppress noise, wherein the trained machine learning model (609, 517) dynamically adjusts the relative contributions of the first and second signals (503a, 503b, 511, 505a, 505b, 513) according to the background noiselevel derived from the second signal (505a, 505b, 513), such that in low-noise environments the contribution from the first signal (503a, 503b, 511 ) is increased, and in high-noise environments the contribution from the second signal (505a, 505b, 513) is increased, thereby maintaining optimal speech intelligibility and noise suppression across varying demanding environments.

28. The method (601 ) according to any one of claim 25 - 27, wherein the trained machine learning model (609, 517) has been trained on a dataset (703) comprising paired data records (705, 707) using supervised learning, each data record included:speech signals (765, 767) obtained simultaneously from a vibration-sensitive transmit microphone (315, 315a, 315b) and an ambient microphone (317, 317a, 317b) under varying noise conditions representative of demanding operational environments (731, 735); and a corresponding reference voice signal (749), the reference voice signal (749) being obtained using an external microphone (745) positioned to capture high-quality airborne speech from the user 101.

29. A data processing system (105) comprising one or more processors (205, 207) configured to carry out the method (601 ) of any one of claims 25 to 28.

30. A computer program or a computer-readable medium (211 ) storing a computer program, the computer program comprising instructions which, when executed by one or more processors (205, 207), cause the one or more processors (205, 207) to carry out the method (601 ) of any one of claims 25 to 28.

31. A computer-implemented method (703) of generating a training dataset (705, 707) for a machinelearning model (713), in particular the machine-learning model (607, 515) according to claim 25, comprising:obtaining noise data (739, 743) by recording simultaneous output signals (737, 741 ) from the vibration-sensitive microphone (317, 317a, 317b) and the ambient microphone (315, 315a, 315b) while in a mute mode (725) (i.e. , the user is not speaking), under varying noise conditions representative of demanding operational environments (731);obtaining speech data (761 , 763, 749) by recording simultaneous output signals (757, 759) from the vibration-sensitive microphone (317, 317a, 317b), the ambient microphone (315, 315a, 315b), and an external microphone (745) positioned to capture high-quality airborne speech (747) from the user, while the user is speaking in a silent background environment (727);generating paired data records (707) by mixing the obtained speech data (761, 763) from the vibration-sensitive and ambient microphones (317, 317a, 317b, 315, 315a, 315b) with the corresponding noise data (739, 743) from the respective microphones, thereby simulating speech signals (765, 763) under varying noise conditions representative of demanding operational environments;associating each paired data record (707) with a reference voice signal (705, 749), the reference voice signal (705,749) being the high-quality airborne speech (747) from the user as captured by the external microphone (745).

32. The computer-implemented method according to claim 27, wherein the varying noise conditions, representative of demanding operational environments, include environmental noise, involuntary user sounds, and equipment-induced vibrations.