Intelligent safety helmet-based off-line ai voice control method and intelligent safety helmet
By introducing state and posture detection into the smart helmet, combined with a microphone array and offline AI audio processing, the problem of accidental triggering of voice interaction in high-noise environments is solved, achieving low-power and highly reliable device control.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- RUNDE TECH (SHENZHEN) CO LTD
- Filing Date
- 2026-04-24
- Publication Date
- 2026-06-12
AI Technical Summary
In work scenarios without cellular network coverage and accompanied by high-intensity, non-stationary mechanical noise, the offline voice interaction devices of existing smart safety helmets are easily triggered by misoperation, resulting in high power consumption and voice command parsing failure or response delay, making it impossible to reliably control field equipment.
The main control unit is woken up by detecting the wearing status using a state detection unit. After determining a stable interaction state by combining posture detection, audio data is collected through a dual-microphone array, and spatial domain beamforming and frequency domain adaptive filtering are performed. Feature extraction and local speech recognition are performed using an offline AI audio processing unit. Finally, control commands are sent through local wireless communication.
It effectively reduces device power consumption, improves the accuracy and response speed of voice interaction, achieves reliable device control in high-noise environments, and ensures low-latency and high-reliability voice command parsing.
Smart Images

Figure CN122201301A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of smart helmet technology, and in particular to an offline AI voice control method based on a smart helmet and a smart helmet. Background Technology
[0002] In extreme work scenarios such as construction and mining, where there is no cellular network coverage and high-intensity, non-stationary mechanical noise is present, voice interaction provides great convenience for workers whose hands are occupied. However, existing edge wearable devices such as smart safety helmets generally rely on continuous monitoring of the audio module or simple acoustic threshold wake-up mechanisms when implementing offline voice interaction.
[0003] On the one hand, in complex industrial environments, equipment is easily triggered by high-frequency mechanical noises, such as knocking and machine roaring, resulting in extremely high ineffective power consumption. On the other hand, in offline environments without cloud computing power, existing edge devices struggle to perform accurate local collaborative noise reduction and feature extraction after receiving a large amount of raw audio mixed with strong noise, leading to offline voice command parsing failure or response delay, and consequently, inability to reliably control on-site equipment. Summary of the Invention
[0004] In view of this, in order to at least partially improve the above-mentioned problems, this application provides an offline AI voice control device based on a smart helmet and a smart helmet.
[0005] This application first provides an offline AI voice control method based on a smart safety helmet, applied to a smart safety helmet, wherein the smart safety helmet is equipped with a state detection unit, an attitude detection unit, a main control unit, an audio acquisition unit, and an offline AI audio processing unit; the method includes: The status detection unit detects in real time whether the smart safety helmet is being worn. If it is being worn, it triggers a wake-up command to the main control unit so that the main control unit is woken up. The main control unit wakes up the attitude detection unit, and the attitude detection unit detects the spatial attitude data of the smart helmet in real time and sends it to the main control unit. The main control unit determines whether the smart helmet is in a stable interactive state based on the spatial attitude data. If it is in a stable interactive state, it generates an audio channel activation permission flag. In response to the audio channel activation permission flag, the main control unit controls the audio acquisition unit and the offline AI audio processing unit to power on and run, acquires environmental audio data through the audio acquisition unit, and sends the environmental audio data to the offline AI audio processing unit for caching; The offline AI audio processing unit performs spatial domain beamforming and frequency domain adaptive filtering on the environmental audio data, and inputs the processed environmental audio data into a pre-stored offline regression noise reduction model to generate a feature soft mask. The feature soft mask is purified to obtain denoised features, and the denoised features are subjected to local offline speech recognition to obtain target speech features. The offline AI audio processing unit sends the target speech features to the main control unit. The main control unit, based on local wireless communication technology, sends device control messages to external devices corresponding to the target voice features according to the target voice features, so that the external devices can execute corresponding control commands according to the device control messages.
[0006] In one embodiment, the smart safety helmet further includes a power management unit. The status detection unit detects in real time whether the smart safety helmet is being worn. If it is being worn, it triggers a wake-up command to the main control unit to wake up the main control unit, including: After the smart safety helmet is powered on, the power management unit disconnects the power supply circuit of the audio acquisition unit and the offline AI audio processing unit, and controls the main control unit to enter standby mode, while providing low-frequency drive power to the status detection unit. The state detection unit periodically detects whether there is a physical obstruction inside the smart safety helmet based on the low-frequency driving power supply. If there is a physical obstruction, it determines that the smart safety helmet is in the wearing state. When it is determined that the device is in a wearing state, the output circuit of the state detection unit generates an output level flip and generates a wake-up command based on the output level flip, so as to wake up the main control unit.
[0007] In one embodiment, the audio acquisition unit includes a dual-microphone array, wherein the physical geometric center distance between the two microphones of the dual-microphone array is limited to a preset distance; The preset distance satisfies the anti-spatial aliasing half-wavelength constraint for the highest operating frequency band of industrial high-frequency noise.
[0008] In one embodiment, the main control unit determines whether the smart helmet is in a stable interaction state based on the spatial attitude data. If it is in a stable interaction state, it generates an audio channel activation permission flag, including: The main control unit acquires the spatial attitude data within a preset time window, wherein the spatial attitude data includes a sequence of quaternion matrices containing pitch and yaw angle information; The spatial attitude data within the preset time window is range-checked to detect whether the smart safety helmet is within the preset head-up working range, and the motion variance of the spatial attitude data within the preset time window is calculated. If the spatial posture data is within the preset eye-level working range and the motion variance is less than the preset motion variance threshold, the smart safety helmet is determined to be in the stable interactive state, and the audio channel activation permission flag is generated.
[0009] In one embodiment, in response to the audio channel activation permission flag, the main control unit controls the audio acquisition unit and the offline AI audio processing unit to power on and operate, acquires ambient audio data through the audio acquisition unit, and sends the ambient audio data to the offline AI audio processing unit for caching, including: In response to the audio channel activation permission flag, the main control unit controls the audio acquisition unit and the offline AI audio processing unit to power on and operate. The audio acquisition unit acquires external environmental audio data and uses the analog-to-digital converter inside the offline AI audio processing unit to convert the environmental audio data into a discrete digital sequence. By using the integrated circuit's built-in audio bus, the discrete digital sequence of environmental audio data is directly written into the circular buffer inside the offline AI audio processing unit for caching based on direct memory access technology.
[0010] In one embodiment, the offline AI audio processing unit performs spatial domain beamforming and frequency domain adaptive filtering on the environmental audio data, and inputs the processed environmental audio data into a pre-stored offline regression noise reduction model to generate a feature soft mask, including: The offline AI audio processing unit reads the environmental audio data in the form of discrete digital sequences and performs a fast Fourier transform to convert the environmental audio data from time-domain frames into frequency-domain signals. Obtain a preset fixed target sound source steering vector, wherein the target sound source steering vector is constructed based on a specific frequency or frequency band of the currently processed frequency domain signal and a fixed incident angle of the target sound source relative to the audio acquisition unit; The phase difference of the frequency domain signal is calculated based on the direction-of-arrival estimation algorithm to lock the fixed noise source and the corresponding initial spatial azimuth angle. Extract the yaw angle change from the spatial attitude data, call the preset rigid body coordinate system rotation matrix operation logic, use the yaw angle change to perform inverse angle compensation on the initial spatial azimuth angle, and calculate the real-time relative incident angle of the fixed noise source in the smart safety helmet follow-up coordinate system. The corresponding noise space steering vector is calculated based on the real-time relative incident angle, and the space noise covariance matrix is calculated based on the noise space steering vector. The offline AI audio processing unit, based on the principle of minimum variance distortion-free response, calculates the optimal frequency domain filtering weights according to the spatial noise covariance matrix and the target sound source steering vector, and applies them to the frequency domain signal to obtain the first noise-reduced frequency. After performing frequency domain adaptive filtering on the first noise-reduced frequency, it is input into a pre-stored offline regression noise reduction model to generate a feature soft mask.
[0011] In one embodiment, the step of performing frequency-domain adaptive filtering on the first noise-reduced frequency and then inputting it into a pre-stored offline regression noise reduction model to generate a feature soft mask includes: The offline AI audio processing unit divides the first noise-reduced frequency into a preset number of non-overlapping frequency bands, and performs multi-band spectrum subtraction processing on the preset number of non-overlapping frequency bands to output the speech energy corresponding to multiple frequency bands. The speech energy from multiple frequency bands is combined with the original phase information of the corresponding frequency band noisy signal to reconstruct a frequency domain signal, and the reconstructed frequency domain signal is input into the normalized minimum mean square filter algorithm frame by frame according to the time sequence. Simultaneously, the energy envelope gradient difference between the current processing frame and the previous processing frame in the preset low-frequency band of the reconstructed frequency domain signal is extracted in real time. When the energy envelope gradient difference is greater than a preset transient change threshold, the step size factor in the normalized least mean square filtering algorithm is forcibly decayed to suppress weight divergence. When the energy envelope gradient difference of a consecutive preset number of frames is not greater than the transient change threshold, the step size factor is controlled to smoothly recover to the preset original value according to the preset exponential function. After the normalized least mean square filtering algorithm of the current processing frame is iteratively updated, the inner product operation is performed between the updated filter weight vector and the reconstructed frequency domain signal to obtain the actual output signal of the filter of the current processing frame. The actual output signal of the filter is input frame by frame into a pre-stored offline regression denoising model to generate a feature soft mask.
[0012] In one embodiment, the step of inputting the actual output signal of the filter frame by frame into a pre-stored offline regression denoising model to generate a feature soft mask includes: The offline AI audio processing unit extracts the Mel frequency cepstral coefficients and their time difference of the actual output signal of the filter in the current processing frame to construct an input feature vector in the form of a one-dimensional tensor. The offline AI audio processing unit feeds the input feature vector into a pre-stored offline regression noise reduction model; wherein, the offline regression noise reduction model is... The one-dimensional convolutional neural network, after offline training, has all its weight parameters and biases down-quantized from 32-bit floating-point numbers to 8-bit integer numbers using uniform quantization mapping technology, and statically burned into the internal flash memory of the offline AI audio processing unit; the network topology of the one-dimensional convolutional neural network adopts a three-layer cascaded one-dimensional convolutional layer, wherein the first one-dimensional convolutional layer uses a convolutional kernel of size 9, the subsequent two one-dimensional convolutional layers use a convolutional kernel of size 3, and each one-dimensional convolutional layer is followed by a batch normalization operation; At the output of the one-dimensional convolutional neural network, the Sigmoid activation function is called to perform a nonlinear mapping on the linear feature values output by the last one-dimensional convolutional layer, so as to generate an acoustic feature soft mask for the current processing frame that is completely consistent with the dimension of the input feature vector. During continuous streaming processing, the acoustic feature soft masks for each frame, which are continuously output frame by frame, are aggregated in time series to obtain the feature soft mask.
[0013] In one embodiment, the step of refining the feature soft mask to obtain denoised features, and performing local offline speech recognition on the denoised features to obtain target speech features includes: The offline AI audio processing unit performs element-wise multiplication of the feature soft mask with the input feature vector to purify the features and then splices them frame by frame to output a denoised feature sequence. The offline AI audio processing unit inputs the denoising feature sequence into the local offline speech recognition engine for decoding and mapping to obtain the target speech features.
[0014] This application also provides a smart safety helmet, which is equipped with a status detection unit, an attitude detection unit, a main control unit, an audio acquisition unit, and an offline AI audio processing unit. The status detection unit, the attitude detection unit, the main control unit, the audio acquisition unit, and the offline AI audio processing unit are used to implement the method described in any of the above claims.
[0015] This application also provides a computer device including a memory and a processor, the memory storing a computer application program, and the processor executing the computer application program, wherein the computer application program is configured to perform the method as described in any of the preceding claims when executed by the processor.
[0016] This application also provides a readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described in any of the preceding claims.
[0017] Beneficial effects: In this embodiment, the physical power supply circuit between the audio acquisition unit and the offline AI audio processing unit is closed only when the smart safety helmet is determined to be in a wearing state and in a stable interactive state. This effectively blocks the false triggering of voice interaction caused by mechanical noise from the hardware level, thus reducing the overall power consumption of the smart safety helmet. At the same time, the environmental audio data is directly transmitted to the offline AI audio processing unit for caching, and combined with triple noise reduction processing in the spatial domain, frequency domain and feature domain, the accurate extraction and offline recognition of target voice features in complex industrial overlapping noise is achieved under limited edge computing power. Finally, the control message is sent through local wireless communication technology, thus ensuring low latency and high reliability of voice command parsing and external device collaborative control in high-noise environments without external broadband network connections. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly described below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort. In the drawings: Figure 1 This is a flowchart illustrating an embodiment of an offline AI voice control method based on a smart safety helmet according to this application.
[0019] Figure 2 This is a structural schematic diagram of a smart safety helmet according to another embodiment of this application.
[0020] Figure 3 This is a block diagram of a computer device according to an embodiment of this application.
[0021] Figure 4 This is a block diagram of a readable storage medium according to an embodiment of this application. Detailed Implementation
[0022] The terms "first," "second," and "third," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish different objects, not to describe a specific order. Furthermore, the term "comprising," and any variations thereof, is intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or modules is not limited to the listed steps or modules, but may optionally include steps or modules not listed, or may optionally include other steps or modules inherent to these processes, methods, products, or apparatuses.
[0023] Please see Figure 1One embodiment of this application provides an offline AI voice control method based on a smart safety helmet. This method is mainly applied to extreme work scenarios without cellular network coverage, such as smart safety helmets in environments with network blind spots and high-intensity mechanical noise, such as construction and mining.
[0024] In this embodiment, the smart helmet body is equipped with at least a status detection unit, an attitude detection unit, a main control unit, an audio acquisition unit, an offline AI audio processing unit, and a power management unit. Specifically, the status detection unit can be an infrared sensor, the attitude detection unit can be a multi-axis inertial measurement unit (IMU), the main control unit is a low-power microcontroller, such as the GD32L233C8T6 chip, the audio acquisition unit can be a microphone array, and the offline AI audio processing unit can be an offline AI voice processing digital signal processor (DSP, such as the CI1306 chip) with built-in memory and a digital-to-digital converter. This hardware structure of the smart helmet avoids reliance on high-performance cloud computing.
[0025] The method in this embodiment includes the following steps: S1. The status detection unit detects in real time whether the smart safety helmet is being worn. If it is being worn, it triggers a wake-up command to the main control unit so that the main control unit is woken up. As described above, after the smart safety helmet is powered on, the power management unit disconnects the power supply circuits of the offline AI audio processing unit and the audio acquisition unit, and controls the main control unit to enter a low-power standby mode. Simultaneously, it only provides low-frequency drive power to the status detection unit for periodic polling detection. When a worker puts on the smart safety helmet, the status detection unit detects a physical obstruction, generates an output level flip, determines that the helmet is being worn, and sends a hardware interrupt signal (wake-up command) to the external interrupt pin of the main control unit, thereby waking up the main control unit.
[0026] By using being in a wearing state as a prerequisite for waking up the main control unit, the audio acquisition unit and the offline AI audio processing unit can always be kept disconnected from power when not in a wearing state. This avoids the offline AI audio processing unit, which is in a high-frequency operating state in a complex industrial environment, being frequently triggered by invalid mechanical noise in the environment, thus reducing the overall power consumption of the smart safety helmet.
[0027] S2. The main control unit wakes up the attitude detection unit, and the attitude detection unit detects the spatial attitude data of the smart safety helmet in real time and sends it to the main control unit. As described above, after being awakened by a wake-up command, the main control unit can send a working clock signal and an enable command to the attitude detection unit, which is in standby or power-off state, via a general-purpose input / output pin or a communication bus to wake up the attitude detection unit. Once awakened, the attitude detection unit collects the acceleration and angular velocity information of the smart helmet in three-dimensional space in real time at a set sampling frequency (e.g., 50Hz), calculates the spatial attitude data representing the current head posture, and continuously sends the spatial attitude data to the main control unit. The spatial attitude data includes a sequence of quaternion matrices for pitch and yaw angles.
[0028] S3. The main control unit determines whether the smart safety helmet is in a stable interactive state based on the spatial attitude data. If it is in a stable interactive state, it generates an audio channel activation permission flag. As described above, to filter out interference from unconscious minor tremors or violent movements of the operator on the voice wake-up determination, the main control unit, after acquiring the spatial attitude data, will extract and verify the spatial attitude data within a preset time window (e.g., a continuously collected 1.0-second time window). Specifically, the main control unit verifies whether the spatial attitude data (such as pitch and yaw angles) within the time window is within a preset eye-level working range and calculates the motion variance within that time window. If the spatial attitude data is within this range and the motion variance is less than a preset motion variance threshold, it is determined that no violent turbulence has occurred, i.e., the smart safety helmet is in a stable interactive state. At this time, the main control unit can generate a unique audio channel activation permission flag in its internal register. The audio channel activation permission flag is used to drive the power management unit to supply power to the offline AI audio processing unit and audio acquisition unit. Only when the main control unit generates this audio channel activation permission flag is the power management unit allowed to supply power to the audio unit and audio acquisition unit. This reduces the possibility of accidental activation of voice interaction and improves the accuracy of voice interaction determination under complex operations.
[0029] S4. In response to the audio channel activation permission flag, ambient audio data is collected through the audio acquisition unit, and the ambient audio data is sent to the offline AI audio processing unit for caching; As described above, after obtaining the audio channel activation permission flag, the main control unit closes the physical power supply circuit between the audio acquisition unit and the offline AI audio processing unit through the power management unit, enabling both to power on and operate. The audio acquisition unit begins to capture external environmental audio data around the clock. To avoid data congestion and insufficient memory in the main control unit, the acquired environmental audio data, after being converted into a discrete digital sequence, is not sent to the main control unit but directly to the buffer area inside the offline AI audio processing unit for caching via the internal hardware data bus.
[0030] S5. The offline AI audio processing unit performs spatial domain beamforming and frequency domain adaptive filtering on the environmental audio data, and inputs the processed environmental audio data into a pre-stored offline regression noise reduction model to generate a feature soft mask. As described above, after the environmental audio data is cached inside the offline AI audio processing unit, the offline AI audio processing unit reads the environmental audio data sequentially according to a preset data frame length, and performs multi-level hardware-level noise reduction calculations using its internal arithmetic logic unit.
[0031] Specifically, firstly, for strong noise sources operating at fixed points in industrial sites, such as the roar of tunnel boring machines, generators, and water pumps, the offline AI audio processing unit combines the spatial attitude data obtained in the aforementioned steps to perform spatial domain beamforming processing on the environmental audio data, so that the auditory directivity of the smart safety helmet forms a spatial null in the fixed noise direction of the external environment; subsequently, for non-stationary transient impact noise in the environment, such as occasional metallic knocking sounds, the offline AI audio processing unit performs frequency domain adaptive filtering processing on the audio data after spatial domain processing to further eliminate high-frequency residual impact noise.
[0032] Next, the offline AI audio processing unit extracts acoustic features from the environmental audio data that has undergone spatial domain beamforming and frequency domain adaptive filtering. These acoustic features can be Mel-frequency cepstral coefficients. The unit then calls upon an offline regression noise reduction model, such as a lightweight one-dimensional convolutional neural network, pre-programmed into its internal memory. The offline AI audio processing unit uses these acoustic features as an input feature matrix to perform forward inference on the model, fitting an output feature soft mask that perfectly matches the input dimension. This feature soft mask is essentially a probability weight matrix composed of values from 0 to 1, representing the probability that the corresponding acoustic feature node belongs to clean speech or noise.
[0033] By performing triple noise reduction on environmental audio data, this application can accurately remove complex overlapping noise in extreme industrial environments with limited edge computing power, providing a highly robust computational basis for the final purification of the feature domain.
[0034] S6. The feature soft mask is purified to obtain denoised features. The denoised features are then subjected to local offline speech recognition to obtain target speech features. The offline AI audio processing unit sends the target speech features to the main control unit. As described above, after outputting the feature soft mask, the offline AI audio processing unit performs element-wise multiplication and fusion processing on the feature soft mask and the noisy acoustic features in its internal matrix operation register. At the physical computation level, this operation uses the minimal weight values in the soft mask to suppress and filter out background noise components in the acoustic feature matrix, while using weight values close to 1 to losslessly preserve the true speech harmonic components, thereby extracting the denoised and clean target speech features.
[0035] Subsequently, the offline AI audio processing unit performs offline acoustic matching and word mapping on the pure target speech features locally to identify the corresponding business instruction control intent. Finally, it sends the target speech features representing the intent to the main control unit through internal communication interfaces such as Universal Asynchronous Receiver / Transmitter (UART) which occupy low bandwidth. The target speech features can be in text form or in the form of corresponding simplified instruction flag codes.
[0036] By performing local offline speech recognition on the feature soft mask, compared with the existing technology of transmitting large amounts of undenoised raw audio data to the main control unit through a low-speed serial port, this application not only eliminates the problems of data congestion and insufficient memory operation on the inter-chip communication bus, but also achieves fast and accurate parsing of pure offline speech features of smart safety helmets in a network-free environment without cloud computing power.
[0037] S7. The main control unit, based on local wireless communication technology, sends a device control message to an external device corresponding to the target voice feature according to the target voice feature, so that the external device executes the corresponding control command according to the device control message.
[0038] As described above, after acquiring the target speech features parsed by the offline AI audio processing unit and returned via the internal bus, the main control unit calls its internally stored mapping dictionary to translate and package the target speech features into a device control message conforming to a preset communication protocol. At the physical transmission layer, the main control unit broadcasts or directionally sends the device control message in the form of electromagnetic wave signals to the corresponding external device at the work site through its internally integrated wireless radio frequency front-end or an external wireless transceiver module, such as a low-power Bluetooth module, based on local wireless communication technology.
[0039] For example, in extreme operating scenarios where there is no cellular network or the network signal is extremely weak, such as underground mines or large building foundation pits, the external device can be an industrial physical device that has been pre-paired and networked wirelessly with the smart safety helmet. Examples include a walkie-talkie equipped with a wireless receiving node, a crane control terminal, a tunnel boring machine emergency stop controller, or a field lighting system. After the external device's radio frequency antenna receives the electromagnetic wave signal and unpacks and reconstructs the device control message, it can directly trigger its own physical drive circuit to execute corresponding control commands such as "start machine," "emergency stop," and "switch walkie-talkie channel."
[0040] By sending device control messages based on local wireless communication technology, combined with the aforementioned steps of controlling the power supply of the offline AI audio processing unit based on wearing status and spatial posture data, and performing noise reduction and feature extraction locally on the offline AI audio processing unit, this embodiment can complete the parsing of voice commands and the control of corresponding external devices in high-noise environments without external broadband network connections. This avoids the delay caused by voice data transmission over remote networks and improves the reliability of collaborative device control in harsh environments.
[0041] In one embodiment, the state detection unit detects in real time whether the smart safety helmet is being worn. If it is being worn, the unit triggers a wake-up command to the main control unit to wake up the main control unit. Step S1 includes: S101. After the smart safety helmet is powered on, the power management unit disconnects the power supply circuit of the audio acquisition unit and the offline AI audio processing unit, and controls the main control unit to enter standby mode, while providing low-frequency drive power to the status detection unit.
[0042] As described above, the smart helmet utilizes a finite state machine (FSM) as its underlying hardware control logic, built upon multimodal sensing. To minimize the power consumption of the smart helmet in static conditions, when the device is in state 0 (deep sleep), the power supply to the audio acquisition unit and the offline AI audio processing unit is completely cut off physically via a low-power power management unit (PMU, such as the ETA6122 chip). Simultaneously, the main control unit enters a low-power standby mode. In this state, the power management unit only provides a low-frequency drive power supply of a preset frequency (e.g., 1Hz) to the state detection unit, causing the state detection unit to operate in a periodic polling manner, thereby forcibly locking the overall static power consumption of the smart helmet below 50μA.
[0043] S102. The state detection unit periodically detects whether there is a physical obstruction inside the smart safety helmet based on the low-frequency driving power supply. If there is a physical obstruction, it determines that the smart safety helmet is in the wearing state. As described above, the status detection unit can be an infrared sensor. The infrared sensor emits an infrared detection signal to the wearing area inside the smart safety helmet at a preset frequency and listens for the reflected signal. When the worker puts on the smart safety helmet, the infrared sensor receives an infrared reflected signal that meets the preset threshold conditions, and then determines that there is a physical obstruction, that is, the worker is in the wearing state.
[0044] S103. When it is determined that the device is in a wearing state, the output circuit of the state detection unit generates an output level flip and generates a wake-up command based on the output level flip so that the main control unit is woken up.
[0045] As described above, when the state detection unit confirms the wearing state, its internal sensing circuit undergoes an electrical state transition, causing its output pin to flip its output level, for example, from low to high. This output level flip acts as a purely physical layer trigger source, directly forming a pulse-shaped hardware interrupt signal in the circuit. This hardware interrupt signal represents the wake-up command. After the external interrupt pin of the main control unit captures the hardware interrupt signal, it directly triggers the underlying hardware interrupt handling logic of the main control unit, that is, it forcibly restores the core operating clock of the main control unit, causing the main control unit to switch from the low-power standby mode to the normal operating mode, thereby completing the wake-up of the main control unit. During this wake-up process, the offline AI audio processing unit and the audio acquisition unit remain in a state of power disconnection. At this time, the state of the finite state machine is 1.
[0046] In one embodiment, the audio acquisition unit includes a dual-microphone array, wherein the physical geometric center distance between the two microphones of the dual-microphone array is limited to a preset distance; the preset distance satisfies the anti-spatial aliasing half-wavelength constraint for the highest operating frequency band of industrial high-frequency noise.
[0047] Existing safety helmets typically have arbitrarily set microphone array spacing, failing to meet physical acoustic constraints. This leads to phase ambiguity or grating lobe effects when processing high-frequency mechanical noise, rendering beamforming and other noise reduction algorithms ineffective. Therefore, the dual-microphone array in this embodiment consists of two microelectromechanical system (MEMS) silicon microphones. The two microphones are symmetrically arranged on the body of the smart safety helmet, for example, below the brim, with the distance between their respective physical geometric centers strictly limited to a preset distance. In high-noise industrial environments, to ensure that spatial domain beamforming processing in subsequent steps does not experience phase ambiguity or grating lobe effects when processing high-frequency mechanical noise, the preset distance must satisfy the anti-spatial aliasing half-wavelength constraint.
[0048] For example, the highest operating frequency band f of the industrial high-frequency noise that the smart safety helmet needs to suppress is determined. max And according to the highest operating frequency band fmax Calculate the corresponding physical geometric center distance.
[0049] Specifically, in construction and mining scenarios, the strong noise generated by rotating machinery such as tunnel boring machines and generators contains a large number of high-frequency harmonic components. If the spacing between microphone arrays is too large, spatial sampling underfitting will occur in the high-frequency band. According to the acoustic anti-aliasing principle, the microphone spacing must be less than or equal to half the wavelength of the highest frequency sound wave of the target, and the calculation formula is as follows: ; Where d represents the physical geometric center distance between the two microphones, v represents the speed of sound in air, which is approximately 340 m / s in this embodiment, and f max The highest operating frequency band of industrial high-frequency noise that needs to be processed is preset for smart safety helmets.
[0050] In this embodiment, the smart safety helmet is preset to process industrial high-frequency noise with a maximum operating frequency of 5.6 kHz. Substituting into the above formula, the array spacing should not exceed 3.03 cm to achieve acoustic anti-aliasing within the 5.6 kHz frequency band. Based on this physical constraint, this embodiment limits the physical geometric center distance between the two microphones to 3 cm. By limiting the preset distance of the dual-microphone array to 3 cm, which satisfies the half-wavelength constraint for anti-spatial aliasing, it can be ensured that the offline AI audio processing unit can acquire environmental audio data with accurate phase information within the audio frequency range of 5.6 kHz. This eliminates the grating lobe effect caused by the array spacing violating acoustic principles at the physical structure level, that is, it avoids the dual-microphone array generating a false main lobe with the same receiving gain as the target sound source direction, i.e., the direction of the worker's mouth, in non-target sound source directions, such as the direction of fixed mechanical noise. By eliminating this false main lobe, the failure of the smart safety helmet's spatial directivity is prevented, ensuring that the offline AI audio processing unit can always accurately focus the pickup angle on the target sound source direction, thereby providing a reliable phase basis for subsequent extraction of target speech features.
[0051] In one embodiment, step S2, in which the main control unit wakes up the attitude detection unit and the attitude detection unit detects the spatial attitude data of the smart helmet in real time and sends it to the main control unit, includes: S201. After being woken up, the main control unit sends an enable command and a working clock signal to the attitude detection unit in standby state through a general-purpose input / output pin or a communication bus to wake up the attitude detection unit.
[0052] Specifically, the main control unit sends a hardware enable command and a working clock signal to the attitude detection unit, which is in a low-power standby or power-off sleep state, through general purpose input / output (GPIO) pins or an internal communication interface, such as an I2C bus or an SPI bus. This activates the internal registers and sensing circuits of the attitude detection unit on the physical circuit, thus waking up the attitude detection unit. At this time, the finite state machine is in state 2.
[0053] S202. After the attitude detection unit is awakened, it collects the spatial motion parameters of the smart safety helmet in real time at a preset sampling frequency, and calculates and generates the spatial attitude data.
[0054] As described above, after the attitude detection unit is activated, it uses its integrated microelectromechanical system (MEMS) accelerometer and gyroscope to continuously collect real-time triaxial acceleration and triaxial angular velocity information of the smart safety helmet in three-dimensional space at a set sampling frequency, such as 50Hz, as the spatial motion parameters. The digital motion processor or arithmetic logic circuit inside the attitude detection unit calls a preset attitude fusion algorithm, such as a complementary filtering algorithm or a Kalman filtering algorithm, to calculate the spatial motion parameters in real time: First, it performs integration calculation using the real-time collected triaxial angular velocity data to update the quaternion matrix representing the current attitude; simultaneously, it uses the real-time collected triaxial acceleration data to provide a gravity vector reference to compensate for and correct the low-frequency integral drift error of the gyroscope during the quaternion update process; finally, after obtaining the corrected quaternion matrix, it maps and extracts the corresponding Euler angle information, i.e., pitch angle and yaw angle, according to the rigid body kinematics transformation relationship. In this way, the attitude detection unit can continuously output a sequence of quaternion matrices containing pitch angle and yaw angle information, which serves as the spatial attitude data representing the current head attitude.
[0055] S203, the attitude detection unit continuously sends the spatial attitude data to the receiving register of the main control unit for storage.
[0056] As described above, after processing one or more batches of data, the attitude detection unit packages the quaternion matrix sequence of consecutive frames and sends it to the hardware receiving pin of the main control unit at the aforementioned output frequency of 50Hz. The internal data reading logic of the main control unit sequentially transfers the received spatial attitude data to its internal static random access memory, thereby constructing a continuous data queue for subsequent time window verification, providing basic data support for subsequent determination of whether the smart helmet is in a stable interactive state.
[0057] In one embodiment, the main control unit determines whether the smart helmet is in a stable interaction state based on the spatial attitude data. If it is in a stable interaction state, step S3, which generates an audio channel activation permission flag, includes: S301, The main control unit acquires the spatial attitude data within a preset time window.
[0058] Specifically, the main control unit extracts a preset number of frames, such as 50 consecutive frames, corresponding to a preset time window of 1.0 seconds from its internal static random access memory according to the first-in-first-out principle, the spatial attitude data, which is a sequence of quaternions containing pitch angle and yaw angle information.
[0059] S302. The main control unit performs range verification on the spatial attitude data within the preset time window to detect whether it is within the preset head-up operation range, and calculates the motion variance within the preset time window.
[0060] Specifically, the main control unit extracts the pitch and yaw angle values from each frame of the spatial attitude data and compares them with the pre-stored head-on working range values in the read-only memory to verify whether the operator's head orientation is within the set head-on working range. For example, the pitch angle is set to be between -20° and +20°, and the yaw angle is set to be between -30° and +30°. Simultaneously, it extracts the angular velocity data contained within this preset time window and calculates the angular velocity variance within this 1.0-second time window, i.e., the motion variance, using statistical dispersion calculation logic. This motion variance characterizes the intensity of movement or the amplitude of turbulence of the smart safety helmet during the current time period.
[0061] S303. If the spatial posture data is within the preset head-up working range and the motion variance is less than the preset motion variance threshold, the main control unit determines that the smart safety helmet is in a stable interactive state and generates an audio channel activation permission flag.
[0062] As described above, when the verification result of the main control unit indicates that both the pitch angle and the yaw angle are within the line-of-sight working range, and the calculated motion variance is less than the preset motion variance threshold, the main control unit eliminates interference from unconscious minor tremors or violent walking, determines that the smart safety helmet is not currently experiencing violent shaking, and confirms that it is in a stable interactive state. After the determination condition is met, the main control unit generates a unique audio channel activation permission flag bit in hardware by writing a specified logic level value to its internal specific state control register, for example, setting the corresponding data bit from logic "0" to logic "1". This audio channel activation permission flag bit can be used as a hardware trigger source to control the startup of the audio acquisition unit and the offline AI audio processing unit. At this time, the state of the finite state machine is 3.
[0063] In one embodiment, step S4, which involves acquiring ambient audio data via the audio acquisition unit in response to the audio channel activation permission flag and sending the ambient audio data to the offline AI audio processing unit for caching, includes: S401. In response to the audio channel activation permission flag, the main control unit controls the power management unit to close the physical power supply circuit between the audio acquisition unit and the offline AI audio processing unit.
[0064] As described above, the main control unit sends a power enable signal to the power management unit via a hardware control pin. Upon receiving this signal, the power management unit closes the power supply loop connecting the audio acquisition unit and the offline AI audio processing unit in the physical circuit, thereby switching the audio acquisition unit and the offline AI audio processing unit from a previous power-off state to a power-on operating state.
[0065] S402, The audio acquisition unit acquires external environmental audio data and uses an analog-to-digital converter to convert the environmental audio data into a discrete digital sequence.
[0066] As described above, after power-on, the audio acquisition unit begins to capture sound signals from the work site around the clock, acquiring simulated environmental audio data. Subsequently, using the analog-to-digital converter built into the offline AI audio processing unit, the simulated environmental audio data is sampled and quantized at high frequency, converting it into a discrete digital sequence for subsequent pure digital signal processing.
[0067] S403. The environmental audio data in discrete digital sequence form is directly written into the circular buffer inside the offline AI audio processing unit for caching via the integrated circuit's built-in audio bus.
[0068] In existing technologies, hardware often transmits raw audio to the main control unit via a low-speed UART bus, which can easily cause data congestion. Furthermore, due to limited MCU resources, such as insufficient memory in edge MCUs with less than 100KB of memory, operation may be interrupted or forced to terminate.
[0069] As described above, for edge-end resource-constrained main control units, such as those with less than 100KB of memory, the converted environmental audio data in this embodiment is not transferred through the main control unit. Instead, it flows directly through the integrated circuit's built-in audio bus in the underlying hardware, such as an I2S bus, and is written to the offline AI audio processing unit. Based on direct memory access technology, it is cached in a pre-defined circular buffer within the unit. At this point, the system state machine transitions to state 3, i.e., acoustic full-time verification. During the environmental audio data caching process, the main control unit only interacts with the offline AI audio processing unit through low-bandwidth interfaces such as a universal asynchronous receiver / transmitter (UART) to exchange instruction flags.
[0070] In one embodiment, step S5, in which the offline AI audio processing unit performs spatial domain beamforming and frequency domain adaptive filtering on the environmental audio data and inputs the processed environmental audio data into a pre-stored offline regression noise reduction model to generate a feature soft mask, includes: S511. The offline AI audio processing unit extracts the environmental audio data for time-frequency conversion and uses the spatial attitude data to calculate the relative incident angle in real time to update the spatial noise covariance matrix.
[0071] Specifically, the offline AI audio processing unit reads the ambient audio data in discrete digital sequence form from its internal direct memory access circular buffer and performs a fast Fourier transform to convert the ambient audio data from time-domain frames into frequency-domain signals. Since the relative position of the dual-microphone array to the wearer's mouth is physically fixed, the offline AI audio processing unit internally presets a fixed target sound source steering vector a(f, θ). target ), where f represents the specific frequency or bandwidth of the frequency domain signal currently being processed, and θ target This represents the fixed angle of incidence of the target sound source (i.e., the wearer's mouth) relative to the center of the dual-microphone array; To address fixed noise sources such as externally operating mechanical equipment, and in order to obtain their initial physical location, during the initial calibration period when the smart safety helmet first enters a stable interactive state upon initial power-on, the offline AI audio processing unit invokes a preset direction-of-arrival estimation algorithm, such as the generalized cross-correlation algorithm (GCC-PHAT), to perform phase difference calculation on the frequency domain signal. This scans and locks onto the most energetic steady-state spatial sound source as the fixed noise source, obtaining its initial spatial azimuth angle θ. initial .
[0072] Since the absolute spatial orientation of the fixed noise source remains unchanged for a short period of time, when the operator's head rotates, the offline AI audio processing unit extracts the yaw angle change Δψ from the spatial attitude data, calls the preset one-dimensional linear inverse angle compensation logic, and uses the yaw angle change Δψ to adjust the initial spatial orientation angle θ. initial Compensation is performed to calculate the real-time relative incident angle θ of the fixed noise source in the smart safety helmet's servo coordinate system. noise The formula is: θ noise =θ initial -Δψ; In obtaining the real-time relative incident angle θ noise Subsequently, the offline AI audio processing unit calculates the corresponding noise space steering vector a based on the physical geometric center spacing d of the dual microphone array. n (f,θ noise Its mathematical model is: ; Where f represents the specific frequency or frequency band of the frequency domain signal being processed, v represents the speed of sound in air, j is the imaginary unit, [·] T This is the matrix transpose.
[0073] The offline AI audio processing unit, based on the calculated noise spatial steering vector, uses a diagonally loaded covariance reconstruction strategy to update the spatial noise covariance matrix R, which characterizes the external noise distribution, in real time. vv (f), the updated formula is: ; in, The noise power estimated for the current frequency band. Let I be the white noise variance of the smart safety helmet, I be the identity matrix, and the superscript H denote the noise space steering vector a. n (f,θ noise The conjugate transpose of ).
[0074] S512. The offline AI audio processing unit calculates the optimal frequency domain filtering weights based on the minimum variance distortion-free response principle and applies them to the frequency domain signal to obtain the first noise-reduced frequency.
[0075] Specifically, to suppress strong high-frequency noise sources located at fixed positions in the external environment, the offline AI audio processing unit invokes its internal operational logic circuitry to execute the Minimum Variance Distortion-Free Response (MVDR) algorithm. This algorithm calculates the optimal frequency domain filtering weights w. optIt can minimize output interference power while maintaining a target sound source directional gain of 1, i.e., no distortion. Its optimal frequency domain filtering weight w opt The calculation formula is: ; in, Let a be the inverse of the spatial noise covariance matrix. H (f,θ target ) is the conjugate transpose of the target sound source steering vector. The offline AI audio processing unit uses the calculated w opt By performing point-by-point multiplication with the frequency domain signal, based on the principle of destructive interference of sound waves, a 180° phase reversal is applied to the noise signal from a specific direction, so that its peaks and troughs are precisely canceled out. This generates a deep null in real time in the direction of dynamically changing external spatial interference, that is, an acoustic blind zone with gain approaching zero is formed in this direction, thereby effectively filtering fixed-direction noise in the spatial domain and outputting an enhanced audio signal, which is the first noise-reduced frequency.
[0076] As an example, for sound from a target direction, such as human voice, the algorithm performs phase compensation on the signals of each microphone channel using optimal frequency domain filtering weights to maintain phase alignment. The superimposed signals are then enhanced, achieving distortion-free output. For noise from a fixed direction to be filtered out, the algorithm applies a 180° phase reversal to the signals of each channel based on the noise's incident angle, ensuring that the waveforms of the noise in different channels exhibit a precise correspondence between peaks and troughs. When these signals are superimposed within the offline AI audio processing unit, the noise in that direction cancels each other out due to destructive interference, bringing the energy close to zero. This creates an acoustic blind zone with extremely low gain at the corresponding spatial angle, known as a deep null. Simultaneously, the system dynamically updates the optimal frequency domain filtering weights based on the real-time relative incident angle detected by spatial attitude data, ensuring that this deep null is always aligned with the noise source at a fixed location in space, ultimately outputting the first noise-reduced frequency.
[0077] In one embodiment, step S5, in which the offline AI audio processing unit performs spatial domain beamforming and frequency domain adaptive filtering on the environmental audio data, and inputs the processed environmental audio data into a pre-stored offline regression noise reduction model to generate a feature soft mask, further includes: S521. The offline AI audio processing unit divides the first noise-reduced frequency into a preset number of non-overlapping frequency bands and performs multi-band spectrum subtraction processing to output the speech energy corresponding to multiple frequency bands. Specifically, the offline AI audio processing unit acquires the first noise-reduced frequency after spatial domain enhancement, divides it into N non-overlapping frequency bands in the frequency domain, and independently performs multi-band spectral subtraction for each frequency band. The calculation logic is as follows: ; in, This represents the estimated speech energy in the i-th frequency band; This represents the initial energy of the noisy signal in the i-th frequency band; α represents the noise energy estimated for the i-th frequency band; i This represents the preset over-subtraction factor for the i-th frequency band. Through this multi-band calculation matrix, refined removal of steady-state residual noise in different frequency bands is achieved.
[0078] Since the first noise-reduced frequency output by spatial domain beamforming is already a complex signal in the frequency domain, the offline AI audio processing unit can directly extract the frequency domain amplitude spectrum of the first noise-reduced frequency. The magnitude of this amplitude spectrum is squared, and according to preset filter bank frequency band boundaries, such as Mel scale boundaries, the energy values of each discrete frequency point falling within the i-th frequency band are weighted and integrated or directly summed to obtain the initial energy of the noisy signal in the current frame of that frequency band. .
[0079] The offline AI audio processing unit internally runs Voice Endpoint Detection (VAD) logic or a minimum value statistical tracking algorithm. During the initial power-on phase of the smart helmet or when the VAD logic determines the current frame to be a pure noise frame with no speech activity, the offline AI audio processing unit extracts the energy of the corresponding frequency band and calls a first-order exponential recursive smoothing formula, for example, by weighted averaging the noise estimate from the previous frame and the measurement from the current frame, to update the noise baseline data in memory in real time. When a speech activity frame is detected, updating stops and the noise baseline data is latched, which is used as the estimated noise energy for the current i-th frequency band. .
[0080] In order to achieve the optimal acoustic balance between filtering out steady-state noise and preserving speech harmonics, the over-subtraction factor α... i It is not a globally constant, but a preset adjustment parameter with frequency band differentiation. During the factory calibration stage of the smart safety helmet, based on prior statistics of the noise spectrum characteristics of fixed machinery in target work scenarios, such as mines and construction sites, the effective value range of the over-reduction factor (e.g., 1 ≤ α) is preset. i≤5). For frequency bands with consistently low signal-to-noise ratios (SNR), such as the low-frequency mechanical rumble band, a larger over-subtraction factor value (e.g., between 3 and 5) is burned into the read-only memory of the offline AI audio processing unit to achieve excessive subtraction of steady-state noise in this frequency band, forcibly reducing the power spectral density of residual background noise. For frequency bands containing key speech formants and with high SNR, a smaller over-subtraction factor value (i.e., close to 1) within this range is burned to prevent speech distortion. During real-time computation, the offline AI audio processing unit directly addresses and reads the over-subtraction factor α corresponding to the i-th frequency band from the read-only memory. i Perform matrix substitution.
[0081] S522, The offline AI audio processing unit reconstructs the frequency domain signal of the speech energy of multiple frequency bands and inputs it into the normalized minimum mean square filter algorithm. At the same time, it extracts the energy envelope gradient difference of adjacent processing frames in the time series in a preset low frequency band in real time. When the energy envelope gradient difference is greater than the preset transient change threshold, it forcibly attenuates the step size factor in the normalized minimum mean square filter algorithm to suppress weight divergence.
[0082] Specifically, after performing multi-band spectral subtraction, the offline AI audio processing unit extracts the square root of the estimated speech energy for each frequency band and, combined with the original phase information of the noisy signal in the corresponding frequency band before multi-band spectral subtraction, reconstructs the frequency domain signal after eliminating steady-state noise. To eliminate non-stationary residual noise that multi-band spectral subtraction fails to handle, such as transient impact signals like metallic knocking sounds in industrial settings, the offline AI audio processing unit uses the reconstructed frequency domain signal as input frame by frame according to the time series, and introduces the Normalized Least Mean Square (NLMS) filtering algorithm for iterative frame-by-frame calculation. The filter weight update formula is: ; Where, n represents the current frame number in the time series; w(n+1) and w(n) represent the filter weight vectors after the update (i.e., for the next n+1 frame) and before the update (i.e., the current nth frame), respectively. When the smart safety helmet is initially powered on, w(n) is initialized to a zero vector; μ represents the step size factor that controls the convergence speed and stability of the algorithm. Its original setting value is the steady-state convergence constant that is pre-burned into the internal read-only memory during the factory calibration of the smart safety helmet; x(n) represents the input signal vector of the current nth frame, i.e., the current processing frame in the reconstructed frequency domain signal; e(n) represents the error signal of the current nth frame, which is the difference between the set expected target signal and the actual output signal of the filter; To prevent tiny positive numbers with a denominator of zero.
[0083] Since the aforementioned conversion from time-domain frames to frequency-domain signals is performed continuously frame by frame, the subsequent frequency-domain filtering process also maintains this time-series characteristic. During frame-by-frame processing, the offline AI audio processing unit calculates in real time the sum of the energy of the frequency-domain signal of the current processing frame (i.e., frame t) and the frequency-domain signal of the previous processing frame (i.e., frame t-1) in a preset low-frequency band, such as 100 Hz to 500 Hz, and calculates the absolute value of the difference between the sums of the low-frequency energy of these two frames, using this absolute value as the energy envelope gradient difference.
[0084] When the energy envelope gradient difference is detected to be greater than the preset transient change threshold, it is determined that an external transient noise impact has been encountered.
[0085] Traditional adaptive filtering algorithms, such as the NLMS algorithm, are prone to overlearning transient features when encountering occasional transient impact noise in industrial environments, such as metal impacts. This can lead to filter weight divergence, causing harmonic components of valid speech to be mistakenly canceled out as noise. If this application does not process transient impact noise, the Normalized Least Mean Square (NLMS) filtering algorithm will continue to call the step size factor of the original setting to update the filter weight vector. This will cause the filter to overlearn the transient noise features, resulting in the filter weight vector deviating from the steady-state optimal solution and exhibiting abnormal divergence, i.e., mistakenly treating speech formants as noise and canceling them out. Therefore, the offline AI audio processing unit triggers an algorithm protection mechanism at this time, forcibly attenuating the step size factor μ in the above formula to a preset multiple of the original setting value, such as 0.1 times. By reducing the iteration step size of the filter weight vector at the computational level, the abnormal divergence of the weights is effectively suppressed, preventing the offline AI audio processing unit from erroneously converging under strong transient noise interference.
[0086] Then, the counter inside the offline AI audio processing unit continuously monitors the energy envelope gradient difference of subsequent frames. When it is detected that the energy envelope gradient difference of consecutive preset frame numbers, such as 25 consecutive frames, is not greater than a preset transient change threshold, it is determined that the external transient noise impact has ended. At this time, the offline AI audio processing unit controls the step size factor μ to smoothly recover to the original set value according to a preset exponential function curve.
[0087] Since the system performs real-time streaming processing on external audio data, during frame-by-frame processing, for each current processing frame, after completing the iterative update of the normalized least mean square filtering algorithm for that frame, the multiplier-accumulator inside the offline AI audio processing unit performs an inner product operation with the currently updated filter weight vector w(n) and the reconstructed frequency domain signal of that frame, i.e., the aforementioned input signal vector x(n), to obtain the actual output signal of the filter for the current processing frame.
[0088] In one embodiment, the step of inputting the actual output signal of the filter frame by frame into a pre-stored offline regression denoising model to generate a feature soft mask includes: S531. The offline AI audio processing unit extracts the Mel frequency cepstral coefficients and their time difference of the actual output signal of the filter in the current processing frame, and constructs an input feature vector in the form of a one-dimensional tensor.
[0089] Specifically, following the streaming output of the pre-processed frequency domain adaptive filtering, the offline AI audio processing unit calls the Mel frequency cepstral coefficient extraction algorithm to extract acoustic features for the actual output signal of the filter in the current processing frame: First, pre-emphasis and windowing (such as Hamming window) are performed on the actual output signal of the filter in the current processing frame, and then a fast Fourier transform is performed to calculate the frequency domain power spectrum; further, the frequency domain power spectrum is passed through a Mel filter bank containing several triangular filters to calculate the logarithmic energy in each filter band; finally, a discrete cosine transform is performed on the logarithmic energy to accurately extract the 40-dimensional Mel frequency cepstral coefficients.
[0090] To capture the dynamic evolution of the filter's actual output signal over time, the offline AI audio processing unit uses a preset difference equation to perform time difference calculations. Centered on the current processing frame, it uses the 40-dimensional Mel-frequency cepstral coefficients of adjacent preset frames (e.g., the first two and last two frames in a time series) as input variables. Through polynomial fitting or direct difference operations, it calculates approximate local first-order derivatives of the feature parameters over time, thus obtaining the first-order time difference corresponding to the current processing frame. Further, using the calculated first-order time differences of multiple adjacent frames as input, it executes the same difference calculation logic again to calculate the gradient of the rate of change of the feature parameters, thus obtaining the second-order time difference corresponding to the current processing frame. Finally, the offline AI audio processing unit concatenates and reassembles the extracted 40-dimensional Mel-frequency cepstral coefficients, the calculated first-order time difference, and the second-order time difference according to a fixed feature dimension order, constructing a high-dimensional one-dimensional tensor that meets the input dimension requirements of a neural network. The one-dimensional tensor is fed into the subsequent network as the input feature vector V(n) of the current processing frame, where n represents the current frame number in the time series.
[0091] S532, The offline AI audio processing unit pre-quantizes and reduces the order of the input feature vector and burns it into a one-dimensional convolutional neural network in the internal flash memory, and outputs the feature soft mask through cascaded convolutional kernels of specific sizes and terminal activation functions.
[0092] As mentioned above, to address the stringent requirements of deploying smart safety helmets with extremely low computing power, the offline regression denoising model is a lightweight one-dimensional convolutional neural network pre-processed using uniform quantization mapping technology. After the model is trained offline, the system downquantizes all weight parameters and biases in the model from 32-bit floating-point numbers to 8-bit integers, thereby extremely compressing the overall physical size of the model to less than 47.5 KB, and directly statically burning it into the offline AI audio processing unit, such as the flash memory inside a digital signal processor (DSP).
[0093] In the online processing stage of the current frame, the offline regression denoising model uses a three-layer cascaded one-dimensional convolutional network topology. The offline AI audio processing unit feeds the input feature vector V(n) into the network. Its first one-dimensional convolutional layer uses a kernel of size 9 to capture wide receptive field features, and the subsequent two one-dimensional convolutional layers use kernels of size 3 for deep feature extraction. Batch normalization is performed after each convolutional layer. At the network output, based on the convergence result of fitting the continuous mask target using the mean square error loss function on the offline training platform, the system calls the sigmoid activation function to perform nonlinear mapping on the linear feature values output by the last one-dimensional convolutional layer, directly generating an acoustic feature soft mask for the current frame that is completely consistent with the dimension of the input feature vector. During the continuous streaming processing of the actual output signal of the filter, the acoustic feature soft masks of each frame, continuously output frame by frame, together constitute the feature soft mask in the time series.
[0094] In one embodiment, step S6, which involves refining the feature soft mask to obtain denoised features and performing local offline speech recognition on the denoised features to obtain target speech features, includes: S61. The offline AI audio processing unit performs element-wise multiplication of the feature soft mask with the noisy input feature vector to achieve feature purification and output denoised features.
[0095] Specifically, after generating the acoustic feature soft mask for the current processing frame, the offline AI audio processing unit performs element-wise multiplication of the predicted output acoustic feature soft mask for the current processing frame with the noisy MFCC matrix before being fed into the network, i.e., the input feature vector V(n). Through the probability weight constraints provided by the acoustic feature soft mask of the current processing frame, non-stationary noise components in the current feature vector corresponding to the acoustic feature soft mask of the current processing frame can be accurately suppressed, thereby extracting the denoised acoustic features of the current processing frame. During continuous frame-by-frame inference, the continuously extracted frame-level denoised acoustic features are cached in the underlying circular buffer and concatenated to form a complete denoised acoustic feature sequence.
[0096] S62. The offline AI audio processing unit inputs the denoising feature sequence into the local offline speech recognition engine for decoding and mapping to obtain the target speech features.
[0097] As described above, for the business requirements of smart safety helmet operation scenarios without network dependence and specific control commands, such as "turn on the searchlight" and "call the front desk," the offline AI audio processing unit has a lightweight offline speech recognition engine pre-programmed into it. The offline AI audio processing unit extracts the concatenated denoised acoustic feature sequence from the circular buffer and feeds it into the acoustic model of the local offline speech recognition engine.
[0098] In the decoding network of this engine, the offline AI audio processing unit calculates the feature distribution distance between the denoised acoustic feature sequence and each phoneme or word template in a preset local restricted vocabulary, and performs Viterbi decoding search in conjunction with the state transition matrix. After joint probability comparison between the acoustic layer and the language layer, the decoding network finally locks the semantic feature vector with the highest global matching confidence, and uses it as the identified target speech feature. Finally, the system directly maps this target speech feature to a specific hardware execution level or digital signal, as an effective offline voice control command to trigger the subsequent underlying control logic of the smart safety helmet.
[0099] It should be noted that when in state 3, the main control unit determines within a preset time window (e.g., 5 seconds) whether it has received the target speech features parsed by the offline AI audio processing unit. If it has not received them, it immediately cancels the audio channel activation permission flag, and the power management unit then cuts off the power supply circuit of the audio acquisition unit and the offline AI audio processing unit, and the state machine automatically switches to state 1. Or, when in state 3, the posture detection unit continues to run in the background. If the operator makes a violent movement, causing the real-time calculated motion variance to exceed the preset motion variance threshold again, the main control unit triggers a posture blocking command, immediately cancels the audio channel activation permission flag, and the power management unit then cuts off the power supply circuit of the audio acquisition unit and the offline AI audio processing unit, and the state machine automatically switches to state 1. Alternatively, when in state 3, if the state detection unit detects that the physical obstruction has disappeared, the main control unit will also cancel the audio channel activation permission flag, and the power management unit will then cut off the power supply circuit of the audio acquisition unit and the offline AI audio processing unit, and the state machine will automatically switch to state 0.
[0100] As an example, let's illustrate the high-noise operation scenario of a deep mine pump room without network coverage: The test environment included a steady-state high-noise source (a stationary water pump generating continuous high-frequency mechanical noise) and occasional non-steady transient impact noise such as falling metal tools. Before the worker put on the smart safety helmet, the system was in state 0 (deep sleep state) of the finite state machine. At this time, the power supply to the main control unit and the offline AI audio processing unit was completely cut off by the power management unit in the physical circuit, and only a low-frequency polling of 1Hz to the state detection unit was maintained. The measured static power consumption of the entire machine was forcibly locked below 50μA. When the worker put on the safety helmet, the state detection unit detected that the infrared reflection signal exceeded the physical obstruction threshold, immediately flipped the output level, and sent a hardware interrupt signal to the external interrupt pin of the main control unit. The main control unit was instantly awakened and sent a working clock signal to the attitude detection unit, and the system transitioned to state 1 (physical wake-up state). At this time, the audio-related units remained powered off to prevent false triggering by high-frequency noise.
[0101] Subsequently, during the operator's movement or preparation phase, the system enters state 2 (attitude verification state). The attitude detection unit continuously outputs a quaternion attitude matrix containing pitch and yaw angles at a frequency of 50Hz. The main control unit continuously samples and calculates the motion variance of angular velocity within a 1.0-second sliding time window. If the variance exceeds the standard due to personnel movement or bumps, the system determines that the interaction conditions have not been met and maintains audio power-off. When the operator arrives at the workstation and steadily looks ahead for more than 1 second, their pitch and yaw angles enter the preset eye-level working range, and the long-window motion variance is lower than the preset threshold. At this time, the main control unit confirms that the stable interaction conditions are met, issues a unique audio channel activation permission flag, and the power management unit closes the audio power supply circuit accordingly. The system then officially enters state 3 (acoustic working state) with the highest authority.
[0102] During the full-time verification in State 3, operators issued voice control commands. The sound was picked up by a dual-microphone array with a physical geometric center-to-center distance of 3cm. This specific distance precisely met the anti-spatial aliasing half-wavelength constraint up to 5.6kHz, effectively avoiding the grating lobe effect caused by high-frequency mechanical noise in this frequency band. The acquired environmental audio data, after being converted into a digital sequence, did not flow through the main control unit but was directly stored in the circular buffer of the offline AI audio processing unit via internal audio buses such as I2S. For the continuously roaring water pump to the left rear, the offline AI audio processing unit had already locked its initial spatial azimuth during the initial calibration phase. When the operator's head turns during maintenance, the system extracts the yaw angle change output in real time from the attitude detection unit, calls the inverse angle compensation logic to dynamically update the spatial noise covariance matrix, and recalculates the optimal frequency domain filter weights to ensure that the spatial null generated by the system is always accurately aligned with the externally fixed water pump. Under dynamic operation, compared with conventional algorithms that do not incorporate attitude compensation, this application can continuously and significantly suppress interference energy from the direction of a fixed noise source, greatly improving the signal-to-noise ratio and clarity of the picked-up target speech features.
[0103] In processing the aforementioned steady-state noise, for transient impact noise such as sudden metal tool drops in the environment, the energy envelope gradient difference between adjacent processing frames in the time series at a preset low frequency band is extracted in real time. When the energy envelope gradient difference exceeds a preset transient change threshold, the system instantly triggers a damping protection mechanism, forcibly attenuating the step size factor of the normalized least mean square filtering algorithm to 0.1 times its original set value. This damping action can suppress abnormal divergence of filter weights and protect the high-frequency harmonic resonance peaks of the target speech from distortion. After the energy envelope gradient difference of 25 consecutive frames is no greater than the preset transient change threshold, the step size factor is smoothly restored to the set initial value according to a preset exponential function. Finally, the filtered and extracted input feature vector is input into a 1D-CNN offline regression denoising model compressed to within 47.5KB, and the denoised feature sequence purified by feature soft masking is sent to the local lightweight speech recognition engine. In a completely offline state, the system accurately identifies the voice control command through Viterbi decoding. The main control unit then packages it into a device control message and broadcasts it to the controlled equipment in the pump room using local wireless communication technology. The controlled equipment executes the corresponding control command according to the device control message, thereby completing low-latency end-to-end device linkage control in a highly complex overlapping noise environment.
[0104] It is understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0105] It should be noted that the offline AI audio processing unit described in this application can be any hardware entity with digital signal processing capabilities, including but not limited to digital signal processors (DSPs), microcontrollers (MCUs), neural network processors (NPUs), or general-purpose central processing units (CPUs). The term "offline AI" is used only to highlight its ability to run lightweight acoustic models locally and should not be construed as limiting its specific hardware form.
[0106] Based on the same inventive concept, this application also provides a solution for implementing the aforementioned smart safety helmet. The solution provided by this smart safety helmet is similar to the solution described in the above method; therefore, the specific limitations in one or more system embodiments provided below can be found in the limitations of the method described above, and will not be repeated here.
[0107] In one exemplary embodiment, such as Figure 2 As shown, a smart safety helmet is provided, including: a status detection unit 1, an attitude detection unit 2, a main control unit 3, an audio acquisition unit 4, and an offline AI audio processing unit 5.
[0108] Please see Figure 3 This application also provides a computer device. The computer device 90 may include a processor 91, a memory 92, and computer applications, wherein: The memory 92 is used to store the computer application, and the memory may also be flash memory. The computer application is, for example, an application that implements the various method embodiments described above.
[0109] Processor 91 is configured to execute the computer application stored in the memory to implement the steps in the various method embodiments described above. For details, please refer to the relevant descriptions in the preceding method embodiments.
[0110] Alternatively, the memory 92 can be either standalone or integrated with the processor 91.
[0111] When the memory 92 is a device independent of the processor 91, the computer device 90 may further include: Bus 93 is used to connect the memory 92 and the processor 91.
[0112] Please see Figure 4 This application also provides a readable storage medium storing a computer application program, which, when executed by a processor, implements the methods of the above-described method embodiments.
[0113] The readable storage medium can be a computer storage medium or a communication medium. A communication medium includes any medium that facilitates the transfer of computer programs from one location to another. A computer storage medium can be any available medium accessible to a general-purpose or special-purpose computer. For example, a readable storage medium is coupled to a processor, enabling the processor to read information from and write information to the readable storage medium. Of course, the readable storage medium can also be a component of the processor. The processor and the readable storage medium can be located on an Application-Specific Integrated Circuit (ASIC). s In an ASIC (Integrated Circuit-Based ASIC), the processor and readable storage medium can reside within the user equipment. Alternatively, the processor and readable storage medium can also exist as discrete components within the communication device.
[0114] In the above embodiments, it should be understood that the processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in this application can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor.
[0115] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein, and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
Claims
1. An offline AI voice control method based on a smart safety helmet, characterized in that, Applied to smart safety helmets, the smart safety helmet is equipped with a status detection unit, an attitude detection unit, a main control unit, an audio acquisition unit, and an offline AI audio processing unit; the method includes: The status detection unit detects in real time whether the smart safety helmet is being worn. If it is being worn, it triggers a wake-up command to the main control unit so that the main control unit is woken up. The main control unit wakes up the attitude detection unit, and the attitude detection unit detects the spatial attitude data of the smart helmet in real time and sends it to the main control unit. The main control unit determines whether the smart helmet is in a stable interactive state based on the spatial attitude data. If it is in a stable interactive state, it generates an audio channel activation permission flag. In response to the audio channel activation permission flag, the main control unit controls the audio acquisition unit and the offline AI audio processing unit to power on and run, acquires environmental audio data through the audio acquisition unit, and sends the environmental audio data to the offline AI audio processing unit for caching; The offline AI audio processing unit performs spatial domain beamforming and frequency domain adaptive filtering on the environmental audio data, and inputs the processed environmental audio data into a pre-stored offline regression noise reduction model to generate a feature soft mask. The feature soft mask is purified to obtain denoised features, and the denoised features are subjected to local offline speech recognition to obtain target speech features. The offline AI audio processing unit sends the target speech features to the main control unit. The main control unit, based on local wireless communication technology, sends device control messages to external devices corresponding to the target voice features according to the target voice features, so that the external devices can execute corresponding control commands according to the device control messages.
2. The offline AI voice control method based on a smart safety helmet according to claim 1, characterized in that, The smart safety helmet also includes a power management unit. The status detection unit detects in real time whether the smart safety helmet is being worn. If it is being worn, it triggers a wake-up command to the main control unit to wake up the main control unit, including: After the smart safety helmet is powered on, the power management unit disconnects the power supply circuit of the audio acquisition unit and the offline AI audio processing unit, and controls the main control unit to enter standby mode, while providing low-frequency drive power to the status detection unit. The state detection unit periodically detects whether there is a physical obstruction inside the smart safety helmet based on the low-frequency driving power supply. If there is a physical obstruction, it determines that the smart safety helmet is in the wearing state. When it is determined that the device is in a wearing state, the output circuit of the state detection unit generates an output level flip and generates a wake-up command based on the output level flip, so as to wake up the main control unit.
3. The offline AI voice control method based on a smart safety helmet according to claim 1, characterized in that, The audio acquisition unit includes a dual-microphone array, and the physical geometric center distance between the two microphones of the dual-microphone array is limited to a preset distance; The preset distance satisfies the anti-spatial aliasing half-wavelength constraint for the highest operating frequency band of industrial high-frequency noise.
4. The offline AI voice control method based on a smart safety helmet according to claim 1, characterized in that, The main control unit determines whether the smart helmet is in a stable interactive state based on the spatial attitude data. If it is in a stable interactive state, it generates an audio channel activation permission flag, including: The main control unit acquires the spatial attitude data within a preset time window, wherein the spatial attitude data includes a sequence of quaternion matrices containing pitch and yaw angle information; The spatial attitude data within the preset time window is range-checked to detect whether the smart safety helmet is within the preset head-up working range, and the motion variance of the spatial attitude data within the preset time window is calculated. If the spatial posture data is within the preset eye-level working range and the motion variance is less than the preset motion variance threshold, the smart safety helmet is determined to be in the stable interactive state, and the audio channel activation permission flag is generated.
5. The offline AI voice control method based on a smart safety helmet according to claim 1, characterized in that, In response to the audio channel activation permission flag, the main control unit controls the audio acquisition unit and the offline AI audio processing unit to power on and operate. The audio acquisition unit acquires ambient audio data and sends the ambient audio data to the offline AI audio processing unit for caching, including: In response to the audio channel activation permission flag, the main control unit controls the audio acquisition unit and the offline AI audio processing unit to power on and operate. The audio acquisition unit acquires external environmental audio data and uses the analog-to-digital converter inside the offline AI audio processing unit to convert the environmental audio data into a discrete digital sequence. By using the integrated circuit's built-in audio bus, the discrete digital sequence of environmental audio data is directly written into the circular buffer inside the offline AI audio processing unit for caching based on direct memory access technology.
6. The offline AI voice control method based on a smart safety helmet according to claim 1, characterized in that, The offline AI audio processing unit performs spatial domain beamforming and frequency domain adaptive filtering on the environmental audio data, and inputs the processed environmental audio data into a pre-stored offline regression noise reduction model to generate a feature soft mask, including: The offline AI audio processing unit reads the environmental audio data in the form of discrete digital sequences and performs a fast Fourier transform to convert the environmental audio data from time-domain frames into frequency-domain signals. Obtain a preset fixed target sound source steering vector, wherein the target sound source steering vector is constructed based on a specific frequency or frequency band of the currently processed frequency domain signal and a fixed incident angle of the target sound source relative to the audio acquisition unit; The phase difference of the frequency domain signal is calculated based on the direction-of-arrival estimation algorithm to lock the fixed noise source and the corresponding initial spatial azimuth angle. Extract the yaw angle change from the spatial attitude data, call the preset rigid body coordinate system rotation matrix operation logic, use the yaw angle change to perform inverse angle compensation on the initial spatial azimuth angle, and calculate the real-time relative incident angle of the fixed noise source in the smart safety helmet follow-up coordinate system. The corresponding noise space steering vector is calculated based on the real-time relative incident angle, and the space noise covariance matrix is calculated based on the noise space steering vector. The offline AI audio processing unit, based on the principle of minimum variance distortion-free response, calculates the optimal frequency domain filtering weights according to the spatial noise covariance matrix and the target sound source steering vector, and applies them to the frequency domain signal to obtain the first noise-reduced frequency. After performing frequency domain adaptive filtering on the first noise-reduced frequency, it is input into a pre-stored offline regression noise reduction model to generate a feature soft mask.
7. The offline AI voice control method based on a smart safety helmet according to claim 6, characterized in that, The step of performing frequency-domain adaptive filtering on the first noise-reduced frequency and then inputting it into a pre-stored offline regression noise reduction model to generate a feature soft mask includes: The offline AI audio processing unit divides the first noise-reduced frequency into a preset number of non-overlapping frequency bands, and performs multi-band spectrum subtraction processing on the preset number of non-overlapping frequency bands to output the speech energy corresponding to multiple frequency bands. The speech energy from multiple frequency bands is combined with the original phase information of the corresponding frequency band noisy signal to reconstruct a frequency domain signal, and the reconstructed frequency domain signal is input into the normalized minimum mean square filter algorithm frame by frame according to the time sequence. Simultaneously, the energy envelope gradient difference between the current processing frame and the previous processing frame in the preset low-frequency band of the reconstructed frequency domain signal is extracted in real time. When the energy envelope gradient difference is greater than a preset transient change threshold, the step size factor in the normalized least mean square filtering algorithm is forcibly decayed to suppress weight divergence. When the energy envelope gradient difference of a consecutive preset number of frames is not greater than the transient change threshold, the step size factor is controlled to smoothly recover to the preset original value according to the preset exponential function. After the normalized least mean square filtering algorithm of the current processing frame is iteratively updated, the inner product operation is performed between the updated filter weight vector and the reconstructed frequency domain signal to obtain the actual output signal of the filter of the current processing frame. The actual output signal of the filter is input frame by frame into a pre-stored offline regression denoising model to generate a feature soft mask.
8. The offline AI voice control method based on a smart safety helmet according to claim 7, characterized in that, The step of inputting the actual output signal of the filter frame by frame into a pre-stored offline regression denoising model to generate a feature soft mask includes: The offline AI audio processing unit extracts the Mel frequency cepstral coefficients and their time difference of the actual output signal of the filter in the current processing frame to construct an input feature vector in the form of a one-dimensional tensor. The offline AI audio processing unit feeds the input feature vector into a pre-stored offline regression denoising model. The offline regression denoising model is a one-dimensional convolutional neural network (CNN). After offline training, all weight parameters and biases of the CNN are down-quantized from 32-bit floating-point numbers to 8-bit integers using uniform quantization mapping technology and statically burned into the internal flash memory of the offline AI audio processing unit. The network topology of the CNN uses three cascaded 1D convolutional layers. The first 1D convolutional layer uses a kernel of size 9, and the subsequent two 1D convolutional layers use kernels of size 3. Each 1D convolutional layer is followed by a batch normalization operation. At the output of the one-dimensional convolutional neural network, the Sigmoid activation function is called to perform a nonlinear mapping on the linear feature values output by the last one-dimensional convolutional layer, so as to generate an acoustic feature soft mask for the current processing frame that is completely consistent with the dimension of the input feature vector. During continuous streaming processing, the acoustic feature soft masks for each frame, which are continuously output frame by frame, are aggregated in time series to obtain the feature soft mask.
9. The offline AI voice control method based on a smart safety helmet according to claim 8, characterized in that, The process of refining the features of the soft mask to obtain denoised features, and then performing local offline speech recognition on the denoised features to obtain target speech features includes: The offline AI audio processing unit performs element-wise multiplication of the feature soft mask with the input feature vector to purify the features and then splices them frame by frame to output a denoised feature sequence. The offline AI audio processing unit inputs the denoising feature sequence into the local offline speech recognition engine for decoding and mapping to obtain the target speech features.
10. A smart safety helmet, wherein the smart safety helmet is provided with a status detection unit, an attitude detection unit, a main control unit, an audio acquisition unit, and an offline AI audio processing unit, characterized in that, The state detection unit, the posture detection unit, the main control unit, the audio acquisition unit, and the offline AI audio processing unit are used to implement the method as described in any one of claims 1-9.