An audio signal processing method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a dual-path parallel front-end processing architecture and a multi-dimensional feature extraction audio signal processing method, the problems of noise type differentiation and response lag in wireless Bluetooth headphones have been solved, enabling personalized noise reduction and transparency strategies, thereby improving the listening experience and safety of the headphones.

CN122201328APending Publication Date: 2026-06-12DONGGUAN BEEVO IND CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: DONGGUAN BEEVO IND CO LTD
Filing Date: 2026-03-13
Publication Date: 2026-06-12

Application Information

Patent Timeline

13 Mar 2026

Application

12 Jun 2026

Publication

CN122201328A

IPC: G10L21/0224; G10L21/0232; G10L25/24; G10K11/178; G06N3/0442; G06N3/0464; G06N3/049

AI Tagging

Application Domain

Speech analysis Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Lightweight video summary generation method, device, equipment and medium
CN122205202ASpeech analysis Biological models
Rendering Encoded 6DOF Audio Bitstream and Late Updates
US20260164205A1Speech analysis Selective content distribution
A proportional normalized least mean square acoustic feedback cancellation method based on dynamic window step strategy
CN122205313ASpeech analysisTransducer acoustic reaction prevention
A cable internal fault location system utilizing acoustic wave signals
CN121955617BTesting dielectric strengthSpeech analysis
Methods, devices and bitstreams for providing core coder side information for a core coder
WO2026122427A1Speech analysis Transmission

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing adaptive modes in wireless Bluetooth headphones struggle to accurately distinguish noise types and priorities in complex scenarios, making it impossible to implement personalized noise reduction or transparency strategies, and they also exhibit lag in response to sudden environmental noise.

⚗Method used

A dual-path parallel front-end processing architecture is adopted, which combines a high-pass filter and an adaptive notch filter to filter out low-frequency noise, and multi-scale wavelet decomposition to enhance burst noise. Audio signals are classified through multi-dimensional feature extraction and a lightweight time-frequency fusion network. Combined with speech recognition and keyword detection, a dynamic decision engine performs personalized transparency or noise reduction operations.

🎯Benefits of technology

It achieves accurate differentiation and independent extraction of different noise types, improves response speed and wearing safety, ensures timely capture and processing of key sound sources, and provides a personalized listening experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122201328A_ABST

Patent Text Reader

Abstract

The present application belongs to the technical field of audio processing, and particularly relates to an audio signal processing method and system, comprising obtaining an environmental audio signal, and pre-processing the environmental audio signal based on a two-path parallel front-end processing architecture to obtain a pre-processed audio signal; performing multi-dimensional feature extraction and fusion on the pre-processed audio signal to obtain fused features; matching the fused features based on a light-weight time-frequency fusion network to obtain the type of the environmental audio signal; performing time-frequency mask separation on the pre-processed audio signal according to the type of the environmental audio signal to obtain independent time-domain audio signals; performing speech recognition and keyword detection on the independent time-domain audio signals to obtain a speech recognition result; and performing a transparent or noise reduction operation on the independent time-domain audio signals based on a dynamic mode decision engine and the speech recognition result to obtain a transparent signal or a noise reduction signal. By using the present application, personalized noise reduction or transparent strategies can be performed for different types of noise.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of audio signal processing technology, and particularly relates to an audio signal processing method and system. Background Technology

[0002] With the technological advancements in wireless Bluetooth headphones, their functions have become increasingly diverse. Currently, mid-to-high-end wireless Bluetooth headphones generally feature noise cancellation and transparency modes. Noise cancellation actively generates anti-phase sound waves to cancel ambient noise and isolate interference for immersive listening. Transparency mode amplifies and optimizes ambient sound, allowing the wearer to clearly hear human voices and key prompts, balancing listening needs with environmental awareness. However, noise cancellation and transparency modes are contradictory; while achieving an immersive listening experience, human voices and key prompts are inevitably missed. Therefore, engineers have developed an adaptive mode that bridges the gap between noise cancellation and transparency. Adaptive mode is an intelligent hybrid mode that automatically balances "interference isolation" and "environmental awareness" by dynamically adjusting the noise cancellation depth and ambient sound gain ratio based on real-time ambient noise and user behavior.

[0003] However, existing adaptive modes are limited by microphone accuracy and noise recognition algorithm capabilities, making it difficult to accurately distinguish noise types and priorities in complex scenarios. This is mainly reflected in low recognition of noise in similar frequency domains; for example, it cannot effectively distinguish between the continuous low-frequency noise of subway operation and the conversations of nearby passengers. This leads to applying a uniform noise reduction or transparency strategy to both types of noise, resulting in either excessive suppression of conversations or incomplete cancellation of low-frequency noise. Furthermore, there is a response lag when facing sudden environmental noise. The adaptive mode requires a certain adjustment time to respond to sudden environmental noise, making it impossible to amplify sudden warning sounds in time to ensure safety and also making it difficult to maintain the original noise reduction experience.

[0004] Therefore, existing adaptive modes have the problem of difficulty in accurately distinguishing the types and priorities of noise in complex scenarios, and cannot implement personalized noise reduction or transparency strategies for different types of noise. They also have the problem of response lag when faced with sudden environmental noise. Summary of the Invention

[0005] The purpose of this invention is to provide an audio signal processing method and system, which aims to solve the problem that the existing technology cannot provide personalized noise reduction or transparency strategies for different types of noise.

[0006] To achieve the above objectives, the first technical solution provided by the present invention is an audio signal processing method, comprising the following steps;

[0007] The ambient audio signal is acquired and preprocessed based on a dual-channel parallel front-end processing architecture to obtain a preprocessed audio signal. Multi-dimensional features are extracted from the preprocessed audio signal, and the extracted features from different dimensions are fused to obtain fused features; The type of environmental audio signal is obtained by matching the fusion features based on a lightweight time-frequency fusion network. The preprocessed audio signal is subjected to time-frequency masking separation according to the type of environmental audio signal to obtain an independent time-domain audio signal; Speech recognition and keyword detection are performed on the independent time-domain audio signal to obtain the speech recognition result; Based on the dynamic pattern decision engine and the speech recognition results, the independent time-domain audio signal is subjected to transparency or noise reduction operations to obtain the processed signal.

[0008] As an optional embodiment of the present invention, the dual-path parallel front-end processing architecture includes a first processing branch and a second processing branch. The first processing branch includes a high-pass filter and an adaptive notch filter. The high-pass filter is used to filter out low-frequency noise in the ambient audio signal. The adaptive notch filter is used to eliminate narrowband interference. The second processing branch uses multi-scale wavelet decomposition to transiently enhance the burst noise in the ambient audio signal.

[0009] As an optional aspect of this invention, the step of extracting multi-dimensional features from the preprocessed audio signal and fusing the extracted features from different dimensions to obtain fused features specifically includes: The short-time energy entropy, zero-crossing rate of change, impulse degree, and non-stationarity index of the preprocessed audio signal are calculated to obtain the time-domain characteristics. Frequency domain analysis was performed on the preprocessed audio signal based on the log-Mel spectrum and gamma-ton frequency cepstral coefficients to obtain frequency domain characteristics. Typical noise features are extracted from the preprocessed audio signal to obtain specific features. The time-domain features, frequency-domain features, and specific features are weighted and fused to obtain fused features.

[0010] As an optional aspect of this invention, the step of calculating the short-time energy entropy, zero-crossing rate of change, impulse degree, and non-stationarity index of the preprocessed audio signal to obtain its time-domain characteristics specifically includes: The preprocessed audio signal is divided into frames according to a preset time length to obtain several audio signals of fixed length. The fixed-length audio signal is windowed to obtain a windowed audio signal; Calculate the short-time energy entropy, zero-crossing rate of change, and impulse degree of the windowed audio signal; The nonstationarity index of the preprocessed audio signal is calculated based on the energy ratio of the intrinsic mode function of empirical mode decomposition.

[0011] As an optional aspect of this invention, the step of performing frequency domain analysis on the preprocessed audio signal based on the log-Mel spectrum and gamma-pass frequency cepstral coefficients to obtain frequency domain characteristics specifically includes: Perform a Fourier transform on the preprocessed audio signal to obtain the frequency domain signal; The frequency domain signal is filtered using a Mel filter to obtain the Mel spectrum. Take the logarithm of the Mel spectrogram and then normalize the spectrum to obtain the logarithmic Mel spectrogram; The log-Mel spectrum is filtered using a gamma-pass filter to obtain the gamma-pass spectrum. The gamma-pass spectrum is nonlinearly compressed to obtain a compressed gamma-pass spectrum. The compressed gamma-pass spectrum is then subjected to logarithmic and discrete cosine transforms to obtain the gamma-pass frequency cepstral coefficients.

[0012] As an optional solution of the present invention, the step of matching the fusion features based on a lightweight time-frequency fusion network to obtain the type of the environmental audio signal specifically includes: The lightweight time-frequency fusion network includes a convolutional neural network branch, a long short-term memory network branch, and a feature fusion output layer; The fused features are input into the lightweight time-frequency fusion network, and the convolutional neural network branch performs local frequency and channel dependency capture on the frequency domain features to obtain the frequency domain feature vector. The long short-term memory network branch performs time-dynamic capture of the temporal features to obtain a temporal dynamic feature vector; The frequency domain feature vector and the time domain dynamic feature vector are input into the feature fusion output layer and spliced together to obtain the type of environmental audio signal.

[0013] As an optional aspect of this invention, the step of performing time-frequency masking separation on the preprocessed audio signal according to the type of the environmental audio signal to obtain an independent time-domain audio signal specifically includes: The preprocessed audio signal is subjected to a short-time Fourier transform to obtain the time-frequency domain complex spectrum of the audio signal, which includes an amplitude spectrum and a phase spectrum. Initialize a masking matrix group with the same dimension as the amplitude spectrum, determine the source attribution of each time-frequency unit of the amplitude spectrum according to the type of the ambient audio signal and the fusion features, and set the element values of the masking matrix group according to the determination results to obtain the assigned masking matrix group. The amplitude spectrum is separated based on the masking matrix group to obtain different types of audio amplitude spectra; Based on Euler's formula, the amplitude spectrum and phase spectrum of different types of audio are reconstructed to obtain different types of time-frequency domain complex spectra; Inverse short-time Fourier transforms are performed on the different types of time-frequency domain complex spectra, and the transformed audio frames are superimposed and overlapped to obtain independent time-domain audio signals.

[0014] As an optional aspect of the present invention, the step of performing speech recognition and keyword detection on the independent time-domain audio signal to obtain the speech recognition result specifically includes: Based on the automatic speech recognition engine, the human voice audio signal in the independent time-domain audio signal is converted into text according to the preset sampling frequency to obtain text data; Keyword matching is performed on the text data to obtain keyword detection results; Calculate the voiceprint embedding cosine similarity of the human voice audio signal in the independent time-domain audio signal, determine whether the human voice audio signal belongs to the user, and obtain the voiceprint detection result; The keyword detection results and the voiceprint detection results together constitute the speech recognition results.

[0015] As an optional solution of the present invention, the step of performing transparency or noise reduction operations on the independent time-domain audio signal based on the dynamic pattern decision engine and the speech recognition result to obtain the processed signal specifically includes: Based on the type of independent time-domain audio signal and the speech recognition result, the independent time-domain audio signal is divided into tasks to obtain noise reduction tasks or transparency tasks. Based on the dynamic pattern decision engine, noise reduction or transparency operation instructions are generated and executed for the independent time-domain audio signal to obtain a noise reduction residual signal or a transparency signal. The noise-reduced residual signal and the transparent signal are synchronized and aligned to obtain the synchronized and aligned signal. The synchronized and aligned signals are superimposed and synthesized to obtain the processed signal.

[0016] The second technical solution provided by this invention is an audio signal processing system, comprising: The audio acquisition and preprocessing module is used to acquire ambient audio signals and preprocess the ambient audio signals based on a dual-channel parallel front-end processing architecture to obtain preprocessed audio signals. The feature extraction and fusion module is used to extract multi-dimensional features from the preprocessed audio signal and fuse the extracted features from different dimensions to obtain fused features. The audio signal classification module matches the fused features based on a lightweight time-frequency fusion network to obtain the type of the environmental audio signal; The audio signal separation module performs time-frequency masking separation on the preprocessed audio signal according to the type of the environmental audio signal to obtain an independent time-domain audio signal; The voice recognition module performs speech recognition and keyword detection on the independent time-domain audio signal to obtain the speech recognition result; The dynamic decision module performs transparency or noise reduction operations on the independent time-domain audio signal based on the dynamic pattern decision engine and the speech recognition results to obtain the processed signal.

[0017] The audio signal processing method and system provided by the embodiments of the present invention have at least one of the following technical effects: 1. Employing a dual-path parallel front-end processing architecture, this invention can simultaneously suppress steady-state low-frequency interference and enhance sudden transient noise, solving the problem of delayed response to sudden noise in existing technologies. The invention effectively filters out power frequency and low-frequency steady-state noise through high-pass filtering and adaptive notch filtering in the first branch; and enhances transient signals such as sudden alert tones, horns, and human voices through multi-scale wavelet decomposition in the second branch. This enables the back-end classification and separation module to capture key sound sources earlier and more accurately, significantly reducing algorithm response latency during sudden environmental changes and improving wearing safety and auditory experience consistency.

[0018] 2. By fusing multi-dimensional time-domain, frequency-domain, and specific features, this invention solves the problem of difficulty in distinguishing similar frequency-domain noise. Instead of relying on a single spectral feature, this invention combines time-domain features such as short-time energy entropy, zero-crossing rate of change, impulse degree, and non-stationarity index with frequency-domain features such as log-Mel spectrum and gamma-ton frequency cepstral coefficients. Furthermore, it extracts specific features for wind noise, tire noise, and horn sounds, and through weighted fusion, forms a highly discriminative fusion feature. This feature can accurately distinguish easily confused sound sources such as environmental noise, human voices, wind noise, vehicle noise, and horn sounds, avoiding misjudgments and inaccurate noise reduction due to single features.

[0019] 3. A lightweight time-frequency fusion network is used to achieve accurate environmental audio classification, balancing recognition accuracy and real-time performance. This invention captures local frequency domain features through convolutional neural network branches and models temporal dynamics through long short-term memory network branches. High-precision audio classification is achieved by fusing the features from the two branches. At the same time, the network structure is lightweight and can run in real time on edge devices such as headphones.

[0020] 4. Time-frequency masking separation based on noise type enables independent extraction of different sound sources, fundamentally preventing the suppression of human voices and key prompts. This invention determines the source attribution of time-frequency units based on classification results, and separates human voices, horns, background noise, wind noise, and tire noise into independent time-domain signals through a multi-channel masking matrix, rather than simple global noise reduction. This achieves refined processing of different components and completely resolves the contradiction in traditional adaptive modes where "human voices and noise are at the same frequency, and noise reduction is equivalent to suppressing human voices."

[0021] 5. By combining speech recognition, keyword detection, and voiceprint verification, a dynamic pattern decision engine is constructed to achieve truly personalized noise reduction / transparency strategies. This invention not only distinguishes between noise and human voice, but also further identifies dialogue invitations, public announcements, warning prompts, and the user's own voiceprint, dynamically allocating transparency / noise reduction tasks: performing transparency enhancement on others' calls, horns, and warning sounds; performing deep noise reduction on wind noise, vehicle noise, and ambient background noise; and suppressing one's own voice, truly achieving "transparency where it should be transparent and noise reduction where it should be reduced," solving the shortcomings of existing technologies that cannot intelligently adjust according to sound source priority. Attached Figure Description

[0022] To more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0023] Figure 1 This is a flowchart illustrating an audio signal processing method according to the present invention.

[0024] Figure 2 This is a schematic block diagram of an audio signal processing system according to the present invention.

[0025] Figure 3 This is a flowchart of the feature fusion steps of an audio signal processing method according to the present invention.

[0026] Figure 4 This is a flowchart of the time-frequency masking and separation steps of an audio signal processing method according to the present invention.

[0027] Figure 5 This is a flowchart of the transparency or noise reduction operation steps of an audio signal processing method according to the present invention. Detailed Implementation

[0028] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain embodiments of the present invention, and should not be construed as limiting the present invention.

[0029] In the description of the embodiments of the present invention, it should be understood that the terms "length", "width", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing the embodiments of the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the present invention.

[0030] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of embodiments of the present invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0031] In the embodiments of the present invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components. Those skilled in the art can understand the specific meaning of the above terms in the embodiments of the present invention according to the specific circumstances.

[0032] In specific embodiments of the present invention, such as Figure 1 As shown, an audio signal processing method is provided, including the following steps; S1. Acquire ambient audio signals and preprocess the ambient audio signals based on a dual-path parallel front-end processing architecture to obtain preprocessed audio signals; S2. Perform multi-dimensional feature extraction on the preprocessed audio signal, and fuse the extracted features from different dimensions to obtain fused features; S3. Match the fusion features based on a lightweight time-frequency fusion network to obtain the type of environmental audio signal; S4. Perform time-frequency masking separation on the preprocessed audio signal according to the type of the environmental audio signal to obtain an independent time-domain audio signal; S5. Perform speech recognition and keyword detection on the independent time-domain audio signal to obtain the speech recognition result; S6. Based on the dynamic pattern decision engine and the speech recognition result, perform transparency or noise reduction operation on the independent time-domain audio signal to obtain the processed signal.

[0033] Preferably, in a specific embodiment of the present invention, a dual-microphone array is used as the audio acquisition unit. The sampling frequency of the dual-microphone array is 14kHz to 18kHz, the sampling accuracy is 16bit, and a combination of continuous acquisition and segmented buffering is adopted. The acquisition gain is adjusted to 25 to 35dB to ensure that the amplitude of the acquired ambient audio signal is within the preset effective range.

[0034] The dual-parallel front-end processing architecture of this invention includes a first processing branch and a second processing branch. The first processing branch primarily suppresses power frequency interference, including a high-pass filter and an adaptive notch filter. The high-pass filter removes low-frequency noise from the ambient audio signal, while the adaptive notch filter eliminates narrowband interference and adjusts according to minor fluctuations in the power grid frequency, ensuring efficient suppression without affecting the useful signal. The second processing branch focuses on transient enhancement of burst noise. It employs existing multi-scale wavelet decomposition analysis to analyze the energy distribution and transient characteristics of burst noise in the ambient audio signal at different scales, enhancing the burst noise components for subsequent identification and separation. This dual-parallel front-end processing architecture enables the back-end classification and separation module to capture key sound sources earlier and more accurately, significantly reducing algorithm response latency during sudden environmental changes.

[0035] Preferred, refer to Figure 3 Step S2 specifically includes: S21. Calculate the short-time energy entropy, zero-crossing rate change rate, impulse degree, and non-stationarity index of the preprocessed audio signal to obtain its time-domain characteristics; Specifically, step S21 includes: S211. The preprocessed audio signal is divided into frames according to a preset time length to obtain several audio signals of fixed length. In a specific embodiment of the present invention, the preset time length ranges from 20ms to 40ms, and the time length of each frame of audio signal is preferably 30ms to prevent 50% overlap between each frame of audio signal.

[0036] S212. Windowing is applied to the fixed-length audio signal to obtain a windowed audio signal. In a specific embodiment of the present invention, a Hamming window is used to process the audio signal to reduce spectral leakage. First, Hamming window coefficients with the same frame length as the fixed-length audio signal are generated. Then, the samples of the audio signal in each frame are multiplied element-wise by the corresponding Hamming window coefficients to complete the application of the window function.

[0037] S213. Calculate the short-time energy entropy, zero-crossing rate change rate, and impulse degree of the windowed audio signal.

[0038] Specifically, the short-time energy entropy is calculated as follows:

[0039]

[0040]

[0041] in, Indicates the first Frame windowing of audio signals, Indicates the first Intra-frame temporal sampling point index, Indicates frame length. Represents the Hamming window coefficient. Indicates the first The short-time energy of a frame Indicates the first The probability distribution of short-time energy of a frame. Indicates continuous frame; Indicates continuity Short-time energy sequence of frame-windowed audio signals; Indicates continuity Short-time energy entropy of frame-windowed audio signals.

[0042] Specifically, the zero-crossing rate of change is calculated using the following expression:

[0043]

[0044] in, Represents the normalization factor. Represents a symbolic function; Indicates the first Short-time zero-crossing rate of frame-windowed audio signals; Indicates the length of the sliding window; The standard deviation of the short-time zero-crossing rate; This represents the mean of the short-term zero-crossing rates; To represent the minimum value, avoid having a denominator of 0; Indicates continuity The rate of change of the zero-crossing rate of the frame-windowed audio signal.

[0045] Specifically, the pulse degree is calculated using the following expression:

[0046]

[0047]

[0048]

[0049] in, Indicates the first The imaginary part obtained after performing a Hilbert transform on a frame-windowed audio signal. Indicates the first The parsed signal of a frame-windowed audio signal; Represents the imaginary unit; The amplitude of the analytic signal is represented by the Hilbert envelope. Indicates the first Standard deviation of frame Hilbert envelope; Indicates the first Mean of the Hilbert envelope of a frame; Indicates frame length; Indicates the first The pulse degree of a frame-windowed audio signal.

[0050] S214. Calculate the nonstationarity index of the preprocessed audio signal based on the energy ratio of the intrinsic mode function of empirical mode decomposition.

[0051] Specifically, empirical mode decomposition is performed on the preprocessed audio signal to obtain intrinsic mode functions and residual components. The expression for empirical mode decomposition is as follows:

[0052] in, This represents the pre-processed audio signal. Indicates the first First-order intrinsic mode function ( The highest frequency component, (the lowest frequency component); This represents the residual component.

[0053] The energy ratio of high-frequency intrinsic mode functions is calculated using the following expression:

[0054]

[0055]

[0056] in, Indicates the first The sum of the energies of the first-order intrinsic mode functions; Represents the sum of the energies of all intrinsic mode functions; Indicates the preceding First-order intrinsic mode function; Indicates the preceding The energy of the first-order intrinsic mode function accounts for the proportion of the energy of all intrinsic mode functions.

[0057] Forward The energy ratio of the first-order high-frequency intrinsic mode functions is assigned decreasing weights to obtain the weighted high-frequency energy ratio as a nonstationarity index, and its expression is as follows:

[0058] in, Indicates a non-stationary index. This indicates the assigned weight.

[0059] S22. Based on the log-Mel spectrum and gamma-pass frequency cepstral coefficients, the preprocessed audio signal is analyzed in the frequency domain to obtain frequency domain characteristics.

[0060] Preferably, step S22 specifically includes: S221. Perform Fourier transform on the preprocessed audio signal to obtain the frequency domain signal; Specifically, a short-time Fourier transform is performed on the preprocessed audio signal to transform it from the time domain to the frequency domain. The calculation expression is as follows:

[0061]

[0062] in, Indicates the first Intra-frame frequency domain sampling point index, corresponding linear frequency , Indicates the sampling frequency; This represents the frequency domain representation of the preprocessed audio signal. This represents the power spectrum.

[0063] S222. The frequency domain signal is filtered using a Mel filter to obtain the Mel spectrum. Specifically, the log-Mel spectrum combines the Mel frequency scale and logarithmic scaling to better simulate the human ear's perception of different frequencies. Mapping the spectrum onto the Mel frequency scale makes it more consistent with human frequency perception. The relationship between Mel frequency and linear frequency is as follows:

[0064] in, Represents linear frequency The corresponding Mel frequency.

[0065] Next, to calculate the Mel spectrum, a set of Mel filters (usually triangular filters) is designed, and the power spectrum is summed over the Mel filters, thus converting the linear spectrum into a Mel spectrum, as expressed below:

[0066] in, Indicates the first The frequency domain response of a Mel filter; Indicates the first Mel spectrogram of the frame.

[0067] S223. Take the logarithm of the Mel spectrogram and then normalize the spectrum to obtain the logarithmic Mel spectrogram; Specifically, in this embodiment of the invention, the logarithmic Mel spectrum is normalized using Z-score to eliminate amplitude differences, resulting in the final logarithmic Mel spectrum, the expression of which is as follows:

[0068] in, To represent the minimum value, avoid having a denominator of 0; Indicates the first Logarithm of the Mel spectrogram of the frame; Indicates the first The mean of the Mel spectrogram of the frame; Indicates the first The standard deviation of the Mel spectrogram of the frame; This represents the normalized log-Mel spectrum.

[0069] S224. The log-Mel spectrum is filtered using a gamma-pass filter to obtain a gamma-pass spectrum. Specifically, in order to further enhance the ability to capture the characteristics of the sound source, in a specific embodiment of the present invention, gamma-pass frequency cepstral coefficients are superimposed on the log-mel spectrum. The center frequency of the gamma-pass filter corresponds one-to-one with the center frequency of the mel filter, and its expression is as follows:

[0070] in, Represents Mel frequency The corresponding linear frequency, This represents the center frequency of the gamma-pass filter corresponding to the linear frequency.

[0071] Log-Mel spectrogram for each frame Convolving the gamma-pass filter with the Mel frequency dimension yields the gamma-pass filter convolution output for each frame. Finally, the gamma-pass filter convolution outputs of all frames are concatenated into a two-dimensional matrix to obtain the gamma-pass spectrum. The calculation expression for the gamma-pass filter convolution output of each frame is as follows:

[0072] in, This represents the discrete frequency offset. , This represents the total number of Mel filters. This represents the impulse response of the gamma-pass filter; Indicates the first The gamma-pass filter convolution output of the frame.

[0073] S225. Perform nonlinear compression on the gamma-pass spectrum to obtain a compressed gamma-pass spectrum. Specifically, the expression for the compressed gamma-pass spectrum is as follows:

[0074] in, Indicates the first Compressed gamma-pass spectrum of the frame.

[0075] S226. The compressed gamma-pass spectrum is sequentially subjected to logarithmic and discrete cosine transforms to obtain the gamma-pass frequency cepstral coefficients.

[0076] Specifically, the expression for the gamma-pass frequency cepstral coefficients is as follows:

[0077]

[0078] in, Indicates the first Logarithm of compressed gamma-pass spectrum of the frame Indicates the cepstral coefficient index. Indicates the first The cepstral coefficients of the gamma-pass frequency of the frame.

[0079] S23. Extract typical noise features from the preprocessed audio signal to obtain specific features; Specifically, the present invention extracts specific features for three types of typical noise commonly found in complex scenarios: wind noise, tire noise, and horn noise. This further improves the classification model's accuracy in identifying specific noises and solves the problem of difficulty in distinguishing different noise frequencies due to overlapping spectra.

[0080] In response to the specific characteristics of wind noise, since the core feature of wind noise is the concentration of energy in the low-frequency band, the specific embodiment of this invention extracts the energy ratio of the 0.1-0.5 kHz frequency band to the full frequency band, calculates the zero-crossing rate standard deviation of the 0.1-0.5 kHz frequency band, and supplements the low-frequency band spectral flatness. The combination of these three factors serves as a unique identification feature for wind noise.

[0081] In response to the specific characteristics of tire noise, since tire noise is a steady-state broadband noise, its core characteristic is the modulation characteristics in the 1.2-2.5 kHz frequency band. In a specific embodiment of the present invention, the modulation depth of the 1.2-2.5 kHz frequency band is extracted, and the energy ratio of the 1.2-2.5 kHz frequency band to the 0.5-1 kHz frequency band is supplemented. The combination of the two is used to distinguish tire noise from other steady-state noise.

[0082] Based on the specific characteristics of horn sounds, since horn sounds are non-stationary narrowband signals, their core characteristics are a narrowband energy surge in the 2-3.5 kHz range and a periodic harmonic structure. In a specific embodiment of the present invention, the energy peak value of the 2-3.5 kHz frequency band is extracted, the impulse degree of the 2-3.5 kHz frequency band is calculated, and the number of harmonics and the harmonic frequency interval of the 2-3.5 kHz frequency band are detected simultaneously. The combination of these three factors is used to distinguish sudden horn sounds from other sudden noises.

[0083] S24. Perform feature weighted fusion on the time domain features, the frequency domain features, and the specific features to obtain fused features.

[0084] Specifically, in specific embodiments of the present invention, weights are assigned to corresponding features based on the contribution of each type of feature to noise classification, as expressed below:

[0085]

[0086] in, Represents frequency domain characteristics, including gamma-pass frequency cepstral coefficients. , Represents time-domain characteristics, including short-time energy entropy. Rate of change of zero crossing Pulse degree Non-stationarity index ; Indicates specific characteristics. Weights representing frequency domain features; Weights representing temporal features; The weights representing specific features; This represents the frequency domain characteristics after weight allocation; This represents the temporal characteristics after weighting. This represents the specific features after weighting. This indicates the fusion feature.

[0087] Preferably, step S3 specifically includes: The lightweight time-frequency fusion network includes a convolutional neural network branch, a long short-term memory network branch, and a feature fusion output layer; S31. Input the fused features into the lightweight time-frequency fusion network, and the convolutional neural network branch performs local frequency and channel dependency capture on the frequency domain features to obtain the frequency domain feature vector; Specifically, the convolutional neural network branch is mainly used to process log-Mel spectrograms. The log-Mel spectrograms first pass through two convolutional layers (3×3 kernel size, stride 1, same padding, and 32 and 64 output channels respectively), with ReLU6 activation function. After each convolutional layer, a normalization layer and a max pooling layer are concatenated to reduce the number of parameters and suppress overfitting. Subsequently, the SE attention mechanism is applied to weight the channel features, enhance the attention to specific features, and improve feature discrimination. Finally, a flattening layer is applied to convert the convolutional features into a one-dimensional frequency domain feature vector.

[0088] S32. The long short-term memory network branch performs time dynamic capture on the time-domain features to obtain a time-domain dynamic feature vector; Specifically, the Long Short-Term Memory (LSTM) branch uses two bidirectional LSTM layers, with 128 hidden units in each LSTM layer and a dropout probability of 0.2. The temporal feature sequence is input into the first LSTM layer, and after passing through the first LSTM layer, it undergoes layer normalization. The normalized feature sequence is then input into the second LSTM layer, and after passing through the second LSTM layer, the forward and reverse features are concatenated. Subsequently, a fully connected layer is connected, with ReLU6 as the activation function, to further compress the feature dimension and obtain the key temporal dynamic feature vector.

[0089] S33. Input the frequency domain feature vector and the time domain dynamic feature vector into the feature fusion output layer for splicing and output to obtain the type of environmental audio signal.

[0090] Specifically, the one-dimensional frequency domain feature vector output from the convolutional neural network branch is concatenated with the temporal dynamic feature vector output from the long short-term memory network branch to obtain the fused feature. The fused feature is then fed into a fully connected layer with ReLU6 activation, followed by a dropout layer; finally, it is fed into the output layer with sigmoid activation to output the probabilities of different types of environmental audio signals. Specific embodiments of this invention mainly include background noise, human voice, wind noise, tire noise, and horn sounds.

[0091] Preferred, refer to Figure 4 Step S4 specifically includes: S41. Perform a short-time Fourier transform on the preprocessed audio signal to obtain the time-frequency domain complex spectrum of the audio signal. The time-frequency domain complex spectrum includes an amplitude spectrum and a phase spectrum. The amplitude spectrum is used to characterize the energy distribution of the audio signal at different frequencies and different time frames, and the phase spectrum is used to characterize the phase information of the audio signal at different frequencies and different time frames.

[0092] S42. Initialize a masking matrix group with the same dimension as the amplitude spectrum, determine the source attribution of each time-frequency unit of the amplitude spectrum according to the type of the ambient audio signal and the fusion feature, and set the element values of the masking matrix group according to the determination result to obtain the masking matrix group after assignment. Specifically, the masking matrix group includes a human voice masking matrix corresponding to human voice, a horn masking matrix corresponding to horn sound, a background noise masking matrix corresponding to ambient background noise, a wind noise masking matrix corresponding to wind noise, and a tire noise masking matrix corresponding to tire noise. In the initial state, the element values of all masking matrices are set to 0. The target sound source type to be separated and its corresponding feature threshold (the feature threshold is the feature threshold parameter after the lightweight time-frequency fusion network has been trained) are determined from the environmental audio signal type output by the lightweight time-frequency fusion network obtained in step S3. The target sound source type includes at least background noise, human voice, wind noise, tire noise, and horn sound. Each target sound source type corresponds to a preset time-domain feature threshold, frequency-domain feature threshold, and specific feature threshold. The time-domain feature threshold includes a short-time energy entropy threshold, a zero-crossing rate change rate threshold, an impulse degree threshold, and a non-stationarity index threshold. The frequency-domain feature threshold includes a log-Mel spectrum energy threshold and a gamma-ton frequency cepstral coefficient threshold. The specific feature threshold includes wind noise specific feature thresholds (including the energy ratio threshold between the 0.1-0.5 kHz band and the full frequency band, the zero-crossing rate standard deviation threshold of the 0.1-0.5 kHz band, and the low-frequency band spectral flatness threshold), and tire noise specific feature thresholds (including the modulation depth threshold of the 1.2-2.5 kHz band and the modulation depth threshold of the 1.2-2.5 kHz band and the low-frequency band spectral flatness threshold). Energy ratio threshold in the kHz band), horn sound specific characteristic threshold (including energy peak threshold in the 2-3.5 kHz band, impulse degree threshold in the 2-3.5 kHz band, harmonic number threshold and harmonic frequency interval threshold in the 2-3.5 kHz band).

[0093] Traverse each time-frequency unit of the amplitude spectrum ,in For frequency domain sampling point index, Using the time-domain sampling point index, and combining the fusion features extracted in step 2 with the feature thresholds of the target sound sources to be separated, the sound source attribution for each time-frequency unit is determined, and the determination rules are as follows: If the time-domain features of the fusion features corresponding to the time-frequency unit meet the threshold range of human voice time-domain features and the frequency-domain features meet the threshold range of human voice frequency-domain features, and the corresponding typical noise features do not belong to horn sounds, wind noise, or tire noise, then the time-frequency unit is determined to belong to human voice, and the element value at the corresponding position in the human voice masking matrix is set to 1.

[0094] If the time-domain features of the fusion features corresponding to the time-frequency unit satisfy the time-domain threshold range of the environmental background noise, the frequency-domain features satisfy the frequency-domain threshold range of the environmental background noise, and the corresponding typical noise features do not belong to horn sounds, wind noise, or tire noise, then the time-frequency unit is determined to belong to the environmental background noise, and the element value at the corresponding position in the environmental background noise masking matrix is set to 1.

[0095] If, in the fusion features corresponding to the time-frequency unit, the specific features meet the threshold of the horn sound specific features, and the time-domain features do not meet the time-domain threshold range of environmental background noise and human voice, while the frequency-domain features meet the frequency-domain threshold range of environmental background noise and human voice, then the time-frequency unit is determined to belong to the horn sound, and the element value at the corresponding position in the horn sound masking matrix is set to 1.

[0096] If, in the fusion features corresponding to the time-frequency unit, the specific features meet the threshold of wind noise specific features, and the time-domain features do not meet the time-domain threshold range of environmental background noise and human voice, while the frequency-domain features meet the frequency-domain threshold range of environmental background noise and human voice, then the time-frequency unit is determined to belong to wind noise, and the element value at the corresponding position in the wind noise masking matrix is set to 1.

[0097] If, in the fusion features corresponding to the time-frequency unit, the specific features meet the threshold of tire noise specific features, and the time-domain features do not meet the time-domain threshold range of environmental background noise and human voice, while the frequency-domain features meet the frequency-domain threshold range of environmental background noise and human voice, then the time-frequency unit is determined to belong to tire noise, and the element value at the corresponding position in the tire noise masking matrix is set to 1.

[0098] For time-frequency cells that do not meet any of the above criteria, keep the element value of the corresponding masking matrix at 0, consider them as invalid noise floor cells, and filter them out.

[0099] S43. Based on the masking matrix group, the amplitude spectrum is separated to obtain different types of audio amplitude spectra; Specifically, the amplitude spectrum is separated using a masking matrix group. Each human voice masking matrix, horn masking matrix, background noise masking matrix, wind noise masking matrix, and tire noise masking matrix is multiplied element-wise with the amplitude spectrum to obtain the human voice amplitude spectrum, horn sound amplitude spectrum, background noise amplitude spectrum, wind noise amplitude spectrum, and tire noise amplitude spectrum, thereby achieving the separation of different target sound sources in the frequency domain.

[0100] S44. Based on Euler's formula, reconstruct the amplitude spectrum and phase spectrum of different types of audio to obtain different types of time-frequency domain complex spectra; Specifically, using the original phase spectrum obtained in step S41, the amplitude spectra of human voice, horn sound, background noise, wind noise, and tire noise are combined with the original phase spectrum. By substituting the amplitude spectra of human voice, horn sound, background noise, wind noise, and tire noise with the original phase spectrum into Euler's formula for reconstruction, the time-frequency domain complex spectra corresponding to human voice, horn sound, environmental background noise, wind noise, and tire noise are obtained.

[0101] S45. Perform inverse short-time Fourier transform on the different types of time-frequency domain complex spectra respectively, and superimpose and overlap the transformed audio frames to obtain independent time-domain audio signals.

[0102] Specifically, because the Short-Time Fourier Transform (SFT) uses frame shifting, adjacent audio frames obtained after the Inverse Short-Time Fourier Transform (ISFT) have overlapping regions. The superposition and overlap processing involves weighted superposition of these overlapping regions to eliminate frame discontinuities and ensure the continuity of the time-domain audio signal. This superposition and overlap processing specifically includes: The Hanning window, consistent with the short-time Fourier transform, is used as the superposition weight window. For each audio frame after inverse transform, the first overlapping region (the part overlapping with the previous frame) adopts the weight of the second half of the Hanning window, which increases linearly from 0 to 1. The second overlapping region (the part overlapping with the next frame) adopts the weight of the first half of the Hanning window, which decreases linearly from 1 to 0. The weight of the non-overlapping region is set to 1.

[0103] Iterate through the inverse transform audio frames corresponding to each human voice source, horn sound source, ambient background noise source, wind noise source, and tire noise source. Starting from the first frame, the overlapping area between the current frame and the next frame is weighted and summed according to the weighting rules mentioned above. This process is repeated for all audio frames to ensure a smooth transition between adjacent frames.

[0104] After the superposition is completed, the length of the obtained time-domain audio signal is calibrated to ensure that its length is consistent with the length of the preprocessed audio signal. At the same time, the edge redundant signals generated during the superposition process are removed to obtain independent time-domain audio signals (including separated human voice signals, horn sound signals, environmental background noise signals, wind noise signals, and tire noise signals) to avoid signal length deviations affecting subsequent processing.

[0105] The purpose of superposition and overlap processing is to compensate for the inter-frame distortion introduced during the short-time Fourier transform framing and windowing process, eliminate the breakpoints and amplitude abrupt changes between adjacent audio frames, make the reconstructed independent time-domain audio signal closer to the original sound source characteristics, improve signal continuity and sound quality, and provide a high-quality signal foundation for subsequent speech recognition, transparency / noise reduction operations.

[0106] If the independent time-domain audio signal obtained in step S4 does not contain human voice audio signal, skip step S5 and proceed directly to step S6; if the independent time-domain audio signal obtained in step S4 contains human voice audio signal, proceed to step S5.

[0107] Preferably, step S5 specifically includes: S51. Based on the automatic speech recognition engine, the human voice audio signal in the independent time-domain audio signal is converted into text according to a preset sampling frequency to obtain text data; Specifically, the human voice signal separated in step S4 is input into the automatic speech recognition engine. The automatic speech recognition engine in this specific embodiment of the invention adopts the FunASR model, which performs text conversion on real-time human voice data at a sampling frequency of 16kHz to obtain text data.

[0108] S52. Perform keyword matching on the text data to obtain keyword detection results; Specifically, a dual-category keyword library is constructed, consisting of a keyword library for dialogue invitations and a keyword library for broadcast reminders, wherein: The dialogue invitation keyword library contains dialogue invitation keywords for users, covering nicknames (such as names, nicknames), invitations (calling you, can you hear me, talking to you), inquiries (such as what do you think, where are you going, help me), and interactions (such as together, want to, okay). The keywords are mainly colloquial and short, adapted to the semantic characteristics of others initiating dialogues with users, accurately distinguishing between "dialogue invitations for users" and "ordinary dialogues for others" and public announcements, avoiding misjudgment.

[0109] The keyword library for announcements and reminders contains commonly used keywords for public place information announcements, covering scene identification (such as scenic spots, high-speed rail, airplanes, and platforms), reminders (such as please note, friendly reminder, and upcoming arrival), and instructions (such as please check your ticket, please board the plane, and please do not lean). The keywords are mainly formal and standardized, adapted to the semantic characteristics of the announcement voice, and associated with exclusive announcement sentences for the corresponding scene (such as train XX is about to arrive at the station, please prepare to check your ticket).

[0110] A string matching algorithm is used to match text data with a preset dual-category keyword library one by one. In a specific embodiment of the present invention, the string matching algorithm is preferably the KMP algorithm. The matching similarity is calculated, and the distinction between dialogue voices and broadcast reminder voices is realized based on the keyword matching results.

[0111] Specifically, if the text data matches the keyword library for dialogue invitations with a match rate of ≥80%, and matches the keyword library for broadcast reminders with a match rate of <50%, then the voice signal is determined to be a voice for dialogue invitations. If the text data matches the keyword library for broadcast reminders with a match rate of ≥80%, and matches the keyword library for dialogue invitations with a match rate of <50%, then the voice signal is determined to be a voice for broadcast reminders. If the text data does not meet the above conditions, it is determined to be an unclassified voice.

[0112] Finally, the above-mentioned keyword detection sub-results for dialogue invitations and broadcasts, as well as unclassified tags, are integrated into a complete keyword detection result, clearly marking the category of human voice signals (dialogue invitation voices / broadcast reminder voices / unclassified voices), corresponding matching keywords, and matching degree, providing a basis for subsequent dynamic mode decisions.

[0113] S53. Calculate the voiceprint embedding cosine similarity of the human voice audio signal in the independent time-domain audio signal, determine whether the human voice audio signal belongs to the user, and obtain the voiceprint detection result. Specifically, the cosine similarity is expressed as follows:

[0114]

[0115]

[0116] in, Represents cosine similarity. The feature vector representing the user's preset voiceprint template; This represents the voiceprint embedding vector of the human voice to be detected; Represents the L2 norm; This represents the nth element of the feature vector of the user's preset voiceprint template. This represents the nth element of the voiceprint embedding vector of the human voice to be detected.

[0117] S54. The keyword detection results and the voiceprint detection results together constitute the speech recognition results.

[0118] Specifically, the keyword detection results and voiceprint detection results are integrated in the time domain. The voiceprint detection results are then added to the voice signals that have been labeled with the voice signal category in the keyword detection results (voices for dialogue invitations / voices for broadcast reminders / voices without clear classification). In other words, the user's voice or other people's voice is labeled for voices for dialogue invitations, voices for broadcast reminders, or voices without clear classification.

[0119] Preferred, refer to Figure 5 Step S6 specifically includes: S61. Based on the type of independent time-domain audio signal and the speech recognition result, the independent time-domain audio signal is divided into tasks to obtain noise reduction tasks or transparency tasks.

[0120] Specifically, the independent time-domain audio signal may include human voice signal, horn signal, environmental noise signal, wind noise signal, and tire noise signal; if a human voice signal is present, based on the speech recognition result, the human voice audio signal is further divided into broadcast reminder and warning signals issued by the user, broadcast reminder and warning signals issued by others, dialogue invitation signals issued by the user, dialogue invitation signals issued by others, unclassified human voice signals issued by the user, and unclassified human voice signals issued by others.

[0121] If the independent time-domain audio signal contains one or more of the following: environmental noise signal, wind noise signal, tire noise signal, and user's own voice signal (including broadcast reminder and warning signals issued by the user, dialogue invitation signals issued by the user, and unclassified human voice signals issued by the user), then noise reduction tasks are performed on the corresponding environmental noise signal, wind noise signal, tire noise signal, or user's own voice signal respectively.

[0122] If the type of independent time-domain audio signal includes one or more of the following: horn signal, broadcast reminder / warning signal issued by another person, or dialogue invitation signal issued by another person, then a transparency task is performed on the horn signal, the broadcast reminder / warning signal issued by another person, or the dialogue invitation signal issued by another person.

[0123] In a specific embodiment of the present invention, if one or more of the following are not present in the independent time-domain audio signal type: horn signal, broadcast reminder / warning signal issued by others, or dialogue invitation signal issued by others, then the noise reduction task state is maintained by default.

[0124] S62. Based on the dynamic pattern decision engine, generate and execute noise reduction or transparency operation instructions for the independent time-domain audio signal to obtain a noise reduction residual signal or a transparency signal.

[0125] Specifically, the same noise reduction instructions are applied to environmental noise signals, wind noise signals, and tire noise signals. First, a short-time Fourier transform is performed on the noise signal to obtain the noise time-frequency domain complex spectrum, and the noise amplitude spectrum and noise phase spectrum are extracted. Then, an adaptive noise reduction algorithm (preferably the existing adaptive noise cancellation algorithm ANC in this specific embodiment) is used to generate an anti-phase sound wave signal with the opposite phase and the same amplitude as the noise signal according to the energy distribution of the noise amplitude spectrum. The anti-phase sound wave signal is superimposed and canceled with the corresponding environmental noise signal, wind noise signal, and tire noise signal to achieve noise reduction processing of environmental noise and obtain the noise reduction residual signal.

[0126] Specifically, amplitude calibration is performed on the horn signal to adjust its amplitude to a preset transparent amplitude range, ensuring that the horn signal is clearly distinguishable and does not produce distortion; high-pass filtering is used to filter out weak low-frequency residual noise mixed in the horn signal, retaining the core frequency band of the horn signal (1kHz~4kHz), thus obtaining a transparent horn signal.

[0127] Specifically, harmonic enhancement processing is applied to broadcast reminders and warning signals or dialogue invitation signals sent by others to repair minor distortions that may occur during time-frequency masking and separation, improve the clarity of human voices, and obtain transparent broadcast reminders and warning signals and dialogue invitation signals.

[0128] Specifically, noise reduction and elimination operations are performed on the user's own voice signal (including broadcast reminder and warning signals issued by the user, dialogue invitation signals issued by the user, and unclassified human voice signals issued by the user). Based on the voiceprint characteristics of the user's own voice, an anti-phase cancellation signal with opposite phase and matching amplitude is generated. The anti-phase cancellation signal is superimposed on the user's own voice signal to achieve complete noise reduction and elimination of the user's own voice, ensuring that there is no residual user's own voice in the output signal.

[0129] S63. Perform synchronous alignment processing on the noise reduction residual signal and the transparent signal to obtain the synchronously aligned signal, ensuring that the time axis of various signals is consistent and avoiding signal misalignment.

[0130] S64. The synchronized and aligned signals are superimposed and synthesized to obtain the processed signal.

[0131] Specifically, during the synthesis process, the transformed audio frames are superimposed and overlapped. This superimposed and overlapped process is consistent with step S45 and is used to compensate for inter-frame distortion generated during signal processing, ensuring the continuity and integrity of the synthesized signal.

[0132] Preferably, in a specific embodiment of the present invention, amplitude normalization and distortion detection are performed on the synthesized signal. If distortion is detected, i.e., the amplitude exceeds the preset range or there are breaks between frames, the superposition and overlap parameters and amplitude range are adjusted, and the synthesis process is repeated. If no distortion is detected, the final signal is output.

[0133] The dynamic mode decision engine monitors changes in speech recognition results in real time. If keyword detection results or voiceprint detection results change, it adjusts the transparency or noise reduction operation commands in real time to ensure that the processing mode dynamically matches the audio signal type and speech recognition results.

[0134] Reference Figure 2 The present invention provides an audio signal processing system, comprising: The audio acquisition and preprocessing module is used to acquire ambient audio signals and preprocess the ambient audio signals based on a dual-channel parallel front-end processing architecture to obtain preprocessed audio signals. The feature extraction and fusion module is used to extract multi-dimensional features from the preprocessed audio signal and fuse the extracted features from different dimensions to obtain fused features. The audio signal classification module matches the fused features based on a lightweight time-frequency fusion network to obtain the type of the environmental audio signal; The audio signal separation module performs time-frequency masking separation on the preprocessed audio signal according to the type of the environmental audio signal to obtain an independent time-domain audio signal; The voice recognition module performs speech recognition and keyword detection on the independent time-domain audio signal to obtain the speech recognition result; The dynamic decision module performs transparency or noise reduction operations on the independent time-domain audio signal based on the dynamic pattern decision engine and the speech recognition results to obtain the processed signal.

[0135] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. An audio signal processing method, characterized in that, Includes the following steps; The ambient audio signal is acquired and preprocessed based on a dual-channel parallel front-end processing architecture to obtain a preprocessed audio signal. Multi-dimensional features are extracted from the preprocessed audio signal, and the extracted features from different dimensions are fused to obtain fused features; The type of environmental audio signal is obtained by matching the fusion features based on a lightweight time-frequency fusion network. The preprocessed audio signal is subjected to time-frequency masking separation according to the type of environmental audio signal to obtain an independent time-domain audio signal; Speech recognition and keyword detection are performed on the independent time-domain audio signal to obtain the speech recognition result; Based on the dynamic pattern decision engine and the speech recognition results, the independent time-domain audio signal is subjected to transparency or noise reduction operations to obtain the processed signal.

2. The audio signal processing method according to claim 1, characterized in that, The dual-path parallel front-end processing architecture includes a first processing branch and a second processing branch. The first processing branch includes a high-pass filter and an adaptive notch filter. The high-pass filter is used to filter out low-frequency noise in the ambient audio signal. The adaptive notch filter is used to eliminate narrowband interference. The second processing branch uses multi-scale wavelet decomposition to transiently enhance burst noise in the ambient audio signal.

3. The audio signal processing method according to claim 1, characterized in that, The step of extracting multi-dimensional features from the preprocessed audio signal and fusing the extracted features from different dimensions to obtain fused features specifically includes: The short-time energy entropy, zero-crossing rate of change, impulse degree, and non-stationarity index of the preprocessed audio signal are calculated to obtain the time-domain characteristics. Frequency domain analysis was performed on the preprocessed audio signal based on the log-Mel spectrum and gamma-ton frequency cepstral coefficients to obtain frequency domain characteristics. Typical noise features are extracted from the preprocessed audio signal to obtain specific features. The time-domain features, frequency-domain features, and specific features are weighted and fused to obtain fused features.

4. The audio signal processing method according to claim 3, characterized in that, The step of calculating the short-time energy entropy, zero-crossing rate of change, impulse degree, and non-stationarity index of the preprocessed audio signal to obtain its time-domain characteristics specifically includes: The preprocessed audio signal is divided into frames according to a preset time length to obtain several audio signals of fixed length. The fixed-length audio signal is windowed to obtain a windowed audio signal; Calculate the short-time energy entropy, zero-crossing rate of change, and impulse degree of the windowed audio signal; The nonstationarity index of the preprocessed audio signal is calculated based on the energy ratio of the intrinsic mode function of empirical mode decomposition.

5. The audio signal processing method according to claim 3, characterized in that, The step of performing frequency domain analysis on the preprocessed audio signal based on the log-Mel spectrum and gamma-pass frequency cepstral coefficients to obtain frequency domain characteristics specifically includes: Perform a Fourier transform on the preprocessed audio signal to obtain the frequency domain signal; The frequency domain signal is filtered using a Mel filter to obtain the Mel spectrum. Take the logarithm of the Mel spectrogram and then normalize the spectrum to obtain the logarithmic Mel spectrogram; The log-Mel spectrum is filtered using a gamma-pass filter to obtain the gamma-pass spectrum. The gamma-pass spectrum is nonlinearly compressed to obtain a compressed gamma-pass spectrum. The compressed gamma-pass spectrum is then subjected to logarithmic and discrete cosine transforms to obtain the gamma-pass frequency cepstral coefficients.

6. The audio signal processing method according to claim 3, characterized in that, The step of matching the fused features based on a lightweight time-frequency fusion network to obtain the type of the environmental audio signal specifically includes: The lightweight time-frequency fusion network includes a convolutional neural network branch, a long short-term memory network branch, and a feature fusion output layer; The fused features are input into the lightweight time-frequency fusion network, and the convolutional neural network branch performs local frequency and channel dependency capture on the frequency domain features to obtain the frequency domain feature vector. The long short-term memory network branch performs time-dynamic capture of the temporal features to obtain a temporal dynamic feature vector; The frequency domain feature vector and the time domain dynamic feature vector are input into the feature fusion output layer and spliced together to obtain the type of environmental audio signal.

7. The audio signal processing method according to claim 1, characterized in that, The step of performing time-frequency masking separation on the preprocessed audio signal according to the type of the environmental audio signal to obtain an independent time-domain audio signal specifically includes: The preprocessed audio signal is subjected to a short-time Fourier transform to obtain the time-frequency domain complex spectrum of the audio signal, which includes an amplitude spectrum and a phase spectrum. Initialize a masking matrix group with the same dimension as the amplitude spectrum, determine the source attribution of each time-frequency unit of the amplitude spectrum according to the type of the ambient audio signal and the fusion features, and set the element values of the masking matrix group according to the determination results to obtain the assigned masking matrix group. The amplitude spectrum is separated based on the masking matrix group to obtain different types of audio amplitude spectra; Based on Euler's formula, the amplitude spectrum and phase spectrum of different types of audio are reconstructed to obtain different types of time-frequency domain complex spectra; Inverse short-time Fourier transforms are performed on the different types of time-frequency domain complex spectra, and the transformed audio frames are superimposed and overlapped to obtain independent time-domain audio signals.

8. The audio signal processing method according to claim 1, characterized in that, The step of performing speech recognition and keyword detection on the independent time-domain audio signal to obtain the speech recognition result specifically includes: Based on the automatic speech recognition engine, the human voice audio signal in the independent time-domain audio signal is converted into text according to the preset sampling frequency to obtain text data; Keyword matching is performed on the text data to obtain keyword detection results; Calculate the voiceprint embedding cosine similarity of the human voice audio signal in the independent time-domain audio signal, determine whether the human voice audio signal belongs to the user, and obtain the voiceprint detection result; The keyword detection results and the voiceprint detection results together constitute the speech recognition results.

9. An audio signal processing method according to claim 8, characterized in that, The step of performing transparency or noise reduction operations on the independent time-domain audio signal based on the dynamic pattern decision engine and the speech recognition results to obtain the processed signal specifically includes: Based on the type of independent time-domain audio signal and the speech recognition result, the independent time-domain audio signal is divided into tasks to obtain noise reduction tasks or transparency tasks. Based on the dynamic pattern decision engine, noise reduction or transparency operation instructions are generated and executed for the independent time-domain audio signal to obtain a noise reduction residual signal or a transparency signal. The noise-reduced residual signal and the transparent signal are synchronized and aligned to obtain the synchronized and aligned signal. The synchronized and aligned signals are superimposed and synthesized to obtain the processed signal.

10. An audio signal processing system, characterized in that, include: The audio acquisition and preprocessing module is used to acquire ambient audio signals and preprocess the ambient audio signals based on a dual-channel parallel front-end processing architecture to obtain preprocessed audio signals. The feature extraction and fusion module is used to extract multi-dimensional features from the preprocessed audio signal and fuse the extracted features from different dimensions to obtain fused features. The audio signal classification module matches the fused features based on a lightweight time-frequency fusion network to obtain the type of the environmental audio signal; The audio signal separation module performs time-frequency masking separation on the preprocessed audio signal according to the type of the environmental audio signal to obtain an independent time-domain audio signal; The voice recognition module performs speech recognition and keyword detection on the independent time-domain audio signal to obtain the speech recognition result; The dynamic decision module performs transparency or noise reduction operations on the independent time-domain audio signal based on the dynamic pattern decision engine and the speech recognition results to obtain the processed signal.