A method and system for acoustic event recognition

By dynamically adjusting the energy threshold and feature verification in the acoustic environment, the accuracy and robustness issues of sweeping sound recognition in complex acoustic environments are solved, and high-precision sweeping sound detection is achieved.

CN122245345APending Publication Date: 2026-06-19BEIJING ZHONGKE DONGREN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING ZHONGKE DONGREN TECH CO LTD
Filing Date
2026-03-16
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately identify sweeping sound events in complex acoustic environments, resulting in decreased accuracy, high false alarm and false negative rates, and an inability to adapt to changing acoustic environments.

Method used

By acquiring audio data and using a sliding window to dynamically statistically analyze signal energy reference data, an adaptive dynamic energy threshold is generated. This threshold is then compared and verified using short-time energy and multiple target features to filter out abnormal events and achieve high-precision wind noise recognition.

🎯Benefits of technology

Achieve high-precision, low-false-alarm, and robust sweeping sound detection in complex acoustic environments, adapt to environmental changes, possess high discrimination accuracy, and balance accuracy, efficiency, and ease of use.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245345A_ABST
    Figure CN122245345A_ABST
Patent Text Reader

Abstract

This application relates to the field of acoustic signal processing technology, specifically to a method and system for acoustic event recognition. This application dynamically calculates the median and standard deviation of short-term energy using a sliding window, and combines this with a user-defined sensitivity coefficient to generate a dynamically adjusted energy threshold in real time, effectively addressing background noise fluctuations and baseline drift. In the detection phase, a duration constraint mechanism is introduced: when the energy continuously exceeds the threshold for a first preset duration, it is marked as a candidate event, and abnormal events with excessively long durations are eliminated. Furthermore, multi-dimensional target features are extracted from the candidate events, and each target feature is compared with typical sweeping sound features. This method is computationally efficient, can filter events based on dynamic thresholds and provide structured output, significantly reducing false alarm and false negative rates, and is applicable to various scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of acoustic signal processing technology, specifically to a method and system for acoustic event recognition. Background Technology

[0002] Sound event recognition is one of the key technologies in fields such as intelligent monitoring, industrial inspection, intelligent transportation and consumer electronics. In application scenarios such as intelligent voice interaction, audio enhancement, wind monitoring and edge audio devices, accurately and efficiently identifying wind noise events (i.e. transient wind noise generated by airflow passing over microphones or device structures) is a key technical link to improve system robustness and user experience.

[0003] The physical characteristics of wind sweeping noise are highly uncertain: its intensity is significantly affected by wind speed, wind direction, microphone directivity, and equipment structure; its duration generally ranges from tens of milliseconds to hundreds of milliseconds. This type of noise is usually characterized by suddenness, short duration, high energy, and wide spectrum. Although the spectrum is mainly mid-to-high frequency, it may overlap with certain mechanical noises, clothing friction sounds, or popping sounds, making it very easy to be misjudged as a valid voiceprint or key sound event, thus leading to problems such as incorrect sound event recognition or failure of anomaly detection.

[0004] In practical applications, sound event recognition faces challenges from complex and ever-changing application scenarios. On the one hand, the acoustic environments in which the acquisition equipment is located vary significantly. For example, the sweeping sound generated in narrow air ducts between urban buildings differs drastically in spectral energy distribution from that generated in open outdoor environments. Similarly, the sweeping sound produced by a gentle breeze blowing through doors and windows indoors differs fundamentally in time-frequency characteristics from the turbulent sound caused by high-speed moving objects. On the other hand, even in the same scenario, the duration, intensity envelope, and harmonic components of the sweeping sound exhibit high non-stationarity and randomness as wind speed, wind direction, and environmental obstructions change. This drift in acoustic feature distribution caused by both scenario differences and environmental time-varying characteristics severely limits the generalization ability of recognition models trained on single-scenario datasets in different deployment environments, leading to a sharp decline in recognition accuracy.

[0005] In summary, there is an urgent need to develop a sound event recognition method and system that can adapt to changing acoustic environments and robustly handle the recognition of different forms of air-sweeping sound events, thereby meeting the high-precision sound event recognition requirements in complex dynamic environments. Summary of the Invention

[0006] In view of this, this application provides a sound event recognition method and system that can adapt to dynamic and changing acoustic environments and thus identify target sweeping sound events.

[0007] In a first aspect, this application provides a sound event recognition method, comprising: acquiring audio data to be processed; the audio data to be processed containing at least one frame; determining signal energy reference data of the current frame within a preset sliding window based on the audio data to be processed; the current frame being an audio segment at the current moment; obtaining a dynamic energy threshold corresponding to the current frame based on the signal energy reference data; and comparing a numerical threshold based on the short-time energy of the current frame and the dynamic energy threshold corresponding to the current frame to identify a target wind-sweeping sound event.

[0008] In conjunction with the first aspect, in one possible implementation, the step of comparing the short-time energy of the current frame with the dynamic energy threshold corresponding to the current frame to identify the target sweeping sound event includes: in response to the current frame being marked as a candidate sweeping sound event when the duration for which the short-time energy of the current frame exceeds the dynamic energy threshold reaches a first preset duration; extracting multiple target features of the candidate sweeping sound event; sequentially verifying each target feature with its corresponding typical feature, filtering out abnormal sweeping sound events, and obtaining the target sweeping sound event.

[0009] In conjunction with the first aspect, in one possible implementation, acquiring the audio data to be processed includes: receiving initial audio data; uniformly standardizing the initial audio data to obtain a standardized signal; calling a preset filter to filter the standardized signal to obtain an enhanced signal; and performing frame-segmentation processing on the enhanced signal to obtain the audio data to be processed.

[0010] In conjunction with the first aspect, in one possible implementation, determining the signal energy reference data of the current frame within a preset sliding window based on the audio data to be processed includes: setting the preset sliding window by backtracking a preset number of frames based on the current frame; calculating the median and standard deviation of the short-time energy within the preset sliding window to obtain the signal energy reference data.

[0011] In conjunction with the first aspect, in one possible implementation, obtaining the dynamic energy threshold corresponding to the current frame based on the signal energy reference data includes: setting a sensitivity coefficient and a threshold correspondence; and obtaining the dynamic energy threshold of the current frame based on the sensitivity coefficient, the signal energy reference data, and the threshold correspondence.

[0012] In conjunction with the first aspect, in one possible implementation, before extracting multiple target features of the candidate sweeping sound events, the method further includes: determining the time interval between two adjacent candidate sweeping sound events; merging adjacent candidate sweeping sound events whose time interval is less than a preset interval threshold; and marking candidate sweeping sound events whose duration exceeds a maximum preset duration as non-sweeping events and filtering them out.

[0013] In conjunction with the first aspect, in one possible implementation, the target features include at least one of the following: the total duration of the candidate sweeping sound event, the maximum energy amplitude of the candidate sweeping sound event, the root mean square value of the energy amplitude of the candidate sweeping sound event, the total energy of the candidate sweeping sound event, the dominant frequency of the candidate sweeping sound event, and the spectral centroid of the candidate sweeping sound event.

[0014] In conjunction with the first aspect, in one possible implementation, the step of sequentially verifying each of the target features and their corresponding typical features, filtering out abnormal sweeping sound events, and obtaining the target sweeping sound event includes: setting the typical features corresponding to the feature parameters of each target feature; sequentially comparing each target feature with its corresponding typical feature to obtain multiple verification difference values; and in response to a preset number of verification difference values ​​being greater than a preset difference value, marking the corresponding candidate sweeping sound event as the abnormal sweeping sound event and filtering it out to obtain the target sweeping sound event.

[0015] In conjunction with the first aspect, one possible implementation further includes: exporting the target wind-sweeping sound event and its corresponding multiple target features to obtain output data; converting the output data into multiple output files in various formats; and generating visual statistical charts based on the output data.

[0016] Secondly, this application provides a sound event recognition system, comprising: a data receiving module configured to: acquire audio data to be processed; the audio data to be processed includes at least one frame; a dynamic threshold calculation module communicatively connected to the data receiving module, the dynamic threshold calculation module being configured to: determine signal energy reference data of the current frame within a preset sliding window based on the audio data to be processed; the current frame is an audio segment at the current moment; and obtain a dynamic energy threshold corresponding to the current frame based on the signal energy reference data; and a sound event recognition module communicatively connected to the dynamic threshold calculation module, the sound event recognition module being configured to: perform a numerical threshold comparison based on the short-time energy of the current frame and the dynamic energy threshold corresponding to the current frame to recognize a target sweeping sound event.

[0017] In application, this application receives audio data to be processed and dynamically calculates signal energy reference data of signals near the current frame based on a sliding window. It then generates a dynamic energy threshold that slides with the frame, effectively addressing background noise fluctuations and baseline drift. Furthermore, short-time energy is introduced for numerical threshold comparison, effectively suppressing false triggers caused by instantaneous impulse noise and improving detection accuracy. Ultimately, a high-confidence target sweeping sound event is obtained, achieving high-precision, low-false-alarm, and robust sweeping sound detection in complex acoustic environments. It is suitable for various sweeping sound recognition application scenarios, adapting to environmental changes and possessing high discrimination accuracy, balancing accuracy, efficiency, and ease of use. Attached Figure Description

[0018] Figure 1 The diagram shown is a schematic representation of the steps of a sound event recognition method according to an embodiment of this application.

[0019] Figure 2 The diagram illustrates the steps of a method for determining target sweeping sound events based on multiple target features.

[0020] Figure 3 The diagram shows the steps involved in preprocessing audio data.

[0021] Figure 4 The diagram shows the steps for setting the signal energy reference data.

[0022] Figure 5 The diagram shows the steps for setting a dynamic energy threshold.

[0023] Figure 6 The diagram shows the steps of a method for filtering events based on short-term energy.

[0024] Figure 7 The diagram illustrates the steps of merging and filtering events based on time parameters.

[0025] Figure 8 The diagram shows the steps involved in verifying each target feature.

[0026] Figure 9 The diagram shows the steps involved in processing the output data.

[0027] Figure 10 This is a schematic diagram of the original audio signal waveform.

[0028] Figure 11 This is the signal diagram after noise reduction processing.

[0029] Figure 12 This is a diagram showing the energy distribution after separation.

[0030] Figure 13A table diagram showing target sweeping sound events and multiple target features provided for a specific embodiment.

[0031] Figure 14 The figure shows a schematic diagram of the system structure of an acoustic event recognition system. Detailed Implementation

[0032] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0033] Sound event recognition is a key technology in fields such as intelligent monitoring, industrial inspection, intelligent transportation, and consumer electronics. In these applications, acoustic signals are often not isolated but intertwined with complex environmental noise. Among these, the sweeping noise (transient wind noise) generated by airflow passing over microphones or equipment structures is a very common and unavoidable source of interference. In industrial inspection, airflow noise generated by fans and pneumatic equipment on production lines, or high-speed airflow noise caused by pipeline leaks, may be highly similar to the sound patterns of mechanical faults. If the sweeping noise cannot be accurately identified and separated, it will directly lead to a decrease in the accuracy of fault diagnosis.

[0034] Therefore, accurately and efficiently identifying sweeping sound events is not only a key technical aspect for improving the robustness and user experience of the aforementioned systems, but also a fundamental prerequisite for achieving intelligent acoustic perception in complex environments. Related technologies include detection methods based on fixed energy thresholds to identify sweeping sound events. This method processes the audio signal in frames, calculates the short-time energy of each frame, and determines a valid sound event when the energy exceeds a preset fixed threshold. However, this method cannot adapt to changes in the energy baseline of the audio signal over different time periods and under different environmental conditions, leading to false alarms in quiet environments and missed alarms in noisy environments, requiring frequent manual adjustment of the threshold parameters.

[0035] In addition, some related technologies use spectral template matching to identify sweeping sound events. This method pre-extracts typical spectral features of sweeping sounds as templates, and then performs similarity matching (such as cosine similarity, dynamic time warping, etc.) between the spectrum of the audio to be detected and the template. When the similarity exceeds a set threshold, it is determined to be a sweeping sound. This method has high requirements for the universality of the template and has high computational complexity, requiring a large amount of computing resources. Especially when processing long audio files, it is difficult to meet the requirements of real-time detection, and its efficiency is low for batch audio file processing.

[0036] In addition, there are machine learning methods based on Mel-frequency cepstral coefficients (MFCC) for identifying wind sweeping events. This method extracts MFCC features from the audio and uses pre-trained support vector machines, random forests, or neural network classifiers for frame-level or segment-level classification. This method requires a large amount of labeled data for model training and is highly dependent on the quality and representativeness of the training data. It also has high computational complexity and requires a lot of computing resources. Especially when processing long audio files, it is difficult to meet the requirements for real-time detection and has low efficiency for batch audio file processing.

[0037] In addition, there are related technologies that use endpoint detection (VAD)-based methods for sound event recognition. This method employs a dual-threshold endpoint detection algorithm, using two parameters—short-time energy and zero-crossing rate—to identify valid sound segments in the audio. This method is primarily designed for speech detection and has limited ability to distinguish special sound events such as the sound of wind sweeping.

[0038] In summary, this application proposes a method and system for identifying sweeping sound events of different forms that can robustly cope with such events. It achieves high accuracy, low false alarms, and strong robustness in sweeping sound event detection in complex acoustic environments. It is applicable to a variety of sweeping sound recognition application scenarios, can adapt to environmental changes, and has high discrimination accuracy. It balances accuracy, efficiency, and ease of use, and meets the high-precision sound event recognition requirements in complex dynamic environments.

[0039] Figure 1 The diagram shown is a schematic representation of the method steps for an acoustic event recognition method according to an embodiment of this application. This application provides an acoustic event recognition method, such as... Figure 1 As shown, the method includes:

[0040] Step 110: Obtain the audio data to be processed.

[0041] In this step, the audio data to be processed contains at least one frame, and the raw audio signal is obtained from a microphone or other audio input device as the audio data to be processed.

[0042] Step 120: Based on the audio data to be processed, determine the signal energy reference data of the current frame within the preset sliding window.

[0043] In this step, the current frame is the audio segment at the current moment. The audio data to be processed can be pre-divided into frames, for example, with frame length of 1024 samples and frame shift of 512 samples. Using the current frame as a reference point, a sliding window that continuously moves based on the current frame is set as a preset sliding window. Within this preset sliding window, one or more reference values ​​of the signal, such as the average or median, are calculated. This reference value is used as the signal energy reference data of the acoustic environment in which the current frame is located. This dynamically reflects the acoustic activity of the current frame.

[0044] Step 130: Obtain the corresponding dynamic energy threshold for the current frame based on the signal energy reference data.

[0045] In this step, based on the reference value obtained in step 120, an adaptive dynamic energy threshold suitable for the current frame is generated through weighting, offsetting, or multiplying by coefficients. This allows the threshold to adaptively adjust with frame shift, improving robust detection of wind noise under varying noise backgrounds. The adaptive dynamic threshold can adjust in real time according to the local statistical characteristics of the audio signal, effectively addressing signal baseline drift and background noise fluctuations. Compared to fixed threshold methods, it reduces the false alarm rate by approximately 30-50% and the false negative rate by approximately 20-40% in variable environments.

[0046] Step 140: Based on the short-time energy of the current frame and the dynamic energy threshold corresponding to the current frame, perform a numerical threshold comparison to identify the target sweeping sound event.

[0047] In this step, an adaptive dynamic threshold is used to determine the target sweeping sound, rather than relying on a fixed static threshold. The short-time energy of the current frame reflects the energy intensity of the audio signal within the current time window. The dynamic energy threshold is not a fixed constant, but is calculated in real time based on the energy statistical characteristics using the current frame as the standard. Then, the target sweeping sound event is identified by comparing the numerical threshold. Introducing a dynamic energy threshold effectively overcomes the problem of traditional fixed threshold methods failing when background noise changes drastically. Whether in a quiet indoor environment or a noisy outdoor environment, the algorithm can automatically adjust the detection benchmark, avoiding false alarms caused by background noise fluctuations and preventing missed detections of weak sweeping sounds due to excessively high thresholds.

[0048] In this embodiment, the audio data to be processed is received, and the signal energy reference data of the signals near the current frame is dynamically statistically analyzed based on a sliding window. A dynamic energy threshold that slides with the frame is then generated, effectively addressing background noise fluctuations and baseline drift. Furthermore, short-time energy is introduced for numerical threshold comparison, effectively suppressing false triggers caused by instantaneous impulse noise and improving detection accuracy. Ultimately, a high-confidence target sweeping sound event is obtained, achieving high-precision, low-false-alarm, and robust sweeping sound detection in complex acoustic environments. It is suitable for various sweeping sound recognition application scenarios, adapting to environmental changes and possessing high discrimination accuracy, balancing accuracy, efficiency, and ease of use.

[0049] Figure 2 The diagram illustrates the steps of a method for determining target sweeping sound events based on multiple target features. In one embodiment, as shown... Figure 2 As shown, step 140 includes:

[0050] Step 141: In response to the duration for which the short-term energy of the current frame exceeds the dynamic energy threshold for a first preset duration, the current frame is marked as a candidate wind noise event.

[0051] This step establishes a screening mechanism for candidate sweeping sound events. It requires not only high energy in a single frame but also that the energy level continuously exceeds the dynamic energy threshold for a first preset duration before it is preliminarily identified as a candidate sweeping sound event. This effectively suppresses false triggers caused by transient impulse noise and improves the temporal stability of event detection. The first preset duration can be pre-set based on experience and actual scenario requirements; no specific restrictions are imposed here.

[0052] Step 142: Extract multiple target features from candidate wind noise events.

[0053] In this step, more discriminative acoustic features are extracted from candidate event segments, such as the total duration of the candidate sweeping sound event, the maximum energy amplitude within the event, the root mean square value of the energy amplitude within the event, the total energy of the event, the dominant frequency of the event, and the centroid of the event spectrum. This step can reflect the sound characteristics from multiple dimensions, providing a basis for subsequent accurate classification and distinguishing sweeping sounds from other high-energy noise.

[0054] Step 143: Verify each target feature and its corresponding typical feature in turn, filter out abnormal sweeping sound events, and obtain the target sweeping sound events.

[0055] In this step, the extracted target features are compared with the typical features of pre-established typical sweeping sounds. If a certain number of features do not match, the candidate sweeping sound event is marked as an abnormal sweeping sound event and filtered out. The remaining events after filtering are the target sweeping sound events. This step can significantly reduce the false alarm rate and improve the detection accuracy. The multi-feature joint verification mechanism enhances the algorithm's generalization ability and anti-interference ability.

[0056] In this embodiment, a criterion is introduced that the short-term energy must continuously exceed the dynamic energy threshold for a first preset duration to effectively suppress false triggering caused by instantaneous impulse noise. Then, multiple target features are extracted from the initially screened candidate sweeping sound events, and these features are compared item by item with pre-established typical sweeping sound features to filter out mismatched abnormal events, ultimately obtaining the aforementioned target sweeping sound events.

[0057] Figure 3 The diagram illustrates the steps of a method for preprocessing audio data. In one embodiment, as shown... Figure 3 As shown, prior to step 110, the adaptive threshold for wind noise detection further includes:

[0058] Step 210: Receive initial audio data.

[0059] Step 220: Standardize the initial audio data to obtain a standardized signal.

[0060] In this step, the raw audio data is preprocessed to eliminate inconsistencies caused by differences in acquisition equipment, recording conditions, or formats, providing a unified and stable input signal for subsequent feature extraction and event detection.

[0061] The standardization process includes sampling rate unification, channel merging, and amplitude normalization.

[0062] Among them, uniform sampling rate refers to resampling audio at different sampling rates (such as 44.1kHz, 16kHz, etc.) to the system's preset standard sampling rate (where different systems can have the same or different preset standard sampling rates, which can be set according to actual scenario requirements and experience, for example, it can be 16kHz), to ensure consistent time resolution and facilitate the unified calculation of parameters such as frame length and spectrum.

[0063] Channel merging refers to converting multi-channel audio (such as stereo) into mono (usually by averaging the left and right channels) to simplify the processing and avoid interference from energy differences between channels.

[0064] Amplitude normalization refers to scaling the amplitude of an audio signal proportionally to a fixed range. For example, the amplitude of an audio signal can be normalized to [-1, 1], or the root mean square energy can be normalized based on the amplitude of the audio signal to eliminate the influence of the audio volume on the energy threshold judgment.

[0065] Through the above processing, the obtained standardized signal has consistency in time domain scale, channel structure and energy level, which significantly improves the stability and generalization ability of subsequent adaptive threshold detection and feature analysis.

[0066] Step 230: Use a preset filter to filter the standardized signal to obtain an enhanced signal.

[0067] In this step, the filter can be configured according to typical frequency domain characteristics. The filter is an adjustable bandwidth bandpass filter, and the type can be a Butterworth filter, which can maintain the maximum flat response and allow for steeper or smoother transition characteristics in specific frequency bands. The filter settings in this step are as follows:

[0068] .............Formula (1);

[0069] .............Formula (2);

[0070] .......Formula (3);

[0071] In the above formulas (1) to (3), N is the filter order. For the normalized cutoff frequency, This is the normalized upper cutoff frequency. For example, the energy of typical frequency domain characteristics is mainly concentrated between 500-4800Hz. 500Hz The frequency is 4800Hz. H(jω0) is the maximum gain within the passband, which is usually normalized to 1, i.e., |H(jω0)|=1. is the geometric center frequency of the passband, and B is the passband width of the filter.

[0072] Step 240: Perform frame segmentation on the enhanced signal to obtain the audio data to be processed.

[0073] In this step, the audio data to be processed is divided into frames, for example, the frame length is 1024 samples and the frame shift is 512 samples.

[0074] In some embodiments, after the framing process in step 240, several characteristics of each frame are calculated, including short-time energy, zero-crossing rate, and spectral flatness. Short-time energy represents the energy level of a frame of signal and can be used to distinguish between voiced and unvoiced sounds, and between silence and non-silence. Zero-crossing rate represents the number of times a frame's signal waveform crosses a zero point and can be used to estimate frequency; for example, for a sweeping sound signal, voiced sounds have a low zero-crossing rate, while unvoiced sounds have a high zero-crossing rate. Spectral flatness represents the flatness of the signal spectrum; for example, for a sweeping sound signal, voiced sounds have lower spectral flatness (with obvious formants), while unvoiced sounds have higher spectral flatness (similar to white noise).

[0075] In the process of frame segmentation: for example, if the audio data to be processed is represented as x[n], the signal is divided into frames, and the signal of the i-th frame is x. i [n] = x[i *H+n], where H is the frame shift, n = 0, 1, ..., L-1, and L is the frame length.

[0076] Short-time energy E i The calculation formula is:

[0077] ............Formula (4);

[0078] The zero-crossing rate Z of the i-th frame i The calculation formula is:

[0079] .....Formula (5);

[0080] In formula (5), Sgn(x) is a sign function:

[0081] ....Formula (6);

[0082] Spectral flatness (SFM) of the i-th frame i The calculation formula is:

[0083] ....Formula (7);

[0084] In formula (7), S i [k]=|X i [k]| 2 Let represent the power spectrum, ò be a minimal constant to prevent taking the logarithm of zero, and N represent the signal length.

[0085] Figure 4 The diagram illustrates the steps of setting signal energy reference data. In one embodiment, as shown... Figure 4 As shown, step 120 includes:

[0086] Step 121: Based on the current frame, backtrack a preset number of frames and set a preset sliding window.

[0087] Step 122: Calculate the median and standard deviation of the short-time energy within the preset sliding window to obtain the signal energy reference data.

[0088] In this embodiment, the median is used as the signal energy reference data, or the median and standard deviation are integrated and used as the signal energy reference data. In step 121, a preset sliding window is constructed by backtracking a preset number of frames (such as the most recent 50 frames) with the current frame as the reference point. Based on formula (4), the median and standard deviation of short-term energy are calculated within the preset sliding window to characterize the background energy level and its fluctuation degree of the current acoustic environment. Among them, the median is used to estimate the typical background energy. The median is not sensitive to outliers and is less affected by extremely large or small energy values ​​compared to the mean median. The standard deviation reflects the discreteness of energy distribution, that is, environmental stability. The two can be used alone, and only the median can be used as the signal energy reference data; they can also be used in combination, such as weighted fusion of the median and standard deviation, to generate more refined signal energy reference data. This design enables the subsequent dynamic threshold to take into account both the central trend and the magnitude of change of background energy, so as to maintain high detection robustness in complex scenarios such as sudden noise changes or weak wind noise, effectively improve the accuracy of wind noise recognition and reduce false alarms and false negatives.

[0089] Figure 5 The diagram illustrates the steps of setting a dynamic energy threshold. In one embodiment, as shown... Figure 5 As shown, step 130 includes:

[0090] Step 131: Set the correspondence between sensitivity coefficient and threshold.

[0091] In this step, the sensitivity coefficient can be set and tested in advance based on expert experience, or it can be determined through multiple experimental tests; no specific limitation is made here.

[0092] Step 132: Based on the sensitivity coefficient, signal energy reference data, and threshold correspondence, obtain the dynamic energy threshold of the current frame.

[0093] In this embodiment, by introducing a user-configurable sensitivity coefficient and threshold correspondence, and combining it with the signal energy reference data obtained in step 120, a dynamic energy threshold for the current frame is generated.

[0094] For example, when using the median of short-time energy, the dynamic energy threshold T(t) = a*Median(t), where t is an array of short-time energy values ​​for each frame, Median(t) is the median value of the short-time energy distribution, and a is a sensitivity coefficient set by the user. The higher the sensitivity, the smaller a is; the higher the strictness, the larger a is.

[0095] Here, "strictness" refers to a metric in the dynamic energy threshold setting, which determines the system's tolerance for noise and non-target sounds when recognizing signals. This tolerance is reflected in the sensitivity coefficient after adjustment. By adjusting the sensitivity coefficient, the strictness with which the algorithm filters background noise can be controlled.

[0096] Higher stringency results in a larger sensitivity coefficient: when a larger sensitivity coefficient is selected, the dynamic energy threshold increases accordingly. In this case, only when the signal energy is significantly higher than the dynamic energy threshold of the current frame will it be considered a valid acoustic event. Increasing the stringency makes it more difficult for the system to trigger a response, thus reducing the possibility of false alarms.

[0097] Conversely, reducing the stringency lowers the dynamic energy threshold, making it easier for the system to detect potential acoustic events, although this may increase the system's sensitivity.

[0098] By flexibly adjusting the stringency, the system's performance can be optimized according to the needs of the actual application scenario. This mechanism not only enhances adaptability to changes in environmental noise but also allows for dynamic adjustment of detection accuracy and reliability based on specific application scenarios.

[0099] For example, when using the median and standard deviation of short-time energy, the dynamic energy threshold T(t) = Median(t) + a*Std t ;

[0100] Median(t) is the median of short-time energy within a preset sliding window, and Std t Let be the standard deviation of short-time energy within the sliding window, and 'a' be a sensitivity coefficient set by the user. The higher the sensitivity, the smaller 'a'; the higher the strictness, the larger 'a'.

[0101] This embodiment achieves flexible adaptive adjustment of the dynamic energy threshold, which not only retains the ability to follow the statistical characteristics of environmental noise, but also supports dynamic adjustment of detection sensitivity according to actual application scenarios, significantly improving the practicality and robustness of the system without increasing algorithm complexity.

[0102] Figure 6 The diagram illustrates the steps of a method for filtering events based on short-term energy. In one embodiment, as shown... Figure 6 After step 140, the sound event recognition method further includes:

[0103] Step 144: If the short-term energy corresponding to the candidate sweeping sound event continuously exceeds the dynamic energy threshold for a second preset duration, then the event is filtered out.

[0104] In this embodiment, the second preset duration is longer than the first preset duration. After initially marking candidate sweeping noise events, i.e., when the short-term energy continuously exceeds the dynamic threshold for the first preset duration, a duration criterion is further introduced: if the duration of the candidate event exceeds the second preset duration, it is determined to be an atypical sweeping noise event and is removed. The second preset duration can be set according to experience and actual scenario requirements, and is not specifically limited here.

[0105] The rationale is that real sweeping sounds typically have a finite duration, while prolonged high-energy events are more likely to originate from steady-state noise. By setting an upper limit on the duration, such long-duration spurious events can be effectively filtered out. Technically, this mechanism, while preserving short-duration real sweeping sounds, significantly suppresses false alarms caused by persistent background noise, further improving the accuracy and reliability of the detection results and enhancing the algorithm's practicality in complex acoustic environments.

[0106] Figure 7 The diagram illustrates the steps of a method for merging and filtering events based on time parameters. In one embodiment, as shown... Figure 7 As shown, prior to step 150, the method further includes:

[0107] Step 1501: Determine the time interval between two adjacent candidate sweeping sound events.

[0108] Step 1502: Merge adjacent candidate air sweeping events with a time interval less than a preset interval threshold.

[0109] Step 1503: Mark candidate sweeping sound events with a duration exceeding the maximum preset duration as non-sweeping events and filter them out.

[0110] In this embodiment, the start and end times of each candidate sweeping event can be obtained in the time domain. The interval between the end time of the previous candidate sweeping event and the start time of the next candidate sweeping event is the time interval between two candidate sweeping events. The preset interval threshold can be set to a value between 0.1 seconds and 0.5 seconds, and the maximum preset duration can be set to a value between 1 second and 5 seconds. For candidate sweeping sound events with a duration exceeding the maximum preset duration, they generally include multiple independent events or non-sweeping events, and such events are eliminated. In application, this embodiment can perform event-level post-processing optimization before extracting target features from candidate sweeping sound events. Event merging can address the problem of incorrect segmentation of the same sweeping process due to a brief drop in energy, and event elimination can remove other non-sweeping interference. This embodiment effectively improves the accuracy of event boundary identification, avoids over-segmentation or erroneous retention of long-duration pseudo-events, thereby providing cleaner and more representative candidate samples in the subsequent feature extraction and verification stages, further reducing the false alarm rate and enhancing the system's ability to focus on real sweeping sound events.

[0111] In one embodiment, the target features include at least one of the following: the total duration of the candidate sweeping sound event, the maximum energy amplitude of the candidate sweeping sound event, the root mean square value of the energy amplitude of the candidate sweeping sound event, the total energy of the candidate sweeping sound event, the dominant frequency of the candidate sweeping sound event, and the spectral centroid of the candidate sweeping sound event.

[0112] In this embodiment, the start and end times of each candidate sweeping sound are calculated. Furthermore, the total event duration T is... total The calculation formula is:

[0113] .........Formula (8);

[0114] In formula (8), N is the sum of the number of sampling points for all candidate sweeping sound events, f s Sampling frequency (Hz);

[0115] Maximum energy amplitude A within the event max The calculation formula is:

[0116] .........Formula (9);

[0117] In formula (9), x[n] is the energy amplitude value of the nth sampling point, which is dimensionless and usually normalized;

[0118] The formula for calculating the root mean square (RMS) value of the energy amplitude within an event is as follows:

[0119] ............Formula (10);

[0120] Total energy of the event E total The calculation formula is:

[0121] .............Formula (11);

[0122] In formula (11), Δt is the sampling time interval;

[0123] Event dominance frequency f dominant The calculation formula is:

[0124] .......Formula (12);

[0125] In formula (12), X(f) is the complex amplitude at frequency f;

[0126] The formula for calculating the centroid SC of the event spectrum is:

[0127] ............Formula (13);

[0128] f in formula (13) k The Discrete Fourier Transform (DFT) decomposes a signal into N frequency categories, f k The calculation formula is:

[0129] ............Formula (14);

[0130] In formula (13), X[k] uses the Discrete Fourier Transform to convert the time-domain signal into a frequency-domain representation, which includes amplitude and phase information. The formula for calculating X[k] is as follows:

[0131] ............Formula (15).

[0132] The total duration of an event is the length of the sound event on the time axis, reflecting the event's duration characteristics. The maximum energy amplitude within an event is the instantaneous maximum sound pressure level (peak loudness) during the event, reflecting the peak intensity of the sound. The root mean square value of the energy amplitude within an event is the average sound energy level during the event, reflecting a key indicator of perceived loudness. The total energy of an event is the total amount of sound energy accumulated throughout the entire event, reflecting the energy scale of the sound. The dominant frequency of an event is the frequency component in the event's spectrum where energy is most concentrated, reflecting the fundamental frequency or dominant frequency of the sound. The centroid of the event's spectrum is the "center of gravity" frequency of the spectral energy, reflecting the brightness / sharpness of the sound.

[0133] Figure 8 The diagram illustrates the steps involved in verifying various target features. In one embodiment, as shown... Figure 8 As shown, step 160 includes:

[0134] Step 161: Set the typical features corresponding to the feature parameters of each target feature.

[0135] Step 162: Compare each target feature with its corresponding typical feature in turn to obtain multiple verification difference values.

[0136] Step 163: In response to a preset number of verification difference values ​​being greater than a preset difference value, the corresponding candidate air-sweeping sound events are marked as abnormal air-sweeping sound events and filtered out to obtain the target air-sweeping sound event.

[0137] In this embodiment, the preset number can be set to a value from 1 to the total number of target features. In this embodiment, firstly, for each target feature (such as total event duration, maximum energy amplitude within the event, root mean square value of energy amplitude within the event, total event energy, dominant event frequency, and event spectral centroid), the typical feature parameter range or benchmark value corresponding to its typical sweeping sound is preset; then, each target feature extracted from the candidate sweeping sound event is compared with the typical feature in turn, and multiple verification difference values ​​are calculated; if the difference value that reaches or exceeds the preset number (such as 1 or more, the preset number can be configured according to actual needs, and no specific limit is made here) exceeds the preset difference threshold, then the candidate sweeping sound event is determined to not conform to the typical sweeping sound pattern, and it is marked as an abnormal sweeping sound event and filtered out.

[0138] The mechanism in this embodiment achieves joint discrimination based on multi-dimensional acoustic features, avoiding the risk of misjudgment caused by a single feature. By introducing a configurable fault tolerance threshold (i.e., a preset number) and a multi-feature consistency verification strategy, similar but different interference events are effectively eliminated while retaining the real wind sweeping sound, significantly improving the accuracy, robustness, and generalization ability of the detection system, and supporting flexible adjustment of the discrimination strictness according to different application scenarios.

[0139] Figure 9 The diagram illustrates the steps of a method for processing output data. In one embodiment, as shown... Figure 9 As shown, the sound event recognition method also includes:

[0140] Step 910: Export the target sweeping sound event and its corresponding multiple target features to obtain output data.

[0141] Step 920: Convert the output data into multiple output files in various formats.

[0142] Step 930: Generate visual statistical charts based on the output data.

[0143] In this embodiment, the identified target sweeping sound events and their related target features are exported to form structured output data, enabling the data of sweeping sound events to be recorded and shared in a systematic manner. This output data is converted into various file formats, such as CSV tables and JSON files, to meet the compatibility and needs of different users or systems, allowing the data to be easily imported into various analysis tools, databases, or third-party applications for more in-depth research or processing.

[0144] Visualized statistical charts help understand the location and characteristics of each sweeping sound event, and by graphically displaying the statistical characteristics of each event (such as duration distribution, energy distribution, etc.), the efficiency and accuracy of data analysis are greatly improved. For example, marking the location of each sweeping sound event with different colors or shaded areas on the original audio waveform can quickly identify which parts are sweeping sounds and the distribution of target features throughout the audio. This embodiment not only simplifies the data export and sharing process, but also enhances the understanding and analysis capabilities of sweeping sound events by providing intuitive visualization tools. It provides strong support for both subsequent manual review and automated processing. In this embodiment, steps 920 and 930 can be executed independently or sequentially, and this application does not limit this.

[0145] Reference Figure 10 , 11 As shown in Figure 12, the sound of the fan is analyzed. Figure 10 This is a schematic diagram of the original audio signal waveform. Figure 11 The signal image after noise reduction clearly shows multiple events. Figure 12 This is a diagram showing the energy distribution after separation.

[0146] Figure 13 This is a table diagram of target sweeping sound events and multiple target features provided in a specific embodiment. Through this table, the target feature data corresponding to a certain event can be quickly and clearly queried. Figure 13 The total duration of the event is abbreviated as duration, the maximum energy amplitude within the event is abbreviated as maximum amplitude, the root mean square value of the energy amplitude within the event is abbreviated as RMS amplitude, the total energy of the event is abbreviated as energy, the dominant frequency of the event is abbreviated as dominant frequency, and the centroid of the event spectrum is abbreviated as centroid of the spectrum.

[0147] Figure 14 The diagram shown is a schematic representation of the system structure of an acoustic event recognition system. This application also provides an acoustic event recognition system, such as... Figure 14 As shown, the system includes: a data receiving module 1401, a dynamic threshold calculation module 1402, and a sound event recognition module 1403.

[0148] The data receiving module 1401 is configured to: acquire audio data to be processed; the audio data to be processed contains at least one frame.

[0149] The dynamic threshold calculation module 1402 is communicatively connected to the data receiving module 1401. The dynamic threshold calculation module 1402 is configured to: determine the signal energy reference data of the current frame within a preset sliding window based on the audio data to be processed; the current frame is the audio segment at the current moment; and obtain the dynamic energy threshold corresponding to the current frame based on the signal energy reference data.

[0150] The sound event recognition module 1403 is communicatively connected to the dynamic threshold calculation module 1402. The sound event recognition module 1403 is configured to: compare the numerical threshold based on the short-time energy of the current frame and the dynamic energy threshold corresponding to the current frame to identify the target sweeping sound event.

[0151] In one embodiment, the sound event recognition module 1403 is further configured to: mark the current frame as a candidate sweeping sound event in response to the duration for which the short-term energy of the current frame exceeds the dynamic energy threshold reaching a first preset duration; determine the time interval between two adjacent candidate sweeping sound events; merge adjacent candidate sweeping sound events with a time interval less than a preset interval threshold; mark candidate sweeping sound events with a duration exceeding a maximum preset duration as non-sweeping events and filter them out; extract multiple target features of the candidate sweeping sound events; sequentially verify each target feature with its corresponding typical feature, filter out abnormal sweeping sound events, and obtain the target sweeping sound event. The target features include at least one of the following: the total duration of the candidate sweeping sound event, the maximum energy amplitude of the candidate sweeping sound event, the root mean square value of the energy amplitude of the candidate sweeping sound event, the total energy of the candidate sweeping sound event, the dominant frequency of the candidate sweeping sound event, and the spectral centroid of the candidate sweeping sound event.

[0152] In one embodiment, the sound event recognition module 1403 is further configured to: set typical features corresponding to the feature parameters of each target feature; compare each target feature with the corresponding typical feature in turn to obtain multiple verification difference values; and, in response to a preset number of verification difference values ​​being greater than a preset difference value, mark the corresponding candidate sweeping sound event as an abnormal sweeping sound event and filter it out to obtain the target sweeping sound event.

[0153] In one embodiment, such as Figure 14 As shown, the sound event recognition system also includes a data processing module 1400, which is communicatively connected to the data receiving module 1401. The data processing module 1400 is configured to: receive initial audio data; uniformly standardize the initial audio data to obtain a standardized signal; call a preset filter to filter the standardized signal to obtain an enhanced signal; and perform frame-by-frame processing on the enhanced signal to obtain the audio data to be processed.

[0154] In one embodiment, the dynamic threshold calculation module 1402 is further configured to: set a preset sliding window based on a preset number of frames backward from the current frame; calculate the median and standard deviation of short-time energy within the preset sliding window to obtain signal energy reference data; set the correspondence between sensitivity coefficient and threshold; and obtain the dynamic energy threshold of the current frame based on the sensitivity coefficient, signal energy reference data, and threshold correspondence.

[0155] In one embodiment, such as Figure 14As shown, the sound event recognition system also includes a data output module 1404, which is communicatively connected to the sound event recognition module 1403. The data output module 1404 is configured to: export the target sweeping sound event and its corresponding multiple target features to obtain output data; convert the output data into multiple output files in various formats; and generate visual statistical charts based on the output data.

[0156] In this embodiment, the audio data to be processed is received, and the signal energy reference data of the signals near the current frame is dynamically statistically analyzed based on a sliding window. A dynamic energy threshold that slides with the frame is then generated, effectively addressing background noise fluctuations and baseline drift. Furthermore, short-time energy is introduced for numerical threshold comparison, effectively suppressing false triggers caused by instantaneous impulse noise and improving detection accuracy. Ultimately, a high-confidence target sweeping sound event is obtained, achieving high-precision, low-false-alarm, and robust sweeping sound detection in complex acoustic environments. It is suitable for various sweeping sound recognition application scenarios, adapting to environmental changes and possessing high discrimination accuracy, balancing accuracy, efficiency, and ease of use.

[0157] This application significantly improves the accuracy and robustness of wind sweep sound detection by integrating an adaptive threshold mechanism, a multi-level event filtering strategy, and multi-dimensional acoustic feature verification. The dynamic energy threshold is generated in real-time based on the median and standard deviation of signal energy within a sliding window, and combined with a user-adjustable sensitivity coefficient, effectively addressing background noise fluctuations and signal baseline drift. Compared to fixed threshold methods, it effectively reduces false alarm and false negative rates in complex and variable environments. Furthermore, this application can process multiple audio files in parallel, possessing efficient batch audio analysis capabilities, meeting the automatic detection needs of large-scale data in scenarios such as wind farms. In addition, the system not only outputs the time location of wind sweep events but also provides multi-dimensional feature vectors including duration, energy distribution, dominant frequency, and spectral centroid, supporting event pattern analysis, trend statistics, and anomaly diagnosis, greatly enhancing the analytical value of the data.

[0158] The basic principles of this application have been described above with reference to specific embodiments. However, it should be noted that the advantages, benefits, and effects mentioned in this application are merely examples and not limitations, and should not be considered as essential features of each embodiment of this application. Furthermore, the specific details disclosed above are for illustrative and facilitative purposes only, and are not limitations. These details do not restrict the application from being implemented using the specific details described above.

[0159] The block diagrams of devices, apparatuses, devices, and systems involved in this application are merely illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will recognize, these devices, apparatuses, devices, and systems can be connected, arranged, and configured in any manner. Words such as “comprising,” “including,” “having,” etc., are open-ended terms meaning “including but not limited to,” and are used interchangeably with them. The terms “or” and “and” as used herein refer to the terms “and / or,” and are used interchangeably with them unless the context clearly indicates otherwise. The term “such as” as used herein refers to the phrase “such as but not limited to,” and is used interchangeably with it.

[0160] It should also be noted that in the apparatus, equipment, and methods of this application, the components or steps can be disassembled and / or recombined. These disassemblies and / or recombinations should be considered as equivalent solutions of this application.

[0161] The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use this application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein can be applied to other aspects without departing from the scope of this application. Therefore, this application is not intended to be limited to the aspects shown herein, but rather to be carried out within the widest scope consistent with the principles and novel features of the invention herein.

[0162] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications or equivalent substitutions made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. A method for sound event recognition, characterized in that, include: Acquire the audio data to be processed; The audio data to be processed contains at least one frame; Based on the audio data to be processed, the signal energy reference data of the current frame within a preset sliding window is determined; the current frame is the audio segment at the current moment. The dynamic energy threshold corresponding to the current frame is obtained based on the signal energy reference data; Based on the short-time energy of the current frame and the dynamic energy threshold corresponding to the current frame, a numerical threshold comparison is performed to identify the target sweeping sound event.

2. The sound event recognition method according to claim 1, characterized in that, The step of identifying target wind noise events by comparing numerical thresholds based on the short-time energy of the current frame and the dynamic energy threshold corresponding to the current frame includes: If the duration for which the short-term energy of the current frame exceeds the dynamic energy threshold reaches a first preset duration, then the current frame is marked as a candidate wind noise event. Extract multiple target features from the candidate wind noise events; Each of the target features and its corresponding typical features is verified sequentially to filter out abnormal sweeping sound events, thereby obtaining the target sweeping sound events.

3. The sound event recognition method according to claim 1, characterized in that, The acquisition of the audio data to be processed includes: Receive initial audio data; The initial audio data is then standardized to obtain a standardized signal. The standardized signal is filtered by a preset filter to obtain an enhanced signal; The enhanced signal is processed by frame segmentation to obtain the audio data to be processed.

4. The sound event recognition method according to claim 1, characterized in that, The step of determining the signal energy reference data of the current frame within a preset sliding window based on the audio data to be processed includes: Based on the current frame, a preset sliding window is set by traversing back a preset number of frames. The median and standard deviation of the short-time energy within the preset sliding window are calculated to obtain the signal energy reference data.

5. The sound event recognition method according to claim 1, characterized in that, The step of obtaining the dynamic energy threshold corresponding to the current frame based on the signal energy reference data includes: Define the relationship between sensitivity coefficients and thresholds; The dynamic energy threshold of the current frame is obtained based on the sensitivity coefficient, the signal energy reference data, and the threshold correspondence.

6. The sound event recognition method according to claim 2, characterized in that, Before extracting multiple target features of the candidate sweeping sound events, the method further includes: Determine the time interval between two adjacent candidate sweeping sound events; Merge adjacent candidate wind noise events whose time interval is less than a preset interval threshold. Candidate air-sweeping events whose duration exceeds the maximum preset duration are marked as non-air-sweeping events and filtered out.

7. The sound event recognition method according to claim 2, characterized in that, The target features include at least one of the following: the total duration of the candidate sweeping sound event, the maximum energy amplitude of the candidate sweeping sound event, the root mean square value of the energy amplitude of the candidate sweeping sound event, the total energy of the candidate sweeping sound event, the dominant frequency of the candidate sweeping sound event, and the spectral centroid of the candidate sweeping sound event.

8. The sound event recognition method according to claim 2, characterized in that, The step of sequentially verifying each target feature and its corresponding typical feature, filtering out abnormal sweeping sound events, and obtaining the target sweeping sound events includes: Define the typical features corresponding to the feature parameters of each target feature; By sequentially comparing each target feature with its corresponding typical feature, multiple verification difference values ​​are obtained; If the number of verification difference values ​​exceeds a preset difference value, the corresponding candidate air-sweeping sound event is marked as the abnormal air-sweeping sound event and filtered out to obtain the target air-sweeping sound event.

9. The sound event recognition method according to claim 1, characterized in that, Also includes: Export the target sweeping sound event and its corresponding multiple target features to obtain output data; The output data is converted into multiple output files in various formats; Visual statistical charts are generated based on the output data.

10. A sound event recognition system, characterized in that, include: The data receiving module is configured to: acquire audio data to be processed; the audio data to be processed contains at least one frame; A dynamic threshold calculation module is communicatively connected to the data receiving module. The dynamic threshold calculation module is configured to: determine the signal energy reference data of the current frame within a preset sliding window based on the audio data to be processed; the current frame is the audio segment at the current moment. The dynamic energy threshold corresponding to the current frame is obtained based on the signal energy reference data; The sound event recognition module is communicatively connected to the dynamic threshold calculation module. The sound event recognition module is configured to: compare the short-time energy of the current frame with the dynamic energy threshold corresponding to the current frame to identify the target sweeping sound event.