An audio anomaly detection method based on physical perception iterative coding
By employing a physical perception-based iterative coding-based audio anomaly detection method, audio features are decoupled into timbre and energy, simulating auditory system processing. This method uses only normal samples for training, solving the problem of unstable detection results in complex acoustic environments and achieving efficient and low-cost detection in industrial scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HARBIN ENG UNIV
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-26
AI Technical Summary
Existing audio anomaly detection methods are poorly adaptable to complex acoustic environments, especially in domain transition scenarios where performance degrades. They also require a large number of labeled anomaly samples, resulting in unstable performance and high costs in practical industrial applications.
A physical perception-based iterative coding method is adopted to decouple audio features into timbre and energy. The processing of the auditory system is simulated through multiple rounds of iteration. Only normal samples are used for training, and anomaly detection is performed using Gaussian mixture model or feature distance.
Maintaining stable detection performance in domain transfer scenarios reduces reliance on anomalous samples, lowers data preparation costs, and enhances the feasibility and practicality of industrial applications.
Smart Images

Figure CN122290633A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of audio signal processing and machine learning technology, and in particular to an audio anomaly detection method based on physical perception iterative coding. Background Technology
[0002] Audio anomaly detection refers to the technology of identifying abnormal states by analyzing sound signals generated by machines, equipment, or the environment. It has significant application value in fields such as industrial fault diagnosis, equipment health monitoring, and intelligent security. Traditional methods mainly rely on signal processing techniques and shallow machine learning models, such as Gaussian mixture models and support vector machines. These methods have certain effects in simple scenarios, but they are less adaptable to subtle anomalies and domain shifts in complex acoustic environments.
[0003] In recent years, deep learning-based audio anomaly detection methods have made significant progress. Autoencoders detect anomalies through reconstruction errors, but performance degrades when exposed to anomalous samples during training. Flow-based density estimation methods identify anomalies by modeling the distribution of normal samples, but are sensitive to assumptions about data distribution. Self-supervised classification methods train on different machine identities as categories and calculate anomaly scores using classification confidence, but lack stability when facing domain transitions. Most of these existing methods treat audio signals as ordinary data, failing to fully consider the inherent physical characteristics of sound signals, such as timbre, energy, and resonance, and their crucial role in anomaly detection.
[0004] Especially in real-world industrial environments, changes in machine operating conditions can lead to domain shifts in acoustic features, causing models that perform well on the training set to significantly degrade on the test set. Existing methods have limited adaptability to these domain shifts and lack a deep understanding of the physical nature of audio, resulting in unstable detection performance in complex application scenarios. Furthermore, most deep learning methods require large amounts of labeled data for training, while in industrial practice, the scarcity of outlier samples and the high cost of labeling further limit the practical application effectiveness of existing technologies. Summary of the Invention
[0005] To address the aforementioned issues, this invention proposes an audio anomaly detection method based on physical perception iterative coding. This method, based on a physical perception iterative coding mechanism, simulates the human auditory system's perception of sound, decoupling audio features into physical components such as timbre and energy. Feature representations are progressively refined through multiple iterations, thereby achieving more accurate detection of audio anomalies, particularly maintaining stable detection performance in domain-transfer scenarios. Furthermore, this method does not rely on anomalous samples; training can be completed using only normal samples, effectively solving the practical problems of scarce anomalous samples and high annotation costs in industrial scenarios, making it more suitable for real-world industrial applications.
[0006] The objective of this invention is achieved as follows: An audio anomaly detection method based on physical perception iterative coding includes: a feature extraction and fusion module, a physical perception iterative decoupling network module, and an anomaly scoring module; The feature extraction and fusion module is used to extract Mel spectral features and temporal Gram features from the original audio signal, and to fuse the Mel spectral features and temporal Gram features into a multi-channel time-frequency feature map. The physical perception iterative decoupling network module decouples the multi-channel time-frequency feature map into timbre features and energy features through multiple rounds of iteration, simulating the physical formation process of audio signals; The anomaly scoring module calculates anomaly scores based on the timbre and energy features, and uses a Gaussian mixture model or feature distance to determine anomalies.
[0007] The above-mentioned audio anomaly detection method based on physical perception iterative coding includes a feature extraction and fusion module comprising: a Mel spectrum extraction unit, a temporal Gram extraction unit, and a first feature fusion unit; The Mel spectrum extraction unit extracts spectral features from the original audio signal based on short-time Fourier transform and Mel filter bank. The temporal Gram extraction unit uses a one-dimensional convolutional network to extract temporal structure features from the original audio signal. The first feature fusion unit: splices the Mel feature and the time Gram feature along the channel dimension to form a multi-channel time-frequency feature map.
[0008] The above-mentioned audio anomaly detection method based on physical perception iterative coding includes a physical perception iterative decoupling network module comprising: a physical perception iterative encoder, multiple iterative decoupling blocks, and a second feature fusion unit. The physical sensing iterative encoder is used to initialize features and control the iterative process. The iterative decoupling block includes a spectrum-to-timbre / energy decomposition unit and a timbre / energy-to-spectrum reconstruction unit. The second feature fusion unit is used to fuse timbre and energy features after multiple iterations.
[0009] Furthermore, the iterative decoupling block also includes: a multi-band analysis unit, a timbre path, and an energy path; The multi-band analysis unit simulates the frequency band decomposition mechanism of the auditory system. The timbre path: uses a channel attention mechanism to extract stable spectral envelope features; The energy path: uses a spatial attention mechanism to extract time-varying energy pattern features.
[0010] The above-mentioned audio anomaly detection method based on physical perception iterative coding includes an anomaly scoring module comprising: a feature extraction unit, a Gaussian mixture model unit, and an anomaly score calculation unit; The feature extraction unit: uses the trained model to extract features from the test samples; The Gaussian mixture model unit models the features of normal samples; The anomaly score calculation unit calculates anomaly scores based on the negative log-likelihood of the feature to the GMM or the feature norm distance.
[0011] Beneficial Effects: This invention presents an audio anomaly detection method based on physical perception iterative coding, which includes a feature extraction and fusion module, a physical perception iterative decoupling network module, and an anomaly scoring module. The physical perception iterative decoupling network module simulates the physical perception mechanism of the auditory system through multi-band decomposition and iterative decoupling, decoupling audio features into timbre and energy components. It also breaks the frequency and time domain symmetry of spectral features, enhancing the extraction capability of essential audio features. It achieves significant physical perception feature decoupling in both timbre and energy dimensions, enabling the simultaneous extraction of stable timbre features and dynamic energy features. The model is lightweight, has a clear structure, and is suitable for scenarios such as industrial equipment fault diagnosis and environmental sound monitoring. Importantly, this method only requires audio samples under normal operating conditions for model training, eliminating the need to collect and label rare anomaly samples, significantly reducing data preparation costs and improving feasibility and practicality in actual industrial deployments. Attached Figure Description
[0012] Figure 1 This is a schematic diagram of the audio anomaly detection method based on physical perception iterative coding of the present invention.
[0013] Figure 2 This is a schematic diagram of the physical sensing iterative decoupling network.
[0014] Figure 3 This is a detailed structural diagram of the iterative decoupling block.
[0015] Figure 4 This is a structural diagram of the multi-band analysis module.
[0016] Figure 5 It is a visualization feature map of timbre features, energy features, and timbre-energy features after the first iteration of decoupling.
[0017] Figure 6 This is a flowchart of the exception scoring module.
[0018] Figure 7 This is a visualization diagram of the feature maps extracted by the model using t-SNE.
[0019] Figure 8 This is a flowchart of the steps of the audio anomaly detection method based on physical perception iterative coding of the present invention. Detailed Implementation
[0020] The specific embodiments of the present invention will now be described in further detail with reference to the accompanying drawings. Specific Implementation Method 1
[0022] The audio anomaly detection method based on physical perception iterative coding in this specific implementation is discussed from the perspective of modules, such as... Figure 1 As shown, it includes a feature extraction and fusion module, a physical perception iterative decoupling network module, and an anomaly scoring module; The feature extraction and fusion module is used to extract Mel spectral features and temporal Gram features from the original audio signal, and to fuse the Mel spectral features and temporal Gram features into a multi-channel time-frequency feature map. The physical perception iterative decoupling network module decouples the multi-channel time-frequency feature map into timbre features and energy features through multiple rounds of iteration, simulating the physical formation process of audio signals; The anomaly scoring module calculates anomaly scores based on the timbre and energy features, and uses a Gaussian mixture model or feature distance to determine anomalies.
[0023] Specifically, the method for extracting Mel-spectral features and temporal Gram features from the original audio signal through the feature extraction and fusion module, and fusing them into a multi-channel time-frequency feature map is as follows: The original audio signal is analyzed in time and frequency using a Mel spectrum converter to extract Mel spectrum features. The Mel spectrum converter includes a short-time Fourier transform and a Mel filter bank to convert the audio signal into a log-Mel spectrum. Temporal Gram features are extracted from the original audio signal using a temporal Gram network. The temporal Gram network employs one-dimensional convolutional layers and layer normalization operations to capture long-term temporal dependencies in the audio waveform. The Mel spectral features and the temporal Gram features are concatenated along the channel dimension to form a multi-channel time-frequency feature map, which serves as the input to the physical perception iterative decoupling network module.
[0024] Specifically, the method by which the physical perception iterative decoupling network module decouples the multi-channel time-frequency feature map into timbre and energy features through multiple iterations, simulating the physical formation process of audio signals, is as follows: The initial features of the multi-channel time-frequency feature map are extracted using a Physically Aware Iterative Encoder (PAIE), which includes an initial convolutional layer and a multi-round iterative decoupling block. The initial features are iteratively decomposed through the multi-round iterative decoupling blocks. Each iterative decoupling block includes a spectrum-to-timbre energy decomposition module and a timbre energy-to-spectrum reconstruction module. In the spectrum-to-timbre-energy decomposition module, multi-band analysis is used to simulate the frequency band processing of the auditory system, and timbre features and energy features are extracted in parallel. The timbre features are focused on a stable spectral envelope through a channel attention mechanism, and the energy features are focused on a time-varying energy pattern through a spatial attention mechanism. In the timbre-energy-to-spectrum reconstruction module, timbre features and energy features are modulated and synthesized based on the audio formation model, and then fused with the input features through residual connections to achieve feature reconstruction. After multiple iterations, the decoupled timbre and energy features are averaged and fused to output an enhanced time-frequency feature vector.
[0025] Specifically, the anomaly scoring module calculates anomaly scores based on decoupled features and uses a Gaussian mixture model or feature distance for anomaly determination. During the training phase, the physical perception iterative decoupling network module is trained using normal audio samples, and the feature vectors of the normal samples are extracted. A Gaussian mixture model is trained based on the aforementioned feature vectors to establish the probability distribution of normal features; It is worth noting that the training process of this invention does not rely on anomalous samples at all. During the training phase, only audio samples collected from named devices under normal operating conditions (the training set contains only samples labeled "normal") are used to train the model through supervised learning. The classification task aims to distinguish different device types or operating conditions (such as different machine IDs or RPM settings), rather than directly identifying anomalies. This design allows the model to learn stable physical feature representations of normal sound (such as timbre envelope and energy temporal patterns) without any anomalous samples participating in the training.
[0026] During the testing phase, for the audio sample to be detected, its feature vector is extracted and its distance from the normal feature distribution is calculated. That is, whether it is abnormal is determined by measuring the degree of deviation of the sample to be tested from the learned normal feature space. The abnormality score is calculated by one of the following methods: using the Euclidean distance from the feature vector to the origin as the abnormality score, or using the negative log-likelihood of the Gaussian mixture model as the abnormality score. The abnormality of an audio sample is determined by comparing the abnormal score with a preset threshold, where the threshold is determined through performance optimization on the validation set. Specific Implementation Method Two
[0028] The audio anomaly detection method based on physical perception iterative coding in this specific implementation is discussed from the perspective of steps, such as... Figures 2 to 8 As shown, it includes the following steps: Step S1, as follows Figure 1 The design includes a feature extraction and fusion module, a physical perception iterative decoupling network module, and an anomaly scoring module. Step S2: The feature extraction and fusion module first performs standardized preprocessing on the input audio signal, including sampling rate unification and length normalization. Then, a dual-path feature extraction strategy is adopted: on the one hand, frequency domain features are extracted through Mel spectrum transform to capture the spectral distribution characteristics of the sound; on the other hand, temporal structural features are extracted through a temporal Gram network to capture the temporal pattern of the sound signal.
[0029] Mel spectrum extraction is based on short-time Fourier transform, using a Mel filter bank to simulate the characteristics of human hearing, and logarithmic compression is applied to the results to enhance feature representation. Temporal Gram feature extraction employs a deep one-dimensional convolutional network, extracting discriminative temporal features through multi-level convolution and normalization operations.
[0030] The feature fusion stage employs a channel-by-channel splicing approach, fusing Mel spectral features and temporal Gram features along the channel dimension to form a comprehensive feature representation that incorporates both time and frequency information. This fusion strategy preserves the fine structure of the frequency domain while incorporating dynamic changes in the time domain, providing a rich information foundation for subsequent processing.
[0031] Step S3: Based on the results of the feature extraction and fusion module, the physical perception iterative decoupling network module performs the next operation. The physical perception iterative decoupling network is the core innovative module of this invention, such as... Figure 2 As shown, its design is inspired by the physical formation mechanism of audio signals. The network simulates the hierarchical processing of sound signals by the auditory system through a multi-round iterative process.
[0032] The network first initializes and encodes the input features, and then proceeds to... Figure 3 The iterative decoupling loop is shown. Each iteration block contains two key components: a feature decomposition module and a feature reconstruction module. The feature decomposition module processes the input spectral features as follows: Figure 4 The multi-band analysis module shown is preprocessed and decoupled into timbre features and energy features, corresponding to the steady-state characteristics and dynamic changes of sound, respectively. The timbre features capture the spectral envelope and timbre characteristics of the sound, while the energy features focus on the intensity changes and temporal patterns of the sound.
[0033] In the feature reconstruction stage, based on the physical principles of audio formation, timbre and energy features are reconstructed into spectral features through modulation and synthesis, and residual connections are established with the original input. This iterative decoupling-reconstruction mechanism enables the network to gradually improve feature representation and enhance its sensitivity to anomalous features.
[0034] Figure 5 This provides a visual representation of the features after the first iteration of decoupling, enhances deep features, and improves sensitivity to anomalous features.
[0035] The network ultimately aggregates the timbre and energy features generated in each iteration through multi-round feature fusion, and outputs a deep feature representation with strong discriminative power.
[0036] Step S4: Based on the results obtained by the physical perception iterative decoupling network module, the anomaly scoring module performs the next operation.
[0037] The anomaly scoring module adopts a dual-mode design, such as... Figure 6 As shown, it supports both distance-based and probability density-based scoring. During the training phase, the model optimizes its feature extraction capabilities through supervised learning, combining classification loss and feature discrimination loss to guide network training.
[0038] Distance-based scoring methods measure the degree of anomaly by calculating the distance between the features of test samples and the feature distribution of normal samples. Normal samples typically form compact clusters in the feature space, while anomalous samples deviate from this distribution region.
[0039] Probability density estimation-based methods model the feature distribution of normal samples and calculate the likelihood of test samples under that distribution. Abnormal samples, because their feature distribution does not conform to the normal pattern, will receive a lower likelihood score.
[0040] In practical applications, the two scoring mechanisms can be used individually or in combination, and the detection performance is comprehensively evaluated by calculating indicators such as the area under the ROC curve. This method can effectively distinguish between normal and abnormal sound patterns and maintain stable detection capabilities even in complex acoustic environments.
[0041] To verify the effectiveness of the model, we used the method of simulating the sound of factory machines running by using the sound of various types of fans. Each fan has multiple speeds. We recorded the sound of the fan motor rotating as the source of the dataset. For the creation of abnormal samples, we simulated the sound of the fan malfunction by attaching small pieces of paper to the fan blades, including but not limited to. Finally, we created a training set containing only normal samples and a test set containing both normal and abnormal samples for testing.
[0042] Figure 7The t-SNE diagram visualizes the feature maps extracted by STgram-MFN. (a) shows the features extracted by STgram-MFN without PAIE, and (b) shows the features extracted by STgram-MFN with PAIE. In the t-SNE diagram, ○ represents the features of normal samples in the training set, △ represents the features of normal samples in the test set, and × represents the features of abnormal samples in the test set. The visualization comparison shows that the model with PAIE (Figure b) has a stronger ability to separate normal and abnormal samples than the model without PAIE (Figure a). This result fully demonstrates that the Physically Aware Iterative Encoding (PAIE) mechanism proposed in this invention can effectively decouple and extract stable physical features strongly correlated with device health status in audio signals, thereby achieving high discriminative separation of normal and abnormal samples in the feature space, verifying the effectiveness of the model.
[0043] It should be noted that the above are merely specific embodiments of this application and are not intended to limit this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. A method for audio anomaly detection based on physical perception iterative coding, characterized in that, include: Feature extraction and fusion module, physical perception iterative decoupling network module, and anomaly scoring module; The feature extraction and fusion module is used to extract Mel spectral features and temporal Gram features from the original audio signal, and to fuse the Mel spectral features and temporal Gram features into a multi-channel time-frequency feature map. The physical perception iterative decoupling network module decouples the multi-channel time-frequency feature map into timbre features and energy features through multiple rounds of iteration, simulating the physical formation process of audio signals; The anomaly scoring module calculates anomaly scores based on the timbre and energy features, and uses a Gaussian mixture model or feature distance to determine anomalies.
2. The audio anomaly detection method based on physical perception iterative coding as described in claim 1, characterized in that, The feature extraction and fusion module includes: a Mel spectrum extraction unit, a temporal Gram extraction unit, and a first feature fusion unit; The Mel spectrum extraction unit extracts spectral features from the original audio signal based on short-time Fourier transform and Mel filter bank. The temporal Gram extraction unit uses a one-dimensional convolutional network to extract temporal structure features from the original audio signal. The first feature fusion unit: splices the Mel feature and the time Gram feature along the channel dimension to form a multi-channel time-frequency feature map.
3. The audio anomaly detection method based on physical perception iterative coding as described in claim 1, characterized in that, The physical sensing iterative decoupling network module includes: a physical sensing iterative encoder, multiple iterative decoupling blocks, and a second feature fusion unit; The physical sensing iterative encoder is used to initialize features and control the iterative process. The iterative decoupling block includes a spectrum-to-timbre / energy decomposition unit and a timbre / energy-to-spectrum reconstruction unit. The second feature fusion unit is used to fuse timbre and energy features after multiple iterations.
4. The audio anomaly detection method based on physical perception iterative coding as described in claim 3, characterized in that, The iterative decoupling block also includes: a multi-band analysis unit, a timbre path, and an energy path; The multi-band analysis unit simulates the frequency band decomposition mechanism of the auditory system. The timbre path: uses a channel attention mechanism to extract stable spectral envelope features; The energy path: uses a spatial attention mechanism to extract time-varying energy pattern features.
5. The audio anomaly detection method based on physical perception iterative encoding according to claim 1, wherein, The anomaly scoring module includes: a feature extraction unit, a Gaussian mixture model unit, and an anomaly score calculation unit; The feature extraction unit: uses the trained model to extract features from the test samples; The Gaussian mixture model unit models the features of normal samples; The anomaly score calculation unit calculates anomaly scores based on the negative log-likelihood of the feature to the GMM or the feature norm distance.