Mel-spectrogram time-domain fusion industrial machine device sound recognition method, medium, and device
By fusing Mel spectrum and time-domain features, and utilizing short-time Fourier transform, Mel filter bank filtering, and deep neural networks, the problem of ineffective fusion of sound features of industrial machinery and equipment was solved, thereby improving the accuracy of fault diagnosis and the ability to identify equipment status.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA UNIV OF TECH
- Filing Date
- 2023-10-30
- Publication Date
- 2026-06-30
Smart Images

Figure CN117457027B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of industrial machinery and equipment sound recognition technology, and more specifically, to a method, medium, and device for industrial machinery and equipment sound recognition based on Mel spectrum time-domain fusion. Background Technology
[0002] The normal operation of industrial machinery and equipment is crucial for industrial production. Identifying and recognizing the sounds of the equipment can help detect potential defects in a timely manner and prevent them from escalating into serious production accidents and economic losses.
[0003] Existing methods for identifying the condition of industrial machinery and equipment mainly utilize neural networks to classify vibration or sound signals, lacking the fusion of Mel-frequency spectrum and time-domain features of the equipment's sound. For example, Chinese invention patent application "A Fault Diagnosis Method for Wind Turbine Bearings Based on Time-Frequency Domain Convolutional Networks and Deep Forests" (Publication No.: CN114964780A) acquires fault data on vibration, operating conditions, speed, and load, performs feature extraction based on time-frequency domain convolutional networks, and completes fault diagnosis through a two-layer deep forest model. Another Chinese invention patent application, "A Fault Diagnosis Method for Rolling Bearings Based on Time-Frequency Domain Multidimensional Vibration Feature Fusion" (Publication No.: CN104655423A), first performs wavelet denoising on the vibration signal, obtains time-domain feature parameters using feature extraction, obtains the energy matrix using wavelet packet decomposition and energy moment calculation, synthesizes a multidimensional feature matrix, and determines the bearing condition based on the index distance. Chinese invention patent application "A Method for Diagnosing Latent Defects in Transformers Based on Sound Monitoring" (Publication No.: CN113253156A) first determines the installation location of the sound sensor based on noise attenuation laws, and then judges whether the transformer has latent defects based on characteristic frequencies and defect evaluation indicators. Chinese invention patent application "A Method and System for Diagnosing Transformer Faults Using Sound Feature Coding" (Publication No.: CN114527410A) judges transformer fault results based on sound feature coding rules and combinations. Chinese invention patent application "A Digital Evaluation Method and System for High-Voltage Bushings Based on Time-Frequency Domain Feature Fusion" (Publication No.: CN115015684A) uses the analysis results of time-domain or frequency-domain evaluation units in different sampling intervals to determine the state of the high-voltage bushing. None of the above inventions involve fusing the Mel-frequency spectrum and time-domain features of industrial machinery and equipment sounds. Because a single Mel-frequency spectrum lacks sufficient fault-related information, and a single time-domain feature lacks prior knowledge, the above methods may lead to a decrease in fault diagnosis accuracy. Summary of the Invention
[0004] To overcome the shortcomings and deficiencies in the existing technology, the purpose of this invention is to provide a method, medium, and device for industrial machine sound recognition based on Mel spectrum and time domain fusion. This method integrates Mel spectrum and time domain features, making full use of the full-frequency information of industrial machine sound, which can improve the accuracy of recognition results under different states of industrial machine sound, and is beneficial for effective diagnosis of the operating status of industrial machine.
[0005] To achieve the above objectives, the present invention is implemented through the following technical solution: a method for industrial machine equipment sound recognition based on Mel-spectrum time-domain fusion, comprising the following steps:
[0006] Step S1: Collect audio signals from industrial machinery and equipment;
[0007] Step S2: Input the audio signal into the frequency domain and time domain primary feature extraction module; the frequency domain and time domain primary feature extraction module performs short-time Fourier transform and Mel filter bank filtering on the audio signal to obtain the Mel spectrum; the audio signal is subjected to one-dimensional convolution and multiple convolution-normalization-activation modules to obtain time domain features; then the feature maps of the Mel spectrum and time domain features are concatenated at the channel level to obtain primary features;
[0008] Step S3: Use a deep neural network to perform feature transformation on the primary features, and fuse the Mel spectrum and temporal features to obtain the advanced feature embedding code;
[0009] Step S4: Input the high-level feature embedding code into the classifier; based on the output of the classifier, obtain the recognition result of the sound of industrial machinery and equipment.
[0010] Preferably, in step S2, the short-time Fourier transform includes framing, windowing, and discrete Fourier transform;
[0011] The framing refers to cutting the audio signal into frames of fixed length according to a set time length; there is an overlapping area between adjacent frames to ensure the continuity of the audio signal in time.
[0012] The windowing refers to performing window function processing on each frame of the audio signal;
[0013] The short-time Fourier transform S(m,k) is:
[0014]
[0015] Where N is the frame length of the audio signal; H is the frame shift; m represents the current window number; k represents a positive integer from 0 to N-1; n represents a positive integer from 1 to N-1; x represents the audio signal sequence; i represents the imaginary unit; and w(n) represents the window function.
[0016] Preferably, in the windowing process, the window function is a Hamming window function w(n):
[0017]
[0018] Where n represents the frame number of the current audio signal; N represents the frame length of the audio signal; and α is a constant, with α = 0.46 for the Hamming window.
[0019] Preferably, in step S2, Mel filter bank filtering refers to a spectrum analysis method that filters the frequency domain signal of an audio signal through a set of triangular filter banks designed with Mel scale.
[0020] The frequency response formula for a Mel filter is:
[0021]
[0022] Among them, H M(k) The frequency response of the Mel filter is represented by K; the frequency value is represented by M; the Mth Mel filter is represented by f(M); and the center frequency of the Mth Mel filter is represented by f(M).
[0023] Preferably, the frequency domain and time domain primary feature extraction module, deep neural network, and classifier refer to the trained frequency domain and time domain primary feature extraction module, deep neural network, and classifier;
[0024] The training method is as follows: audio signal samples of industrial machinery and equipment are collected; the audio signal samples are preprocessed to obtain a sample set; the sample set includes audio signal samples and corresponding labels;
[0025] The audio signal samples in the sample set are processed in steps S2 and S3 to obtain the high-level feature embedding code.
[0026] First, the high-level feature embedding code is input into the classifier for forward propagation. After obtaining the output, the loss is obtained by using the label and the loss function Loss with strong discriminative ability. Then, the loss is backpropagated to obtain the gradients of the frequency domain and time domain primary feature extraction module, deep neural network and classifier. The gradient descent update algorithm is used to update the weight parameters of the frequency domain and time domain primary feature extraction module, deep neural network and classifier, thereby realizing training.
[0027] Preferably, the preprocessing includes labeling, downsampling, and cutting;
[0028] The labeling refers to labeling the audio signal samples with the corresponding time period name in the operating condition record table of the industrial machinery and equipment, based on the operating condition record table.
[0029] The downsampling refers to: downsampling an audio signal sample to a set frequency;
[0030] The term "segmentation" refers to dividing each downsampled audio signal sample into several short-duration audio signals.
[0031] Preferably, the loss function Loss is AAM-Softmax:
[0032]
[0033] Where D represents the dimension of the model output probability vector; y i θ represents the true label of the i-th audio signal sample; j and This refers to the angles in the angle vector obtained after transforming the normalized high-level feature embedding code through a weight matrix and taking the arccosine transform; s is the scaling factor; and r is the angle interval in the angle space obtained by taking the arccosine transform.
[0034] Preferably, the gradient descent update algorithm is used to update the weight parameters of the frequency domain and time domain primary feature extraction module, the deep neural network, and the classifier. The weight parameter update formula is as follows:
[0035]
[0036] b t =μ*b t-1 +(1-τ)*g t
[0037] g t =bt
[0038] θ t =θ t-1 -γ*g t
[0039] Where, θ t-1 The weight parameter g is the weight parameter at step t-1. t The objective function at step t is θ t-1 The gradient of θ; t b is the weight parameter at step t; t It is the momentum of the t-th iteration; μ is the momentum coefficient; τ is the momentum damping; It is the gradient operator; γ is the update step size.
[0040] A readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the aforementioned Mel-spectrum time-domain fusion method for industrial machine equipment sound recognition.
[0041] A computer device includes a processor and a memory for storing a processor-executable program. When the processor executes the program stored in the memory, it implements the aforementioned Mel-spectrum time-domain fusion method for industrial machine sound recognition.
[0042] Compared with the prior art, the present invention has the following advantages and beneficial effects:
[0043] This invention proposes a method for identifying industrial machine sounds by combining Mel spectrum and temporal features. The Mel spectrum is obtained by performing a short-time Fourier transform and Mel filter bank filtering on the industrial machine sound, which fully simulates human hearing's perception of pitch and loudness. The temporal features are obtained by performing one-dimensional convolution and multiple convolution-normalization-activation modules on the industrial machine sound, making full use of the full-frequency information of the industrial machine sound. By combining the Mel spectrum and temporal features, it is beneficial to accurately identify the industrial machine sound under different conditions.
[0044] The identification results obtained by the method of the present invention can be used to diagnose the operating status of industrial machinery and equipment, promptly detect potential defects, and prevent further development that could lead to serious production accidents and economic losses. Attached Figure Description
[0045] Figure 1 This is a flowchart of the industrial machine sound recognition method based on Mel spectrum time-domain fusion of the present invention. Detailed Implementation
[0046] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments.
[0047] Example 1
[0048] This embodiment presents a method for identifying industrial machinery sounds using Mel-spectrum time-domain fusion, aiming to accurately identify the sounds of industrial machinery under different conditions. Mel-spectrum analysis is designed to simulate human hearing's perception of pitch and loudness, focusing on the low-frequency components of sound while neglecting the high-frequency components. Methods relying solely on Mel-spectrum analysis fail to adequately consider the high-frequency components of industrial machinery sounds, thus affecting identification accuracy.
[0049] This invention fully utilizes the full-frequency information of industrial machinery and equipment sound and human hearing's perception of pitch and loudness, fusing the Mel spectrum with time-domain characteristics, and then using a classification method to identify the sound of industrial machinery and equipment (specific equipment or scenarios including water turbine room, main transformer high-voltage side, main transformer low-voltage side, GCB spring energy storage, wind tunnel outer wall, tailrace pipe inlet, volute inlet, unit brake control cabinet, pressurized water tank, leakage drainage pump, governor oil pump, water guide oil pump, thrust and lower guide bearing oil pump, high-pressure oil injection pump, technical water supply pump, turbine return water exhaust hydraulic valve) (specific sounds include no sound, normal sound, abnormal sound, no action, normal action sound, abnormal action sound, no exhaust, normal exhaust sound, abnormal exhaust sound, no air leakage sound, and air leakage sound).
[0050] Its process is as follows Figure 1 As shown, it includes the following steps:
[0051] Step S1: Collect audio signals from industrial machinery and equipment.
[0052] Step S2: Input the audio signal into the frequency domain and time domain primary feature extraction module; the frequency domain and time domain primary feature extraction module performs short-time Fourier transform and Mel filter bank filtering on the audio signal to obtain the Mel spectrum.
[0053] The short-time Fourier transform includes frame division, windowing, and discrete Fourier transform.
[0054] The framing refers to cutting the audio signal into frames of fixed length according to a set time length. A frame is a small segment. In this invention, the frame length is 25ms. There is a certain overlap area between adjacent frames to ensure the continuity of the audio signal in time and avoid abrupt changes at the boundary. The time interval between two adjacent frames is called frame shift. In this invention, the frame shift is 10ms.
[0055] The windowing process refers to applying a window function to each frame of the audio signal. Window functions improve the accuracy of spectrum estimation and reduce spectral leakage caused by framing, i.e., the distortion in spectrum estimation caused by spectral leakage to adjacent frequency components. Common window functions include the Hamming window and the Hanning window. This invention uses the Hamming window function w(n):
[0056]
[0057] Where n represents the frame number of the current audio signal; N represents the frame length of the audio signal; and α is a constant, with α = 0.46 for the Hamming window.
[0058] The short-time Fourier transform S(m,k) is:
[0059]
[0060] Where N is the frame length; H is the frame shift; m represents the current window number; k represents a positive integer from 0 to N-1; n represents a positive integer from 1 to N-1; x represents the audio signal sequence; i represents the imaginary unit; and w(n) represents the window function.
[0061] Mel filter bank filtering is a spectral analysis method that filters the frequency domain signal of an audio signal through a set of triangular filters designed with a Mel scale. Human hearing perceives sound non-linearly, and Mel filtering is based on this non-linear characteristic, dividing the linear spectrum into a series of non-linear Mel frequency bands. This results in higher resolution for the low-frequency range and lower resolution for the high-frequency range, better matching the response of the human auditory system. Furthermore, Mel filtering can reduce high-dimensional spectral data to lower-dimensional Mel spectra, reducing computational load and storage space while retaining important information relevant to human hearing.
[0062] The frequency response formula for a Mel filter is:
[0063]
[0064] Among them, H M(K) The frequency response of the Mel filter is represented by K; the frequency value is represented by M; the Mth Mel filter is represented by f(M); and the center frequency of the Mth Mel filter is represented by f(M).
[0065] Following the calculation process of Mel spectrum, one-dimensional convolution and multiple convolution-normalization-activation modules are performed on the audio signal to obtain temporal features; then the feature maps of Mel spectrum and temporal features are concatenated at the channel level to obtain primary features.
[0066] Step S3: Utilize a deep neural network to perform feature transformation on the primary features, fusing the Mel spectrum and temporal features to obtain the advanced feature embedding code. This feature transformation converts the three-dimensional primary features into a one-dimensional feature vector, achieving dimensionality reduction and extracting the most useful information for fault diagnosis, making subsequent classifiers more capable of fault diagnosis.
[0067] Step S4: Input the high-level feature embedding code into the classifier; based on the output of the classifier, obtain the recognition result of the sound of industrial machinery and equipment.
[0068] The frequency domain and time domain primary feature extraction module, deep neural network, and classifier refer to the trained frequency domain and time domain primary feature extraction module, deep neural network, and classifier.
[0069] The training method is as follows: audio signal samples of industrial machinery and equipment are collected; the audio signal samples are preprocessed to obtain a sample set; the sample set includes audio signal samples and corresponding labels;
[0070] The preprocessing includes labeling, downsampling, and segmentation;
[0071] The labeling refers to labeling the audio signal samples with the corresponding time period name in the operating condition record table of the industrial machinery and equipment, based on the operating condition record table.
[0072] The downsampling refers to: downsampling audio signal samples to a set frequency, which saves storage space for audio signal samples while speeding up the computer system's reading speed of audio signal samples;
[0073] The segmentation refers to dividing each downsampled audio signal sample into several short-duration audio signals to reduce the amount of data contained in a single audio signal sample, thus facilitating subsequent feature extraction and feature transformation.
[0074] The audio signal samples in the sample set are processed in steps S2 and S3 to obtain the high-level feature embedding code.
[0075] First, the high-level feature embedding code is input into the classifier for forward propagation. After obtaining the output, the loss is obtained by using the label and the loss function Loss with strong discriminative ability.
[0076] The loss function is AAM-Softmax, which enhances the discriminative power of features based on the principle of angle cosine.
[0077]
[0078] Where D represents the dimension of the model output probability vector; y i θ represents the true label of the i-th audio signal sample; j and This refers to the angles in the angle vector obtained after transforming the normalized high-level feature embedding code through the weight matrix and taking the arccosine transform; s is the scaling factor; r is the angle interval in the angle space obtained by taking the arccosine transform, which is used to penalize the angle between the feature vector and the weight, thereby improving intra-class compactness and inter-class diversity.
[0079] The loss is then backpropagated to obtain the gradients of the frequency-domain and time-domain primary feature extraction module, the deep neural network, and the classifier. The gradient descent update algorithm is then used to update the weight parameters of the frequency-domain and time-domain primary feature extraction module, the deep neural network, and the classifier, thereby achieving training.
[0080] The formula for updating the weight parameters is:
[0081]
[0082] b t =μ*b t-1 +(1-τ)*gt
[0083] g t =b t
[0084] θ t =θ t-1 -γ*g t
[0085] Where, θ t-1 The weight parameter g is the weight parameter at step t-1. t The objective function at step t is θ t-1 The gradient of θ; t b is the weight parameter at step t; t It is the momentum of the t-th iteration; μ is the momentum coefficient; τ is the momentum damping; It is the gradient operator; γ is the update step size.
[0086] This invention proposes a method for identifying industrial machine sounds by combining Mel spectrum and temporal features. The Mel spectrum is obtained by performing a short-time Fourier transform and Mel filter bank filtering on the industrial machine sound, which fully simulates human hearing's perception of pitch and loudness. The temporal features are obtained by performing one-dimensional convolution and multiple convolution-normalization-activation modules on the industrial machine sound. The calculation parameters of the temporal features are learned through a gradient descent update algorithm, making full use of the full-frequency information of the industrial machine sound. By combining the Mel spectrum and temporal features, it is beneficial to accurately identify the industrial machine sound under different conditions.
[0087] To verify the technical effectiveness of the method of the present invention, the same dataset was processed using the Mel spectrum method, the time domain feature method and the Mel spectrum time domain fusion method of the present invention, respectively, and the fault diagnosis accuracy is shown in Table 1.
[0088] Table 1 Fault Diagnosis Accuracy
[0089]
[0090] As shown in Table 1, the industrial machine equipment sound recognition method based on Mel spectrum time-domain fusion of the present invention has a higher fault diagnosis accuracy.
[0091] Example 2
[0092] This embodiment provides a readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the Mel-spectrum time-domain fusion method for industrial machine equipment sound recognition as described in Embodiment 1.
[0093] Example 3
[0094] This embodiment discloses a computer device, including a processor and a memory for storing processor-executable programs. When the processor executes the program stored in the memory, it implements the industrial machine sound recognition method based on Mel-spectrum time-domain fusion as described in Embodiment 1.
[0095] The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above embodiments. Any changes, modifications, substitutions, combinations, or simplifications made without departing from the spirit and principle of the present invention shall be considered equivalent substitutions and shall be included within the protection scope of the present invention.
Claims
1. A method for sound recognition of industrial machinery and equipment using Mel-spectrum time-domain fusion, characterized in that: Includes the following steps: Step S1: Collect audio signals from industrial machinery and equipment; Step S2: Input the audio signal into the frequency domain and time domain primary feature extraction module; the frequency domain and time domain primary feature extraction module performs short-time Fourier transform and Mel filter bank filtering on the audio signal to obtain the Mel spectrum; the audio signal is subjected to one-dimensional convolution and multiple convolution-normalization-activation modules to obtain time domain features; then the feature maps of the Mel spectrum and time domain features are concatenated at the channel level to obtain primary features; Step S3: Use a deep neural network to perform feature transformation on the primary features, and fuse the Mel spectrum and temporal features to obtain the advanced feature embedding code; Step S4: Input the high-level feature embedding code into the classifier; based on the output of the classifier, obtain the recognition result of the sound of industrial machinery and equipment. The frequency domain and time domain primary feature extraction module, deep neural network, and classifier refer to the trained frequency domain and time domain primary feature extraction module, deep neural network, and classifier. The training method is as follows: audio signal samples of industrial machinery and equipment are collected; the audio signal samples are preprocessed to obtain a sample set; the sample set includes audio signal samples and corresponding labels; The audio signal samples in the sample set are processed in steps S2 and S3 to obtain the high-level feature embedding code. First, the high-level feature embedding code is input into the classifier for forward propagation. Then, the output is used in conjunction with the label and loss function. The loss is obtained; then the loss is backpropagated to obtain the gradients of the frequency domain and time domain primary feature extraction module, deep neural network and classifier. The gradient descent update algorithm is used to update the weight parameters of the frequency domain and time domain primary feature extraction module, deep neural network and classifier, thereby realizing training. The preprocessing includes labeling, downsampling, and segmentation; The labeling refers to labeling the audio signal samples with the corresponding time period name in the operating condition record table of the industrial machinery and equipment, based on the operating condition record table. The downsampling refers to: downsampling an audio signal sample to a set frequency; The cutting refers to: cutting each downsampled audio signal sample into several short-duration audio signals; The gradient descent update algorithm is used to update the weight parameters of the frequency domain and time domain primary feature extraction module, deep neural network, and classifier. The weight parameter update formula is as follows: ; ; ; ; in, It is the first The weight parameters of the step. It is the first Step objective function pair The gradient; It is the first The weight parameters of the step; It is the momentum of the t-th iteration; ; It is the momentum coefficient; It is momentum damping; It is a gradient operator; It updates the step size.
2. The industrial machine equipment sound recognition method based on Mel-spectrum time-domain fusion according to claim 1, characterized in that: In step S2, the short-time Fourier transform includes framing, windowing, and discrete Fourier transform. The framing refers to cutting the audio signal into frames of fixed length according to a set time length; there is an overlapping area between adjacent frames to ensure the continuity of the audio signal in time. The windowing refers to performing window function processing on each frame of the audio signal; The short-time Fourier transform for: ; in, The frame length of the audio signal; For frame shift; Represents the current window's index; Representative from arrive Positive integers; Representative from arrive Positive integers; Represents an audio signal sequence; Represents the imaginary unit; Represents a window function.
3. The industrial machinery and equipment sound recognition method based on Mel-spectrum time-domain fusion according to claim 2, characterized in that: In the windowing process, the window function used is the Hamming window function. : ; in, Indicates the frame number of the current audio signal; Indicates the frame length of the audio signal. As a constant, the Hamming window .
4. The industrial machinery and equipment sound recognition method based on Mel-spectrum time-domain fusion according to claim 1, characterized in that: In step S2, Mel filter bank filtering refers to a spectrum analysis method that filters the frequency domain signal of an audio signal through a set of triangular filter banks designed with Mel scale. The frequency response formula for a Mel filter is: ; in, This represents the frequency response of the Mel filter; Represents frequency value; Representing the One Mel filter; Representing the The center frequency of the Mel filter.
5. The industrial machine equipment sound recognition method based on Mel-spectrum time-domain fusion according to claim 1, characterized in that: The loss function For AAM-Softmax: ; in, Indicates the dimension of the model's output probability vector; This represents the true label of the i-th audio signal sample; and This refers to transforming the normalized high-level feature embedding code through a weight matrix and taking... The angles in the angle vector obtained afterwards; It is the scaling size; It is the inverse cosine transform The angular intervals in the obtained angular space.
6. A readable storage medium, characterized in that, The storage medium stores a computer program that, when executed by a processor, causes the processor to perform the industrial machine sound recognition method based on Mel-spectrum time-domain fusion as described in any one of claims 1-5.
7. A computer device comprising a processor and a memory for storing a processor-executable program, characterized in that, When the processor executes the program stored in the memory, it implements the industrial machine equipment sound recognition method of Mel spectrum time-domain fusion as described in any one of claims 1-5.