Method of sound source detection and localization

By using a CRNN network with CEEMDAN noise reduction and feature fusion, the problem of insufficient accuracy of sound source detection and localization algorithms in unknown noise environments is solved, and high-precision end-to-end sound source localization is achieved.

CN115267672BActive Publication Date: 2026-06-23NANJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV OF POSTS & TELECOMM
Filing Date
2022-07-04
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing sound source detection and localization algorithms are not accurate enough when faced with unknown noise, and traditional methods are difficult to implement end-to-end sound source localization systems, especially in three-dimensional Cartesian space where the localization requirements are too high.

Method used

The CEEMDAN noise reduction algorithm is used to denoise the multi-channel signal. Combined with FBANK and GCC feature extraction, multi-task learning is performed through CRNN network to achieve joint estimation of sound source category and location.

Benefits of technology

It effectively reduces the impact of unknown noise, significantly improves the accuracy of sound source detection and localization, simplifies the complexity of the online prediction process, and realizes an end-to-end sound source localization system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115267672B_ABST
    Figure CN115267672B_ABST
Patent Text Reader

Abstract

The application provides a sound source detection and positioning method, mainly comprising the following steps: splitting a multi-channel signal into a single-channel signal; using a CEEMDAN noise reduction algorithm for noise reduction processing; extracting FBANK features and GCC features from the single-channel signal after noise reduction respectively; training a CRNN network combined with category labels and position labels to obtain a sound source positioning detection model; splitting the online extracted samples according to channels; extracting FBANK features and GCC features from the single-channel signal after splitting respectively, and jointly inputting the comprehensive features into the sound source positioning detection model to obtain the estimation results of the sound source category and the estimation results of the position. The application reduces the influence of unknown noise on the sound source signal by carrying out noise reduction processing on the sound source signal with unknown noise distribution, and simultaneously uses a multi-task learning method for the sound source category and position, which can significantly improve the precision and reduce the complexity of the online prediction process.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a method for sound source detection and localization, belonging to the field of deep learning. Background Technology

[0002] In recent years, due to the widespread application of various positioning algorithms and information, the localization and detection of sound events has received considerable attention, such as intelligent traffic management in smart cities, speech recognition in smart conference rooms, and audio monitoring in smart homes. With the rapid development of the Internet of Things and artificial intelligence, there is an urgent need for a fast and accurate algorithm for the localization and detection of sound events. Typically, such algorithms consist of two sub-tasks: Sound Event Detection (SED) and Sound Source Localization (SSL). The SED task primarily addresses the classification of the sound source, while the SSL task primarily addresses the estimation of the sound source's location.

[0003] For the Sound Emission Deposition (SED) task, different supervised classification learning methods are typically used to determine the category of the sound source. Existing classifiers include: Hidden Markov Models, Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Convolutional Recurrent Neural Networks (CRNNs). For the SED task, faster and more accurate sound source classification is crucial. Currently, the best performance in academia comes from CRNNs, a network structure obtained by stacking CNN, RNN, and fully connected (FC) layers. This effectively utilizes the receptive fields of CNNs at different levels to reduce feature dimensionality and expand the vertical dimension of features, while leveraging RNNs to effectively model time-related sequences.

[0004] Traditionally, SSL (Sound Source Localization) tasks have employed methods based on time-of-arrival delay, controlled beam response, and multiple signal classification. These traditional approaches vary in terms of algorithmic complexity, microphone array geometric constraints, and acoustic scenario model assumptions, making it difficult to achieve end-to-end sound source localization systems. Meanwhile, the continuous expansion of deep learning in recent years has led to an increasing number of researchers using deep learning frameworks to build various SSL networks. In early SSL tasks, sound source direction was often classified as a classification task. This was because, in the early stages of deep learning network development, building classification networks was much easier than building regression networks. Therefore, early sound source directions were subjectively categorized into multiple types. However, this led to several problems. For example, the classification of sound source direction directly affected the resolution of the SSL task. Furthermore, most previous work classified elevation and azimuth angles in spherical coordinates. If localization were to be performed in a three-dimensional Cartesian coordinate system, hundreds of categories might be required, which places extremely stringent demands on network construction and training data, rendering it impractical. Therefore, classification-based sound source localization tasks have gradually been replaced by regression-based tasks.

[0005] In speech feature extraction, the most commonly used feature is the Mel-Frequency Cepstral Coefficients (MFCCs). MFCCs utilize the cepstral transform between the sound source signal and the Mel filter bank, as well as the spectral compression transform of the Mel scale. Since the values ​​of the first few MFCCs can capture audio features with invariant pitch, they are often used in tasks that generalize pitch, such as speaker recognition. However, in recent sound event detection work, results using MFCCs show that they are not the optimal choice due to their sensitivity to background noise. In previous work, Mel-Filter Bank (FBANK) features have been shown to be better than MFCCs in deep neural networks. Regarding spatial localization, for multi-channel signals, the generalized cross correlation (GCC) between adjacent channel signals can effectively reveal the differences between channel signals and demonstrate the ability to distinguish signals coming from different directions. GCCs have been widely applied and developed, and are currently a commonly used solution for sound source localization in traditional methods.

[0006] In view of this, it is indeed necessary to propose a method for sound source detection and localization to solve the above problems. Summary of the Invention

[0007] The purpose of this invention is to provide a method for sound source detection and localization, which effectively reduces the impact of unknown noise on sound source signals.

[0008] To achieve the above objectives, the present invention provides a method for sound source detection and localization, which mainly includes the following steps:

[0009] Step 1: Split the audio signal from the sound source into channels, and split the multi-channel signal into a single-channel signal;

[0010] Step 2: For each single-channel signal, perform noise reduction processing using the CEEMDAN noise reduction algorithm;

[0011] Step 3: Extract FBANK features and GCC features from the denoised single-channel signal, and input the combined FBANK features and GCC features as the comprehensive features into the CRNN network.

[0012] Step 4: Train the CRNN network by combining category labels and location labels to obtain the sound source localization and detection model;

[0013] Step 5: Split the online extracted samples by channel, and split the multi-channel signal into a single-channel signal;

[0014] Step 6: Extract FBANK and GCC features from the single-channel signals split in Step 5, and combine the FBANK and GCC features as a comprehensive feature. Input the combined feature into the sound source localization detection model in Step 4 to obtain the estimated results of the sound source category and location.

[0015] As a further improvement of the present invention, it includes an offline stage and an online stage, wherein steps 1-4 are completed in the offline stage, and steps 5 and 6 are completed in the online stage.

[0016] As a further improvement of the present invention, in step 1, category information and location information are used as labels to mark different sound sources. The category information uses a one-hot code as a label, and the location information is transformed from a spherical coordinate system to a three-dimensional Cartesian coordinate system, as shown in the following formula:

[0017] x = r·cos(ele)·cos(ele)

[0018] y = r·cos(ele)·sin(azi)

[0019] Z = r·sin(ele),

[0020] Where r is the distance between the speaker and the microphone, ele is the elevation angle in degrees, azi is the azimuth angle in degrees, and x, y and z are the three-dimensional Cartesian coordinates.

[0021] As a further improvement of the present invention, step 2 specifically includes the following steps:

[0022] Step 21: Add Gaussian white noise to the single-channel signal to be decomposed to obtain the first set of new signals;

[0023] Step 22: Perform EMD decomposition on the first set of new signals to obtain the first-order intrinsic mode components;

[0024] Step 23: Perform an overall average of the generated N modal components to obtain the first intrinsic modal component decomposed by the CEEMDAN noise reduction algorithm;

[0025] Step 24: Calculate and remove the residual signal of the first intrinsic mode component, add positive and negative paired Gaussian white noise to obtain a second set of new signals, and use the second set of new signals as the carrier to perform EMD decomposition to obtain the first-order mode components.

[0026] Step 25: Repeat the above steps until all modal components are obtained;

[0027] Step 26: For each modal component, calculate its cross-correlation coefficient with the single-channel signal to be decomposed in Step 21.

[0028] As a further improvement of the present invention, in step 21, the single-channel signal to be decomposed is y(t), and after adding Gaussian white noise, a first new signal is obtained as y(t)+(-1). q εv j (t), where q = 1, 2.

[0029] As a further improvement of the present invention, step 3 specifically includes the following steps:

[0030] Step 31: Perform a short-time Fourier transform on the denoised single-channel signal;

[0031] Step 32: Extract the intra-band features from the vector obtained by the short-time Fourier transform using a Mel filter bank;

[0032] Step 33: Perform a logarithmic operation on the obtained internal features to obtain the FBANK features;

[0033] Step 34: Combine different channels in pairs to obtain different combinations;

[0034] Step 35: Perform a Fourier transform on each signal in each combination in step 34, and perform a conjugate operation on one of the signals to obtain two vectors;

[0035] Step 36: Use the GCC-PHAT weighting function to obtain the product of the two vectors;

[0036] Step 37: Perform an inverse Fourier transform on the product to obtain the GCC features between channels;

[0037] Step 38: Overlay the FBANK features and GCC features on the time axis to obtain the comprehensive features.

[0038] As a further improvement of the present invention, in step 32, the Mel filter bank includes 64 triangular filters, and the frequency response of the triangular filters is defined as:

[0039]

[0040] in,

[0041] As a further improvement of the present invention, in step 33, the logarithmic operation is as follows:

[0042]

[0043] The resulting FBANK feature has 513 dimensions.

[0044] As a further improvement of the present invention, in step 34, the received signals between the two channels are respectively

[0045] x1(t)=α1s(t-τ1)+n1(t)

[0046] x2(r)=α2s(t-τ2)+n2(t),

[0047] Where s(t) is the sound source signal, n1(t) and n2(t) are the environmental noise, and τ is the time when the array element receives the sound source signal.

[0048] As a further improvement of the present invention, in step 36, the GCC-PHAT weighting function is:

[0049]

[0050] Where X(ω) is the Fourier transform of the original signal.

[0051] The beneficial effects of this invention are: by denoising the sound source signal with unknown noise distribution, this invention effectively reduces the influence of unknown noise on the sound source signal. At the same time, by using multi-task learning of sound source category and location, it can significantly improve accuracy and reduce the complexity of the online prediction process. Attached Figure Description

[0052] Figure 1 This is a flowchart illustrating the sound source detection and localization method of the present invention.

[0053] Figure 2 This is a graph showing the cross-relationships of the various IMF components in the CEEMDAN noise reduction algorithm decomposing the sound source detection and localization method of the present invention.

[0054] Figure 3 The diagram shows the noise reduction effect achieved by using different noise reduction thresholds in the sound source detection and localization method of the present invention.

[0055] Figure 4 This is a schematic diagram of feature extraction and fusion used in the sound source detection and localization method of the present invention.

[0056] Figure 5 This diagram illustrates how the extraction of features in the sound source detection and localization method of the present invention is affected by the sound source category and location.

[0057] Figure 6 This is a schematic diagram of the CRNN network framework in the sound source detection and localization method of the present invention. Detailed Implementation

[0058] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0059] It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and / or processing steps closely related to the present invention are shown in the accompanying drawings, while other details that are not closely related to the present invention are omitted.

[0060] Additionally, it should be noted that the terms “comprising,” “including,” or any other variations are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.

[0061] like Figures 1 to 6 As shown, this invention discloses a deep learning-based method for sound source detection and localization using CRNN. ​​It employs the CEEMDAN denoising algorithm (Adaptive Noise Complete Set Empirical Mode Decomposition) to denoise sound source signals with unknown noise distributions. The invention mainly includes the following steps:

[0062] Step 1: Split the audio signal from the sound source into channels, and split the multi-channel signal into a single-channel signal;

[0063] Step 2: For each single-channel signal, perform noise reduction processing using the CEEMDAN noise reduction algorithm;

[0064] Step 3: Extract FBANK features and GCC features from the denoised single-channel signal, and input the combined FBANK features and GCC features as the comprehensive features into the CRNN network.

[0065] Step 4: Train the CRNN network by combining category labels and location labels to obtain the sound source localization and detection model;

[0066] Step 5: Split the online extracted samples by channel, and split the multi-channel signal into a single-channel signal;

[0067] Step 6: Extract FBANK and GCC features from the single-channel signals split in Step 5, and combine the FBANK and GCC features as a comprehensive feature. Input the combined feature into the sound source localization detection model in Step 4 to obtain the estimated results of the sound source category and location.

[0068] This invention mainly includes two stages: an offline stage and an online stage. Steps 1-4 are completed in the offline stage, while steps 5 and 6 are completed in the online stage. Steps 1-6 will be described in detail below.

[0069] In step 1, audio signals from sound sources are collected to form a dataset, and the audio signals are split by channel, breaking down multi-channel signals into single-channel signals. For different sound source emission scenarios, the category and location information of each sample are recorded as labels. The category information uses a one-hot code as a marker, and the location information is converted from spherical coordinates to a three-dimensional Cartesian coordinate system, as shown in the following formula:

[0070] x = r·cos(ele)·cos(ele)

[0071] yr.cos(ele).sin(azi)

[0072] Z = r·sin(ele),

[0073] Where r is the distance between the speaker and the microphone, ele is the elevation angle in degrees, azi is the azimuth angle in degrees, and x, y, and z are the three-dimensional Cartesian coordinates. To accelerate network convergence during the final regression for SSL implementation, we normalized the three-dimensional coordinates so that they all fall within the range of (-1, 1). Then, based on the number of microphones in the array, the multi-channel signal was split into single-channel signals, and the sampling rate was changed back to 24kHz.

[0074] In step 2, for each single-channel signal, the CEEMDAN noise reduction algorithm is used for noise reduction processing to improve signal quality and reduce the impact of noise. Assume E i (·) represents the i-th intrinsic mode component obtained after EMD (Empirical Mode Decomposition), and the i-th intrinsic mode component obtained by the CEEMDAN denoising algorithm is... v jTo satisfy a Gaussian white noise signal that conforms to a standard normal distribution, where j = 1, 2, 3...N represents the number of times white noise is added, ε is the standard value of the white noise, and y(t) is the signal to be decomposed, the CEEMDAN denoising algorithm specifically includes the following steps:

[0075] Step 21: Add Gaussian white noise to the single-channel signal y(t) to be decomposed to obtain the first new signal y(t)+(-1). q εv j (t), where q = 1, 2.

[0076] Step 22: Perform EMD decomposition on the first set of new signals to obtain the first-order intrinsic mode components.

[0077] Step 23: The overall average of the generated N modal components yields the first intrinsic mode component (IMF) of the CEEMDAN denoising algorithm, i.e.

[0078] Step 24: Calculate and remove the residual signal of the first intrinsic mode component, i.e. A new signal is obtained by adding positive and negative paired Gaussian white noise to r1(t). The new signal is then used as a carrier for EMD decomposition to obtain the first-order mode component D1.

[0079] Step 25: Repeat the above steps until all modal components are obtained. At this point...

[0080] Step 26: For each modal component, calculate its cross-correlation coefficient with the single-channel signal to be decomposed in Step 21 (i.e., the original audio signal). Based on the cross-correlation coefficient, select or discard each Intrinsic Mode Function (IMF). In normal processing, high-frequency signals are usually directly discarded. However, in many cases, high-frequency signals contain useful information, and direct removal would destroy the integrity of the original data. How to select or discard IMFs from each signal decomposition involves the original noise distribution of the original signal. However, real-world noise is very complex, and its specific distribution is often unknown. Therefore, the correlation coefficient is used to determine the selection or rejection of IMFs.

[0081] To demonstrate the algorithm logic, 4000 sampling points are extracted for a simulation of the CEEMDAN noise reduction algorithm, resulting in 13 IMF components. Correlation coefficients are then calculated for each component, yielding the following correlation coefficients: Figure 2 .

[0082] from Figure 2It can be seen that different IMFs have different degrees of correlation with the original signal. We set the concept of a noise reduction threshold t, that is, we retain IMFs with a correlation coefficient greater than t, and we filter out IMFs with a correlation coefficient less than t as noise.

[0083] Figure 3 These are comparison images of the original signal and the signal obtained after filtering out noise at different noise reduction thresholds. Sub-image a shows the original signal, sub-image b shows the signal at noise reduction threshold t = 0, sub-image c shows the signal at noise reduction threshold t = 0.05, and sub-image d shows the signal at noise reduction threshold t = 0.5. The noise reduction coefficients were obtained using an exhaustive method.

[0084] Figure 3 As can be seen from the circled area, different noise reduction thresholds can achieve signal smoothing while reducing some noise. It should be noted that filtering out some IMFs will result in the loss of some information from the original signal.

[0085] In step 3, FBANK features and GCC features are extracted from the denoised signal, and then combined as a comprehensive feature input into the CRNN network.

[0086] Figure 4 The overall flowchart of the feature extraction and fusion algorithm is shown. Step 3, which involves extracting and fusing FBANK and GCC features, is divided into the following steps:

[0087] Step 31: First, extract the FBANK features from the denoised single-channel signal, and then perform a short-time Fourier transform on the denoised single-channel signal. Using 25ms as a frame, the audio signal can be considered a stationary signal within a short time. At this time, the sampling rate is 24KHz, so a 1024-point Fourier transform is performed, resulting in a vector length of 513.

[0088] Step 32: Extract the internal frequency band features of the vector obtained by the short-time Fourier transform using a Mel filter bank. The Mel filter bank used has 64 triangular filters to extract frequency band information. The triangular bandpass filter has two main purposes: to smooth the spectrum and eliminate the effect of harmonics, highlighting the formants of the original speech. The frequency response of the triangular filter is defined as:

[0089]

[0090] in, The purpose of this filter bank is to simulate the non-linear perception of sound by the human ear, making it more discriminative at lower frequencies and less discriminative at higher frequencies. This involves converting frequency to a Mel-scale, as shown in the formula:

[0091]

[0092] Step 33: Perform a logarithmic operation on the obtained internal features to obtain the FBANK features. That is...

[0093]

[0094] The extracted FBANK features have 513 dimensions. This concludes the FBANK extraction process.

[0095] Step 34: Next, extract the GCC features from the denoised single-channel signal and combine different channels in pairs to obtain different combinations. For example, in this experiment, a 4-channel microphone array is used, resulting in 6 possible combinations, thus giving the GCC features 6 dimensions. Assume the received signals between the two microphones are...

[0096] x1(f)=α1s(t-τ1)+n1(t)

[0097] x2(t)=α2s(t-τ2)+n2(t),

[0098] Where s(t) is the sound source signal, n1(t) and n2(t) are the environmental noise, and τ is the time when the array element receives the sound source signal.

[0099] Step 35: Perform a Fourier transform on each signal in each combination, and perform a conjugate operation on one of the signals to obtain two vectors.

[0100] Step 36: Multiply the two obtained vectors using the GCC-PHAT weighting function. The time delay estimation algorithm based on GCC can introduce a weighting function to adjust the cross-power spectral density, thereby optimizing the performance of time delay estimation. Depending on the weighting function, the generalized cross-correlation function has various variations, among which the Generalized Cross Correlation Phase Transformation (GCC-PHAT) method is the most widely used. The GCC-PHAT weighting function itself has a certain degree of noise and reverberation resistance; therefore, this method is also used here to enhance the robustness of the system. The GCC-PHAT weighting function is...

[0101]

[0102] Where X(ω) is the Fourier transform of the original signal. It can be seen that the PHAT-weighted cross-power spectrum approximates the expression of the unit impulse response, highlighting the peak value of the time delay, effectively suppressing reverberation noise, and improving the accuracy and precision of time delay estimation.

[0103] Step 37: Perform an inverse Fourier transform on the product to obtain the inter-channel GCC features. At this point, the GCC feature extraction is complete.

[0104] Step 38: Finally, the FBANK and GCC features are superimposed on the time axis to obtain the comprehensive feature. At this point, both the FBANK and GCC features have a dimension of 513. Using the concat function of the numpy module in Python, all the extracted features are combined into a comprehensive feature of (10, 513).

[0105] Figure 5 The visualization of features is shown below. Subplot (a) shows the time-domain plot of a telephone ringing sound at location A and its corresponding features; subplot (b) shows the time-domain plot of a telephone ringing sound at location B and its corresponding features; and subplot (c) shows the time-domain plot of a knocking sound at location B and its corresponding features. A comparison of subplots (a) and (b) shows that when the same sound source is emitted from different locations, the FBANK features remain largely unchanged, but the GCC features change significantly. A comparison of subplots (b) and (c) shows that when different sound sources are emitted from the same location, the FBANK features change dramatically, while the GCC features remain largely unchanged. This confirms that the combined effect of FBANK and GCC features allows for the determination of sound source category and location.

[0106] In step 4, the sound source localization and detection model is trained by combining the given category labels and location labels of the samples;

[0107] Figure 6 The CRNN network framework used in this invention is described. Sub-figure (a) shows the overall network structure, which consists of three convolutional blocks, two gated recurrent units (GRUs), and fully connected (FC) layers corresponding to classification and regression. Details of the convolutional blocks are shown in sub-figure (b), where a soft attention mechanism is embedded in the first convolutional block; the detailed structure is shown in sub-figure (c). These will be described in detail below.

[0108] First, for the overall framework, the input features are (10×1×513). Before going through the first convolutional block, the attention mechanism is first divided, as shown in sub-figure (c). Since the effect of convolution is local, multiple convolutions are needed to realize the association of features at different positions in the entire feature map. The attention mechanism can realize the fusion of overall features in convolution instead of being limited to the convolution kernel. The attention mechanism used in this invention draws on the idea of ​​Natural Language Processing (NLP) and uses a soft attention method to implement a self-attention mechanism. First, the feature maps of each channel are separated. Since the input features of this invention are 10 channels, the vector of each channel is reset to the matrix size and the two are multiplied by a dot product. The significance of this step is that the (i, j) coordinates in the subsequent attention mechanism mapping map are the influence of the i-th element and the j-th element in that channel, thus realizing the dependency relationship between any two elements in the entire feature map. Then, the attention mechanism mapping feature map is obtained by softmax normalization. Finally, the feature map is multiplied by the original CNN feature map, and the weights of each feature in the CNN are updated. As learning deepens, the individual features of the original feature map receive the weights updated by the attention mechanism, which means they gain global dependencies at any position.

[0109] The parameters of each convolutional block are similar. The batch regularization layer in all convolutional blocks is used to normalize the parameters, which can speed up the convergence of training. The dropout layer uses a fixed probability of 0.2 to ensure that the training process does not overfit. The ReLU layer is the activation function at the end of each convolutional block to avoid the linear relationship between the learned parameters, which also plays a role in preventing overfitting.

[0110] In the structure proposed in this invention, the combined features of FBANK and GCC can be considered as 10-channel combined features, with each channel being a 1D vector of feature dimension with respect to the time dimension. For the local shift-invariant properties, we focus on using CNN for multi-layer learning. In the three convolutional kernels, there are 1x2 2D convolutional kernels with a stride of 1x1, expanding the dimension from 10 to 32, and then from 32 to 64. The pooling kernels are also 1x2 2D pooling kernels with a stride of 1x2. The convolution and pooling parts focus on reducing the dimensionality of the features within a single channel, extracting their local invariant properties along the time dimension, and expanding the features to more dimensional spaces to enhance the deeper information of the features. Additionally, in the third convolutional block, 0x1 edge padding is added to ensure that the CNN output dimension can be transformed into a dimension suitable for GRU by reshaping the matrix size. The convolutional blocks focus on expanding the channel dimension to deeply mine the combined features, while compressing the feature values ​​along the time dimension to extract the necessary feature information. The purpose of the three convolutional kernels is to integrate the features between the various channels and make them match the subsequent GRU input dimensions.

[0111] The CNN output, after being resized to 128x64, is directly input into the GRU to match its sequence length for temporal memorization learning. Specifically, each GRU unit consists of two layers, specified by the `num_layer` parameter defined in PyTorch. Each GRU layer has an input and output sequence length of 64, and a hidden layer size of 64. The output of each GRU is activated using tanh. ReLU activation is avoided because it is prone to gradient explosion and decay in recurrent units; therefore, tanh activation is used throughout RNNs to prevent these issues. All GRUs are bidirectional. Through GRU learning, temporal information about the features can be obtained, allowing for further feature extraction. After GRU processing, the feature vector output is a 128x256 two-dimensional vector. This feature contains more temporal information.

[0112] Following the backbone network are two branch networks, both constructed from fully connected (FC) layers. These FC layers share weights across time, corresponding to the classification task of SED and the regression task of SSL, respectively. The SED branch network consists of three FCs, with the last FC using the sigmoid activation function to achieve an 11-class classification task. The output of the sigmoid function corresponds to the range (0,1), and for each event prediction, values ​​exceeding a threshold of 0.5 are considered valid outputs. The SSL branch network consists of four FCs, with the last FC using three tanh activations, corresponding to the regression prediction results of the event at the x, y, and z coordinates. Since this invention specifies the x, y, and z coordinates to be within the range (-1,1) during label specification, tanh is used here to ensure the output results fall within this range.

[0113] For the loss function, the SED branch uses the Binary Cross-Entropy Loss (BCE Loss) function between the predicted and true classes; the SSL branch uses the Mean Square Error Loss (MSE Loss) function as the difference between the predicted and true coordinates. Furthermore, since this invention uses a multi-task end-to-end neural network, and the loss functions for classification and regression tasks are not of the same order of magnitude, to effectively balance the difference between the loss functions, this invention adjusts the magnitudes of BCE and MSE to be of the same order of magnitude. After exhaustive parameter tuning, the final loss function for backpropagation is determined to be Loss = BCE + 50 × MSE.

[0114] In summary, the sound source detection and localization method based on convolutional recurrent networks proposed in this invention, after passing through the same backbone network, defines the SED (Self-Device Analysis) and SSL (Self-Device Regression) tasks as classification and regression tasks, respectively. This multi-task learning approach effectively improves estimation performance. By fusing the FBANK feature, which is fixed in its sound source category, and the GCC feature, which changes with the sound source's location, comprehensive features for sound source identification and localization can be extracted in one step. These features are then used as input features for training the entire neural network to obtain the model, which is then used to obtain the prediction result. This is an end-to-end neural network framework that simultaneously solves the SED and SSL problems, effectively improving accuracy. Furthermore, apart from the microphone array in the data acquisition stage, no additional hardware support is required, making it extremely convenient for end-to-end sound source localization and detection systems.

[0115] Meanwhile, this invention utilizes the CEEMDAN denoising algorithm to denoise signals with unknown noise distribution, effectively reducing the impact of unknown noise on the sound source signal. It achieves an end-to-end model that handles two tasks with a single model, significantly improving accuracy and reducing the complexity of the online prediction process compared to other similar research tasks, demonstrating excellent performance. Since the CEEMDAN denoising algorithm performs intrinsic mode decomposition (IMF) on the original signal to obtain individual IMF components, and the degree of cross-correlation between each IMF component and the original signal represents the contribution of that component, a denoising threshold t can be used to effectively remove noise. Multiple experiments show that the optimal denoising threshold value in this invention is 0.05.

[0116] Furthermore, this invention provides a novel deep learning network framework for sound source detection and localization. Different frameworks offer varying results for this type of algorithm; the framework provided by this invention is simple to implement and exhibits superior performance compared to other frameworks.

[0117] The above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method for sound source detection and localization, characterized in that, The main steps include: Step 1: Split the audio signal from the sound source into channels, and split the multi-channel signal into a single-channel signal; Step 2: For each single-channel signal, perform noise reduction processing using the CEEMDAN noise reduction algorithm; Step 3: Extract FBANK features and GCC features from the denoised single-channel signal, and input the combined FBANK features and GCC features as the comprehensive features into the CRNN network. Step 4: Train the CRNN network by combining category labels and location labels to obtain the sound source localization and detection model; Step 5: Split the online extracted samples by channel, and split the multi-channel signal into a single-channel signal; Step 6: Extract FBANK features and GCC features from the single-channel signals split in Step 5, and combine the FBANK features and GCC features as a comprehensive feature. Input the combined feature into the sound source localization detection model in Step 4 to obtain the estimation results of the sound source category and the location. In step 1, category information and location information are used as labels to mark different sound sources. Category information uses a one-hot code as the label, and location information is transformed from spherical coordinates to three-dimensional Cartesian coordinates, as shown in the following formula: , in, It is the distance between the speaker and the microphone, ele is the elevation angle in degrees, azi is the azimuth angle in degrees, and x, y and z are the three-dimensional Cartesian coordinates. Step 2 specifically includes the following steps: Step 21: Add Gaussian white noise to the single-channel signal to be decomposed to obtain the first set of new signals; Step 22: Perform EMD decomposition on the first set of new signals to obtain the first-order intrinsic mode components; Step 23: Perform an overall average of the generated N modal components to obtain the first intrinsic modal component decomposed by the CEEMDAN noise reduction algorithm; Step 24: Calculate and remove the residual signal of the first intrinsic mode component, add positive and negative paired Gaussian white noise to obtain a second set of new signals, and use the second set of new signals as the carrier to perform EMD decomposition to obtain the first-order mode components. Step 25: Repeat the above steps until all modal components are obtained; Step 26: For each modal component, calculate its cross-correlation coefficient with the single-channel signal to be decomposed in Step 21; based on the cross-correlation coefficient, select or discard each intrinsic component and set a noise reduction threshold. Retain cross-relation ratio Large intrinsic components, filtering out cross-correlation coefficients Small intrinsic components; Step 3 specifically includes the following steps: Step 31: Perform a short-time Fourier transform on the denoised single-channel signal; Step 32: Extract the intra-band features from the vector obtained by the short-time Fourier transform using a Mel filter bank; Step 33: Perform a logarithmic operation on the obtained internal features to obtain the FBANK features; Step 34: Combine different channels in pairs to obtain different combinations; Step 35: Perform a Fourier transform on each signal in each combination in step 34, and perform a conjugate operation on one of the signals to obtain two vectors; Step 36: Use the GCC-PHAT weighting function to obtain the product of the two vectors; Step 37: Perform an inverse Fourier transform on the product to obtain the GCC features between channels; Step 38: Concatenate the FBANK features and GCC features along the channel dimension to obtain the comprehensive features; The CRNN network framework consists of three convolutional blocks, two gated recurrent units (GRUs), and fully connected layers for classification and regression. A soft attention mechanism is embedded in the first convolutional block. The convolutional and pooling parts of the three convolutional kernels focus on dimensionality reduction of features within a single channel, extracting their local invariant properties along the time dimension, and expanding the features to more dimensional spaces to enhance their deeper information. The GRUs consist of two bidirectional GRU layers to obtain the temporal information of the features. The loss functions for classification and regression tasks are tuned to the same order of magnitude. After exhaustive parameter tuning, the final loss function for backpropagation is determined as Loss = BCE + 50 × MSE. Following the backbone network are two branch networks, both constructed from fully connected (FC) layers. These FC layers share weights across time, corresponding to the classification task of SED and the regression task of SSL, respectively. The SED branch network consists of three FCs, with the last FC using the sigmoid activation function to perform the classification task. The SSL branch network consists of four FCs, with the last FC using three tanh activations, corresponding to the regression prediction results of the event on the x, y, and z coordinates, respectively.

2. The method for sound source detection and localization according to claim 1, characterized in that: It includes an offline phase and an online phase, where steps 1-4 are completed in the offline phase, and steps 5 and 6 are completed in the online phase.

3. The method for sound source detection and localization according to claim 1, characterized in that: In step 21, the single-channel signal to be decomposed is After adding Gaussian white noise, the first new signal is obtained. ,in ; To satisfy the standard normal distribution of Gaussian white noise signal, j = 1, 2, 3...N represents the number of times white noise is added, and ε is the standard value of white noise.

4. The method for sound source detection and localization according to claim 1, characterized in that: In step 32, the Mel filter bank includes 64 triangular filters, and the frequency response of the triangular filters is defined as: , in, .

5. The method for sound source detection and localization according to claim 4, characterized in that: In step 33, the logarithmic operation is as follows: , The resulting FBANK feature has 513 dimensions.

6. The method for sound source detection and localization according to claim 1, characterized in that: In step 34, the received signals between the two channels are respectively , in, It is a sound source signal. and For environmental noise, This refers to the time it takes for the array element to receive the sound source signal.

7. The method for sound source detection and localization according to claim 6, characterized in that: In step 36, the GCC-PHAT weighting function is: , in, This is the Fourier transform of the original signal.