A distributed speech enhancement system based on maximum likelihood

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using a maximum likelihood-based distributed speech enhancement system, and leveraging signal processing modules and filter updates, the problems of noise residue and high computational burden in wireless acoustic sensor networks are solved, achieving efficient speech enhancement results.

CN116524943BActive Publication Date: 2026-06-23ZHONGBEI UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ZHONGBEI UNIV
Filing Date: 2023-05-11
Publication Date: 2026-06-23

Application Information

Patent Timeline

11 May 2023

Application

23 Jun 2026

Publication

CN116524943B

IPC: G10L21/0216

AI Tagging

Application Domain

Speech analysis

Technology Topics

Data compression Noise

Technical Efficacy Phrases

increase diversityGood noise cancellation performance

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Cd-polluted soil ecological remediation material based on FeCl3 modified biochar coupled with humus loaded SRB composite bacterial ball and preparation method and application thereof
CN122343206Apromote conversionIncrease the proportionMetal contaminationBiochar
Environment monitoring network optimization site method and system combining internet of things and deep learning
CN122093436Aimprove accuracy increase diversity Particular environment based services Securing communication Dynamical optimization The Internet
一种汽车金属配件模具缺陷检测方法
CN121280334BStrong complementarityAvoid blind spots
A multi-source carbon data intelligent management and control method and system for a zero-carbon park
CN121724287Baccurate collection Diversity guaranteed
An escalator step missing detection device
CN224279470Uincrease diversity improve accuracy Elevators Sustainable buildings Simulation Mechanical engineering

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing distributed speech enhancement technologies in wireless acoustic sensor networks suffer from poor output performance, severe noise residue, the need for prior information, and high computational burden, and also lack diversity.

Method used

A distributed speech enhancement system based on maximum likelihood is adopted. Through discrete Fourier transform, speech activity detection, steering vector estimation, data compression, result output and inverse discrete Fourier transform modules, combined with signal construction, weighted correlation matrix estimation and filter update, the system realizes the compression and summation of signals between nodes, and completes distributed speech enhancement.

Benefits of technology

It improves the performance of distributed speech enhancement in wireless acoustic sensor networks, reduces the difficulty of noise cancellation, expands the diversity of technologies, and maintains good speech quality at low signal-to-noise ratios.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116524943B_ABST

Patent Text Reader

Abstract

The present application belongs to the technical field of distributed speech enhancement, and particularly relates to a distributed speech enhancement system based on maximum likelihood. In order to expand the diversity of speech enhancement technology in WASN and complete good noise elimination performance, the system comprises a discrete Fourier transform module, a speech activity detection module, a steering vector estimation module, a data compression module, a result output module, a signal construction module, a weighted correlation matrix estimation module, a filter update module, and a discrete inverse Fourier transform module. The present application is a distributed speech enhancement technology which can be applied to a wireless acoustic sensor network without a data processing center. The technology estimates a weighted correlation matrix through a local signal constructed by a node and a variance of an output result, and updates a filter by combining the estimated weighted correlation matrix with a constructed local steering vector, so as to complete distributed speech enhancement.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of distributed speech enhancement technology, specifically relating to a distributed speech enhancement system based on maximum likelihood. Background Technology

[0002] Wireless acoustic sensor networks (WASNs) typically consist of multiple nodes, which can be a single microphone, a microphone array, or even a smart device such as a mobile phone, smartwatch, or laptop. Each node possesses some data processing capabilities, and these nodes can transmit data through a pre-established wireless communication protocol. Compared to traditional single microphones and microphone arrays, WASNs can not only utilize the temporal and spatial information of audio signals for speech enhancement but also have a large physical coverage area, ensuring that there are always nodes relatively close to the sound source. This allows for the acquisition of noisy speech with a relatively high input signal-to-noise ratio, which is beneficial for further improving the performance of speech enhancement technology.

[0003] Generally, distributed voice enhancement technologies applied to WASNs can be divided into two main categories. One category is applied to WASNs with a data processing center, known as centralized processing; the other is applied to WASNs without a data processing center, known as distributed processing. In centralized processing, all nodes need to send the received voice signals to the data processing center, where voice enhancement is performed. This approach has significant drawbacks: if the data processing center fails or loses connection with a node, the entire WASN will become stagnant and unable to function properly. Furthermore, even when functioning normally, the data processing center requires considerable computing power and incurs high power consumption. In contrast, WASNs without a data processing center do not suffer from these drawbacks. When performing distributed voice enhancement in this network, the computation is shared by each node, and even if a node temporarily fails or joins the network, the overall network operation remains unaffected, and the performance of distributed voice enhancement does not change significantly.

[0004] A distributed beamforming technique based on linearly constrained minimum variance is proposed in the prior art. In this technique, each node has a microphone array, and each node can complete distributed voice enhancement by using its own local signal and a single-channel compressed signal sent from neighboring nodes. Although this technique implements the existing linearly constrained minimum variance beamforming technique in a distributed manner in WASN, its output performance is poor.

[0005] In existing technologies based on network summation methods, a distributed speech enhancement technique is proposed that is not limited by network topology. This technique compresses the signal of each node, sums the compressed signals of each node, and finally iterates the speech enhancement at each node using the sum of its local signal and the compressed signals of other nodes. Although this technique can achieve distributed speech enhancement in any topology, the residual noise in the final enhanced signal is still significant because its core algorithm is multi-channel Wiener filtering.

[0006] In addition, existing technologies based on prior information from the desired sound source steering matrix have proposed a distributed adaptive node-specific speech enhancement technique based on generalized eigenvalue decomposition. This technique can achieve good performance even at low signal-to-noise ratios. Although this technique can achieve certain speech enhancement performance, it requires prior information, which is not easy to implement in practical applications.

[0007] Most existing distributed speech enhancement technologies focus on distributed data fusion, rarely extending to speech enhancement techniques within WASN. Distributed data fusion techniques include methods such as average consistency, diffusion, and gossip, as well as methods that compress the signal before data fusion. However, speech enhancement techniques within WASN are primarily based on Wiener filtering, minimum variance distortionless response, linearly constrained minimum variance, and generalized sidelobe canceller techniques. To address this, and to expand the diversity of speech enhancement techniques within WASN while achieving good noise cancellation performance, this invention presents a maximum likelihood-based distributed speech enhancement solution. It utilizes the variance of the output signal to weight the correlation matrix and perform speech enhancement—a maximum likelihood-based distributed speech enhancement system. Summary of the Invention

[0008] To address the aforementioned problems, this invention provides a distributed speech enhancement system based on maximum likelihood.

[0009] To achieve the above objectives, the present invention employs the following technical solutions:

[0010] A distributed speech enhancement system based on maximum likelihood includes a discrete Fourier transform module, a speech activity detection module, a steering vector estimation module, a data compression module, a result output module, and an inverse discrete Fourier transform module.

[0011] The Discrete Fourier Transform module first performs frame-by-frame windowing processing on the E-dimensional signal received by J nodes in the wireless acoustic sensor network, and then performs Discrete Fourier Transform on each frame of the windowed signal to obtain a discrete spectrum signal.

[0012] The speech activity detection module receives the discrete spectrum signal transmitted by the discrete Fourier transform module, utilizes the characteristic that the first second of speech is mostly without speech segments, and combines it with the logarithmic spectral distance to realize speech activity detection of the discrete spectrum signal, thereby obtaining the speech activity detection result.

[0013] The steering vector estimation module estimates the noisy speech correlation matrix and the noise correlation matrix based on the speech activity detection results obtained by the speech activity detection module. Then, it performs generalized eigenvalue decomposition on the estimated noisy speech correlation matrix and the noise correlation matrix, and finally estimates the steering vector using the eigenvector corresponding to the largest eigenvalue.

[0014] The data compression module compresses the discrete spectrum signal transmitted by the discrete Fourier transform module and the steering vector transmitted by the steering vector estimation module using compression vectors to obtain compressed signals.

[0015] The result output module receives the compressed discrete spectrum signal sent by the data compression module, and each node sums the compressed signals of all nodes to obtain the enhanced speech signal.

[0016] The discrete inverse Fourier transform module receives the enhanced speech signal sent by the result output module, performs a discrete inverse Fourier transform on the signal to obtain the time-domain output speech signal of the current frame, and overlaps and adds the time-domain output speech signals of each frame to obtain the final output signal.

[0017] Furthermore, this system also includes a signal construction module, a weighted correlation matrix estimation module, and a filter update module;

[0018] The signal construction module receives the compressed signal sent by the data compression module, and each node constructs its local signal using its own uncompressed signal and the compressed signals of all other nodes, to obtain the constructed local signal and the local steering vector.

[0019] The weighted correlation matrix estimation module receives the enhanced speech signal sent by the result output module and the constructed local signal sent by the signal construction module, and estimates the weighted correlation matrix.

[0020] The filter update module receives the weighted correlation matrix estimated by the weighted correlation matrix estimation module, updates the filter using the local steering vector constructed by the signal construction module, and transmits the updated filter to the data compression module.

[0021] Furthermore, the estimation of the weighted correlation matrix is performed in the following manner:

[0022] First, the variance of the enhanced speech signal is expressed as:

[0023]

[0024] Where i represents the number of iterations, d represents the enhanced speech signal, and |·| 2 Represents the square of the absolute value;

[0025] Then, the weighted correlation matrix of the noisy speech in the current frame is estimated as follows:

[0026]

[0027] Where α represents the forgetting factor, which is a parameter. This represents the estimated value of the weighted correlation matrix of the noisy speech in the previous frame. This represents the local signal after construction, (·) H ζ represents the conjugate transpose of a vector or matrix, ζ represents a very small positive number, and max(a,b) represents selecting the maximum value between a and b. The estimation of the weighted correlation matrix for each frame of signal is updated by the above formula.

[0028] Furthermore, the filter update is performed using the following expression:

[0029]

[0030] in This represents the filter corresponding to the local signal after construction. Indicates the uncompressed E of the node j Filters corresponding to 3D signals This represents the filter corresponding to the J-1 dimension compressed signal. Let (·) represent the weighted correlation matrix of noisy speech. -1 This represents the inverse operation of a matrix. This represents the local guide vector after construction;

[0031] In each iteration, only the filter of node j will be updated according to the above formula. After the update, this node will... Send it to the remaining nodes q, and then the filters on the remaining nodes are updated as follows:

[0032] .

[0033] Compared with the prior art, the present invention has the following advantages:

[0034] This invention provides a maximum likelihood-based distributed speech enhancement system, a distributed speech enhancement technology applicable to wireless acoustic sensor networks without a data processing center. It estimates the weighted correlation matrix by using the variance of the local signals constructed by each node and the output results, and then updates the filter by combining the estimated weighted correlation matrix with the constructed local steering vector, thereby completing distributed speech enhancement. This invention expands the diversity of distributed speech enhancement techniques in wireless acoustic sensor networks and achieves excellent noise cancellation performance. It utilizes the compressor at each node to compress the received signal, and sums the compressed signals from each node to obtain the final output. Attached Figure Description

[0035] Figure 1 This is a block diagram illustrating the principle of the maximum likelihood-based distributed speech enhancement system of the present invention.

[0036] Figure 2 This is a schematic diagram of the wireless acoustic sensor network in this invention;

[0037] Figure 3 The STOI values for distributed speech enhancement under different input signal-to-noise ratios in this embodiment of the invention are:

[0038] Figure 4 The PESQ values are the distributed speech enhancement values of each technique under different input signal-to-noise ratios in the embodiments of the present invention.

[0039] Figure 5 The ViSQOL values are the distributed speech enhancement values of each technique under different input signal-to-noise ratios in the embodiments of the present invention.

[0040] Figure 6 The WER values are the distributed speech enhancement values for each technique under different input signal-to-noise ratios in this embodiment of the invention. Detailed Implementation

[0041] To make the technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention:

[0042] like Figure 1 The distributed speech enhancement system based on maximum likelihood shown includes a Discrete Fourier Transform (DFT) module 1, a speech activity detection module 2, a steering vector estimation module 3, a data compression module 4, a result output module 5, a signal construction module 6, a weighted correlation matrix estimation module 7, a filter update module 8, and an Inverse Discrete Fourier Transform (IDFT) module 9.

[0043] The Discrete Fourier Transform module 1 first performs frame-by-frame windowing processing on the E-dimensional signals received by J nodes in the wireless acoustic sensor network, and then performs Discrete Fourier Transform on each frame of the windowed signal to obtain the discrete spectrum signal.

[0044] The working principle of Discrete Fourier Transform Module 1 is as follows: Suppose WASN has a total of J nodes, where node j has E j There are 1 microphone, and the e-th signal received by node j is represented as y. j,e (n), and then performs frame-by-frame windowing processing on the signal, followed by DFT for each frame. In this embodiment, the audio signal sampling frequency f during verification is... s The frequency is 16kHz, the window function is the Hanning window, the frame shift is 50%, and the data length per frame is M=256 points. The expression for the Hanning window is as follows:

[0045] (1)

[0046] Based on the Hanning window expression, the windowed signal can be obtained as follows:

[0047] (2)

[0048] Then, a DFT is performed on each windowed frame of the signal, and the resulting discrete spectrum is represented as follows:

[0049] (3)

[0050] Where k represents the frequency index and l represents the time frame index.

[0051] All signals Y received by each node j,e (k,l) are stacked into a column vector, which is represented as:

[0052] (4)

[0053] For ease of representation, indices k and l have been omitted, and all operations in this embodiment are frequency-independent, meaning that the operation corresponding to each frequency point in each frame is consistent. Additionally, y j =x j +v j , where x j =h j s is the reverberant speech component, v j It is additive noise, h j It is the room impulse response from the sound source signal s to the j-th node.

[0054] Speech activity detection module 2: Receives the discrete spectrum signal transmitted by the discrete Fourier transform module 1, utilizes the characteristic that the first second of speech is mostly without speech segments, and combines it with logarithmic spectral distance to realize speech activity detection of the discrete spectrum signal, thereby obtaining the speech activity detection result.

[0055] The working principle of the speech activity detection module 2 is as follows: Speech activity detection is performed on the discrete spectrum of each signal obtained from the discrete Fourier transform module 1. Taking advantage of the characteristic that the first second of a speech signal is mostly devoid of speech, let the number of initial devoid frames of the speech signal be NIS frames, where NIS = f s / (50%×M)-1=124. Therefore, the average noise spectrum estimated using this NIS frame is:

[0056] (5)

[0057] Equation (5) represents the summation and averaging of the corresponding frequency points of each frame of signal. Furthermore, the logarithmic spectrum estimate of the noisy frame is expressed as:

[0058] (6)

[0059] Where |·| represents the modulo operation. Then, the logarithmic spectrum of each frame of the signal is calculated, expressed as follows:

[0060] (7)

[0061] From equations (6) and (7), the logarithmic spectral distance between the signal and the noise signal in each frame can be obtained. The formula for the logarithmic spectral distance is as follows:

[0062] (8)

[0063] In summary, the method for detecting speech activity can be derived as follows: First, set a counter for speechless segments, with an initial value of 125, and set a logarithmic spectral distance threshold of 3. Then, calculate the logarithmic spectral distance d between each frame of signal and the noise frame. spec Determine d spec If the distance is less than the logarithmic spectral distance threshold, then the frame is a speechless frame, and the speechless segment counter is incremented by 1. If not, the frame contains speech, and the speechless segment counter, regardless of its value, must be reset to zero. Finally, it's important to note that if the speechless segment counter's value before resetting to zero is less than the minimum speechless length, then all frames from the time the speechless segment counter was reset to zero until this reset are considered speechless frames. Here, the minimum speechless length is set to 10.

[0064] In this embodiment, to reduce speech distortion during verification, a frame is considered a noise frame only if the speech activity detection result of each signal is a noise frame; otherwise, it is considered a speech frame.

[0065] Guide vector estimation module 3: Based on the speech activity detection results obtained from speech activity detection module 2, the noisy speech correlation matrix and the noise correlation matrix are estimated respectively. Then, the estimated noisy speech correlation matrix and the noise correlation matrix are subjected to generalized eigenvalue decomposition. Finally, the guide vector is estimated using the eigenvector corresponding to the largest eigenvalue.

[0066] The working principle of the steering vector estimation module 3 is as follows: based on the speech activity detection results obtained from the speech activity detection module 2, the noisy speech correlation matrix and the noise correlation matrix are estimated respectively. When there are speech frames, the noisy speech correlation matrix is estimated as follows:

[0067] (9)

[0068] Where the parameter α = 0.997, (·) H Represents the conjugate transpose of a vector or matrix, where y represents the transpose of J nodes E. j y of the channel j The stacked vector has dimension E = ∑E j , This represents the estimated noisy speech correlation matrix from the previous frame. v and y are represented in the same way, so in the absence of a speech frame, the noise correlation matrix estimate is:

[0069] (10)

[0070] Generalized eigenvalue decomposition is performed on the estimated noisy speech correlation matrix and noise correlation matrix:

[0071] (11)

[0072] V obtained from decomposition ec and G ei Let be the eigenvector matrix and the eigenvalue matrix, respectively. Let φ be the eigenvector corresponding to the largest eigenvalue. Then the steering vector estimate is:

[0073] (12)

[0074] Where h in the E dimension contains the guiding vector h of all nodes. j .

[0075] Data compression module 4: Compresses the discrete spectrum signal transmitted by the discrete Fourier transform module 1 and the steering vector transmitted by the steering vector estimation module 3 using compression vectors.

[0076] The working principle of data compression module 4 is: using compression vector w ji The signal y obtained from Discrete Fourier Transform module 1 and steering vector estimation module 3 j and guidance vector estimation h j Compress them separately:

[0077] (13)

[0078] (14)

[0079] Where z j i and ϑ j i Both are one-dimensional compressed signals, and w j i This is also a portion of the centralized filter corresponding to the node's data. Additionally, the compression vector needs to be initialized; in this embodiment, the elements of the compression vector are initialized to random numbers that follow a uniform distribution within a unit interval during verification.

[0080] It's important to note that all instances of the superscript 'i' represent the index of the iteration count. The data calculated in the i-th iteration is the data from the i-th frame of the signal. In this embodiment, 'i' is initialized to 1 during verification, meaning it starts from the first frame of data. Furthermore, this data compression operation is performed on every node in each iteration.

[0081] Output module 5: Receives the compressed discrete spectrum signal sent by data compression module 4. Each node sums the compressed signals of all nodes to obtain the enhanced speech signal.

[0082] The working principle of the output module 5 is as follows: it receives the compressed signal z sent by the data compression module 4. j i Each node will receive the enhanced speech signal:

[0083] (15)

[0084] Signal construction module 6: Receives the compressed signal sent by data compression module 4, and each node constructs its local signal using its own uncompressed signal and the compressed signals of all other nodes.

[0085] The working principle of signal construction module 6 is: to receive the compressed signal z sent by data compression module 4. j i and ϑ j i Compress the z signals of all nodes j iand ϑ j i They are represented as follows:

[0086] (16)

[0087] (17)

[0088] The two variables z mentioned above i and ϑ i Both variables have a dimension of J. At node j, the compressed signals of these two variables are removed, resulting in a J-1 dimensional vector z. -j i and ϑ -j i Then node j uses its uncompressed E j The local signal is constructed using the J-1 dimension compressed signal vector and the J-1 dimension local steering vector. The constructed local signal and local steering vector are represented as follows:

[0089] (18)

[0090] (19)

[0091] The signal dimension constructed by the above formula is E. j +J-1. During the verification of this patent, the above signal construction was completed at each node in each iteration.

[0092] Weighted correlation matrix estimation module 7: Receives the enhanced speech signal sent by the result output module 5 and the constructed local signal sent by the signal construction module 6, and estimates the weighted correlation matrix.

[0093] The working principle of the weighted correlation matrix estimation module 7 is as follows: it utilizes the enhanced speech signal sent by the result output module 5 and the constructed local signal sent by the signal construction module 6. The weighted correlation matrix is estimated. First, the variance of the enhanced speech signal sent by the result output module 5 is expressed as:

[0094] (20)

[0095] Among them |·| 2 Represents the square of the absolute value;

[0096] Then, the weighted correlation matrix of the noisy speech in the current frame is estimated as follows:

[0097] (twenty one)

[0098] in This represents the estimated value of the weighted correlation matrix of the noisy speech in the previous frame. Let ζ represent the constructed local signal, ζ represent a very small positive number, and max(a,b) represent selecting the maximum value between a and b. The estimation of the weighted correlation matrix for each frame of signal is updated using the above formula. In this embodiment, during verification, the above weighted correlation matrix estimation was performed at each node in each iteration, with the parameter ζ set to 10. -5 .

[0099] Filter update module 8: Receives the weighted correlation matrix estimated by the weighted correlation matrix estimation module 7, and updates the filter using the local steering vector constructed by the signal construction module 6.

[0100] The working principle of filter update module 8 is as follows: it receives the weighted correlation matrix estimated by weighted correlation matrix estimation module 7, and uses the local steering vector constructed by signal construction module 6. The filter has been updated:

[0101] (twenty two)

[0102] in This represents the filter corresponding to the local signal after construction. Indicates the uncompressed E of the node j Filters corresponding to 3D signals This represents the filter corresponding to the J-1 dimension compressed signal. Let (·) represent the weighted correlation matrix of noisy speech. -1 This represents the inverse operation of a matrix.

[0103] In each iteration, only the filter of node j will be updated according to the above formula. After the update, this node will... Send it to the remaining nodes q, and then the filters on the remaining nodes are updated as follows:

[0104] (twenty three)

[0105] Discrete Fourier Inverse Transform Module 9: Receives the enhanced speech signal sent by the result output module 5, performs a Discrete Fourier Inverse Transform on the signal to obtain the time-domain output speech signal of the current frame, and overlaps and adds the time-domain output speech signals of each frame to obtain the final output signal.

[0106] The working principle of the discrete Fourier inverse transform module 9 is as follows: after each iteration, the result output module 5 is propagated to the enhanced speech signal d of each node. i The IDFT is then performed to convert the enhanced speech signal to the time domain. The IDFT formula is as follows:

[0107] (twenty four)

[0108] Here, i and l have the same meaning, that is, the i-th iteration calculates the l-th frame signal. Therefore, when they appear together below, the iteration index i will be omitted.

[0109] Because this invention performs frame-by-frame windowing processing on each signal in the Discrete Fourier Transform module 1, and the frame shift is 50%, when the first frame of output speech signal is obtained, it needs to be overlapped and added with the second frame of output speech signal. The overlap portion accounts for 50%, and the specific formula is as follows:

[0110] (25)

[0111] in It is a rounding operation. It represents the largest integer not exceeding a number a.

[0112] This invention discloses a distributed speech enhancement system based on maximum likelihood. To verify the practicality of the technology proposed in this patent, the well-known Imgaei-Source acoustic environment simulation technology is used to simulate a reverberation time T. 60 A closed room measuring 5m × 5m × 3m with a time limit of 0.3s contains a speaker and four randomly distributed nodes. Each node contains E... j =A linear array of 4 microphones, with a distance of 3 cm between the microphones. Figure 2 A two-dimensional schematic diagram of the wireless acoustic sensor network is shown, providing the two-dimensional coordinates of the speaker and four nodes. The speaker's height is 1.7 m, and the height of each of the four nodes is 1 m.

[0113] In verifying this invention, the connection topology between the four nodes is fully connected. The simulated speaker speech is from the TIMIT database. Five sentences are randomly selected from this database and concatenated to form a 19-second speech signal as the source signal, with a sampling frequency of 16kHz. The background noise consists of white noise and baby noise from the NOISEX-92 database, with the sampling frequency downsampled to 16kHz. Finally, the input signal-to-noise ratio of the noisy speech signal received by each node is set to -5dB, 0dB, 5dB, 10dB, and 15dB, respectively.

[0114] To verify the maximum likelihood-based distributed speech enhancement technology proposed in this invention, it was compared with the distributed speech enhancement technologies proposed in references [1] and [2]. When the background noise is babble noise, three evaluation indicators were used to evaluate the above three distributed speech enhancement technologies: Short Time Objective Intelligibility (STOI), Perceptual Speech Quality Assessment (PESQ), and Virtual Speech Quality Objective Assessment (ViSQOL). The values of STOI range from 0 to 1, PESQ range from -0.5 to 4.5, and ViSQOL range from 1 to 5. The larger the values of these three indicators, the higher the speech quality. When the background noise is white noise, the Word Error Rate (WER) evaluation indicator was used for evaluation. Google Speech Recognition version 3.9.0 was used to perform speech recognition on the signal enhanced by the above three distributed speech enhancement technologies.

[0115] Figure 3 , Figure 4 and Figure 5 Different evaluation index values of distributed speech enhancement for each technique under different input signal-to-noise ratios are given. From these three figures, it can be seen that, regardless of the evaluation index, the distributed speech enhancement technique based on maximum likelihood proposed in this invention has the best performance under babble background noise. The technique proposed in reference [1] also has good performance, while the technique in reference [2] cannot achieve high speech enhancement performance under babble background noise.

[0116] Figure 6 The WER values of each distributed speech enhancement technique are given under different input signal-to-noise ratios, with the WER of the clean speech signal being 2.06%. Figure 6 It can be seen that when the input signal-to-noise ratio is low, the maximum likelihood-based distributed speech enhancement technology proposed in this invention can achieve a low error rate.

[0117] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A distributed speech enhancement system based on maximum likelihood, characterized in that, It includes a Discrete Fourier Transform module, a speech activity detection module, a steering vector estimation module, a data compression module, a result output module, and an Inverse Discrete Fourier Transform module; The Discrete Fourier Transform module first performs frame-by-frame windowing processing on the E-dimensional signal received by J nodes in the wireless acoustic sensor network, and then performs Discrete Fourier Transform on each frame of the windowed signal to obtain a discrete spectrum signal. The speech activity detection module receives the discrete spectrum signal, utilizes the characteristic that the first second of speech is mostly without speech segments, and combines it with logarithmic spectral distance to realize speech activity detection of the discrete spectrum signal, thereby obtaining the speech activity detection result; The steering vector estimation module: based on the speech activity detection results, estimates the noisy speech correlation matrix and the noise correlation matrix respectively, then performs generalized eigenvalue decomposition on the estimated noisy speech correlation matrix and the noise correlation matrix, and finally estimates the steering vector using the eigenvector corresponding to the largest eigenvalue; The data compression module: compresses the discrete spectrum signal and the steering vector transmitted by the steering vector estimation module using a compression vector to obtain a compressed signal; The result output module receives the compressed discrete spectrum signal, and each node sums the compressed signals of all nodes to obtain the enhanced speech signal. The discrete inverse Fourier transform module receives the enhanced speech signal, performs a discrete inverse Fourier transform on the signal to obtain the time-domain output speech signal of the current frame, and overlaps and adds the time-domain output speech signals of each frame to obtain the final output signal. It also includes a signal construction module, a weighted correlation matrix estimation module, and a filter update module; The signal construction module receives the compressed signal and each node constructs its local signal using its own uncompressed signal and the compressed signals of all other nodes, resulting in the constructed local signal and local steering vector. The weighted correlation matrix estimation module: receives the enhanced speech signal and the constructed local signal, and estimates the weighted correlation matrix; The filter update module receives the estimated weighted correlation matrix and updates the filter using the local steering vector, then transmits the updated filter to the data compression module. The weighted correlation matrix is estimated using the following method: First, the variance of the enhanced speech signal is expressed as: ； Where i represents the number of iterations, d represents the enhanced speech signal, and |·| 2 Represents the square of the absolute value; Then, the weighted correlation matrix of the noisy speech in the current frame is estimated as follows: ； Where α represents the forgetting factor, which is a parameter. This represents the estimated value of the weighted correlation matrix of the noisy speech in the previous frame. This represents the local signal after construction, (·) H To represent the conjugate transpose of a vector or matrix, ζ is set to 10. -5 max(a,b) represents selecting the maximum value between a and b. The estimation of the weighted correlation matrix for each frame of signal is updated by the above formula.

2. A distributed speech enhancement system based on maximum likelihood as described in claim 1, characterized in that, The filter is updated using the following expression: ； in This represents the filter corresponding to the local signal after construction. Indicates the uncompressed E of the node j Filters corresponding to 3D signals This represents the filter corresponding to the J-1 dimension compressed signal. Let represent the weighted correlation matrix of noisy speech, (·) -1 This represents the inverse operation of a matrix. This represents the local guide vector after construction; In each iteration, only the filter of node j will be updated according to the above formula. After the update, this node will... Send it to the remaining nodes q, and then the filters on the remaining nodes are updated as follows: 。