Single-channel speech enhancement method and device based on mean-reverting schrodinger bridge
The single-channel speech enhancement method using the mean-inverted Schrödinger bridge utilizes posterior mean information to guide the inverse generation process, solving the problems of high computational cost and insufficient generalization ability in existing technologies, and achieving low-cost and efficient speech enhancement effects.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- JIANGSU UNIV
- Filing Date
- 2025-06-06
- Publication Date
- 2026-07-02
Smart Images

Figure CN2025099558_02072026_PF_FP_ABST
Abstract
Description
A Single-Channel Speech Enhancement Method and Apparatus Based on Mean-Inverted Schrödinger Bridge Technical Field
[0001] This invention relates to the field of speech enhancement in intelligent speech technology, specifically to a single-channel speech enhancement method and apparatus based on a mean-inverted Schrödinger bridge. Background Technology
[0002] Speech enhancement (SE) technology aims to estimate a clean waveform from noisy speech waveforms, improving the naturalness and intelligibility of speech. It is widely used in speech recognition systems, online conferencing, smart homes and other fields.
[0003] Generally, speech enhancement methods can be divided into two categories: discriminative and generative. Discriminative methods effectively eliminate noise in speech waveforms by directly modeling the deterministic mapping between noisy and clean speech, but this can lead to speech distortion and over-denoising. Generative methods, on the other hand, implicitly or explicitly learn the latent data distribution of clean speech, resulting in more natural and understandable outputs, and exhibiting stronger generalization ability to various noise patterns. In particular, generative methods based on diffusion models treat the speech enhancement task as a conditional generation process or a probability distribution transmission process, effectively improving the perceptual quality of the enhanced speech. However, because the inverse generation process requires multiple iterations, the computational cost is high, making it unsuitable for real-time speech processing requirements.
[0004] Recent research combines the advantages of the two methods mentioned above to develop a hybrid speech enhancement approach: cascading a discriminative model and a diffusion model. In this approach, the discriminative model outputs an initial predicted waveform of clean speech, and then the diffusion model corrects for blurring and distortion in the initial prediction, reducing the number of iterations required by the diffusion model and producing cleaner speech with higher perceptual quality. However, this hybrid approach still has some limitations. Firstly, the reverse generation process of the diffusion model still requires at least 20 neural network calls, resulting in high computational costs. Secondly, due to the limited generalization ability of the preceding discriminative model, its generalization performance is poor in complex acoustic scenarios. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a single-channel speech enhancement method and apparatus based on a mean-inverted Schrödinger bridge.
[0006] The present invention achieves the above-mentioned technical objectives through the following technical means.
[0007] A single-channel speech enhancement method based on the mean-inverted Schrödinger bridge:
[0008] Noisy speech samples are obtained by mixing clean speech samples and noisy speech samples at different signal-to-noise ratios. A set of speech sample pairs is formed by the clean speech samples and the noisy speech samples.
[0009] The speech samples in the speech sample pair set are preprocessed and then subjected to Fourier transform to obtain the spectrum complex matrix pairs;
[0010] A discriminant model is constructed, taking the complex spectrum matrix of noisy speech as input and the estimated complex spectrum matrix of clean speech as output. Based on the difference between the estimated complex spectrum matrix of clean speech and the complex spectrum matrix of real clean speech, the parameters of the discriminant model are adjusted. When the discriminant model meets the preset training requirements, the discriminant model with the optimal parameters is taken as the target discriminant model.
[0011] A score model is constructed, and the inverse optimal offset score of the mean-inverted Schrödinger bridge is parameterized using the score model to estimate the inverse optimal offset score during the inverse generation process. Based on the difference between the estimated inverse optimal offset score and the true inverse optimal offset score, the parameters of the score model are adjusted. When the score model meets the preset training requirements, the score model corresponding to the optimal parameters is taken as the target score model.
[0012] Given noisy speech, the inverse optimal offset score of the Schrödinger bridge is reversed by using the parameterized mean inverted from the trained target score model to generate clean speech.
[0013] Furthermore, the process of obtaining the spectral complex matrix pair is as follows: For any speech sample pair (x, y), randomly extract sub-segments of x and y, and count the maximum value of the sampled values in y. Use the maximum value to normalize x and y respectively to obtain x′ and y′. Then, obtain the spectral complex matrix X corresponding to x′ and y′ through Fourier transform. 0 Y 0 Then for X 0 and Y 0 Amplitude compression is performed to obtain the complex spectral matrix pair (X,Y).
[0014] Furthermore, the training process for a discriminative model is as follows:
[0015] The complex spectral matrix Y of the noisy speech is input into the discriminant model D. θ Output an estimate of the complex spectral matrix of the clean speech.
[0016] Constructing a reconstruction loss function based on clean speech Where L(·) represents the difference between the two; to minimize the loss function To train the target, perform gradient backpropagation and update the parameters θ.
[0017] Training stops when the preset training requirements are met; the optimal parameter θ during the training process is selected. * The corresponding discriminant model is used as the target discriminant model.
[0018] Furthermore, the process of obtaining the mean-inverted Schrödinger bridge is as follows:
[0019] For a complex matrix pair (X,Y) with respect to Y, assume that the posterior mean of X is a variable. Given Y, the variable From a conditional posterior distribution It was observed that, using the target discriminative model D θ* Approximate the conditional posterior distribution, i.e.
[0020] The mean reversal process at a specific moment The state sample is represented as sample Follows a marginal Gaussian distribution in, This represents a uniform distribution on the interval [0,1].
[0021] Define a Schrödinger bridge process that bridges a noisy speech distribution and a clean speech distribution, and represent the state sample of this Schrödinger bridge process at time t as X. t Sample X t Follows a marginal Gaussian distribution q(X) t |X0,X1), and X0=X, X1=Y; at time t, assume X t by As a condition, the marginal Gaussian distribution of the Schrödinger bridge is derived as follows:
[0022] in, yes Point estimate;
[0023] The Schrödinger bridge process that introduces a mean-reversal process is called the mean-reversal Schrödinger bridge, which replaces the boundary condition X0 = X at time t of the Schrödinger bridge with...
[0024] Furthermore, the mean inversion process is a Schrödinger bridge process that bridges the posterior mean distribution and the pure speech distribution.
[0025] Furthermore, at time t, the marginal Gaussian distribution of the mean-reversal process is:
[0026] in, This represents the mean of the marginal distribution during the mean reversal process at time t. α represents the variance of the marginal distribution during the mean reversal process at time t. t and σ t It is noise scheduling in the mean-inversion process, signal-to-noise ratio
[0027] Furthermore, the mean-inverted Schrödinger bridge is described by the following two stochastic differential equations:
[0028] Where f is the offset coefficient, g is the diffusion coefficient, and W and It is a standard Wiener process, Ψ t and It is the Schrödinger factor, a fraction. and Represents the forward optimal offset score and the reverse optimal offset score;
[0029] At time t, the marginal Gaussian distribution of the mean-inverted Schrödinger bridge is:
[0030] in, Let represent the mean of the mean-reversed marginal distribution of the Schrödinger bridge at time t. This represents the variance of the mean-inverted Schrödinger bridge marginal distribution at time t. and It is a noise scheduling of the mean-inverted Schrödinger bridge, determined by f and g in the above stochastic differential equation, with a signal-to-noise ratio.
[0031] Furthermore, the training process for the score model is as follows:
[0032] The complex spectral matrix Y of the noisy speech is input into the target discriminative model D. θ* Output
[0033] Random sampling time Based on the marginal Gaussian distribution of the mean reversal process at time t Calculate the mean of this distribution to obtain the point estimate at time t. Next, the marginal Gaussian distribution of the Schrödinger bridge is inverted based on the mean at time t. Samples were obtained at time t. in It is random noise; finally, time t and sample X are... t Input score model Output At this point, the true optimal reverse offset score for The estimated optimal reverse offset score is
[0034] Based on the difference between the estimated inverse optimal offset score and the true inverse optimal offset score, the parameters of the score model are adjusted. The difference between the two is reflected in X0 and... The difference; specifically: defining the data prediction loss function. To minimize the loss function To train the system, perform gradient backpropagation to update the parameters.
[0035] Training stops when the preset training requirements are met; the optimal parameters during the training process are selected. The corresponding score model is used as the target score model.
[0036] Furthermore, the reverse generation process specifically includes:
[0037] Time t j ,sample Input the target score model and get
[0038] pass Estimating the optimal reverse offset fraction Solving inverse stochastic differential equations calculate
[0039] Let j = j-1. If j > 0, repeat the above process. If j = 0, stop the reverse generation. The sample at time t0 The complex matrix of the spectrum of the clean speech after the enhancement of noisy speech is then subjected to inverse amplitude compression, inverse Fourier transform and inverse normalization to finally obtain the waveform of the clean speech.
[0040] A single-channel speech enhancement device based on a mean-inverted Schrödinger bridge, comprising:
[0041] Construct a sample pair module, which consists of a set of speech sample pairs composed of clean speech samples and noisy speech samples;
[0042] The sample processing module preprocesses and performs Fourier transform on the speech samples in the speech sample pair set to obtain the spectrum complex matrix pairs;
[0043] The discriminant model building module constructs a discriminant model, estimates the complex spectral matrix of clean speech, and adjusts the model parameters based on the difference between the estimated complex spectral matrix of clean speech and the real complex spectral matrix of clean speech. When the model meets the preset training requirements, the discriminant model with the optimal parameters is taken as the target discriminant model.
[0044] The score model building module constructs a score model, estimates the optimal reverse offset score during the reverse generation process, and adjusts the model parameters based on the difference between the estimated optimal reverse offset score and the true optimal reverse offset score. When the model meets the preset training requirements, the score model corresponding to the optimal parameters is taken as the target score model.
[0045] The clean speech generation module, for a given noisy speech, uses the parameterized mean inversion Schrödinger bridge's inverse optimal offset score, based on a trained target score model, to perform a reverse generation process and generate clean speech.
[0046] Compared with the prior art, the present invention has the following beneficial effects:
[0047] (1) This invention proposes a single-channel speech enhancement framework based on the mean-inverted Schrödinger bridge. The speech enhancement task is defined as the optimal transmission problem between noisy speech distributions and clean speech distributions. The corresponding Schrödinger bridge problem is solved using a diffusion model to achieve clean speech generation. Comparison with existing single-channel speech enhancement methods based on diffusion models on public datasets shows that this invention has more advanced performance, significantly reduces the number of neural network calls, and lowers computational costs.
[0048] (2) This invention improves upon the traditional Schrödinger bridge by proposing a mean-inverted Schrödinger bridge. It introduces posterior mean information (the posterior mean of clean speech conditioned on noisy speech) into the discriminant model as a guide for the generation process. Unlike existing hybrid methods, this invention only approximates the edge Gaussian distribution based on the discriminant model during score model training to calculate the inverse optimal offset score, thus avoiding the limited generalization ability problem caused by the pre-discriminative discriminant model in hybrid methods. Specifically, this invention designs a mean inversion process to simulate the conversion from posterior mean to clean speech. Then, by deriving an approximate form of the Gaussian marginal distribution, a mean inversion process is introduced into the Schrödinger Bridge, using the posterior mean as a guide for the inverse generation process. In the early stages of the inverse generation process, the mean inversion process favors posterior mean samples, and the mean inversion Schrödinger Bridge process samples are located outside the data manifold. Under the guidance of the posterior mean, the complexity of the neural network estimating the optimal inverse offset score can be reduced, improving the ability to preserve high-frequency speech information. In the later stages of the generation process, the mean inversion process gradually favors clean speech samples, and the mean inversion Schrödinger Bridge process samples are located inside the data manifold, transforming into a process of generating clean speech samples, gradually restoring low-frequency speech information, and ultimately achieving the optimal conversion between noisy speech and clean speech under the guidance of the posterior mean, thus realizing speech enhancement. Attached Figure Description
[0049] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0050] Figure 1 is a flowchart of the single-channel speech enhancement based on the mean-inverted Schrödinger bridge described in this invention. Detailed Implementation
[0051] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of protection of this invention.
[0052] As shown in Figure 1, the single-channel speech enhancement method based on the mean-inverted Schrödinger bridge provided in this embodiment includes the following steps:
[0053] S1, Constructing sample pairs: Mix clean speech samples and noisy speech samples according to different signal-to-noise ratios to obtain noisy speech samples, and form a speech sample pair set by combining clean speech samples and noisy speech samples.
[0054] In this embodiment, the VoiceBank+DEMAND speech enhancement benchmark data is obtained from the network in the computer, and the pre-constructed speech sample pair set D is directly obtained using the VoiceBank+DEMAND speech enhancement benchmark dataset. xy ={(x (i) ,y (i) )|x (i) ∈D x ,y (i) ∈D y ,i=1,2,...,n}, where (x (i) ,y (i) Let x be the i-th pair of clean and noisy speech samples. (i) and y (i) D represents the original waveform of clean speech and the original waveform of noisy speech, respectively. x ={x (1) ,x (2) ,...,x (n)} represents a collection of clean speech, D y ={y (1) ,y (2) ,...,y (n)Let} represent the noisy speech set, where i is the sample index and n is the total number of sample pairs. The clean speech in this dataset comes from the VoiceBank corpus, and the noisy speech comes from the DEMAND corpus, and they are mixed according to signal-to-noise ratios of 0dB, 5dB, 10dB, and 15dB.
[0055] The above sample pair construction method is not limited to using pre-built sample pairs from public datasets. It can also use clean speech samples from clean speech datasets such as VoiceBank, TIMIT, and WJS0, and noisy speech samples from noisy speech datasets such as DEMAND and WHAM!, and mix them according to a specific signal-to-noise ratio to construct speech sample pairs.
[0056] S2, Speech preprocessing: After preprocessing the speech samples in the speech sample pair set, the corresponding spectrum complex matrix pairs are obtained through Fourier transform.
[0057] In this embodiment, after preprocessing any speech sample pair (x,y)∈D xy First, randomly select sub-segments of x and y, and then count the maximum value of the sampled values in y. max By normalizing x and y using the maximum value, we obtain x′=x / y max y′=y / y max Then, the complex spectral matrices X corresponding to x′ and y′ are obtained through Fourier transform. 0 Y 0 Next, regarding X 0 and Y 0 Amplitude compression is performed to obtain the final complex spectral matrix pair (X,Y), where Where |·| represents the modulus, ∠(·) represents the phase angle, and β s β is the scaling factor. m As an exponential factor, in this embodiment, β s =0.33, β m =0.5. In this embodiment, both the original waveform x and the complex spectral matrix X can represent clean speech, and both the original waveform y and the complex spectral matrix Y can represent noisy speech.
[0058] S3, Discriminant Model Training: Construct a discriminant model to estimate the posterior mean of clean speech conditioned on noisy speech. This model takes the complex spectral matrix of the noisy speech as input and the estimated complex spectral matrix of the clean speech as output. The parameters of the discriminant model are adjusted based on the difference between the estimated complex spectral matrix of the clean speech and the real complex spectral matrix of the clean speech. When the discriminant model meets the preset training requirements, the discriminant model with the optimal parameters is selected as the target discriminant model.
[0059] In this embodiment, a discriminative model D is first constructed. θ The model is trained according to the following process:
[0060] (1) Select a sample pair (x,y)∈D xy Preprocessing is performed according to S2 to obtain the spectrum complex matrix pair (X,Y);
[0061] (2) Input the complex spectral matrix Y of the noisy speech into D. θ Output an estimate of the complex spectral matrix of the clean speech.
[0062] (3) Constructing a reconstruction loss function based on clean speech Where L(·) represents the difference between the two, which can usually be represented by the mean-squared error (MSE); to minimize the loss function. To train the target, perform gradient backpropagation and update the parameters θ.
[0063] (4) Repeat steps (1) to (3) until training requirements such as the maximum number of iterations are met, then stop training; select the optimal parameter θ during the training process. * The corresponding discriminant model serves as the target discriminant model; this target discriminant model is only used for training the score model in S4 and not for generating clean speech in S5.
[0064] S4, Score Model Training: Construct a score model to match the optimal inverse offset score of the mean-inverted Schrödinger bridge. Parameterize the optimal inverse offset score of the mean-inverted Schrödinger bridge using this score model to estimate the optimal inverse offset score during the inverse generation process. Adjust the parameters of the score model based on the difference between the estimated and true optimal inverse offset scores. When the preset training requirements are met, the score model with the optimal parameters is taken as the target score model.
[0065] In this embodiment, the mean-inverted Schrödinger bridge is a special type of Schrödinger bridge that improves the estimation ability of high-frequency information of clean speech by introducing posterior mean information and reduces the difficulty of score matching during training.
[0066] Specifically, for a specific pair of complex spectral matrices (X,Y), we first assume that the posterior mean of X, conditional on Y, is a variable. Given Y, the variable From a conditional posterior distribution Observed. This embodiment uses the target discriminative model D trained by S3. θ* This approximates the conditional posterior distribution, i.e.
[0067] Next, a Schrödinger's bridge process bridging the posterior mean distribution and the clean speech distribution is defined, called the mean inversion process. The mean inversion process is then applied at a specific time... The state sample is represented as sample Follows a marginal Gaussian distribution in This represents a uniform distribution on the interval [0,1]. At time t, the marginal Gaussian distribution of the mean reversal process can be specifically represented as:
[0068] in, This represents the mean of the marginal distribution during the mean reversal process at time t. Then α represents the variance of the marginal distribution during the mean reversal process at time t. t and σ t It is noise scheduling in the mean-inversion process, signal-to-noise ratio In the early stages of the reverse generation process, the mean inversion process is biased towards the posterior mean sample; while in the later stages, the mean inversion process gradually shifts towards the clean speech sample, achieving the optimal conversion between the posterior mean and the clean speech.
[0069] Finally, a Schrödinger bridge process is defined to bridge the noisy speech distribution and the clean speech distribution, and the state sample of this Schrödinger bridge process at time t is represented as X. t Sample X t Follows a marginal Gaussian distribution q(X) t Given a region |X0,X1), and X0=X, X1=Y. This Schrödinger bridge can be described by the following two stochastic differential equations:
[0070] Where f is the offset coefficient, g is the diffusion coefficient, and W and It is a standard Wiener process, Ψ t and It is the Schrödinger factor, a fraction. and Represents the forward optimal offset score and the reverse optimal offset score.
[0071] To introduce the mean reversal process into the Schrödinger bridge process, at time t, assume X... t by As a conditional expression, the approximate marginal Gaussian distribution of the Schrödinger bridge is derived as follows:
[0072] in, yes The point estimate can usually be represented by the mean of the distribution. Therefore, the Schrödinger bridge process that introduces the mean inversion process is called the mean-inverted Schrödinger bridge, which replaces the boundary condition X0 = X in the traditional Schrödinger bridge at time t with... From the edge Gaussian distribution X is obtained by approximate sampling t At time t, the marginal Gaussian distribution of the mean-inverted Schrödinger bridge can be specifically expressed as:
[0073] in, Let represent the mean of the mean-reversed marginal distribution of the Schrödinger bridge at time t. This represents the variance of the mean-reversed marginal distribution of the Schrödinger bridge at time t. and The noise scheduling of the mean-inverted Schrödinger bridge is determined by f and g in the above stochastic differential equations (the determination process is existing technology), and the signal-to-noise ratio is...
[0074] The inverse generation process of the mean-inverted Schrödinger bridge refers to the process of transforming time t from 1 to 0, corresponding to sample X. t The process gradually transforms noisy speech into clean speech. The characteristic of the mean-inverted Schrödinger bridge is that, due to the approximate replacement of boundary conditions, in the early stage of the inverse generation process, the mean-inverted process favors the posterior mean sample. The mean-inverted Schrödinger bridge process uses the posterior mean as a condition to maximize the restoration of high-frequency information of clean speech from noisy speech. In the later stage, the mean-inverted process gradually favors the clean speech sample, and the mean-inverted Schrödinger bridge process gradually restores low-frequency information, ultimately achieving the optimal conversion between noisy speech and clean speech under the guidance of the posterior mean.
[0075] During the training of the fractional model, since the boundary conditions X0 and X1 are known, and the inverse generation process of the mean-inverted Schrödinger bridge is also known, the optimal inverse offset score at time t can be expressed as: Inverse optimal offset score This characterizes the optimal path for sample transmission from a noisy speech distribution to a clean speech distribution, indicating the direction of gradual transition from noisy speech samples to clean speech samples. In actual speech enhancement, only the boundary condition X1 is known, and the inverse optimal offset fraction at time t can be parameterized by constructing a fractional model. The training objective of this score model is to ensure that the estimated inverse optimal offset score can accurately guide the sample from a noisy state to a clean state, thereby achieving the effect of speech enhancement.
[0076] In this embodiment, when training the score model, a score model is first constructed. This model typically employs a U-Net structure with residual connections. When using a fractional model... Inverse optimal offset fraction of parameterized mean-inverted Schrödinger bridge In practice, this model can typically be used to predict scores, directly matching the true scores, or indirectly matching the true scores using methods such as predictive noise or predicted data. This embodiment will illustrate the training method of indirectly matching the true scores using predicted data, namely: Predict X0 to estimate the reverse optimal offset score, denoted as
[0077] Specifically, this embodiment trains the score model according to the following process:
[0078] (1) Select a sample pair (x,y)∈D xy Preprocessing is performed according to S2 to obtain the spectrum complex matrix pair (X,Y);
[0079] (2) Input the complex spectral matrix Y of the noisy speech into the target discriminant model D obtained by training S3. θ* Output
[0080] (3) Random sampling time Based on the marginal Gaussian distribution of the mean reversal process at time t... Calculate the mean of this distribution to obtain the point estimate at time t. Next, the marginal Gaussian distribution of the Schrödinger bridge is inverted based on the mean at time t. Samples were obtained at time t. in It is random noise; finally, time t and sample X are... t Input score model Output At this point, the true optimal reverse offset score for The estimated optimal reverse offset score is
[0081] (4) Based on the difference between the estimated reverse optimal offset score and the true reverse optimal offset score (the difference is reflected in X0 and...) (Differences), adjust the parameters of the score model: define the data prediction loss function. Minimizing this loss function improves the accuracy of estimating the reverse optimal offset score, where L(·) represents the difference between the two, typically expressed as mean squared error; minimizing the loss function... To train the system, perform gradient backpropagation to update the parameters.
[0082] (5) Repeat steps (1) to (4) until training requirements such as the maximum number of iterations are met, then stop training. Select the optimal parameters during the training process. The corresponding score model is used as the target score model.
[0083] S5, Clean Speech Generation: Given noisy speech, the inverse optimal offset score of the Schrödinger bridge is reversed using the parameterized mean inversion of the target score model trained beforehand, and a reverse generation process is performed to generate clean speech. The initial state of the reverse generation process is noisy speech, and the final state obtained after multiple iterations is the estimated clean speech.
[0084] In this embodiment, given noisy speech y tgt Referring to the preprocessing method of S2, without truncating sub-segments, the noisy speech is sequentially normalized, Fourier transformed, and amplitude compressed to obtain the corresponding complex spectral matrix Y. tgt Let the reverse-generated time list be [t0, t1, ..., tt] i ,...,t N ], where 0 <t i ≤1, t0=0.03, t N =1, i = 0, 1, ..., N, where N is the iteration number. Let j = N, the initial state. The reverse generation process is as follows:
[0085] (1) Time t j ,sample Input the target score model and get
[0086] (2) Through Estimating the optimal reverse offset fraction Solving inverse stochastic differential equations calculate A simplified solution is: at time t j-1 samples It follows the posterior probability distribution q(X) of the Schrödinger bridge. tj-1 |X0,X tj ), in known samples In the case of using a score model to predict Replacing X0, from Medium sampling
[0087] (3) Let j = j-1. If j > 0, repeat (1) to (2). If j = 0, stop the reverse generation. Samples at time t0 The complex matrix of the spectrum of the clean speech after the enhancement of noisy speech is then subjected to inverse amplitude compression, inverse Fourier transform and inverse normalization to finally obtain the waveform of the clean speech.
[0088] The reverse generation process described above uses a method of indirectly matching the real scores with predicted data to train the score model. Therefore, it can directly sample stepwise through the posterior probability distribution in (2) above to simulate the mean-inverted Schrödinger bridge. If the score model is trained by directly matching the real scores with the predicted scores, or by indirectly matching the real scores with the predicted noise, the reverse generation process is similar to the above process and will not be elaborated here.
[0089] In this embodiment, the VoiceBank+DEMAND and TIMIT+WHAM! datasets are used to train the discriminant model and the score model, respectively, to achieve speech enhancement. VoiceBank+DEMAND is a publicly available speech enhancement benchmark dataset, and sample pairs have been pre-constructed. The TIMIT+WHAM! dataset is constructed by mixing clean speech from TIMIT with noisy speech from WHAM! according to a random signal-to-noise ratio, following step S1, to generate noisy speech and construct sample pairs; wherein the signal-to-noise ratio is uniformly sampled between -6dB and 14dB.
[0090] In this embodiment, the speech enhancement performance was compared with that of three mainstream single-channel speech enhancement methods based on diffusion models: CDiffuSE, SGMSE+, and StoRM, based on the VoiceBank+DEMAND dataset. The results are shown in Table 1.
[0091] Table 1 compares the speech enhancement performance of the three mainstream methods on the VoiceBank+DEMAND dataset.
[0092] Among them, PESQ is an objective speech quality perception evaluation index, and the higher the value, the better; ESOI is a speech short-term objective intelligibility evaluation index, and the higher the value, the better; SI-SDR is used to evaluate the degree of speech distortion, and the higher the value, the better; DNSMOS is a subjective speech quality perception evaluation index, and the higher the value, the better.
[0093] As shown in Table 1, the method in this embodiment shows improvements in PESQ, ESTOI, SI-SDR, and DNSMOS compared to other methods, demonstrating the superior performance of this method. Furthermore, this method only calls the neural network three times during the reverse generation process, significantly reducing computational costs.
[0094] In this embodiment, based on the TIMIT+WHAM! dataset and the VoiceBank+DEMAND dataset, the generalization performance of the two mainstream diffusion-based single-channel speech enhancement methods, SGMSE+ and StoRM, was compared, and the results are shown in Table 2.
[0095] Table 2 compares the generalization performance of the two mainstream methods on the TIMIT+WHAM! dataset.
[0096] Here, WER represents the word error rate of recognized text on downstream speech recognition tasks, and a lower value is better. Domain matching indicates that the method was trained on the TIMIT+WHAM! dataset and validated on the test set of the TIMIT+WHAM! dataset. Domain mismatch indicates that the method was trained on the VoiceBank+DEMAND dataset and validated on the test set of the TIMIT+WHAM! dataset.
[0097] As shown in Table 2, compared with other methods, the method in this embodiment achieved the best performance in both domain matching and domain mismatch cases, proving the generalization ability of the method.
[0098] In this embodiment, ablation experiments were conducted based on the VoiceBank+DEMAND dataset, and the results are shown in Table 3.
[0099] Table 3 shows the ablation experimental results on the VoiceBank+DEMAND dataset.
[0100] The second to fifth rows are improvements on the traditional Schrödinger bridge method. In the methods of the second to fourth rows, the discriminant model exists in both the training and reverse generation processes of the score model. However, in the method of the fifth row, which is the method in this embodiment, the discriminant model exists only in the training process of the score model.
[0101] As shown in Table 3, the method in this embodiment achieves the highest objective speech perception quality (PESQ) and intelligibility (ESTOI) without the need for discriminative model assistance during the reverse generation process. All other performance aspects are also improved compared to the traditional Schrödinger bridge method.
[0102] Based on the same inventive concept, this embodiment also provides a single-channel speech enhancement device based on a mean-inverted Schrödinger bridge, comprising:
[0103] Construct a sample pair module, which consists of a set of speech sample pairs composed of clean speech samples and noisy speech samples;
[0104] The sample processing module preprocesses and performs Fourier transform on the speech samples in the speech sample pair set to obtain the spectrum complex matrix pairs;
[0105] The discriminant model construction module constructs a discriminant model, taking the complex spectrum matrix of noisy speech as input and the estimated complex spectrum matrix of clean speech as output. Based on the difference between the estimated complex spectrum matrix of clean speech and the complex spectrum matrix of real clean speech, the parameters of the discriminant model are adjusted. When the discriminant model meets the preset training requirements, the discriminant model with the optimal parameters is taken as the target discriminant model.
[0106] The fractional model construction module constructs a fractional model and parameterizes the inverse optimal offset score of the mean-inverted Schrödinger bridge to estimate the inverse optimal offset score during the inverse generation process. Based on the difference between the estimated inverse optimal offset score and the true inverse optimal offset score, the parameters of the fractional model are adjusted. When the fractional model meets the preset training requirements, the fractional model corresponding to the optimal parameters is taken as the target fractional model.
[0107] The clean speech generation module, for a given noisy speech, uses the parameterized mean inversion Schrödinger bridge's inverse optimal offset score, based on a trained target score model, to perform a reverse generation process and generate clean speech.
[0108] It should be noted that the single-channel speech enhancement device based on the mean-inverted Schrödinger bridge provided in the above embodiments, when performing speech enhancement and generating clean speech, should be illustrated by the division of the above functional modules. The functions can be assigned to different functional modules as needed, that is, the internal structure of the terminal or server can be divided into different functional modules to complete all or part of the functions described above. Furthermore, the single-channel speech enhancement device based on the mean-inverted Schrödinger bridge provided in the above embodiments and the single-channel speech enhancement method embodiment based on the mean-inverted Schrödinger bridge belong to the same concept. For details on the specific implementation process and effects, please refer to the single-channel speech enhancement method embodiment based on the mean-inverted Schrödinger bridge, which will not be repeated here.
[0109] Based on the same inventive concept, this embodiment also provides a computing device, including a memory and one or more processors. The memory stores executable code, and when the one or more processors execute the executable code, they are used to implement the above-mentioned single-channel speech enhancement method based on the mean-inverted Schrödinger bridge.
[0110] The computing device provided in this embodiment, at the hardware level, includes not only a processor and memory, but also internal buses, network interfaces, memory, and other hardware required for business operations. The memory is non-volatile memory. The processor reads the corresponding computer program from the non-volatile memory into memory and then runs it to implement the single-channel speech enhancement method based on the mean-inverted Schrödinger bridge described in S1-S5 above. Of course, besides software implementation, this invention does not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution entity of the following processing flow is not limited to individual logic units, but can also be hardware or logic devices.
[0111] Based on the same inventive concept, this embodiment also provides a computer-readable storage medium storing a program that, when executed by a processor, implements the above-described single-channel speech enhancement method based on the mean-inverted Schrödinger bridge.
[0112] In this embodiment, the computer-readable medium includes both permanent and non-permanent, removable and non-removable media, and information storage can be achieved by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data.
[0113] The embodiments described above are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments. Any obvious improvements, substitutions or modifications that can be made by those skilled in the art without departing from the essence of the present invention shall fall within the protection scope of the present invention.
Claims
1. A single-channel speech enhancement method based on a mean-inverted Schrödinger bridge, characterized in that: Noisy speech samples are obtained by mixing clean speech samples and noisy speech samples at different signal-to-noise ratios. A set of speech sample pairs is formed by the clean speech samples and the noisy speech samples. The speech samples in the speech sample pair set are preprocessed and then subjected to Fourier transform to obtain the spectrum complex matrix pairs; Construct a discriminative model, taking the complex spectral matrix of noisy speech as input and the estimated complex spectral matrix of clean speech as output; Based on the difference between the estimated complex spectrum matrix of clean speech and the complex spectrum matrix of real clean speech, the parameters of the discriminant model are adjusted. When the discriminant model meets the preset training requirements, the discriminant model with the optimal parameters is taken as the target discriminant model. A score model is constructed, and the inverse optimal offset score of the mean-inverted Schrödinger bridge is parameterized using the score model to estimate the inverse optimal offset score during the inverse generation process. Based on the difference between the estimated inverse optimal offset score and the true inverse optimal offset score, the parameters of the score model are adjusted. When the score model meets the preset training requirements, the score model corresponding to the optimal parameters is taken as the target score model. Given noisy speech, the inverse optimal offset score of the Schrödinger bridge is reversed by using the parameterized mean inverted from the trained target score model to generate clean speech.
2. The single-channel speech enhancement method based on the mean-inverted Schrödinger bridge according to claim 1, characterized in that, The process of obtaining the spectral complex matrix pair is as follows: For any speech sample pair (x, y), randomly extract sub-segments of x and y, and count the maximum value of the sampled values in y. Use the maximum value to normalize x and y respectively to obtain x′ and y′. Then, obtain the spectral complex matrix X corresponding to x′ and y′ through Fourier transform. 0 Y 0 Then for X 0 and Y 0 Amplitude compression is performed to obtain the complex spectral matrix pair (X,Y).
3. The single-channel speech enhancement method based on the mean-inverted Schrödinger bridge according to claim 2, characterized in that, The training process for a discriminative model is as follows: The complex spectral matrix Y of the noisy speech is input into the discriminant model D. θ Output an estimate of the complex spectral matrix of the clean speech. Constructing a reconstruction loss function based on clean speech Where L(·) represents the difference between the two; to minimize the loss function To train the target, perform gradient backpropagation and update the parameters θ. Training will stop when the preset training requirements are met. Selecting the optimal parameter θ during the training process * The corresponding discriminant model is used as the target discriminant model.
4. The single-channel speech enhancement method based on the mean-inverted Schrödinger bridge according to claim 3, characterized in that, The process of obtaining the mean-inverted Schrödinger bridge is as follows: For a complex matrix pair (X,Y) with respect to Y, assume that the posterior mean of X is a variable. Given Y, the variable From a conditional posterior distribution It was observed that a target discriminative model was used. Approximate the conditional posterior distribution, i.e. The mean reversal process at a specific moment The state sample is represented as sample Follows a marginal Gaussian distribution in, This represents a uniform distribution on the interval [0,1]. Define a Schrödinger bridge process that bridges a noisy speech distribution and a clean speech distribution, and represent the state sample of this Schrödinger bridge process at time t as X. t Sample X t Follows a marginal Gaussian distribution q(X) t |X0,X1), and X0=X, X1=Y; at time t, assume X t by As a conditional expression, the approximate marginal Gaussian distribution of the Schrödinger bridge is derived as follows: in, yes Point estimate; The Schrödinger bridge process that introduces a mean-reversal process is called the mean-reversal Schrödinger bridge, which replaces the boundary condition X0 = X at time t of the Schrödinger bridge with...
5. The single-channel speech enhancement method based on the mean-inverted Schrödinger bridge according to claim 4, characterized in that, The mean inversion process is a Schrödinger bridge process that bridges the posterior mean distribution and the pure speech distribution.
6. The single-channel speech enhancement method based on the mean-inverted Schrödinger bridge according to claim 4, characterized in that: At time t, the marginal Gaussian distribution of the mean reversal process is: in, This represents the mean of the marginal distribution during the mean reversal process at time t. α represents the variance of the marginal distribution during the mean reversal process at time t. t and σ t It is noise scheduling in the mean-inversion process, signal-to-noise ratio 7. The single-channel speech enhancement method based on the mean-inverted Schrödinger bridge according to claim 6, characterized in that: The mean-inverted Schrödinger bridge is described by the following two stochastic differential equations: Where f is the offset coefficient, g is the diffusion coefficient, and W and It is a standard Wiener process, Ψ t and It is the Schrödinger factor, a fraction. and Represents the forward optimal offset score and the reverse optimal offset score; At time t, the marginal Gaussian distribution of the mean-inverted Schrödinger bridge is: in, Let represent the mean of the mean-reversed marginal distribution of the Schrödinger bridge at time t. This represents the variance of the mean-inverted Schrödinger bridge marginal distribution at time t. and It is a noise scheduling of the mean-inverted Schrödinger bridge, determined by f and g in the above stochastic differential equation, with a signal-to-noise ratio.
8. The single-channel speech enhancement method based on the mean-inverted Schrödinger bridge according to claim 7, characterized in that, The training process for the score model is as follows: The complex spectral matrix Y of the noisy speech is input into the target discriminative model. Output Random sampling time Based on the marginal Gaussian distribution of the mean reversal process at time t Calculate the mean of this distribution to obtain the point estimate at time t. Next, the marginal Gaussian distribution of the Schrödinger bridge is inverted based on the mean at time t. Samples were obtained at time t. in It is random noise; finally, time t and sample X are... t Input score model Output At this point, the true optimal reverse offset score for The estimated optimal reverse offset score is Based on the difference between the estimated inverse optimal offset score and the true inverse optimal offset score, the parameters of the score model are adjusted. The difference between the two is reflected in X0 and... The difference; specifically: defining the data prediction loss function. To minimize the loss function To train the system, perform gradient backpropagation to update the parameters. Training will stop when the preset training requirements are met. Selecting the optimal parameters during the training process The corresponding score model is used as the target score model.
9. The single-channel speech enhancement method based on the mean-inverted Schrödinger bridge according to claim 8, characterized in that, The reverse generation process is specifically as follows: Time t j ,sample Input the target score model and get pass Estimating the optimal reverse offset fraction Solving inverse stochastic differential equations calculate Let j = j-1. If j > 0, repeat the above process. If j = 0, stop the reverse generation. The sample at time t0 The complex matrix of the spectrum of the clean speech after the enhancement of noisy speech is then subjected to inverse amplitude compression, inverse Fourier transform and inverse normalization to finally obtain the waveform of the clean speech.
10. An apparatus for implementing the single-channel speech enhancement method based on the mean-inverted Schrödinger bridge as described in any one of claims 1-9, characterized in that, include: Construct a sample pair module, which consists of a set of speech sample pairs composed of clean speech samples and noisy speech samples; The sample processing module preprocesses and performs Fourier transform on the speech samples in the speech sample pair set to obtain the spectrum complex matrix pairs; The discriminant model building module constructs a discriminant model, estimates the complex spectral matrix of clean speech, and adjusts the model parameters based on the difference between the estimated complex spectral matrix of clean speech and the real complex spectral matrix of clean speech. When the model meets the preset training requirements, the discriminant model with the optimal parameters is taken as the target discriminant model. The score model building module constructs a score model, estimates the optimal reverse offset score during the reverse generation process, and adjusts the model parameters based on the difference between the estimated optimal reverse offset score and the true optimal reverse offset score. When the model meets the preset training requirements, the score model corresponding to the optimal parameters is taken as the target score model. The clean speech generation module, for a given noisy speech, uses the parameterized mean inversion Schrödinger bridge's inverse optimal offset score, based on a trained target score model, to perform a reverse generation process and generate clean speech.