Voice processing model training method, voice processing method, and related devices

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By co-training the speech processing model and utilizing the higher-order feedback of the second model and the supervision of clean labels, the network parameters of the first model are optimized. This solves the problems of insufficient speech enhancement effect and resource-constrained deployment in complex noisy environments, and achieves efficient and robust speech denoising effect.

CN122201260APending Publication Date: 2026-06-12XIAOMI EV TECH CO LTD +2

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: XIAOMI EV TECH CO LTD
Filing Date: 2026-03-18
Publication Date: 2026-06-12

Application Information

Patent Timeline

18 Mar 2026

Application

12 Jun 2026

Publication

CN122201260A

IPC: G10L15/06; G10L15/18; G10L21/0208

AI Tagging

Application Domain

Speech recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing speech enhancement technologies have limited effectiveness in complex noisy environments, and deep learning models are difficult to deploy efficiently on resource-constrained devices, leading to a decline in the accuracy of speech detection and recognition.

⚗Method used

By co-training the first and second speech processing models, the second model provides high-order feature feedback and clean speech labels for dual supervision, constructing a dual loss value closed loop, optimizing the network parameters of the first model, and achieving lightweight and efficient noise reduction.

🎯Benefits of technology

It significantly improves the performance and robustness of speech enhancement processing, reduces computational resource consumption, improves model inference efficiency, and ensures real-time speech processing capabilities on resource-constrained devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122201260A_ABST

Patent Text Reader

Abstract

The present disclosure provides a speech processing model training method, a speech processing method and related devices, which are related to the fields of data processing technology, artificial intelligence technology and the like. The training method comprises: performing collaborative training on the first speech processing model and the second speech processing model; the input data of the second speech processing model comprises the output data of the first speech processing model; and in response to completion of model training, the trained first speech processing model is used as the target speech processing model for speech enhancement processing of noise speech data. Through model collaborative training, the second speech processing model can guide the first speech processing model to learn more optimal speech features, effectively improve the speech enhancement processing performance of the first speech processing model, and significantly improve the training efficiency. After the training is completed, only the first speech processing model is deployed, which not only reduces the consumption of computing resources, but also improves the inference efficiency of the model.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to fields such as data processing technology and artificial intelligence technology, and in particular to a training method for a speech processing model, a speech processing method, and related apparatus. Background Technology

[0002] In related technologies, speech enhancement is one of the core problems in the field of speech processing, with the goal of efficiently extracting the target speech signal from a signal interfered with by noise. Noise interference can significantly reduce the accuracy of speech detection, speech recognition, and related tasks. Therefore, how to reduce the impact of noise on subsequent speech processing tasks is an important research topic in the field. Summary of the Invention

[0003] This disclosure provides a training method for a speech processing model, a speech processing method, and related apparatus to improve the training efficiency and processing effect of speech enhancement.

[0004] According to a first aspect of the present disclosure, a method for training a speech processing model is provided, comprising: The first speech processing model and the second speech processing model are trained together; the input data of the second speech processing model includes the output data of the first speech processing model; the input data of the second speech processing model is obtained at least through the output data of the first speech processing model. In response to the completion of model training, the trained first speech processing model is used as the target speech processing model to perform speech enhancement processing on noisy speech data.

[0005] This disclosure utilizes collaborative training between models, enabling the second speech processing model to guide the first speech processing model in learning superior speech features. This effectively improves the speech enhancement performance of the first speech processing model and significantly enhances training efficiency. After training, deploying only the first speech processing model reduces computational resource consumption and improves inference efficiency.

[0006] In some possible implementations, the input data of the second speech processing model is obtained in the following manner: The output data of the first speech processing model is fused with the input data of the first speech processing model.

[0007] This disclosure integrates the input and output data of the first speech processing model as the input of the second speech processing model, providing the second speech processing model with complete information about the first speech processing model before and after enhancement processing. This enables the second speech processing model to more accurately evaluate the enhancement quality and distortion level of the first speech processing model, thereby providing higher quality and more targeted gradient feedback. This effectively guides the first speech processing model to find the optimal balance between noise suppression and speech detail preservation, significantly improving the speech enhancement performance and robustness of the first speech processing model.

[0008] In some possible implementations, the first speech processing model and the second speech processing model are trained collaboratively, including: Obtain sample noisy speech data and the clean speech tags corresponding to the sample noisy speech data; The first speech processing model to be trained is used to perform speech enhancement processing on the sample noisy speech data to obtain a first processing result. The second processing result is obtained by using the second speech processing model to be trained, based on the sample noisy speech data and the first processing result; Based on the first processing result, the second processing result, and the clean speech label, adjust the network parameters of the first speech processing model and the second speech processing model.

[0009] This disclosure uses sample noisy speech data and the preliminary enhancement results of the first speech processing model as input to the second speech processing model. The high-order feature feedback provided by the second speech processing model and the low-level supervision provided by the clean speech labels can constrain the adjustment of the network parameters of the first speech processing model. This multi-dimensional constraint enables the first speech processing model to better preserve the naturalness and detail fidelity of speech while learning how to suppress noise, effectively avoiding overfitting or distortion that is prone to occur in traditional single-model training. Thus, while ensuring the lightweight nature of the model, it significantly improves the perceptual quality and robustness of the enhanced speech.

[0010] In some possible implementations, a second processing result is obtained based on the sample noisy speech data and the first processing result, including: Based on the sample noisy speech data, the first processing result is subjected to speech enhancement processing to obtain an intermediate processing result; The second processing result is obtained by performing a residual connection between the first processing result and the intermediate processing result.

[0011] This disclosure enables a second speech processing model to focus on learning the residual mapping between the first processing result and the intermediate processing result by performing a residual connection between them. The model then uses raw noisy speech data as a conditional guide for refined enhancement. This not only reduces the learning difficulty of the second model, allowing it to capture details and residual noise more efficiently, but also provides the first model with more accurate and informative supervision signals by integrating the initial results with the refined corrections. Ultimately, this guides the first model to retain more speech details while maintaining a lightweight design, significantly improving the naturalness and robustness of the enhanced speech.

[0012] In some possible implementations, based on the first processing result, the second processing result, and the clean speech label, the network parameters of the first speech processing model and the second speech processing model are adjusted, including: Determine a first loss value between the first processing result and the clean speech label, and a second loss value between the second processing result and the clean speech label; Based on the first loss value and the second loss value, determine the fusion loss value; Based on the fusion loss value, the network parameters of the first speech processing model and the second speech processing model are adjusted.

[0013] This disclosure constructs a dual-supervised collaborative training loop by separately calculating the losses of the first and second processing results with the clean speech labels, and then fusing them into a total loss value to jointly optimize the parameters of the two models. This ensures that the first speech processing model is not only directly constrained by the labels during training but also by higher-order feedback constraints from the second speech processing model, making its output more optimizable while maintaining low error. Simultaneously, the capabilities of the second speech processing model are continuously improved through joint optimization. The two mutually reinforce each other, ultimately enhancing the denoising accuracy and generalization ability of the lightweight first speech processing model.

[0014] In some possible implementations, adjusting the network parameters of the first speech processing model and the second speech processing model based on the fusion loss value includes: The fusion loss value is backpropagated to the second speech processing model to update the network parameters of the second speech processing model and the network parameters of the first speech processing model; and the fusion loss value is backpropagated to the first speech processing model to update the network parameters of the first speech processing model.

[0015] This disclosure achieves differentiated collaborative updates of two models through backpropagation via dual paths, enabling gradient information to flow fully between the two models and forming an efficient bidirectional promotion loop. This allows the lightweight first speech processing model to achieve performance limits and generalization capabilities exceeding those of single-model training in joint optimization.

[0016] In some possible implementations, the number of parameters of the first speech processing model is less than or equal to the number of parameters of the second speech processing model; and / or, the computational complexity of the first speech processing model is less than or equal to the computational complexity of the second speech processing model.

[0017] This disclosure enables a lightweight first speech processing model by leveraging the powerful speech processing capabilities of a complex second speech processing model, thereby significantly improving the noise reduction effect of the first speech processing model while ensuring inference efficiency.

[0018] According to a second aspect of the present disclosure, a voice processing method is provided, comprising: Acquire the noisy speech data to be processed; The noisy speech data is enhanced by using a target speech processing model to obtain target speech data; wherein the target speech processing model is trained using the training method of the speech processing model described in the first aspect of this disclosure.

[0019] The target speech processing model trained by the speech processing model training method described herein processes the noisy speech data to be processed. Since the second speech processing model guides the first speech processing model to learn better speech features during training, the speech enhancement processing performance of the first speech processing model is significantly improved, enhancing the denoising effect on noisy speech data. At the same time, deploying only the first speech processing model not only reduces the consumption of computing resources, but also improves the inference efficiency of the model.

[0020] According to a third aspect of the present disclosure, a training apparatus for a speech processing model is provided, comprising: The training module is configured to perform collaborative training on a first speech processing model and a second speech processing model; the input data of the second speech processing model includes the output data of the first speech processing model. The response module is configured to use the trained first speech processing model as the target speech processing model in response to the completion of model training, and to perform speech enhancement processing on noisy speech data.

[0021] In some possible implementations, the training module is further configured to: Obtain sample noisy speech data and the clean speech tags corresponding to the sample noisy speech data; The first speech processing model to be trained is used to perform speech enhancement processing on the sample noisy speech data to obtain a first processing result. The second processing result is obtained by using the second speech processing model to be trained, based on the sample noisy speech data and the first processing result; Based on the first processing result, the second processing result, and the clean speech label, adjust the network parameters of the first speech processing model and the second speech processing model.

[0022] In some possible implementations, the training module is further configured to: Based on the sample noisy speech data, the first processing result is subjected to speech enhancement processing to obtain an intermediate processing result; The second processing result is obtained by performing a residual connection between the first processing result and the intermediate processing result.

[0023] According to a fourth aspect of the present disclosure, a voice processing apparatus is provided, comprising: The acquisition module is configured to acquire the noisy speech data to be processed. The processing module is configured to perform speech enhancement processing on the noisy speech data using a target speech processing model to obtain target speech data; wherein the target speech processing model is trained using the training method for the speech processing model described in the first aspect of this disclosure.

[0024] According to a fifth aspect of the present disclosure, an electronic device is provided, comprising: processor; Memory used to store processor-executable instructions; The processor is configured to implement the training method of the processing model described in the first aspect of the present disclosure or the speech processing method described in the second aspect.

[0025] According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the training method of the processing model described in the first aspect of the present disclosure or the speech processing method described in the second aspect.

[0026] According to a fifth aspect of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the training method of the processing model described in the first aspect of the present disclosure or the speech processing method described in the second aspect.

[0027] The technical solutions provided by the embodiments of this disclosure may include the following beneficial effects: This disclosure involves co-training a first speech processing model and a second speech processing model. The input data of the second speech processing model includes the output data of the first speech processing model. Upon completion of model training, the trained first speech processing model is used as the target speech processing model for speech enhancement processing of noisy speech data. Thus, through co-training, the second speech processing model can guide the first speech processing model to learn better speech features, effectively improving the speech enhancement performance of the first speech processing model and significantly increasing training efficiency. After training, deploying only the first speech processing model not only reduces computational resource consumption but also improves the model's inference efficiency.

[0028] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0029] The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.

[0030] Figure 1 This is a flowchart illustrating a training method for a speech processing model according to an exemplary embodiment.

[0031] Figure 2 This is a flowchart illustrating a speech processing method according to an exemplary embodiment.

[0032] Figure 3 This is a flowchart illustrating a training method for a speech processing model according to an exemplary embodiment.

[0033] Figure 4 This is a flowchart illustrating a speech processing method according to an exemplary embodiment.

[0034] Figure 5 This is a block diagram illustrating a training apparatus for a speech processing model according to an exemplary embodiment.

[0035] Figure 6 This is a block diagram illustrating a voice processing apparatus according to an exemplary embodiment.

[0036] Figure 7 This is a block diagram illustrating an electronic device according to an exemplary embodiment.

[0037] Figure 8 This is a block diagram illustrating an apparatus for implementing speech processing according to an exemplary embodiment. Detailed Implementation

[0038] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.

[0039] Traditional speech enhancement methods are based on digital signal processing techniques, including spectral subtraction, filter design, statistical model-based techniques, and subspace methods. Most of these methods rely on specific assumptions, such as that speech is uncorrelated with noise or that noise follows a Gaussian distribution. However, these assumptions often fail in complex, non-stationary noise environments, resulting in limited effectiveness of traditional methods in practical applications.

[0040] With the rapid development of artificial intelligence technology, neural networks, with their powerful nonlinear modeling capabilities, have provided new solutions for speech denoising. A series of deep learning-based speech enhancement models have been proposed, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Generative Adversarial Networks (SEGANs). These techniques have shown significant performance improvements in speech denoising tasks, and many research results have gained widespread attention.

[0041] While deep learning-based speech enhancement technologies have demonstrated superiority, their high-performance models often come with a large number of parameters and computational requirements, making them difficult to deploy efficiently on resource-constrained edge devices. This is especially true in real-time speech enhancement scenarios, where systems need to complete computational processing within extremely short latency periods, placing extremely high demands on model efficiency and lightweight design.

[0042] Reference Figure 1 , Figure 1 This is a flowchart illustrating a training method for a speech processing model according to an exemplary embodiment, such as... Figure 1 As shown, the training method for the speech processing model includes the following steps.

[0043] In step S101, the first speech processing model and the second speech processing model are trained together; the input data of the second speech processing model includes the output data of the first speech processing model.

[0044] In step S102, in response to the completion of model training, the trained first speech processing model is used as the target speech processing model to perform speech enhancement processing on noisy speech data.

[0045] For example, the first speech processing model and the second speech processing model can be the same model or different models. When the first speech processing model and the second speech processing model are different, the second speech processing model can be more complex than the first speech processing model, or the second speech processing model can be similar in complexity to the first speech processing model.

[0046] For example, both the first and second speech processing models can employ any of the following: convolutional neural networks, recurrent neural networks, and generative adversarial networks.

[0047] For example, the first speech processing model can use any of the following: convolutional neural network, convolutional recurrent network, recurrent neural network, etc.; the second speech processing model can use any of the following: large model based on Transformer, generative adversarial network, diffusion model, etc.

[0048] For example, collaborative training, also known as joint training, enables the synchronous training of a first speech processing model and a second speech processing model. Here, collaborative training can be based on a sample training set, simultaneously optimizing the parameters of both the first and second speech processing models. Furthermore, during training, there is bidirectional information exchange between the first and second speech processing models, and parameter adjustments are synchronized.

[0049] Here, the input data of the second speech processing model includes the output data of the first speech processing model. This indicates that the first speech processing model can perform preliminary processing on the input raw noisy speech data, and the second speech processing model further processes the data after the preliminary processing by the first speech processing model. Thus, the final processing result of the second speech processing model is better than the preliminary processing result of the first speech processing model. During backpropagation based on labels, the first speech processing model can learn the difference information between the labels and the preliminary processing result, as well as the difference information between the final processing result and the preliminary processing result, significantly improving the speech enhancement processing performance of the first speech processing model.

[0050] For example, during training, a first speech processing model can be used to perform speech enhancement processing on noisy speech data, and a second speech processing model can be used to further enhance the data processed by the first speech processing model. Specifically, the second speech processing model can further enhance the data processed by the first speech processing model based on the original noisy speech data.

[0051] This solution trains the first speech processing model and the second speech processing model collaboratively, rather than training either model alone. Instead, it allows the first and second speech processing models to dynamically exchange information bidirectionally during the training process, resulting in more comprehensive knowledge transfer. This improves the noise reduction capability of the first speech processing model and enhances training efficiency.

[0052] For example, model training is considered complete when the training process reaches a preset stopping condition. For instance, model training can be determined to be complete when at least one of the following conditions is met: the loss function of the first speech processing model converges; the fusion loss function of the first speech processing model and the second speech processing model converges; a preset number of iterations is reached; and the evaluation metric of the first speech processing model on the validation set reaches the expected threshold.

[0053] For example, once training is complete, the first speech processing model can be deployed as the target speech processing model to the actual operating environment. When in use, the first speech processing model runs without the need for the second speech processing model, thus ensuring inference speed.

[0054] For example, the first speech processing model and the second speech processing model can be trained collaboratively using electronic devices such as servers or computers with abundant computing resources. When deploying the first speech processing model, it can be deployed in electronic devices with abundant resources or in electronic devices with limited resources, such as smart IoT devices such as smart speakers, smart home central control, and in-vehicle voice assistants, wearable devices such as smartwatches and smart glasses, and edge computing devices such as speech processing units deployed at the network edge. It can also be deployed in electronic devices with high requirements for real-time speech processing, such as smartphones, Bluetooth headsets, video conferencing systems, walkie-talkies, etc.

[0055] This disclosure involves co-training a first speech processing model and a second speech processing model. The input data of the second speech processing model includes the output data of the first speech processing model. Upon completion of model training, the trained first speech processing model is used as the target speech processing model for speech enhancement processing of noisy speech data. Thus, through co-training, the second speech processing model can guide the first speech processing model to learn better speech features, effectively improving the speech enhancement performance of the first speech processing model and significantly increasing training efficiency. After training, deploying only the first speech processing model not only reduces computational resource consumption but also improves the model's inference efficiency.

[0056] In some possible implementations, the number of parameters of the first speech processing model is less than or equal to the number of parameters of the second speech processing model; and / or, the computational complexity of the first speech processing model is less than or equal to the computational complexity of the second speech processing model.

[0057] For example, the number of parameters and computational cost are core metrics for measuring model capacity and complexity. Parameters can include those in neural network layers, fully connected layers, and bias terms, such as each weight value in the convolutional kernel of a convolutional layer, the connection weights between neurons, and bias parameters. Computational cost reflects the hardware computing resources consumed by the model during runtime and can be measured by the number of floating-point operations required for one forward inference operation.

[0058] Here, the following situations exist: the number of parameters of the first speech processing model is equal to the number of parameters of the second speech processing model, and the computational cost of the first speech processing model is equal to the computational cost of the second speech processing model; the number of parameters of the first speech processing model is less than the number of parameters of the second speech processing model, and the computational cost of the first speech processing model is equal to the computational cost of the second speech processing model; the number of parameters of the first speech processing model is equal to the number of parameters of the second speech processing model, and the computational cost of the first speech processing model is less than the computational cost of the second speech processing model; the number of parameters of the first speech processing model is less than the number of parameters of the second speech processing model, and the computational cost of the first speech processing model is less than the computational cost of the second speech processing model.

[0059] Understandably, the larger the number of parameters or the greater the computational load, the more complex the structure and the stronger the expressive power of the second speech processing model. However, the second speech processing model has higher requirements for memory and computing resources, so the choice can be made according to actual needs.

[0060] Here, the first speech processing model, after training, is suitable for running in resource-constrained environments, with fast response speed and low power consumption. The second speech processing model has a large number of parameters and strong computing power, and can capture more subtle distortions, residual noise, or unnatural details in the output of the first speech processing model and feed them back to the first speech processing model.

[0061] This disclosure enables a lightweight first speech processing model by leveraging the powerful speech processing capabilities of a complex second speech processing model, thereby significantly improving the noise reduction effect of the first speech processing model while ensuring inference efficiency.

[0062] In some possible implementations, the input data of the second speech processing model is obtained by fusing the output data of the first speech processing model with the input data of the first speech processing model.

[0063] For example, the input data of the first speech processing model is noisy speech samples during training, and the output data of the first speech processing model is the preliminary enhancement result obtained after processing the original noisy speech samples. The output data of the first speech processing model and the input data of the first speech processing model are fused together and used as the input of the second speech processing model.

[0064] For example, the input and output data of the first speech processing model can be fused in any of the following ways: concatenating the input and output data in the channel dimension; concatenating the speech waveform corresponding to the output data and the speech waveform corresponding to the input data; adding or multiplying the feature representations of the output and input data element by element; and calculating the residual features between the input and output data.

[0065] This disclosure uses sample noisy speech data and the preliminary enhancement results of the first speech processing model as input to the second speech processing model. The high-order feature feedback provided by the second speech processing model and the low-level supervision provided by the clean speech labels can constrain the adjustment of the network parameters of the first speech processing model. This multi-dimensional constraint enables the first speech processing model to better preserve the naturalness and detail fidelity of speech while learning how to suppress noise, effectively avoiding overfitting or distortion that is prone to occur in traditional single-model training. Thus, while ensuring the lightweight nature of the model, it significantly improves the perceptual quality and robustness of the enhanced speech.

[0066] In some possible implementations, the first speech processing model and the second speech processing model are trained collaboratively, including: Obtain sample noisy speech data and the clean speech tags corresponding to the sample noisy speech data; The first speech processing model to be trained is used to perform speech enhancement processing on the sample noisy speech data to obtain a first processing result. The second processing result is obtained by using the second speech processing model to be trained, based on the sample noisy speech data and the first processing result; Based on the first processing result, the second processing result, and the clean speech label, adjust the network parameters of the first speech processing model and the second speech processing model.

[0067] For example, sample noisy speech data is the input data used during the training phase. Sample noisy speech data can be actual, noisy speech collected in real-world scenarios. It can also be artificially synthesized speech data, such as sample noisy speech data obtained by mixing clean speech with various types of noise at a certain signal-to-noise ratio. Clean speech labels are ideal speech data corresponding to the sample noisy speech data, containing no noise or distortion, and can be used as a standard to measure the quality of the model's output.

[0068] For example, the first speech processing model performs speech enhancement processing on the sample noisy speech data to obtain a first processing result. The speech data in the first processing result may have had some noise removed, but compared with the clean speech label, there may be significant residual noise or speech distortion.

[0069] For example, the second speech processing model is based on sample noisy speech data and the first processing result. After further speech enhancement processing of the first processing result, the second processing result is obtained. The speech data in the second processing result has better speech quality than the first processing result, but there may still be small residual noise or speech distortion compared with the clean speech label.

[0070] For example, the loss can be calculated using the first processing result, the second processing result, and the clean speech label. Then, all weights and biases in the two models can be fine-tuned using a preset optimization algorithm so that the output of the first speech processing model and / or the second speech processing model is closer to the clean speech label in the next iteration.

[0071] This disclosure constructs a dual supervision mechanism for the first speech processing model, in which the high-order feature feedback provided by the second speech processing model and the low-level supervision provided by the clean speech label are combined. This multi-dimensional constraint enables the first speech processing model to better preserve the naturalness and detail fidelity of speech while suppressing noise, effectively avoiding overfitting or distortion that is prone to occur in traditional single-model training. Thus, while ensuring the lightweight nature of the model, it significantly improves the perceptual quality and robustness of the enhanced speech.

[0072] In some possible implementations, adjusting the network parameters of the first speech processing model and the second speech processing model based on the first processing result, the second processing result, and the clean speech label includes: determining a first loss value between the first processing result and the clean speech label, and a second loss value between the second processing result and the clean speech label; determining a fusion loss value based on the first loss value and the second loss value; and adjusting the network parameters of the first speech processing model and the second speech processing model based on the fusion loss value.

[0073] For example, a first loss value can be used to measure the difference between the first processing result and the clean speech label, and a second loss value can be used to measure the difference between the second processing result and the clean speech label. The second processing result is obtained through sample noisy speech data and the first processing result. The second loss value not only reflects the speech processing effect of the second speech processing model, but also indirectly reflects the difference between the first processing result and the clean speech label.

[0074] For example, a first loss value can be determined using the first processing result, the clean speech label, and the first loss function, and a second loss value can be determined using the second processing result, the clean speech label, and the second loss function.

[0075] Here, the first loss function and the second loss function can be the same loss function or different loss functions, such as mean square error, scale-invariant signal-to-noise ratio (SI-SNR), spectral amplitude loss, etc.

[0076] For example, a fusion loss value can be obtained by weighted summing of the first and second loss values. This fusion loss value is then backpropagated, simultaneously adjusting the parameters of both models. The parameter updates of the first speech processing model are influenced by both the first and second loss values. The parameter updates of the second model are primarily influenced by the second loss value.

[0077] This disclosure constructs a dual-supervised collaborative training loop by separately calculating the losses of the first and second processing results with the clean speech labels, and then fusing them into a total loss value to jointly optimize the parameters of the two models. This ensures that the first speech processing model is not only directly constrained by the labels during training but also by higher-order feedback constraints from the second speech processing model, making its output more optimizable while maintaining low error. Simultaneously, the capabilities of the second speech processing model are continuously improved through joint optimization. The two mutually reinforce each other, ultimately enhancing the denoising accuracy and generalization ability of the lightweight first speech processing model.

[0078] In some possible implementations, a second processing result is obtained based on the sample noisy speech data and the first processing result, including: Based on the sample noisy speech data, the first processing result is subjected to speech enhancement processing to obtain an intermediate processing result; The second processing result is obtained by performing a residual connection between the first processing result and the intermediate processing result.

[0079] For example, the second speech processing model can use sample noisy speech data as context to perform more refined processing on the first processing result output by the first speech processing model, such as further suppressing residual noise, repairing distorted harmonic structures, or enhancing speech details, etc., to obtain intermediate processing results, which can extract details or residual noise patterns that the first speech processing model failed to capture.

[0080] For example, the first processing result is directly added to the intermediate processing result to obtain the final second processing result. The first processing result already has a certain enhancement effect and includes the main speech structure, while the intermediate processing result focuses on learning the missing parts or details to be corrected, reducing the learning difficulty. In this way, the main speech structure obtained by the first speech processing model from the sample noisy speech data is preserved, as well as the fine-tuning supplemented by the second speech processing model based on the first speech processing model is also preserved.

[0081] This disclosure enables a second speech processing model to focus on learning the residual mapping between the first processing result and the intermediate processing result by performing a residual connection between them. The model then uses raw noisy speech data as a conditional guide for refined enhancement. This not only reduces the learning difficulty of the second model, allowing it to capture details and residual noise more efficiently, but also provides the first model with more accurate and informative supervision signals by integrating the initial results with the refined corrections. Ultimately, this guides the first model to retain more speech details while maintaining a lightweight design, significantly improving the naturalness and robustness of the enhanced speech.

[0082] In some possible implementations, adjusting the network parameters of the first speech processing model and the second speech processing model based on the fusion loss value includes: The fusion loss value is backpropagated to the second speech processing model to update the network parameters of the second speech processing model and the network parameters of the first speech processing model; and the fusion loss value is backpropagated to the first speech processing model to update the network parameters of the first speech processing model.

[0083] For example, backpropagation is a core algorithm in deep learning training, providing the optimizer with the direction and magnitude of model parameter updates. The chain rule can be used to calculate the gradient of the loss function with respect to the network parameters of each layer, and the error signal can be propagated layer by layer along the network structure from the output layer to the input layer. Based on the calculated gradients, optimization algorithms can be used to fine-tune the network parameters to update the network parameters of the first speech processing model.

[0084] Here, the fusion loss value is backpropagated to the second speech processing model to calculate the gradient of the parameters of each layer in the second speech processing model in order to update the parameters of the second speech processing model. At the same time, since the input of the second speech processing model includes the first processing result output by the first speech processing model, the gradient will continue to propagate to the first speech processing model, providing the first speech processing model with the high-order gradient signal after being processed by the second speech processing model, thereby realizing indirect supervision of the learning of the first speech processing model.

[0085] Here, the fusion loss value is backpropagated to the first speech processing model. The fusion loss value can be used to directly calculate and update the network parameters of the first speech processing model, thereby achieving direct supervision of the learning of the first speech processing model.

[0086] Here, two paths for gradient backpropagation are provided. The parameter updates of the first speech processing model are influenced by both direct and indirect supervision gradients. These two gradients naturally superimpose during backpropagation, jointly guiding the optimization direction of the first speech processing model. The parameter updates of the second speech processing model primarily rely on the gradient path through its own output.

[0087] This disclosure achieves differentiated collaborative updates of two models through backpropagation via dual paths, enabling gradient information to flow fully between the two models and forming an efficient bidirectional promotion loop. This allows the lightweight first speech processing model to achieve performance limits and generalization capabilities exceeding those of single-model training in joint optimization.

[0088] Reference Figure 3-4 This paper illustrates a training method and application method for a speech denoising system. The speech denoising system disclosed herein includes a training framework for collaborative training and a simplified network for actual inference.

[0089] In some embodiments, the base model of the collaborative training framework may include two neural networks with different parameters and / or sizes: a first speech processing model, which is a small network with a compact structure and low computational cost; and a second speech processing model, which is a large network with a more complex structure and stronger expressive power.

[0090] For example, a noisy speech signal carrying a clean speech label is input into a collaborative training framework. The noisy speech signal is converted into a frequency domain signal through Fourier transform. Frequency domain features, i.e., the aforementioned sample noisy speech data, are obtained through feature extraction. The frequency domain features are input into a first speech processing model to generate a preliminary enhanced first processing result Y1. Channel-dimensional concatenation or other feature fusion methods can be used to fuse the input and output data of the first speech processing model, i.e., the frequency domain features are fused with the first processing result Y1 to obtain fused features, which are then used as input data for a second speech processing model. The second speech processing model performs speech enhancement processing on the fused features to generate a further enhanced second processing result Y2.

[0091] For example, a residual connection from the first processing result Y1 to the second processing result Y2 can be added in the second speech processing model or before the final output of the second speech processing model, so that the second speech processing model learns the residual mapping and focuses on repairing the noise and distortion that the first speech processing model failed to remove, which helps to stabilize training.

[0092] Here, the first and second speech processing models employ a collaborative training mechanism. A first loss value is calculated by comparing the first processing result Y1 (output of the first speech processing model) with the clean speech label, and a second loss value is calculated by comparing the second processing result Y2 (output of the second speech processing model) with the clean speech label. Both speech processing models optimize towards the same goal, ensuring that the second speech processing model provides a better processing result than the first, thereby guiding the first speech processing model to improve its denoising capability through gradient backpropagation.

[0093] For example, the fusion loss value can be calculated based on the fusion loss function, and can be expressed as: Combine-Loss=α×Loss1(Y1,Yclean)+β×Loss2(Y2,Yclean); Wherein, Combine-Loss is the fusion loss value; Loss1() is the first loss function used to calculate the first loss value, and Loss2() is the second loss function used to calculate the second loss value. Loss1() and Loss2() can use the same or different loss functions, such as mean squared error, SI-SNR, spectral amplitude loss, etc.; Yclean is the clean speech label; α and β are hyperparameters used to balance the strength of the two supervision signals. By adjusting α and β, the strength and direction of the large model's guidance on the small model can be controlled.

[0094] For example, during end-to-end backpropagation, the gradient generated by Combine-Loss flows to both the first and second speech processing models simultaneously. The first speech processing model is subject to both direct supervision from Loss1 and indirect supervision from Loss2, transmitted via the second speech processing model and the feature fusion path. The second speech processing model is directly driven by Loss2 to learn how to generate cleaner results based on the output data of the first speech processing model.

[0095] In some embodiments, after model training is complete, only the first speech processing model can be retained and deployed to the electronic device. In practical applications, noisy speech signals are transformed into frequency domain data through Fourier transform, and then feature extraction is performed on the frequency domain data to obtain frequency domain features, i.e., the aforementioned noisy speech data. The frequency domain features are directly input into the trained first speech processing model to obtain the target processing result, and the target processing result is transformed into the final target speech data through inverse Fourier transform.

[0096] This disclosure leverages the capabilities of large models during the training phase to empower small models, and through an ingenious framework design, this empowerment is ultimately condensed into the parameters of the small model itself, thereby achieving a perfect balance between performance and efficiency during inference.

[0097] Reference Figure 2 , Figure 2 This is a flowchart illustrating a speech processing method according to an exemplary embodiment, such as... Figure 1 As shown, the speech processing method includes the following steps.

[0098] In step S201, the noise speech data to be processed is acquired.

[0099] In step S202, the noisy speech data is subjected to speech enhancement processing using a target speech processing model to obtain target speech data; wherein, the target speech processing model is trained using the training method of the speech processing model.

[0100] For example, noisy speech data is the raw audio signal containing noise, which may have been mixed in with interference components during acquisition or transmission. These interference components may affect the clarity and comfort of the speech. For example, they may include environmental noise, equipment noise, transmission interference, echo, etc. Target speech data is the clean audio signal after the noisy speech data has been processed by a speech processing model to enhance the speech and remove the noise.

[0101] Here, noisy speech data is input into a trained target speech processing model so that the target speech processing model can perform speech enhancement processing on the noisy speech data, and the resulting target speech data is the denoised speech data.

[0102] The target speech processing model trained by the speech processing model training method described herein processes the noisy speech data to be processed. Since the second speech processing model guides the first speech processing model to learn better speech features during training, the speech enhancement processing performance of the first speech processing model is significantly improved, enhancing the denoising effect on noisy speech data. At the same time, deploying only the first speech processing model not only reduces the consumption of computing resources, but also improves the inference efficiency of the model.

[0103] Reference Figure 5 , Figure 5 This is a block diagram illustrating a training apparatus 500 for a speech processing model according to an exemplary embodiment. (Refer to...) Figure 5 The speech processing device 500 includes a training module 501 and a response module 502.

[0104] The training module 501 is configured to perform collaborative training on the first speech processing model and the second speech processing model; the input data of the second speech processing model includes the output data of the first speech processing model. The response module 502 is configured to, in response to the completion of model training, use the trained first speech processing model as the target speech processing model for speech enhancement processing of noisy speech data.

[0105] In some possible implementations, the input data of the second speech processing model may also include the input data of the first speech processing model.

[0106] In some possible implementations, the training module 501 is further configured to: Obtain sample noisy speech data and the clean speech tags corresponding to the sample noisy speech data; The first speech processing model to be trained is used to perform speech enhancement processing on the sample noisy speech data to obtain a first processing result. The second processing result is obtained by using the second speech processing model to be trained, based on the sample noisy speech data and the first processing result; Based on the first processing result, the second processing result, and the clean speech label, adjust the network parameters of the first speech processing model and the second speech processing model.

[0107] In some possible implementations, the training module 501 is further configured to: Based on the sample noisy speech data, the first processing result is subjected to speech enhancement processing to obtain an intermediate processing result; The second processing result is obtained by performing a residual connection between the first processing result and the intermediate processing result.

[0108] In some possible implementations, the training module 501 is further configured to: Determine a first loss value between the first processing result and the clean speech label, and a second loss value between the second processing result and the clean speech label; Based on the first loss value and the second loss value, determine the fusion loss value; Based on the fusion loss value, the network parameters of the first speech processing model and the second speech processing model are adjusted.

[0109] In some possible implementations, the training module 501 is further configured to: The fusion loss value is backpropagated to the second speech processing model to update the network parameters of the second speech processing model and the network parameters of the first speech processing model; and the fusion loss value is backpropagated to the first speech processing model to update the network parameters of the first speech processing model.

[0110] In some possible implementations, the number of parameters of the first speech processing model is less than or equal to the number of parameters of the second speech processing model; and / or, the computational complexity of the first speech processing model is less than or equal to the computational complexity of the second speech processing model.

[0111] Regarding the training device 500 for the speech processing model in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments concerning the training method of the speech processing model, and will not be elaborated upon here.

[0112] Reference Figure 6 , Figure 6 This is a block diagram illustrating a voice processing apparatus 600 according to an exemplary embodiment. (Refer to...) Figure 6 The voice processing device 600 includes an acquisition module 601 and a processing module 602.

[0113] The acquisition module 601 is configured to acquire noise-speech data to be processed. The processing module 602 is configured to perform speech enhancement processing on the noisy speech data using a target speech processing model to obtain target speech data; wherein the target speech processing model is trained using the speech processing model training method described in this disclosure.

[0114] Regarding the voice processing device 600 in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the voice processing method, and will not be elaborated upon here.

[0115] Based on the same inventive concept, this disclosure also provides an electronic device, comprising: processor; Memory used to store processor-executable instructions; The processor is configured to implement the training method or speech processing method of the speech processing model provided in this disclosure.

[0116] Based on the same inventive concept, this disclosure also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the training method or speech processing method of the speech processing model provided in this disclosure.

[0117] Based on the same inventive concept, this disclosure also provides a computer program product, including a computer program that, when executed by a processor, implements the training method or speech processing method of the speech processing model provided in this disclosure.

[0118] Reference Figure 7 , Figure 7This is a block diagram illustrating an electronic device 700 according to an exemplary embodiment. For example, the electronic device 700 may be a mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical device, fitness equipment, personal digital assistant, etc.

[0119] like Figure 7 As shown, the electronic device 700 may include one or more of the following components: processing component 702, memory 704, power supply component 706, multimedia component 708, audio component 710, input / output interface 712, sensor component 714, and communication component 716.

[0120] Processing component 702 typically controls the overall operation of electronic device 700, such as operations associated with display, telephone calls, data communication, camera operation, and recording. Processing component 702 may include one or more processors 720 to execute instructions to complete all or part of the steps of the aforementioned speech processing model training method or speech processing method. Furthermore, processing component 702 may include one or more modules to facilitate interaction between processing component 702 and other components. For example, processing component 702 may include a multimedia module to facilitate interaction between multimedia component 708 and processing component 702.

[0121] Memory 704 is configured to store various types of data to support the operation of electronic device 700. Examples of this data include instructions for any application or method operating on electronic device 700, contact data, phonebook data, messages, pictures, videos, etc. Memory 704 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0122] Power supply component 706 provides power to various components of electronic device 700. Power supply component 706 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 700.

[0123] Multimedia component 708 includes a screen that provides an output interface between the electronic device 700 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensors may sense not only the boundaries of the touch or swipe action but also the duration and pressure associated with the touch or swipe operation. In some embodiments, multimedia component 708 includes a front-facing camera and / or a rear-facing camera. When the electronic device 700 is in an operating mode, such as a shooting mode or a video mode, the front-facing camera and / or the rear-facing camera may receive external multimedia data. Each front-facing camera and rear-facing camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

[0124] Audio component 710 is configured to output and / or input audio signals. For example, audio component 710 includes a microphone (MIC) configured to receive external audio signals when electronic device 700 is in an operating mode, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 704 or transmitted via communication component 716. In some embodiments, audio component 710 also includes a speaker for outputting audio signals.

[0125] Input / output interface 712 provides an interface between processing component 702 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, start buttons, and lock buttons.

[0126] Sensor assembly 714 includes one or more sensors for providing state assessments of various aspects of electronic device 700. For example, sensor assembly 714 can detect the on / off state of electronic device 700, the relative positioning of components such as the display and keypad of electronic device 700, changes in position of electronic device 700 or a component of electronic device 700, the presence or absence of user contact with electronic device 700, orientation or acceleration / deceleration of electronic device 700, and temperature changes of electronic device 700. Sensor assembly 714 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 714 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 714 may also include an accelerometer, gyroscope, magnetometer, pressure sensor, or temperature sensor.

[0127] Communication component 716 is configured to facilitate wired or wireless communication between electronic device 700 and other devices. Electronic device 700 can access wireless networks based on communication standards, such as WiFi, 2G, or 3G, or combinations thereof. In one exemplary embodiment, communication component 716 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, communication component 716 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

[0128] In an exemplary embodiment, the electronic device 700 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the training method or speech processing method of the speech processing model described above.

[0129] In an exemplary embodiment, a non-transitory computer-readable storage medium including instructions is also provided, such as a memory 704 including instructions, which can be executed by a processor 720 of an electronic device 700 to complete the above-described speech processing model training method or speech processing method. For example, the non-transitory computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc.

[0130] In another exemplary embodiment, a computer program product is also provided, the computer program product comprising a computer program executable by a programmable device, the computer program having a code portion for performing the training method or speech processing method of the above-described speech processing model when executed by the programmable device.

[0131] Reference Figure 8 , Figure 8 This is a block diagram illustrating an apparatus 800 for implementing voice processing according to an exemplary embodiment. For example, apparatus 800 may be provided as a server. Figure 8As shown, the device 800 includes a processing component 822, which further includes one or more processors, and memory resources represented by memory 832 for storing instructions executable by the processing component 822, such as application programs. The application programs stored in memory 832 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 822 is configured to execute instructions to perform the aforementioned training method or speech processing method for the speech processing model.

[0132] Device 800 may also include a power supply component 828 configured to perform power management of device 800, a wired or wireless network interface 850 configured to connect device 800 to a network, and an input / output interface 858. Device 800 can operate on an operating system, such as Windows Server, stored in memory 832. TM Mac OS X TM Unix TM Linux TM FreeBSD TM Or similar.

[0133] Furthermore, the term “exemplary” is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as “exemplary” is not necessarily to be construed as advantageous compared to other aspects or designs. Rather, the use of the term “exemplary” is intended to present the concept in a concrete manner. As used herein, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or.” That is, unless otherwise specified or clear from the context, “X applies A or B” is intended to mean any of the natural inclusive arrangements. That is, “X applies A or B” satisfies any of the foregoing instances if X applies A; X applies B; or both X applies A and B. Additionally, unless otherwise specified or clear from the context to refer to the singular form, the articles “a” and “an” as used in this application and the appended claims are generally understood to mean “one or more.”

[0134] Similarly, although this disclosure has been shown and described with respect to one or more implementations, equivalent variations and modifications will occur to those skilled in the art upon reading and understanding this specification and the accompanying drawings. This disclosure includes all such modifications and variations and is limited only by the scope of the claims. In particular, with respect to the various functions performed by the components described above (e.g., elements, resources, etc.), unless otherwise indicated, the terminology used to describe such components is intended to correspond to any component (functionally equivalent) that performs the specific function of the described component, even if structurally not equivalent to the disclosed structure. Furthermore, although specific features of this disclosure may have been disclosed with respect to only one of several implementations, such features may be combined with one or more other features of other implementations, as may be desired and advantageous to any given or particular application. Moreover, with regard to the terms “comprising,” “owning,” “having,” “having,” or variations thereof as used in the detailed description or claims, such terms are intended to be inclusive in a manner similar to the term “including.”

[0135] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the appended claims.

[0136] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.

Claims

1. A training method for a speech processing model, characterized in that, include: The first speech processing model and the second speech processing model are trained together. The input data of the second speech processing model includes the output data of the first speech processing model; In response to the completion of model training, the trained first speech processing model is used as the target speech processing model to perform speech enhancement processing on noisy speech data.

2. The method according to claim 1, characterized in that, The input data of the second speech processing model also includes the input data of the first speech processing model.

3. The method according to claim 1, characterized in that, The first speech processing model and the second speech processing model are trained collaboratively, including: Obtain sample noisy speech data and the clean speech tags corresponding to the sample noisy speech data; The first speech processing model to be trained is used to perform speech enhancement processing on the sample noisy speech data to obtain a first processing result. The second processing result is obtained by using the second speech processing model to be trained, based on the sample noisy speech data and the first processing result; Based on the first processing result, the second processing result, and the clean speech label, adjust the network parameters of the first speech processing model and the second speech processing model.

4. The method according to claim 3, characterized in that, Based on the sample noisy speech data and the first processing result, a second processing result is obtained, including: Based on the sample noisy speech data, the first processing result is subjected to speech enhancement processing to obtain an intermediate processing result; The second processing result is obtained by performing a residual connection between the first processing result and the intermediate processing result.

5. The method according to claim 3, characterized in that, Based on the first processing result, the second processing result, and the clean speech label, the network parameters of the first speech processing model and the second speech processing model are adjusted, including: Determine a first loss value between the first processing result and the clean speech label, and a second loss value between the second processing result and the clean speech label; Based on the first loss value and the second loss value, determine the fusion loss value; Based on the fusion loss value, the network parameters of the first speech processing model and the second speech processing model are adjusted.

6. The method according to claim 5, characterized in that, Based on the fusion loss value, the network parameters of the first speech processing model and the second speech processing model are adjusted, including: The fusion loss value is backpropagated to the second speech processing model to update the network parameters of the second speech processing model and the network parameters of the first speech processing model; and the fusion loss value is backpropagated to the first speech processing model to update the network parameters of the first speech processing model.

7. The method according to any one of claims 1-6, characterized in that, The number of parameters in the first speech processing model is less than or equal to the number of parameters in the second speech processing model; and / or, The computational complexity of the first speech processing model is less than or equal to that of the second speech processing model.

8. A speech processing method, characterized in that, include: Acquire the noisy speech data to be processed; The noisy speech data is enhanced by using a target speech processing model to obtain target speech data; wherein the target speech processing model is trained using the training method for the speech processing model according to any one of claims 1-7.

9. A training device for a speech processing model, characterized in that, include: The training module is configured to perform collaborative training on the first speech processing model and the second speech processing model. The input data of the second speech processing model includes the output data of the first speech processing model; The response module is configured to use the trained first speech processing model as the target speech processing model in response to the completion of model training, and to perform speech enhancement processing on noisy speech data.

10. The apparatus according to claim 9, characterized in that, The training module is also configured to: Obtain sample noisy speech data and the clean speech tags corresponding to the sample noisy speech data; The first speech processing model to be trained is used to perform speech enhancement processing on the sample noisy speech data to obtain a first processing result. The second processing result is obtained by using the second speech processing model to be trained, based on the sample noisy speech data and the first processing result; Based on the first processing result, the second processing result, and the clean speech label, adjust the network parameters of the first speech processing model and the second speech processing model.

11. The apparatus according to claim 10, characterized in that, The training module is also configured to: Based on the sample noisy speech data, the first processing result is subjected to speech enhancement processing to obtain an intermediate processing result; The second processing result is obtained by performing a residual connection between the first processing result and the intermediate processing result.

12. A voice processing device, characterized in that, include: The acquisition module is configured to acquire the noisy speech data to be processed. The processing module is configured to perform speech enhancement processing on the noisy speech data using a target speech processing model to obtain target speech data; wherein the target speech processing model is trained using the training method for the speech processing model according to any one of claims 1-7.

13. An electronic device, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor is configured to implement the training method of the processing model according to any one of claims 1-7, or the speech processing method according to claim 8.

14. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the training method of the processing model according to any one of claims 1-7, or the speech processing method according to claim 8.

15. A computer program product, characterized in that, Includes a computer program that, when executed by a processor, implements the training method of the processing model according to any one of claims 1-7, or the speech processing method according to claim 8.