Audio processing methods and apparatus, computer equipment and programs

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The speech processing method integrates two-level networks into a single model through stepwise training, addressing computational inefficiencies and enhancing noise and reverberation removal, thus improving speech enhancement performance in real-world applications.

JP7874173B2Active Publication Date: 2026-06-15TENCENT TECHNOLOGY (SHENZHEN) CO LTD

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date: 2023-03-31
Publication Date: 2026-06-15

Application Information

Patent Timeline

31 Mar 2023

Application

15 Jun 2026

Publication

JP7874173B2

IPC: G10L21/0208; G10L25/30

CPC: G10L15/02; G10L15/063; G10L21/0208; G10L2021/02082; G10L25/30

AI Tagging

Application Domain

Speech analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing speech enhancement technologies face challenges in efficiently removing noise and reverberation due to the high computational demands of two-level network models, which are not suitable for real-world applications and degrade performance when parameter reduction is attempted.

Method used

A speech processing method that merges two-level networks into a single model through stepwise training using depth clustering and mask estimation loss functions, reducing computational resources while enhancing noise and reverberation removal capabilities.

Benefits of technology

The method effectively improves speech enhancement performance by efficiently removing noise and reverberation, reducing computational load, and optimizing training processes for improved user experience in video conferencing and speech recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 0007874173000042
Figure 0007874173000043
Figure 0007874173000044

Patent Text Reader

Abstract

The present application discloses a speech processing method and its apparatus, computer device and program, the method including the steps of acquiring initial speech features of a call speech, inputting the initial speech features into a pre-trained speech enhancement model to obtain target speech features output from the speech enhancement model, the speech enhancement model being obtained by stepwise training based on a depth clustering loss function and a mask estimation loss function, and calculating a target speech with noise and reverberation removed based on the target speech features.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] (Cross - reference to related applications) This application claims the priority of a Chinese patent application with application number 202210495197.5, filed with the Chinese Patent Office on May 7, 2022, and all of its content is incorporated herein by reference.

[0002] This application relates to the field of speech recognition technology, and more specifically, to a speech processing method and its device ,Ko computer equipment and programs Mu related thereto.

Background Art

[0003] The essence of speech enhancement is to reduce the noise of speech. In daily life, the speech collected by a microphone is usually "contaminated" speech with different noises. The main purpose of speech enhancement is to restore the clean speech we want from this "contaminated" noisy speech, thereby effectively suppressing various interference signals and enhancing the target speech signal. This not only improves the quality of speech but also helps to improve speech recognition performance.

[0004] The application fields of speech enhancement include video conferencing and speech recognition, etc., and it is a pre - processing module for many speech coding and recognition systems. Usually, it is classified into near - field speech enhancement and far - field speech enhancement. In a complex speech collection environment, since noise and reverberation exist simultaneously, in existing speech enhancement, a noise reduction and reverberation removal scheme based on a two - level network is used. However, due to the large computational amount of this two - level network, speech enhancement cannot meet the performance requirements of actual applications.

Summary of the Invention

Problems to be Solved by the Invention

[0005] Embodiments of this application are a speech processing method and its device ,Ko computer equipment and programs Mu The purpose is to provide and improve the performance of voice enhancement. [Means for solving the problem]

[0006] Embodiments of the present application provide a speech processing method comprising: acquiring initial speech features of a call speech; inputting the initial speech features into a pre-trained speech enhancement model to obtain target speech features output from the speech enhancement model, wherein the speech enhancement model is obtained by stepwise training based on a depth clustering loss function and a mask estimation loss function; and calculating a target speech from which noise and reverberation have been removed based on the target speech features.

[0007] Embodiments of the present invention further provide a speech processing device comprising: an acquisition module configured to acquire initial speech features of a call speech; an enhancement module configured to input the initial speech features into a pre-trained speech enhancement model to obtain target speech features output from the speech enhancement model, wherein the speech enhancement model is obtained by stepwise training based on a depth clustering loss function and a mask estimation loss function; and a computation model configured to calculate a target speech from which noise and reverberation have been removed based on the target speech features.

[0008] Embodiments of the present invention further provide a computer device comprising a processor and memory, wherein computer program instructions are stored in the memory, and the computer program instructions execute the above-described speech processing method when called by the processor.

[0009] Embodiments of the present invention further provide a computer-readable storage medium in which program code is stored, and the above-described speech processing method is executed when the program code is executed by a processor.

[0010] Embodiments of the present application further provide a computer program product or computer program, the computer program product or computer program including computer instructions, the computer instructions being stored in a storage medium. A processor of a computer device reads the computer instructions from the storage medium, and the processor executes the instructions, thereby causing the computer to perform the steps of the above-described audio processing method. [Effects of the Invention]

[0011] The embodiment of this invention progressively trains a pre-configured speech enhancement model through two different loss functions, inducing the model to efficiently remove noise and reverberation from speech features, enabling noise reduction and reverberation removal tasks, achieving optimal training effects in a unique training process, thereby helping to improve the noise reduction and reverberation removal capabilities of the speech enhancement model, and improving speech enhancement performance while reducing model computational resources. [Brief explanation of the drawing]

[0012] [Figure 1] This is a schematic diagram of a general noise reduction and reverberation removal method according to an embodiment of the present invention. [Figure 2] This is a schematic diagram of the architecture of the speech processing system according to an embodiment of the present invention. [Figure 3] This is a flowchart of the audio processing method according to the embodiment of the present invention. [Figure 4] This is a schematic diagram illustrating an application scenario of the audio processing method according to the embodiment of the present invention. [Figure 5] This is a schematic diagram of the architecture of the speech enhancement model according to an embodiment of the present invention. [Figure 6] This is a flowchart of another audio processing method according to an embodiment of the present invention. [Figure 7] This is a flowchart of speech feature extraction according to the embodiment of the present invention. [Figure 8] This is a schematic diagram of the architecture of a predetermined reinforcement network according to an embodiment of the present invention. [Figure 9] It is a module block diagram of the voice processing apparatus according to an embodiment of the present application. [Figure 10] It is a module block diagram of the computer device according to an embodiment of the present application. [Figure 11] It is a module block diagram of the computer-readable storage medium according to an embodiment of the present application.

Mode for Carrying Out the Invention

[0013] In order to more clearly explain the technical solution of the embodiment of the present application, the drawings used in the description of the embodiment will be briefly introduced below. Obviously, the above drawings are only some embodiments of the present application, and those skilled in the art can obtain other related drawings based on these drawings without creative effort.

[0014] In the following, the embodiments of the present application will be described in detail. Examples of the embodiments are shown in the drawings, where the same or similar reference numerals from beginning to end represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are exemplary and are intended only for explaining the present application and should not be construed as limiting the present application.

[0015] In order for those skilled in the art to better understand the technical solution of the present application, in the following, the technical solution of the embodiment of the present application will be clearly and completely described with reference to the drawings in the embodiment of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present application without creative work are included in the protection scope of the present application.

[0016] In daily life, the problem of conducting voice communication under noise interference often occurs. For example, when using a mobile phone in a car or on a train, ambient noise, and the noisy remote-end voice collected by a microphone during a video conference with multiple people, etc., it is necessary to use voice enhancement technology to extract as pure an original voice as possible from a voice signal with noise. Depending on the call scenario, the call types made by users using a client can include a near-end call and a far-end call. For the participants in the call, the near-end is the location of the participant, and the far-end is the location of other participants in a remote conference. Each location has at least one microphone and one speaker. However, the near-end call of the client is only suitable for one person or a small number of short-distance calls, and the voice and video experience is common.

[0017] To improve the user experience, industrially, emphasis is placed on the research of far-end calls on large-screen communication devices. However, since far-end calls have a longer call distance, a lower signal-to-noise ratio, and the call voice usually involves noise and reverberation, it is necessary to perform noise reduction and reverberation removal on the call voice using high-performance long-distance voice enhancement. The voice enhancement solutions of related technologies usually adopt two models to perform noise reduction and reverberation removal respectively. For voice with noise and reverberation, reference can be made to FIG. 1. FIG. 1 shows two technical solutions for noise reduction and reverberation removal that are commonly used, including a method of removing reverberation after noise reduction and a method of removing noise after reverberation removal.

[0018] For example, the microphone array is divided into different subsets, and each subset obtains the voice enhanced by each microphone through a first-level voice enhancement network, integrates the enhanced voices, and then obtains the final output through a second-level voice enhancement network. However, in such a two-level network-based voice enhancement solution, a large amount of computational effort needs to be consumed in the training process, which is not suitable for the actual application performance requirements of the product. Reducing the number of network parameters to reduce the computational effort will deteriorate the effect of voice enhancement by the network.

[0019] To solve the above problems, the applicant has proposed, through research, a speech processing method provided by the embodiment of the present application, which can acquire initial speech features of a call speech, input the initial speech features into a pre-trained speech enhancement model, and obtain target speech features output from the speech enhancement model, which is obtained by stepwise training based on a depth clustering loss function and a mask estimation loss function. This merges two models (two-level networks) into one model, reducing the computational cost of the model training process. Based on the target speech features, a target speech from which noise and reverberation have been removed is calculated. In this way, the model can be trained on a pre-configured speech enhancement model through different loss functions, guiding the model to efficiently remove noise and reverberation from speech features, thereby improving speech enhancement performance while reducing model computational resources.

[0020] First, the application scenarios of the audio processing method according to the present invention will be described. Figure 2 is a schematic diagram of the architecture of the audio processing system. In some embodiments, the audio processing system 300 is applied to a remote video conferencing scenario, and the audio processing system 300 may include a near-end client 310, a far-end client 330, and a server-side 350. Here, the near-end client 310, the far-end client 330, and the server-side 350 are connected via a network, and in one embodiment, the near-end client 310 and the far-end client 330 may be large-screen terminals for video, and the server-side 350 may be a cloud server.

[0021] For example, the far-end client 330 can collect initial speech with noise and reverberation emitted by a participant and transmit the initial speech to the server-side 350. After receiving the initial speech, the server-side 350 can use a pre-trained speech enhancement model to perform noise reduction and reverberation removal on the initial speech to obtain enhanced clean speech (target speech), and transmit the clean speech to the near-end client 310. In some embodiments, the speech enhancement model may be placed in the near-end client 310 or the far-end client 330 as needed for the actual application scene.

[0022] It should be noted that the above-described voice processing system 300 is merely an example, and the architecture and application scenarios of the voice processing system described in the embodiments of this application are intended to more clearly illustrate the technical solutions of the embodiments of this application, and do not limit the technical solutions provided by the embodiments of this application. Those skilled in the art will understand that, as voice processing system architectures evolve and new application scenarios emerge, the technical solutions provided by the embodiments of this application can be similarly applied to similar technical problems.

[0023] Referring to Figure 3, Figure 3 is a flowchart of a speech processing method according to one embodiment of the present invention. In a specific embodiment, the speech processing method is applied to a speech processing device 500 shown in Figure 9 and a computer device 600 (Figure 10) on which the speech processing device 500 is located.

[0024] The specific process of the embodiment of this application will be described using computer equipment as an example. Of course, the computer equipment to which the embodiment of this application applies may be a server or a terminal. A server may be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDNs, big data, and artificial intelligence platforms. A terminal may be, but is not limited to, a smartphone, tablet computer, laptop computer, desktop computer, smart speaker, smartwatch, etc.

[0025] The process shown in Figure 3 will be described in detail below with reference to the application scene shown in Figure 4. Figure 4 is a schematic diagram of an application scene of the voice processing method according to an embodiment of the present invention. In the application scene, the voice processing method may be applied to a specific voice enhancement system, the voice enhancement model 411 of the voice enhancement system may be located on a cloud server 410, and the cloud server 410 can communicate with conference terminals (first conference terminal 430 and second conference terminal 450) in two venues. Here, the first conference terminal 430 and the second conference terminal 450 collect the voices of participants in their respective venues (i.e., original call audio), upload the collected audio to the cloud server 410, the cloud server 410 completes the voice enhancement to obtain clean audio, and finally, the cloud server 410 transmits the clean audio to the corresponding conference terminal for playback. The voice processing method may specifically include the following steps.

[0026] In step S110, the initial voice characteristics of the call audio are acquired.

[0027] In the embodiments of this invention, a computer device can acquire initial speech features of a call audio that requires speech enhancement. Here, the initial speech features are acoustic features obtained based on the transformation of the call audio. Examples include, but are not limited to, logarithmic power spectra (LPS) and mel-frequency inverse spectral coefficients (MFCCs).

[0028] Audio data often cannot be directly input into a model like image data. Furthermore, because there are no obvious feature changes in the long-duration domain, learning the features of audio data is difficult. Audio time-domain data typically consists of 16K sampling rates, i.e., 16,000 sampling points per second. Directly inputting time-domain sampling points results in an excessively large amount of training data, making it difficult to train practical effects. Therefore, in audio processing tasks, audio data is usually converted into acoustic features to be used as input or output to the model.

[0029] In one embodiment, after acquiring call audio, frame splitting and windowing processing can be performed on the call audio to obtain initial audio features. For example, frame splitting and windowing processing can be sequentially performed on all call audio collected by the microphone to obtain audio signal frames of the call audio, a Fast Fourier Transform (FFT) can be performed on the audio signal frames to obtain the FFT-processed discrete power spectrum, and then logarithmic calculation can be performed on the obtained discrete power spectrum to obtain a logarithmic power spectrum as the initial audio features. By performing frame splitting and windowing processing on call audio, the call audio can be converted from a non-stationary signal in the time domain to a stationary signal in the frequency domain, making it easier to train the model.

[0030] The purpose of audio signal frame splitting is to divide several audio sampling points into one frame, within which the characteristics of the audio signal are considered stable. Typically, the length of one frame should be short enough to ensure that the intra-signal is stable; therefore, the length of one frame should be less than the length of one phoneme, and at the normal speed of sound, the duration of one phoneme is approximately 50 ms. Furthermore, for Fourier analysis to be performed, one frame must contain a sufficient oscillation period, which is around 100 Hz for male voices and around 200 Hz for female voices, corresponding to periods of 10 ms and 5 ms, respectively. Therefore, the length of audio frame splitting is generally between 10 and 40 ms.

[0031] After frame division, the discontinuity at the beginning and end of each frame means that the more frames there are, the greater the error from the original signal. To solve this problem, windowing makes the framed signal continuous, and each frame represents the characteristics of a periodic function. For example, usable window functions include the rectangular window, Hamming window, and Hanning window.

[0032] In the video conferencing scene shown in Figure 4, there is a certain distance between the participants and the conference terminal, resulting in noise and reverberation in the participant's voice collected by the conference terminal. Therefore, by using the audio processing method provided in the embodiment of this application to perform audio enhancement processing on the participant's voice, noise and reverberation in the voice can be removed.

[0033] For example, the second conference terminal 450 collects the voices of participants 420 in the venue, i.e., call audio, via a microphone, and transmits the call audio to the cloud server 410 via the network. The cloud server 410 then performs frame segmentation, windowing, and Fourier transform on the call audio to obtain initial audio features.

[0034] In step S120, the initial speech features are input to a pre-trained speech enhancement model to obtain the target speech features output from the speech enhancement model.

[0035] In real-world application scenarios, call audio collected by microphone arrays contains both noise and reverberation. Considering a two-level network for noise reduction and reverberation removal on call audio, the large number of parameters in the two networks during training requires significant computational resources. Reducing the number of parameters in each network also degrades the noise reduction and reverberation removal performance of the model. Therefore, merging the two-level network into a single network reduces the number of parameters in the merged model compared to the two networks, significantly reducing the computational load of the training process and improving the speech enhancement performance of the model.

[0036] In embodiments of the present invention, the speech enhancement model can generate target speech features corresponding to the call speech, i.e., clean speech features from which noise and reverberation have been removed after speech enhancement, based on the input initial speech features. Referring to Figure 5, Figure 5 is a schematic diagram of the architecture of the speech enhancement model. The speech enhancement model may include a plurality of hidden layers, a depth clustering layer, a speech mask estimation layer, and a noise mask estimation layer.

[0037] Here, the depth clustering layer, the speech mask estimation layer, and the noise mask estimation layer can be linear layers, and the inputs to all three of these layers come from the output of the hidden layer. The hidden layer can calculate intermediate features based on the input initial speech features, and these intermediate features are the intermediate values of the speech enhancement process.

[0038] For example, a depth clustering layer can be implemented using normalization and a tangent function (denoted as tanh). The output of the hidden layer is first normalized, and to facilitate subsequent processing, the output of the hidden layer is restricted to a certain range, for example, [0,1] or [-1,1]. Then, the tangent function value is calculated on the normalized result and used as the output of the depth clustering layer.

[0039] For example, both the speech mask estimation layer and the noise mask estimation layer can be implemented using the softmax function.

[0040] The speech mask estimation layer performs mask estimation (MI) based on intermediate features to obtain target speech features with noise and reverberation removed. The noise mask estimation layer performs mask estimation based on intermediate features to obtain speech features with noise. The deep clustering layer performs deep clustering (DC) based on the acquired intermediate features, thereby supporting noise reduction and reverberation removal in the speech mask estimation and noise mask estimation layers. For example, the hidden layer may be a variant such as a Long Short-Term Memory (LSTM) or a Bi-directional Long-Short-Term Memory (Bi-LSTM), because the speech features have a short-term stable time series, which is consistent with the long-short-term memory capabilities of the LSTM. The hidden layer may also be another network with memory-like properties such as a Gated Recurrent Unit (GRU).

[0041] In one embodiment, the model training process can perform stepwise training on the model using a depth clustering loss function corresponding to the depth clustering layer and mask estimation loss functions corresponding to the speech mask estimation layer and noise mask estimation layer, respectively. Exemplarily, in the first step, a noise reduction model can be trained based on the depth clustering loss function and the mask estimation loss function, and training is stopped when the noise reduction model converges. Here, the mask estimation loss function corresponding to the speech mask estimation layer uses clean speech labels with noise and reverberation. In the second step, a reverberation removal model is trained, using the noise reduction model trained in the first step as the reverberation removal model, and the reverberation removal model is trained based on the depth clustering loss function and the mask estimation loss function, and training is stopped when the reverberation removal model converges. Here, the mask estimation loss function corresponding to the speech mask estimation layer uses clean speech labels with noise and no reverberation, so that the final reverberation removal model, i.e., the speech enhancement model, has the ability to perform noise reduction and reverberation removal simultaneously.

[0042] Furthermore, the depth clustering layer of the speech enhancement model is a binary loss based on time-frequency point clustering. Due to the regularization characteristics of the depth clustering loss, it is difficult to guide the speech mask estimation layer and noise mask estimation layer to effectively remove noise and reverberation in speech during the training process of related technologies, and furthermore, it is difficult to effectively improve the speech enhancement performance of the model. On the other hand, with the stepwise training method of the embodiment of the present invention, the noise reduction task and reverberation removal task can achieve optimal training effects in their own training processes, thereby helping to improve the noise reduction and reverberation removal capabilities of the speech enhancement model.

[0043] As a result, the speech enhancement model obtained through the above training can acquire intermediate features using a multilayer LSTM, and the speech mask estimation layer can perform mask estimation based on the intermediate features to calculate the speech mask, i.e., the target speech features. For example, in the video conferencing scene shown in Figure 4, the cloud server 410 can acquire initial speech features and then input these initial speech features into the speech enhancement model 411. The speech mask estimation layer of the speech enhancement model 411 can perform mask estimation based on the intermediate features to calculate the speech mask, i.e., the target speech features, and the intermediate features are obtained using a multilayer LSTM. In the application scene of speech enhancement, it is only necessary to reconstruct the speech using the target speech features output from the speech mask estimation, thus effectively reducing the computational load in the speech enhancement process.

[0044] In step S130, the target audio, with noise and reverberation removed, is calculated based on the target audio features.

[0045] In one embodiment, an inverse Fourier transform can be performed on the acquired target audio features to calculate the target audio from which noise and reverberation have been removed. For example, by performing an inverse Fourier transform (IFT) on the target audio features and converting the target audio features from the frequency domain to the time domain, the time-domain audio after audio enhancement, i.e., the target audio, can be obtained. Exemplarily, in the video conferencing scene shown in Figure 4, the cloud server 410 acquires the target audio features output from the audio enhancement model 411, and then converts the target audio features, i.e., clean audio features, into the target audio using an inverse Fourier transform, thereby obtaining clean audio from which noise and reverberation have been removed. The cloud server 410 transmits the clean audio to the first conference terminal 430, and the speaker of the first conference terminal 430 can play back the participant's 420 audio, which is free of noise and reverberation.

[0046] In the embodiments of this application, initial speech features of a call are acquired, these initial speech features are input to a pre-trained speech enhancement model, and target speech features are obtained from the speech enhancement model. The speech enhancement model is obtained by stepwise training based on a depth clustering loss function and a mask estimation loss function, and can calculate a target speech from which noise and reverberation have been removed based on the target speech features. In this way, model training is performed on a pre-configured speech enhancement model through different loss functions, inducing the model to efficiently remove noise and reverberation from speech features, thereby improving the performance of speech enhancement while reducing the model's computational resources.

[0047] Referring to the methods described in the above examples, the following examples will provide further details.

[0048] In the embodiments of this application, the case in which the voice processing device is specifically incorporated into a computer device will be described as an example.

[0049] Referring to Figure 6, which shows another speech processing method according to an embodiment of the present application, in a specific embodiment, the speech processing method is applied to a predetermined enhancement network shown in Figure 8. The process shown in Figure 5 will be described in detail below.

[0050] The embodiments of this invention incorporate artificial intelligence (AI). Artificial intelligence technology is a theory, method, technique, and application system that uses digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results. In other words, artificial intelligence is a comprehensive field of computer science that seeks to understand the nature of intelligence and produce new intelligent machines that can react in a manner similar to human intelligence. Artificial intelligence studies the design principles and implementation methods of various intelligent machines so that they can perform functions of perception, reasoning, and decision-making.

[0051] Artificial intelligence technology is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing technologies, operation / interaction systems, and mechatronics. Artificial intelligence software technologies primarily encompass several major areas, such as computer vision technologies, speech processing technologies, natural language processing technologies, and machine learning / deep learning.

[0052] The technologies provided by the embodiments of this application relate to artificial intelligence speech technology, and the core technologies of speech technology include automatic speech recognition (ASR), text-to-speech (TTS), and voiceprint recognition (VPR). Enabling computers to hear, see, speak, and feel is the direction of future human-machine interaction development, and within this context, speech is one of the most preferred methods of human-machine interaction in the future.

[0053] The process shown in Figure 6 and the network architecture diagram shown in Figure 8 will be explained in detail below. The voice processing method may specifically include the following steps.

[0054] In step S210, the computer equipment acquires a training sample set.

[0055] The speech processing method provided by the embodiments of the present application includes training a predetermined enhancement network, which may be pre-trained based on an acquired training sample dataset. Subsequently, whenever it is necessary to perform speech enhancement on the initial speech features of a call, the speech enhancement model obtained through training can be used to calculate the target speech features with noise and reverberation removed, eliminating the need to retrain the predetermined enhancement network each time speech enhancement is performed.

[0056] In some embodiments, the wsj0-2mix (Wall Street Journal) dataset can be used to determine the training sample set, which includes a 30-hour speech training set and a 10-hour speech training set. By randomly selecting speech from different speakers from the corresponding sets and mixing them with a random relative signal-to-noise ratio (SNR) between 0dB and 10dB, it is possible to generate speech with noise and reverberation that can be used for network training.

[0057] In one embodiment, the step of the computer device acquiring a training sample set may include the following steps: (1) The computer equipment acquires the first sample audio. (2) The computer equipment performs speech feature extraction on the first sample speech to obtain noise speech features. (3) The computer equipment acquires the second sample audio. (4) The computer equipment performs speech feature extraction on the second sample speech to obtain a first clean speech label and a second clean speech label. (5) The computer equipment determines depth clustering annotations based on the first and second sample audio.

[0058] Here, the first sample audio is the audio collected based on the microphone, including noise and reverberation. The second sample audio consists of a clean audio with noise and reverberation, and a clean audio without noise and reverberation. The depth clustering annotation is the proportion of features of the first and second sample audio at each time-frequency point.

[0059] For example, computer equipment can directly collect call audio, including noise and reverberation, using a microphone. In a video conference, for instance, the first sample audio would be the participant's speech collected by the microphone of a large-screen conference terminal. In the actual training process, the technician can directly obtain the first sample audio from a pre-built noise reduction training corpus.

[0060] The computer equipment can perform speech feature extraction on the acquired first sample audio. Referring to Figure 7, Figure 7 is a flowchart of speech feature extraction, which is a call audio including noise and reverberation collected by a microphone, i.e., the first sample audio.

number

[0061] The computer equipment can obtain reference clean speech from a noise reduction training corpus and use this clean speech as the second sample speech. To facilitate stepwise training of a given enhancement network, clean speech with and without noise and reverberation can be obtained. Next, speech feature extraction is performed on the clean speech with and without noise to obtain the first clean speech label, and speech feature extraction is performed on the clean speech without noise and reverberation to obtain the second clean speech label. In the computation process, the noise speech label

number

[0062] In one embodiment, the computer device compares the audio energy of the first sample audio and the second sample audio at each time frequency point to perform depth clustering annotation.

number

[0063] In step S220, the computer equipment acquires a predetermined coordinated network.

[0064] When products related to speech enhancement technology are put into industrial use, latency, i.e., real-time performance, is a very strict requirement. Therefore, it is necessary to reduce the number of parameters in the speech enhancement model as much as possible, but as a result, the speech enhancement effect of the model is greatly reduced. For this reason, in the embodiment of this application, we propose fusing a two-level network into a single network, so that the speech enhancement model can perform noise reduction and reverberation removal simultaneously, and still improve the speech enhancement effect without reducing the number of parameters in the model.

[0065] Referring to Figure 8, which is a schematic diagram of the architecture of a given collaborative network, the given collaborative network includes a hidden layer, a deep clustering layer, and a mask estimation layer. The given collaborative network is a network with shared lower weights and multiple outputs, where the deep clustering layer can assist in the mask estimation of the speech mask estimation layer and the noise mask estimation layer, thereby enabling the speech mask estimation layer and the noise mask estimation layer to effectively distinguish noise and reverberation in speech during the network training process, and the hidden layer can utilize an LSTM or Bi-LSTM. The hidden layer shown in Figure 8 is an LSTM, and the mask estimation layer includes a speech mask layer (Clean-MI) and a noise mask layer (Noise-MI).

[0066] The speech mask estimation layer can calculate an acoustic mask, i.e., a clean speech label, while the noise mask estimation layer can calculate a noise and reverberation mask, i.e., a noise speech label. Furthermore, since the application process only requires the reconstruction of the speech using the mask output from the speech mask estimation, the computational complexity of the speech enhancement process does not increase, improving the efficiency of speech enhancement.

[0067] In step S230, the computer equipment performs noise reduction training and reverberation reduction training on the predetermined enhancement network via a training sample set in stages until the predetermined enhancement network satisfies predetermined conditions, thereby obtaining the trained target enhancement network as a speech enhancement model.

[0068] The target enhancement network obtained after training, i.e., the speech enhancement model, needs to perform two enhancement tasks simultaneously: noise reduction and reverberation removal. However, training these two enhancement tasks simultaneously will not allow the enhancement network to achieve optimal training effectiveness. Therefore, a stepwise training method can be adopted, in which the training processes for the two tasks are performed separately.

[0069] Specifically, the embodiments of the present invention provide two stepwise training methods. For example, noise reduction training may be performed first, followed by reverberation reduction training, or reverberation reduction training may be performed first, followed by noise reduction training. Here, the objective of noise reduction training is to equip the network with the ability to reduce noise, and the objective of reverberation reduction training is to equip the network with the ability to remove reverberation. This allows the two enhancement tasks to achieve optimal training effects in their own training processes, thereby improving the speech enhancement performance of the speech enhancement model.

[0070] In some embodiments, the computer equipment may perform stepwise noise reduction training and reverberation reduction training on a predetermined enhancement network via a training sample set until the predetermined enhancement network satisfies predetermined conditions, and this step may include the following steps: (1) The computer equipment inputs noise speech features into a hidden layer and generates intermediate training features through the hidden layer. (2) The computer equipment inputs the intermediate training features into the depth clustering layer and generates clustering training annotations through the depth clustering layer. (3) The computer equipment inputs the intermediate training features into the speech mask estimation layer and generates clean speech training features through the speech mask estimation layer. (4) The computer equipment inputs the intermediate training features into the noise mask estimation layer and generates noise speech training features through the noise mask estimation layer. (5) The computer equipment constructs a target loss function based on the clean voice label, the noisy voice label, the depth clustering annotation, the clean voice training features, the noisy voice training features, and the clustering training annotation, and performs noise reduction training and reverberation reduction training on the specified enhanced network in stages based on the target loss function until the specified enhanced network satisfies the specified conditions.

[0071] Here, the intermediate training feature is an intermediate value generated from the hidden layer of a given collaborative network, and is input as a single shared value to the depth clustering layer, speech mask estimation layer, and noise mask estimation layer, respectively, thereby achieving lower weight sharing and reducing the number of network parameters. The speech mask estimation layer and noise mask estimation layer use the intermediate training feature to create a clean speech training feature.

number

[0072] In one embodiment, the computer equipment constructs a target loss function based on clean voice labels, noisy voice labels, depth clustering annotations, clean voice training features, noisy voice training features, and clustering training annotations, and performs denoising training and reverberation de-training on a predetermined enhanced network stepwise based on the target loss function until the predetermined enhanced network satisfies predetermined conditions, and the steps may include the following:

[0073] (5.1) The computer equipment determines the first loss function based on the clustering training annotation and the depth clustering annotation.

[0074] Here, the first loss function is the depth clustering loss function, and exemplarily, the first loss function

number

[0075] (5.2) The computer equipment determines a second loss function based on the clean voice training features and clean voice labels.

[0076] For the two stepwise training methods, two different second loss functions can be determined based on different clean speech labels.

[0077] In some embodiments, the computer equipment features clean voice training characteristics.

number

[0078] In some embodiments, the computer equipment features clean voice training characteristics.

number

[0079] (5.3) The computer equipment determines a third loss function based on the noise speech training features and the noise speech labels.

[0080] For example, the third loss function

number

[0081] Here, the second loss function

number

[0082] (5.4) The computer equipment constructs a target loss function for a predetermined enhancement network based on the first loss function, the second loss function, and the third loss function, and performs noise reduction training and reverberation reduction training on the predetermined enhancement network in stages based on the target loss function until the predetermined enhancement network satisfies predetermined conditions.

[0083] For example, a computer device uses the first loss function

number

[0084]

number

[0085] Here,

number

[0086] Typically, noise refers to "unwanted sounds" in specific situations, such as loud human voices or various sudden noises. Reverberation refers to the persistence of sound that remains even after the sound source in a room has stopped emitting sound. Considering that the needs for voice enhancement differ depending on the application scene—for example, in a large venue, noise is mainly removed from audio collected by conference terminals, while in a professional recording location, reverberation is mainly removed from audio collected by recording equipment—different methods of step-by-step training can be conducted based on the actual scenes used in the final voice enhancement model.

[0087] In some embodiments, applied scene attributes can be obtained based on the actual scene used in the final speech enhancement model, and a corresponding distributed training policy can be determined based on the applied scene attributes. Based on the distributed training policy, a target loss function for a given enhancement network is constructed based on a first loss function, a second loss function, and a third loss function, and denoising training and reverberation de-training are performed stepwise on the given enhancement network based on the target loss function until the given enhancement network satisfies predetermined conditions.

[0088] Here, the applied scene attribute is used to represent the actual scene to which the speech enhancement model is applied, for example, a scene attribute focused on noise reduction, a scene attribute focused on reverberation removal, etc. The distributed training policy includes a first distributed training policy and a second distributed training policy. The first distributed training policy is used for scenes focused on noise reduction, where noise reduction training is performed first, followed by reverberation removal training. The second distributed training policy is used for scenes focused on reverberation removal, where reverberation removal training is performed first, followed by noise reduction training.

[0089] In one embodiment, in an application scenario aimed at noise reduction, for example, in a video conference with multiple participants, the audio collected by the conference terminal includes not only the voice of the speaker but also the voices of other speakers. Since it is necessary to perform noise reduction processing on the call audio collected by the conference terminal, noise reduction training can be performed before reverberation reduction training. The computer equipment can determine the target loss function of a predetermined enhancement network based on a first distributed training policy, a first loss function, and a third loss function, and the second loss function is determined by the noise reduction loss function. Next, noise reduction training is repeatedly performed on the predetermined enhancement network based on the target loss function until the predetermined enhancement network satisfies predetermined conditions, thereby obtaining a noise reduction network, which then performs only the role of noise reduction.

[0090] In some embodiments, a computer can determine a target loss function for a noise reduction network based on a first loss function, a second loss function, and a third loss function, the second loss function being determined by the reverberation removal loss function. Next, noise reduction training is repeated on the reverberation removal network based on the target loss function until the noise reduction network satisfies predetermined conditions. By performing individual noise reduction training first in this way, interference from reverberation factors in the training process can be avoided, thereby resulting in a target-enhanced network with better noise reduction performance.

[0091] In another embodiment, in application scenarios aimed at reducing reverberation, such as recording studios where the demands for sound quality are high and removing unwanted reverberation is particularly important, reverberation removal training can be performed first, followed by noise reduction training. The computer equipment can determine the target loss function of a given enhancement network based on a second distributed training policy, a first loss function, and a third loss function, the second loss function being determined by the reverberation removal loss function. Next, reverberation removal training is repeatedly performed on the given enhancement network based on the target loss function until the given enhancement network satisfies predetermined conditions, thereby obtaining a reverberation removal network, which then performs only the role of reverberation removal.

[0092] In some embodiments, a computer can determine a target loss function for a de-reverberation network based on a first loss function, a second loss function, and a third loss function, the second loss function being determined by a noise reduction loss function. Next, noise reduction training is repeated on the de-reverberation network based on the target loss function until the de-reverberation network satisfies predetermined conditions. By performing individual de-reverberation training in this way, interference from noise factors in the training process can be avoided, thereby resulting in a target-enhanced network with better de-reverberation performance.

[0093] For example, if we define noise precisely, the concept of noise essentially includes reverberation. Therefore, if there are no special requirements for the application scenario of the speech enhancement model, we can first train a given enhancement network to denoise, and then train it to de-reverberation. This allows the network to learn de-reverberation capabilities on top of an already excellent noise reduction network. In this way, both training processes can achieve optimal training effects, improving the speech enhancement performance of the speech enhancement model.

[0094] The predetermined conditions may include the total loss value of the target loss function being less than or equal to a predetermined value, the total loss value of the target loss function ceasing to change, or the number of training iterations reaching a predetermined number. For example, an optimizer can be used to optimize the target loss function, and the learning rate, training batch size, and training epoch can be set based on experimental experience.

[0095] Understandably, after performing multiple cycles of iterative training on the network to be trained (a predetermined enhancement network / denoising network / de-reverberation network) based on a training sample dataset, where each cycle includes multiple iterations of training and the parameters of the network to be trained are continuously optimized, the total loss value mentioned above decreases until it becomes a single fixed value or smaller than the predetermined value mentioned above. In this case, it indicates that the network to be trained has converged. Of course, it is also possible to determine that the predetermined enhancement network / denoising network / de-reverberation network has converged after the number of training cycles has reached a predetermined number.

[0096] In training a given enhancement network using multitask learning, a combination of depth clustering loss and mask estimation loss is used. However, the mask estimation loss is only used during the validation process of the target enhancement network, i.e., the speech enhancement model selection. When the speech enhancement model is executed, the output of the mask estimation branch is used as the speech-enhanced mask, i.e., the target speech features.

[0097] In step S240, the computer equipment acquires the initial voice characteristics of the call audio.

[0098] In step S250, the computer equipment inputs the initial speech features into a hidden layer and generates intermediate features through the hidden layer.

[0099] In step S260, the computer equipment inputs the intermediate features into the speech mask estimation layer, generates clean speech features through the speech mask estimation layer, and uses the clean speech features as the target speech features.

[0100] In one embodiment, a computer device can collect call audio, perform speech feature extraction on the call audio including frame splitting, windowing, and Fourier transform to obtain initial speech features, input the initial speech features into a hidden layer of a speech enhancement network to generate intermediate features through the hidden layer, input the intermediate features into a speech mask estimation layer to generate clean speech features through the speech mask estimation layer, and use the clean speech features as the target speech features.

[0101] In step S270, the computer equipment performs an inverse feature transform on the target speech features to calculate the target speech from which noise and reverberation have been removed.

[0102] In one embodiment, a computer device can acquire target speech features, perform an inverse feature transform on the target speech features, and convert the target speech features (mask) in the frequency domain space into the target speech in the time domain space. In some embodiments, the inverse feature transform may be an inverse Fourier transform. In the embodiment of the present application, a training sample set is acquired, a predetermined enhancement network is acquired, noise reduction training and reverberation reduction training are performed stepwise on the predetermined enhancement network via the training sample set until the predetermined enhancement network satisfies predetermined conditions, a trained target enhancement network is obtained as a speech enhancement model, initial speech features are input to a hidden layer, intermediate features are generated through the hidden layer, intermediate features are input to a speech mask estimation layer, clean speech features are generated through the speech mask estimation layer, the clean speech features are used as the target speech features, an inverse feature transform is performed on the target speech features, and the target speech from which noise and reverberation have been removed can be calculated. This avoids increasing the computational complexity of the speech enhancement process and improves speech enhancement efficiency by simply using the target speech features output from the speech mask estimation of the speech enhancement model to reconstruct the speech.

[0103] Referring to Figure 9, which is a modular block diagram of an audio processing device 500 according to an embodiment of the present application, the audio processing device 500 comprises an acquisition module 510, an enhancement module 520, and a computation model 530. The acquisition module 510 is configured to acquire initial speech features of a call speech, the enhancement module 520 is configured to input the initial speech features into a pre-trained speech enhancement model to obtain target speech features output from the speech enhancement model, the speech enhancement model is obtained by stepwise training based on a depth clustering loss function and a mask estimation loss function, and the computation model 530 is configured to calculate a target speech from which noise and reverberation have been removed based on the target speech features.

[0104] In some embodiments, the speech processing device 500 may further comprise a sample acquisition module, a network acquisition module, and a model training module. The sample acquisition module is configured to acquire a training sample set, the training sample set comprising noisy speech features, clean speech labels, noisy speech labels, and depth clustering annotations; the network acquisition module is configured to acquire a predetermined enhanced network, the predetermined enhanced network comprising a hidden layer, a depth clustering layer, and a mask estimation layer; and the network training module is configured to perform denoising training and reverberation de-training on the predetermined enhanced network via the training sample set stepwise until the predetermined enhanced network satisfies predetermined conditions, thereby obtaining a trained target enhanced network as a speech enhancement model.

[0105] In some embodiments, the mask estimation layer includes a speech mask estimation layer and a noise mask estimation layer, and the network training module may comprise a concealment unit, a depth clustering unit, a speech estimation unit, a noise estimation unit, and a network training unit, wherein the concealment unit is configured to input noise speech features into a hidden layer and generate intermediate training features through the hidden layer, the depth clustering unit is configured to input intermediate training features into a depth clustering layer and generate clustering training annotations through the depth clustering layer, and the speech estimation unit is configured to input intermediate training features into a speech mask estimation layer. The noise estimation unit is configured to generate clean speech training features through a speech mask estimation layer, input intermediate training features to a noise mask estimation layer, and generate noisy speech training features through the noise mask estimation layer. The network training unit is configured to construct a target loss function based on clean speech labels, noisy speech labels, depth clustering annotations, clean speech training features, noisy speech training features, and clustering training annotations, and to perform denoising training and reverberation de-training on a given enhanced network in stages based on the target loss function until the given enhanced network satisfies predetermined conditions.

[0106] In some embodiments, the network training unit comprises a first subunit, a second subunit, a third subunit, and a training subunit, wherein the first subunit is configured to determine a first loss function based on clustering training annotations and depth clustering annotations; the second subunit is configured to determine a second loss function based on clean voice training features and clean voice labels; the third subunit is configured to determine a third loss function based on noisy voice training features and noisy voice labels; and the training subunit is configured to construct a target loss function for a given enhanced network based on the first, second, and third loss functions, and to perform denoising training and reverberation de-training on the given enhanced network in stages based on the target loss function until the given enhanced network satisfies predetermined conditions.

[0107] In some embodiments, the second subunit may be configured to determine a denoising loss function based on clean voice training features and a first clean voice label, and to make the denoising loss function a second loss function, where the first clean voice label is a voice label obtained based on a voice with no noise and reverberation.

[0108] In some embodiments, the second subunit may be further configured to specifically determine a reverberation-removed loss function based on clean voice training features and a second clean voice label, with the reverberation-removed loss function being the second loss function, where the second clean voice label is a voice label obtained based on a noise-free, reverberation-free voice.

[0109] In some embodiments, the training subunit may be configured to perform the following steps: 1) Determine a target loss function for a given enhanced network based on a first loss function, a second loss function, and a third loss function; 2) Repeat denoising training on the given enhanced network based on the target loss function until the given enhanced network satisfies predetermined conditions, thereby obtaining a denoising network, wherein the second loss function is determined by the denoising loss function; and 3) Determine a target loss function for a denoising network based on a first loss function, a reverberation removal loss function, and a third loss function; 4) Repeat reverberation removal training on the denoising network based on the target loss function until the denoising network satisfies predetermined conditions, wherein the second loss function is determined by the reverberation removal loss function.

[0110] In some embodiments, the training subunit may be configured to perform the following steps: 1) Determine a target loss function for a given enhanced network based on a first loss function, a second loss function, and a third loss function; 2) Repeat deresonance training on the given enhanced network based on the target loss function until the given enhanced network satisfies predetermined conditions, thereby obtaining a deresonance network, wherein the second loss function is determined by the deresonance loss function; and 3) Determine a target loss function for a deresonance network based on a first loss function, a second loss function, and a third loss function; 4) Repeat denoising training on the deresonance network based on the target loss function until the deresonance network satisfies predetermined conditions, wherein the second loss function is determined by the denoising loss function.

[0111] In some embodiments, the sample acquisition module may be configured to perform the following steps: acquire a first sample audio, the first sample audio being a noisy audio collected based on a microphone; perform speech feature extraction on the first sample audio to obtain a noise audio feature; acquire a second sample audio, the second sample audio including a clean audio with noise and reverberation and a clean audio without noise and reverberation; perform speech feature extraction on the second sample audio to obtain a first clean audio label and a second clean audio label; and determine a depth clustering annotation based on the first and second sample audio.

[0112] In some embodiments, the speech enhancement model includes a hidden layer, a depth clustering layer, a speech mask estimation layer, and a noise mask estimation layer, and the enhancement module 520 may be configured to input initial speech features into the hidden layer, generate intermediate features through the hidden layer, input the intermediate features into the speech mask estimation layer, generate clean speech features through the speech mask estimation layer, and use the clean speech features as the target speech features. The computational model 530 may be configured to perform an inverse feature transform on the target speech features and compute the target speech from which noise and reverberation have been removed.

[0113] Those skilled in the art will clearly understand that, for the sake of convenience and brevity of explanation, the specific operating processes of the above-described apparatus and modules can be clearly described by referring to the corresponding processes in the embodiments of the method described above, and will not be repeated here.

[0114] In some embodiments provided by this application, the interconnection between modules may be electrical, mechanical, or in other forms.

[0115] Furthermore, each functional module in each embodiment of the present application may be integrated into a single processing module, or each unit may exist physically separately, or two or more modules may be integrated into a single module. The integrated module may be implemented in hardware form or in the form of a software functional module.

[0116] The proposed technology provided by this application involves acquiring initial speech features of a call, inputting these initial speech features into a pre-trained speech enhancement model, and obtaining target speech features output from the speech enhancement model. The speech enhancement model is obtained through stepwise training based on a depth clustering loss function and a mask estimation loss function, and can calculate a target speech from which noise and reverberation have been removed based on the target speech features. In this way, model training is performed on a pre-configured speech enhancement model through different loss functions, guiding the model to efficiently remove noise and reverberation from the speech, thereby improving the performance of speech enhancement while reducing the model's computational resources.

[0117] As shown in Figure 10, an embodiment of the present invention further provides a computer device 600 comprising a processor 610, memory 620, power supply 630, and input unit 640, wherein the memory 620 stores computer program instructions, and when called by the processor 610, the computer program instructions can perform steps of the various methods provided by the above embodiment. Those skilled in the art will understand that the illustrated computer device configuration does not constitute a limitation of computer devices, and may include more or fewer components than shown, or may combine specific components, or may have different component arrangements.

[0118] The processor 610 may include one or more processing cores. The processor 610 controls the computer equipment as a whole by connecting various parts of the entire battery management system using various interfaces and lines, executing instructions, programs, code sets or instruction sets stored in memory 620, and retrieving data stored in memory 620, thereby performing various functions of the entire battery management system, processing data, and performing various functions of the computer equipment. In some embodiments, the processor 610 may be implemented in at least one hardware form from among a digital signal processor (DSP), a field programmable gate array (FPGA), and a programmable logic array (PLA). The processor 610 may integrate one or more combinations of a central processing unit (CPU), a graphics processing unit (GPU), and a modem. Here, the CPU primarily handles the operating system, user interface, and applications, the GPU is responsible for rendering and drawing the display content, and the modem is used for handling wireless communication. Understandably, the modem may not be integrated into the processor 610, but may be implemented independently by a single communication chip.

[0119] Memory 620 may include Random Access Memory (RAM) or Read-Only Memory. Memory 620 may be configured to store instructions, programs, code, code sets, or instruction sets. Memory 620 may include a storage program area and a storage data area, where the storage program area can store instructions for implementing an operating system, instructions for implementing at least one function (e.g., touch function, audio playback function, image playback function, etc.), instructions for implementing embodiments of various methods described below, etc. The storage data area can also store data created by the computer equipment during use (e.g., phone book, audio / video data, etc.). Correspondingly, memory 620 may also include a memory controller to provide processor 610 with access to memory 620.

[0120] The power supply 630 is logically connected to the processor 610 via a power management system, which enables functions such as charging, discharging, and power consumption management. The power supply 630 may further include one or more DC or AC power supplies, a recharging system, a power fault detection circuit, a power converter or inverter, a power status indicator, and other optional components.

[0121] The input unit 640 is configured to receive input numerical or character information and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

[0122] Furthermore, although not shown in the figures, the computer equipment 600 may further include a display unit and the like, which will not be described again here. Specifically, in the embodiments of the present invention, the processor 610 in the computer equipment loads executable files corresponding to the processes of one or more applications into memory 620 according to the following instructions, and the processor 610 executes the applications stored in memory 620, thereby realizing the steps of the various methods provided by the embodiments described above.

[0123] As shown in Figure 11, the embodiment of the present invention further provides a computer-readable storage medium 700 in which computer program instructions 710 are stored, and the computer program instructions 710 can be invoked by a processor to execute the methods described in the above embodiment.

[0124] The computer-readable storage medium may be electronic memory such as flash memory, electrically erasable programmable read-only memory (EEPROM), EPROM, hard disk, or ROM. In some embodiments, the computer-readable storage medium includes non-transitory computer-readable storage medium. The computer-readable storage medium 700 has storage space for program code to perform any of the method steps of the above method. This program code may be read from or written to one or more computer program products. The program code may be compressed in an appropriate format, for example.

[0125] According to one aspect of the present application, a computer program product or computer program is provided, the computer program product or computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium. The processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, thereby causing the computer device to perform the methods provided in the various alternative embodiments provided by the above embodiment.

[0126] The above are merely preferred embodiments of the present application and are not intended to limit the present application. As stated above, the present application is disclosed in preferred embodiments, but is not intended to limit the present application. A person skilled in the art can make equivalent changes or modifications using the technical content disclosed above without departing from the scope of the technical solution of the present application. However, any changes or modifications made to the above embodiments based on the technical substance of the present application without departing from the content of the technical solution of the present application remain within the scope of the technical solution of the present application. [Explanation of symbols]

[0127] 300 Voice Processing Systems 310 Near-End Client 330 Remote Client 350 Server side 410 Cloud Servers 411 Speech Enhancement Models 420 participants 430 First conference terminal 450 Second conference terminal 500 Voice Processing Device 510 Acquisition Module 520 Emphasis Module 530 Computing Modules 600 Computer equipment 610 Processor 620 memory 630 Power supply 640 Input Units 700 Computer-readable storage media 710 Computer Program Instructions

Claims

1. A method of audio processing performed by a processor, Steps to obtain initial audio features of the call audio, The step of inputting the initial speech features into a speech enhancement model including a pre-trained hidden layer, a depth clustering layer, a speech mask estimation layer, and a noise mask estimation layer, in order to obtain target speech features output from the speech enhancement model, The steps include inputting the initial speech features into the hidden layer and generating intermediate features through the hidden layer, The steps include inputting the aforementioned intermediate features into the speech mask estimation layer, generating clean speech features through the speech mask estimation layer, and using the clean speech features as the target speech features, Includes, The aforementioned speech enhancement model is obtained by stepwise training based on a depth clustering loss function and mask estimation loss functions corresponding to the speech mask estimation layer and the noise mask estimation layer, respectively. Steps to obtain target speech features, A step of calculating a target speech from which noise and reverberation have been removed, based on the aforementioned target speech characteristics, The process includes the step of performing an inverse feature transformation on the target speech features to calculate the target speech from which noise and reverberation have been removed. A speech processing method, including the step of calculating a target speech.

2. The aforementioned audio processing method is The method further includes the step of pre-training the speech enhancement model by the following method, the method being: A step of obtaining a training sample set, wherein the training sample set includes noise speech features, clean speech labels, noise speech labels, and depth clustering annotations. A step of obtaining a predetermined enhancement network, The process includes the step of performing noise reduction training and reverberation reduction training on the predetermined enhancement network via the training sample set in stages until the predetermined enhancement network satisfies predetermined conditions, thereby obtaining the trained target enhancement network as the speech enhancement model. The audio processing method according to claim 1.

3. The predetermined enhancement network includes a hidden layer, a depth clustering layer, and a mask estimation layer, the mask estimation layer includes a speech mask estimation layer and a noise mask estimation layer, and the step of performing noise reduction training and reverberation reduction training on the predetermined enhancement network stepwise through the training sample set until the predetermined enhancement network satisfies predetermined conditions is: The steps include inputting the noise audio features into the hidden layer and generating intermediate training features through the hidden layer, The steps include inputting the aforementioned intermediate training features into the depth clustering layer and generating clustering training annotations through the depth clustering layer, The steps include inputting the aforementioned intermediate training features into the speech mask estimation layer and generating clean speech training features through the speech mask estimation layer, The steps include inputting the aforementioned intermediate training features into the noise mask estimation layer and generating noise speech training features through the noise mask estimation layer, The steps include constructing a target loss function based on the clean voice label, the noise voice label, the depth clustering annotation, the clean voice training feature, the noise voice training feature, and the clustering training annotation, and performing noise reduction training and reverberation reduction training on the predetermined enhanced network stepwise based on the target loss function until the predetermined enhanced network satisfies predetermined conditions. The audio processing method according to claim 2.

4. The steps of constructing a target loss function based on the clean voice label, the noise voice label, the depth clustering annotation, the clean voice training feature, the noise voice training feature, and the clustering training annotation, and then performing noise reduction training and reverberation reduction training on the predetermined enhanced network stepwise based on the target loss function until the predetermined enhanced network satisfies predetermined conditions, are as follows: The steps include determining a first loss function based on the clustering training annotation and the depth clustering annotation, The steps include determining a second loss function based on the clean voice training features and the clean voice labels, The steps include determining a third loss function based on the aforementioned noise speech training features and the aforementioned noise speech labels, The process includes the steps of constructing a target loss function for the predetermined enhancement network based on the first loss function, the second loss function, and the third loss function, and performing noise reduction training and reverberation reduction training on the predetermined enhancement network in stages based on the target loss function until the predetermined enhancement network satisfies predetermined conditions, The audio processing method according to claim 3.

5. The step of constructing a target loss function for the predetermined enhanced network based on the first loss function, the second loss function, and the third loss function is: The process includes the step of performing weighted addition on the first loss function, the second loss function, and the third loss function based on weighting parameters corresponding to the first loss function, the second loss function, and the third loss function, respectively, to obtain a target loss function for the predetermined enhanced network. The audio processing method according to claim 4.

6. The clean voice label includes a first clean voice label, and the step of determining a second loss function based on the clean voice training features and the clean voice label is: The steps include determining a noise reduction loss function based on the clean voice training features and the first clean voice label, The step of using the noise reduction loss function as the second loss function, wherein the first clean voice label is a voice label obtained based on a voice with no noise and reverberation, The audio processing method according to claim 4.

7. The clean voice label includes a second clean voice label, and the step of determining a second loss function based on the clean voice training features and the clean voice label is: The steps include determining a reverberation rejection loss function based on the clean voice training features and the second clean voice label, The step of using the aforementioned reverberation removal loss function as a second loss function, wherein the second clean voice label is a voice label obtained based on a noise-free, reverberation-free voice, includes the step of The audio processing method according to claim 4.

8. The steps of constructing a target loss function for the predetermined enhancement network based on the first loss function, the second loss function, and the third loss function, and then performing noise reduction training and reverberation reduction training on the predetermined enhancement network in stages based on the target loss function until the predetermined enhancement network satisfies predetermined conditions, are as follows: Steps to obtain the applicable scene attribute, The steps include determining the corresponding distributed training policy based on the aforementioned application scene attributes, The process includes the steps of constructing a target loss function for a predetermined enhanced network based on the first loss function, the second loss function, and the third loss function, based on the distributed training policy, and performing noise reduction training and reverberation reduction training on the predetermined enhanced network in stages based on the target loss function until the predetermined enhanced network satisfies predetermined conditions. The audio processing method according to claim 5.

9. The distributed training policy includes a first distributed training policy, and the steps of constructing a target loss function for the predetermined enhanced network based on the first loss function, the second loss function, and the third loss function based on the distributed training policy, and performing noise reduction training and reverberation reduction training on the predetermined enhanced network step by step based on the target loss function until the predetermined enhanced network satisfies predetermined conditions, are as follows: If the distributed training policy is the first distributed training policy, the steps include: determining the target loss function of the predetermined enhanced network based on the first loss function, the second loss function, and the third loss function; repeatedly performing denoising training on the predetermined enhanced network based on the target loss function until the predetermined enhanced network satisfies predetermined conditions, thereby obtaining a denoising network, wherein the second loss function is determined by the denoising loss function; The process includes the steps of: determining a target loss function for the noise reduction network based on the first loss function, the second loss function, and the third loss function; and repeatedly performing reverberation removal training on the noise reduction network based on the target loss function until the noise reduction network satisfies predetermined conditions, wherein the second loss function is determined by the reverberation removal loss function. The audio processing method according to claim 8.

10. The distributed training policy includes a second distributed training policy, and the steps of constructing a target loss function for the predetermined enhanced network based on the first loss function, the second loss function, and the third loss function based on the distributed training policy, and performing noise reduction training and reverberation reduction training on the predetermined enhanced network stepwise based on the target loss function until the predetermined enhanced network satisfies predetermined conditions, are as follows: If the distributed training policy is the second distributed training policy, the steps include: determining the target loss function of the predetermined enhanced network based on the first loss function, the second loss function, and the third loss function; repeatedly performing de-reverberation training on the predetermined enhanced network based on the target loss function until the predetermined enhanced network satisfies predetermined conditions, thereby obtaining a de-reverberation network, wherein the second loss function is determined by the de-reverberation loss function; The process includes the steps of: determining a target loss function for the reverberation removal network based on the first loss function, the second loss function, and the third loss function; and repeatedly performing noise reduction training on the reverberation removal network based on the target loss function until the reverberation removal network satisfies predetermined conditions, wherein the second loss function is determined by the noise reduction loss function. The audio processing method according to claim 8.

11. The step of obtaining the aforementioned training sample set is: A step of acquiring a first sample audio, wherein the first sample audio is an audio recording including noise and reverberation, collected based on a microphone. The steps include: performing speech feature extraction on the aforementioned first sample audio to obtain noise speech features, A step of obtaining a second sample audio, wherein the second sample audio includes a clean audio with noise-free reverberation and a clean audio without noise-free reverberation. The steps include: performing speech feature extraction on the second sample audio to obtain a first clean speech label and a second clean speech label; The process includes the step of determining a depth clustering annotation based on the first sample audio and the second sample audio, The audio processing method according to claim 2.

12. The aforementioned predetermined conditions are: This includes any of the following: the total loss value of the target loss function is less than or equal to a predetermined value; the total loss value of the target loss function stops changing; or the number of training iterations reaches a predetermined number. The audio processing method according to claim 3.

13. A sound processing device, An acquisition module configured to acquire the initial audio features of a call audio, An enhancement module configured to input the aforementioned initial speech features into a speech enhancement model including a pre-trained hidden layer, a depth clustering layer, a speech mask estimation layer, and a noise mask estimation layer, in order to obtain target speech features output from the speech enhancement model, The initial speech features are input to the hidden layer, and intermediate features are generated through the hidden layer. The intermediate features are input to the speech mask estimation layer, clean speech features are generated through the speech mask estimation layer, and the clean speech features are used as the target speech features. It is configured to do the following: The aforementioned speech enhancement model is obtained by stepwise training based on a depth clustering loss function and mask estimation loss functions corresponding to the speech mask estimation layer and the noise mask estimation layer, respectively. Emphasis module, A computational model for calculating a target speech from which noise and reverberation have been removed, based on the aforementioned target speech characteristics, The system is configured to perform an inverse feature transform on the target audio features and calculate the target audio from which noise and reverberation have been removed. A speech processing device comprising a computational model for calculating target speech.

14. Computer equipment, Memory and One or more processors coupled to the memory, A computer device comprising: one or more applications stored in the memory and configured to be executed by the one or more processors, and configured to perform the method according to any one of claims 1 to 12.

15. It is a computer program, The computer program includes computer instructions, the computer instructions are stored in a storage medium, and the processor of the computer device reads the computer instructions from the storage medium and executes them, thereby causing the computer device to perform the method according to any one of claims 1 to 12.