Speech signal processing method, speech processing model training method, and computing device
By performing feature enhancement on the initial speech signal and processing it using a speech processing model in a multi-scale mode, the problems of representation inconsistency and speech quality degradation in traditional speech super-resolution technology are solved, achieving fine detail recovery and higher audio representation effect of high-resolution speech signals.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2024-12-10
- Publication Date
- 2026-06-12
AI Technical Summary
Traditional speech super-resolution technology suffers from representation inconsistencies and speech quality degradation when processing speech signals from multiple domains, and is limited to training with a single fixed sampling rate, resulting in poor processing performance.
Feature enhancement is performed by acquiring the spectrogram of the initial speech signal, and signal enhancement is performed using a speech processing model at different scales. The speech signal is then processed by combining a transpose matrix layer and multiple residual blocks to achieve super-resolution conversion from low resolution to high resolution, thereby improving the sound quality, clarity, and fidelity of the speech signal.
It significantly improves the sound quality, clarity, and fidelity of speech signals, achieves fine detail recovery and more accurate audio representation of high-resolution speech signals, and solves the problem of poor speech signal processing performance in traditional technologies.
Smart Images

Figure CN122201320A_ABST
Abstract
Description
Technical Field
[0001] This application relates to large model technology and the field of speech processing. Specifically, it relates to a speech signal processing method, a training method for a speech processing model, and a computing device. Background Technology
[0002] With the development of digital communication and audio technology, speech super-resolution (SR) has become a hot technology for improving audio signal quality, especially in processing low-sampling-rate speech signals to enhance clarity and fidelity. Traditional super-resolution tasks are limited to training at a single fixed sampling rate and perform poorly in high-frequency detail recovery and cross-domain applications. When processing multi-domain speech signals, there are problems of representation inconsistency and speech quality degradation.
[0003] There is currently no effective solution to the above problems. Summary of the Invention
[0004] This application provides a speech signal processing method, a speech processing model training method, and a computing device to at least solve the technical problem of poor speech signal processing performance in related technologies.
[0005] According to one aspect of the embodiments of this application, a speech signal processing method is provided, comprising: acquiring an initial speech signal; performing feature enhancement on a first spectrogram of the initial speech signal based on a preset dimension to obtain enhanced spectral features of the initial speech signal; inputting the enhanced spectral features into a speech processing model, and using the speech processing model to enhance the initial speech signal to obtain a target speech signal, wherein the speech processing model is used to achieve signal enhancement in different scale modes, and the resolution of the target speech signal is greater than the resolution of the initial speech signal.
[0006] According to another aspect of the embodiments of this application, a method for training a speech processing model is also provided, comprising: acquiring an initial sample speech signal; performing feature enhancement on a first sample spectrogram of the initial sample speech signal based on a preset dimension to obtain sample-enhanced spectral features of the initial sample speech signal; inputting the sample-enhanced spectral features into an initial speech processing model, and using the initial speech processing model to perform feature enhancement on the initial sample speech signal to obtain a target sample speech signal; constructing a target sample loss function based on the initial sample speech signal and the target sample speech signal; and updating the model parameters of the initial speech processing model using the target sample loss function to obtain a speech processing model.
[0007] According to another aspect of the embodiments of this application, a speech signal processing method is also provided, comprising: acquiring an initial speech signal by calling a first interface, wherein the first interface includes a first parameter, and the parameter value of the first parameter includes the initial speech signal; performing feature enhancement on a first spectrogram of the initial speech signal based on a preset dimension to obtain enhanced spectral features of the initial speech signal; inputting the enhanced spectral features into a speech processing model, and using the speech processing model to enhance the initial speech signal to obtain a target speech signal, wherein the speech processing model is used to achieve signal enhancement in different scale modes, and the resolution of the target speech signal is greater than the resolution of the initial speech signal; and outputting the target speech signal by calling a second interface, wherein the second interface includes a second parameter, and the parameter value of the second parameter includes the target speech signal.
[0008] According to another aspect of the embodiments of this application, a computer terminal is also provided, including: a memory storing an executable program; and a processor for running the program, wherein the program executes the methods in various embodiments of this application when it runs.
[0009] According to another aspect of the embodiments of this application, a computer-readable storage medium is also provided, the computer-readable storage medium including a stored executable program, wherein, when the executable program is running, it controls the device where the computer-readable storage medium is located to perform the methods of various embodiments of this application.
[0010] According to another aspect of the embodiments of this application, a computer program product is also provided, including a computer program that, when executed by a processor, implements the methods of various embodiments of this application.
[0011] According to another aspect of the embodiments of this application, a computer program product is also provided, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the methods in various embodiments of this application.
[0012] According to another aspect of the embodiments of this application, a computer program is also provided, which, when executed by a processor, implements the methods of the various embodiments of this application.
[0013] According to another aspect of the embodiments of this application, a computing device is also provided, including: a memory storing an executable program; and a processor for running the program, wherein the program executes the methods in various embodiments of this application when it runs.
[0014] According to another aspect of the embodiments of this application, a computing device is also provided, including: a memory storing an executable program; and a processor connected to the memory via a bus for running the program, wherein the program executes the methods in various embodiments of this application when it runs.
[0015] According to another aspect of the embodiments of this application, an electronic device is also provided, including: a memory storing an executable program; and a processor connected to the memory via a bus for running the program, wherein the program executes the methods in various embodiments of this application when it runs.
[0016] In this embodiment, an initial speech signal is acquired; the first spectrogram of the initial speech signal is enhanced based on a preset dimension to obtain enhanced spectral features of the initial speech signal; the enhanced spectral features are input into a speech processing model, and the initial speech signal is enhanced using the speech processing model to obtain a target speech signal. The speech processing model is used to achieve signal enhancement at different scales. The resolution of the target speech signal is greater than that of the initial speech signal, achieving super-resolution conversion from low-resolution speech signal to high-resolution speech signal. By enhancing the low-resolution first spectrogram, enhanced spectral features are obtained. These enhanced spectral features are input into the speech processing model, which can reconstruct the initial speech signal at different scales based on the enhanced spectral features. This significantly improves the sound quality, clarity, and fidelity of the speech signal, achieving fine detail recovery and more accurate audio representation of the high-resolution speech signal, thereby solving the technical problem of poor speech signal processing performance in related technologies.
[0017] It is worth noting that the general description above and the detailed description that follow are merely for illustrative purposes and do not constitute a limitation on this application. Attached Figure Description
[0018] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:
[0019] Figure 1 This is a schematic diagram illustrating an application scenario of a speech signal processing method according to an embodiment of this application;
[0020] Figure 2 This is a flowchart of a speech signal processing method according to an embodiment of this application;
[0021] Figure 3 This is a schematic diagram of a high-fidelity speech super-resolution technology according to an embodiment of this application;
[0022] Figure 4 This is a flowchart of a training method for a speech processing model according to an embodiment of this application;
[0023] Figure 5 This is a flowchart of a speech signal processing method according to an embodiment of this application;
[0024] Figure 6 This is a schematic diagram of a speech signal processing device according to an embodiment of this application;
[0025] Figure 7 This is a schematic diagram of a training device for a speech processing model according to an embodiment of this application;
[0026] Figure 8 This is a schematic diagram of a speech signal processing device according to an embodiment of this application;
[0027] Figure 9 This is a structural block diagram of a computing device according to an embodiment of this application;
[0028] Figure 10 This is a structural block diagram of an electronic device according to an embodiment of this application. Detailed Implementation
[0029] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort should fall within the scope of protection of the present application.
[0030] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0031] The technical solution provided in this application is mainly implemented using large-scale model technology. Here, "large-scale model" refers to a deep learning model with a massive number of parameters, typically containing hundreds of millions, tens of billions, hundreds of billions, trillions, or even tens of trillions of parameters. Large-scale models can also be called foundation models. They are pre-trained using large-scale unlabeled corpora to produce pre-trained models with hundreds of millions of parameters. Such models can adapt to a wide range of downstream tasks and have good generalization ability. Examples include Large Language Models (LLMs) and multi-modal pre-training models.
[0032] It should be noted that, in practical applications, large models can be fine-tuned using a small number of samples to adapt them to different tasks. For example, large models can be widely used in Natural Language Processing (NLP), computer vision, and speech processing. Specifically, they can be applied to computer vision tasks such as Visual Question Answering (VQA), Image Captioning (IC), and Image Generation, as well as NLP tasks such as text-based sentiment classification, text summarization, and machine translation. Therefore, the main application scenarios for large models include, but are not limited to, digital assistants, intelligent robots, search, online education, office software, e-commerce, and intelligent design. In this embodiment, the use of a large model for data processing in a speech scenario is taken as an example for explanation.
[0033] First, some nouns or terms that appear in the description of the embodiments of this application shall be interpreted as follows:
[0034] Generative Adversarial Networks (GANs) are deep learning model architectures that consist of a generator and a discriminator. The generator is responsible for generating near-realistic samples, while the discriminator distinguishes the generated samples from real samples. During adversarial training, the generator and discriminator compete against each other, with the generator gradually improving the quality of the generated samples.
[0035] Speech Super-Resolution (SR) is a technique that improves the quality of low-resolution speech signals. It aims to convert low-sampling-rate speech signals into high-sampling-rate speech signals, thereby enhancing the clarity and fidelity of audio.
[0036] Mel-spectrogram (or simply Mel spectrogram) is an intermediate representation commonly used in audio processing. It converts the spectral information of an audio signal into a representation at the Mel frequency scale, which can better simulate the human ear's perception of different frequencies.
[0037] End-to-End Adversarial Training is a training method in which the entire model, from input to output, is jointly optimized within a unified adversarial framework. This method can better maintain the consistency of model representation and improve the quality of generated results.
[0038] A Transformer-Convolutional Generator (or simply Transformer-convolutional generator) is a generator architecture that combines a Transformer network and a convolutional neural network. The Transformer part is used to encode the low-resolution input signal, while the convolutional network is used to upsample the encoded features to a high-resolution output.
[0039] According to an embodiment of this application, a speech signal processing method is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0040] Considering the large number of model parameters in large models and the limited computing resources of mobile terminals, the speech signal processing method provided in this application embodiment can be applied to, for example, Figure 1 The application scenarios shown are not limited to these. Figure 1 This is a schematic diagram illustrating an application scenario of a speech signal processing method according to an embodiment of this application. Figure 1 In the application scenario shown, the large model is deployed on server 10. Server 10 can connect to one or more client devices 20 via a local area network (LAN), wide area network (WAN), internet connection, or other types of data network. These client devices 20 may include, but are not limited to, smartphones, tablets, laptops, PDAs, personal computers, smart home devices, and in-vehicle devices. Client devices 20 can interact with users through a graphical user interface to access the large model, thereby implementing the method provided in this embodiment.
[0041] It should be noted that with the rapid development of high-performance computing units, the methods provided in this application embodiment can also be applied to model-in-machine systems in other application scenarios. In one optional embodiment, the model-in-machine system has multiple built-in models, and users can select one model to adjust as needed to obtain their own model. The high-performance computing unit built into the model-in-machine system can then directly call the adjusted model to execute the methods provided in this application embodiment. In another optional embodiment, the large model-in-machine system has a pre-trained model built-in, and the high-performance computing unit built into the model-in-machine system can then directly call that model to execute the methods provided in this application embodiment.
[0042] Furthermore, when users need to train their own models, they can upload their own datasets via the client. These datasets are then sent to the server, allowing the server to adjust the pre-trained model using the dataset to obtain the user's customized model, which can then be deployed to the production environment. To facilitate users' model adjustment needs, the server provides complete adjustment tools, development frameworks, and processes, supporting multiple adjustment strategies. This allows the adjusted model to better adapt to different application domains and achieve a high degree of customization.
[0043] In this embodiment, the system consisting of a client device and a server can perform the following steps: The client device uploads an initial speech signal. The server performs feature enhancement on the first spectrogram of the initial speech signal based on a preset dimension to obtain enhanced spectral features of the initial speech signal; the enhanced spectral features are input into a speech processing model, and the speech processing model is used to enhance the initial speech signal to obtain the target speech signal. It should be noted that, provided the client device's operating resources can meet the deployment and operation conditions of a large model, this embodiment can be performed on the client device.
[0044] Under the aforementioned operating environment, this application provides the following: Figure 2 The speech signal processing method shown. Figure 2 This is a flowchart of a speech signal processing method according to an embodiment of this application. Figure 2 As shown, the method may include the following steps:
[0045] Step S202: Acquire the initial speech signal;
[0046] The raw, unprocessed speech signal mentioned above is typically at a low sampling rate (e.g., 4kHz, but not limited to this; it is only used as an example here). Such a signal may have deficiencies in sound quality and detail, especially in the lack of high-frequency information.
[0047] In one alternative embodiment, low-sampling-rate speech signals can be acquired from different sources as the basis for subsequent processing. These sources can include telephone recordings, speech recognition outputs, etc., and the source of the initial speech signal is not limited here.
[0048] Step S204: Perform feature enhancement on the first spectrogram of the initial speech signal based on a preset dimension to obtain the enhanced spectral features of the initial speech signal;
[0049] The first spectrogram mentioned above can be a representation of the initial speech signal transformed from the time domain to the frequency domain through Fourier transform, usually a Mel spectrogram. A Mel spectrogram is a spectral representation of an audio signal that can convert the energy distribution of the audio signal to the Mel frequency scale, which is closer to human auditory perception.
[0050] The aforementioned preset dimension can be a pre-defined dimension of the feature space used to represent or process the signal. A specific transformation is performed on the first spectrogram according to the preset dimension to enhance features in a higher-dimensional space, aiming to improve feature richness and expressiveness. For example, increasing the number of bands or frequency resolution of the Mel spectrogram can capture finer audio features.
[0051] The aforementioned feature enhancements are used to increase the representational power of signal features. In speech signal processing, specific algorithms and models, such as deep learning networks, can be used to increase the detail and richness of the spectrogram, thereby better preserving and recovering information in the speech signal.
[0052] The aforementioned speech processing model can be a Transformer-Convolutional Generator (TGM) model, capable of processing spectral features and enhancing the signal at different scales to generate high-fidelity speech signals. Through end-to-end adversarial training, the speech processing model optimizes signal reconstruction, particularly in cross-domain application scenarios.
[0053] The aforementioned different scale modes refer to the ability of speech processing models to operate at different frequency and time scales in order to adapt to different types of speech signals and recover information from different frequency components.
[0054] The target speech signal mentioned above is a speech signal after super-resolution conversion, which has a higher sampling rate and frequency resolution, thus significantly better than the original signal in terms of sound quality, clarity and fidelity.
[0055] In one alternative embodiment, the initial speech signal is converted into a Mel spectrogram, and then the spectrogram is enhanced using deep learning structures such as a Moss Former module. During feature enhancement, mapping the first spectrogram to a preset dimension helps the model learn richer and more detailed spectral features, laying the foundation for subsequent signal enhancement.
[0056] Step S206: Input the enhanced spectral features into the speech processing model, and use the speech processing model to enhance the initial speech signal to obtain the target speech signal.
[0057] The speech processing model is used to enhance the signal at different scales, where the resolution of the target speech signal is greater than that of the initial speech signal. Here, scale refers to the speech processing model's ability to process the enhanced spectral features at different scales to generate the target speech signal.
[0058] In an optional embodiment, the feature-enhanced Mel spectrogram, i.e., the enhanced spectral features mentioned above, can be input into a speech processing model. This model utilizes its processing capabilities at different scales to further optimize and transform the spectral features. Signal enhancement using the speech processing model yields the target speech signal. Internally, the features undergo a series of upsampling and transformation processes to ultimately generate a high-resolution speech signal, i.e., the target speech signal. This process, through the model's operation at different scales, achieves signal enhancement from low resolution to high resolution, improving the quality and clarity of the speech signal.
[0059] For example, suppose we receive a 4kHz sampled telephone recording with poor sound quality, especially in the high-frequency range. First, we convert this recording into its Mel spectrogram representation. Then, we enhance the Mel spectrogram with features of preset dimensions (such as increasing the number of frequency bands and frequency resolution) to enrich its information. Next, a speech processing model can be used to convert the enhanced Mel spectrogram into a high-resolution speech signal with a 48kHz sampling rate. In this process, the model not only recovers the lost high-frequency information but also improves the fidelity and clarity of the entire speech signal through operations at different scales. Ultimately, we obtain a target speech signal with better sound quality and richer detail, significantly improving the listening experience and information delivery of the original telephone recording.
[0060] Super-resolution conversion from low-resolution to high-resolution speech signals achieves significant improvements in sound quality, clarity, and fidelity through the steps described above. Specifically, by enhancing the features of the Mel spectrogram of the initial speech signal, we not only improve the feature richness of the spectrogram but also ensure that high-frequency details in the speech signal can be more accurately recovered when signal enhancement is performed at different scales, thereby improving the naturalness and intelligibility of the speech overall. For example, in a teleconference scenario, speech signals transmitted at low sampling rates can be upsampled to higher sampling rates through a speech processing model, significantly improving call quality and making the speech clearer and more natural, providing a high-fidelity speech experience even under poor network conditions or high background noise.
[0061] By performing latent representation prediction and waveform synthesis within the speech processing model, the network structure is simplified, and the model's versatility and robustness in processing speech signals with different input sampling rates are improved. Furthermore, by introducing multi-scale Mel reconstruction loss and a multi-bandwidth, multi-scale time-frequency discriminator into the adversarial training framework, the model's ability to recover details in the high-frequency region and its overall audio fidelity are further enhanced.
[0062] Through the above steps, an initial speech signal is obtained; the first spectrogram of the initial speech signal is enhanced based on a preset dimension to obtain enhanced spectral features of the initial speech signal; the enhanced spectral features are input into a speech processing model, which enhances the initial speech signal to obtain a target speech signal. The speech processing model is used to achieve signal enhancement at different scales. The resolution of the target speech signal is greater than that of the initial speech signal, achieving super-resolution conversion from low-resolution to high-resolution speech signals. By enhancing the low-resolution first spectrogram, enhanced spectral features are obtained. These enhanced spectral features are then input into the speech processing model, which can reconstruct the initial speech signal at different scales based on the enhanced spectral features. This significantly improves the sound quality, clarity, and fidelity of the speech signal, achieving fine detail recovery and more accurate audio representation of the high-resolution speech signal, thus solving the technical problem of poor speech signal processing performance in related technologies.
[0063] In the above embodiments of this application, the speech processing model includes a transposed matrix layer and multiple residual blocks. The inputs of the multiple residual blocks are connected to the transposed matrix layer. The speech processing model is used to enhance the initial speech signal to obtain a target speech signal, including: upsampling the enhanced spectral features in the time dimension using the transposed matrix layer to obtain a time-domain speech signal with a preset resolution; enhancing the waveform features of the time-domain speech signal using multiple residual blocks at different scale modes to obtain multiple enhanced signal features, wherein the convolution kernel size and dilation rate of the multiple residual blocks are different; and obtaining the target speech signal based on the multiple enhanced signal features.
[0064] The aforementioned transposed matrix layer can be a collection of transposed convolutional layers, whose function is to upsample low-resolution spectral features in the time dimension to match the length of high-resolution waveforms. The transposed convolutional layer can perform convolution operations in reverse, expanding the size of the feature map to recover the original or longer length of the signal in the time domain. The transposed matrix layer is connected to each residual block, jointly participating in the signal enhancement process. The transposed matrix layer is used for upsampling, transforming the enhanced spectral features into a longer time series to match the length of the target speech signal. Subsequently, multiple residual blocks further refine these upsampled features, with each residual block receiving the output of the previous processing stage, thereby achieving deep feature learning and fusion. This design ensures that the model can capture and recover the details of the speech signal at different scales, ultimately generating a high-fidelity target speech signal.
[0065] The residual block described above is a commonly used structure in deep learning models. It addresses the information loss problem during the training of deep networks by introducing skip connections. In speech processing models, residual blocks are used to enhance the waveform features of time-domain speech signals at different scales. Each residual block includes a multi-receptive field fusion (MRF) module with different convolutional kernel sizes and dilation rates. This module can capture and enhance patterns of different lengths in the signal, thereby enriching waveform details and improving signal fidelity.
[0066] The aforementioned multi-receptive field fusion module is a special residual structure that can capture and fuse features at different time scales through convolution kernels of different sizes and different dilation rates. This is very important for recovering complex structures in speech signals (such as prosody and intonation changes) and can ensure that the generated waveform is closer to real speech in detail.
[0067] In one optional embodiment, the enhanced spectral features are upsampled in the time dimension using a transposed matrix layer. This can be achieved through a series of transposed convolutional layers until the output sequence length matches the waveform length at the target sampling rate. The 48kHz sampling rate used here is merely an example; other frequency sampling rates are also possible and are not limited here. For instance, when converting a 4kHz Mel spectrogram to a 48kHz waveform, the transposed matrix layer gradually enlarges the feature map, restoring and adding lost temporal information.
[0068] The temporal speech signal output from the transposed matrix layer is then enhanced with features through multiple residual blocks. Each residual block is followed by an MRF module, which captures and enhances patterns of varying lengths in the signal, such as prosody and intonation variations, using different convolutional kernel sizes and dilation rates. This makes the generated waveform more detailed and closer to real speech. For example, by adjusting the size and dilation rate of the convolutional kernels in the MRF module, the model can better learn and reconstruct pauses, stresses, and intonations in speech, resulting in more natural and fluent speech.
[0069] Ultimately, by integrating the enhanced signal features from the outputs of multiple residual blocks, a high-resolution target speech signal is generated. This signal not only surpasses the initial input in sampling rate but also achieves significant improvements in sound quality, clarity, and fidelity. For example, by fusing the enhanced signal features from the outputs of all residual blocks, a speech waveform with the desired sampling rate is generated, resulting in a more delicate and audible sound.
[0070] This application utilizes a transposed matrix layer and multiple residual blocks to gradually recover and enhance the temporal and waveform features of a speech signal, starting from low-resolution enhanced spectral features, ultimately generating a high-resolution speech signal. This process combines frequency and time domain processing, ensuring the comprehensiveness and richness of detail in signal enhancement.
[0071] By combining a transposed matrix layer and multiple residual blocks, this application achieves super-resolution conversion from low-resolution speech signals to high-resolution speech signals, resulting in significant improvements in sound quality, clarity, and fidelity. Specifically, the transposed matrix layer recovers the temporal length of the signal, while the residual blocks and MRF module enhance the details and complexity of the signal, ensuring that the generated speech signal maintains high quality at different scales, thereby improving the naturalness and intelligibility of the speech.
[0072] For example, in telephone communication, due to bandwidth limitations, voice signals are typically transmitted at a low sampling rate (e.g., 4kHz), resulting in poor call quality, such as the loss of high-frequency information (e.g., intonation, detail). By using the voice processing model of this application, this low-resolution voice signal can be received and processed. First, it is converted to a Mel spectrogram and feature-enhanced. Then, the Mel spectrogram is upsampled to a 48kHz sampling rate time-domain voice signal through a transpose matrix layer. Subsequently, multiple designed residual blocks further enhance waveform features, capturing and optimizing sound patterns of different lengths. Finally, by integrating all enhanced signal features, a high-fidelity voice signal with a 48kHz sampling rate is generated. This not only significantly improves call quality but also restores voice details and intonation variations, providing users with a clearer, more natural, and higher-quality call experience. This process not only improves the quality of the voice signal but also helps reduce error propagation in cross-domain applications, enhancing the model's generalization ability and robustness.
[0073] The speech processing model in this application includes multiple residual blocks, which work together to enhance the waveform features of the time-domain speech signal. Through the cascading effect of multiple residual blocks, the model can capture and enhance the detailed features of the speech signal at different scales.
[0074] Each residual block is designed with a specific receptive field and kernel size, allowing it to focus on local or global features of the signal for effective enhancement at different scales. For example, some residual blocks may focus on capturing short-term speech details, while others may be better suited to handling long-term speech structures, such as intonation continuity and prosodic features. This design allows the model to learn and fuse features at multiple scales to generate higher-quality enhanced signal features.
[0075] Scale patterns refer to the degree of attention a model pays to at different temporal resolutions and frequency details when processing speech signals. By adjusting the kernel size, dilation rate, and receptive field in residual blocks, the model can simultaneously learn and process signal features at multiple scales. The cascaded structure of multiple residual blocks allows the model to progressively enhance features from low-level to high-level features. The output of each residual block serves as the input to the next block, ensuring the coherence and consistency of features across different scales, thereby improving the fidelity and naturalness of the enhanced signal features. A multi-receptive-field fusion module is employed within the residual block, capturing patterns of varying lengths by summing the outputs from different residual blocks. The MRF module utilizes residual blocks with different kernel sizes and dilation rates to construct diverse receptive-field patterns, which helps to more comprehensively recover the complex structures and details in the speech signal.
[0076] Different residual blocks employ convolutional kernels of different sizes, allowing each block to capture information at different scales in the signal. Smaller kernels are better at handling local features, while larger kernels are better at capturing global structure, thus ensuring that the model can effectively learn and enhance features at multiple scales.
[0077] By introducing different dilation rates through dilated convolutions, the receptive field of each residual block is expanded, allowing it to consider broader contextual information during processing. Different dilation rates ensure the model can capture patterns of varying time series lengths, which is crucial for reconstructing the detail and structure of speech signals. By employing different kernel sizes and dilation rates in each residual block, the model can simultaneously learn and process signal features at multiple scales, contributing to improved fidelity and naturalness of the enhanced signal features. This design strategy ensures progressive enhancement from low-level to high-level features, enhancing the model's adaptability and ability to process complex speech signals.
[0078] The kernel size and dilation rate within each residual block are specifically designed to accommodate different signal processing needs. This diversity exists not only between residual blocks but also within the configuration of each residual block, ensuring the model's flexibility and accuracy in processing speech signals. Multiple residual blocks are cascaded, so the output of one residual block becomes the input of the next. This design allows the model to progressively refine and enhance features, from coarse overview to fine detail processing, ensuring the quality of the final enhanced signal features. By adjusting the kernel size and dilation rate of different residual blocks, we construct a multi-level feature enhancement mechanism. This not only helps capture multi-scale features of the signal but also enables further fusion of these features through a multi-receptive-field fusion module, generating high-quality enhanced signal features.
[0079] In the above embodiments of this application, the method further includes: constructing a target loss function based on the target speech signal and the initial speech signal; and updating the model parameters of the speech processing model using the target loss function.
[0080] The aforementioned target loss function (or objective function, cost function, loss function) is a mathematical expression used to evaluate the difference between the model's prediction and the true label or target. By quantifying this difference, the model can adjust its parameters based on the backpropagation algorithm to minimize the loss and improve prediction accuracy. In the field of speech super-resolution, the target loss function may include adversarial loss (to make the generated signal as close as possible to the real signal), Mel reconstruction loss (to ensure the accuracy of the frequency domain representation), etc., to comprehensively optimize model performance.
[0081] The aforementioned model parameter update process refers to the continuous adjustment of deep learning model parameters during training to optimize model performance. These parameters include weights and biases in the neural network. By calculating the target loss function and performing backpropagation, the gradient of each parameter can be calculated, and then the parameters can be updated through optimization algorithms to gradually improve the model's predictive ability.
[0082] In one alternative embodiment, the target loss function can be constructed based on the target speech signal and the initial speech signal during training, or it can be constructed based on the target speech signal and the initial speech signal during actual use of the speech processing model. The speech processing model needs a mechanism to quantify the difference between its generated high-resolution speech signal and the actual high-resolution speech signal. The target loss function can be constructed, and it typically includes adversarial loss, Mel reconstruction loss, and possible other loss terms (such as waveform reconstruction loss). For example, the Mel reconstruction loss can be calculated using the difference between the generated 48kHz target speech signal and the actual 48kHz speech signal, while the adversarial loss is calculated by using a discriminator to evaluate the similarity between the generated signal and the actual signal.
[0083] After calculating the target loss function, the gradient of the model parameters can be calculated using the backpropagation algorithm. Then, an optimization algorithm is used to update the model parameters to minimize the loss function. This process is repeated until the model can stably generate high-quality, high-fidelity speech signals. For example, for each generated 48kHz target speech signal, the loss between it and the real 48kHz speech signal is calculated, and then the parameters of the speech processing model are adjusted to gradually improve the sound quality and clarity of the generated signal.
[0084] Constructing the target loss function and updating the model parameters are core steps in training deep learning models. For speech processing models, this process not only ensures stable operation under different input sampling rates but also significantly improves the model's detail recovery capability and audio fidelity in high-frequency regions by introducing multi-scale and multi-bandwidth loss terms. Through end-to-end training, the model can learn a direct mapping from low-resolution Mel spectrograms to high-resolution time-domain waveforms, avoiding potential representation inconsistencies in traditional methods, thereby achieving higher quality and more stable speech reconstruction.
[0085] For example, when developing a new voice communication application, a low-sampling-rate speech signal (e.g., 4kHz sampling rate) can be received from the user's device. First, the signal is converted to a Mel spectrogram and its features are enhanced using a preprocessing module of a High-Fidelity Speech Super-Resolution (HiFi-SR) model. Then, the enhanced features are input into the speech processing model to generate a high-resolution 48kHz speech signal. During the training phase, a target loss function is constructed, which may include adversarial loss and multi-scale Mel reconstruction loss, to evaluate the difference between the generated signal and the real 48kHz speech signal. Calculating this loss guides the updating of model parameters to progressively improve the sound quality, clarity, and fidelity of the generated signal. Ultimately, the trained model effectively performs super-resolution conversion from low-sampling-rate to high-sampling-rate speech signals, significantly improving the quality of voice communication and enabling users to enjoy a high-fidelity, clear, and natural call experience under any device or network conditions. This process not only improves the quality of the speech signal, but also optimizes the model's versatility and robustness, enabling it to stably improve the resolution and sound quality of the speech signal in various application scenarios.
[0086] In the above embodiments of this application, constructing a target loss function based on a target speech signal and an initial speech signal includes: obtaining a second spectrogram of the target speech signal; constructing a first loss function based on the target speech signal and the initial speech signal; constructing a second loss function based on the first spectrogram and the second spectrogram; and constructing the target loss function based on the first loss function and the second loss function.
[0087] After the model generates a high-resolution target speech signal, its spectral features can be extracted using techniques such as Fourier transform or Mel filter banks to form a second spectrogram. For example, a Mel spectrogram can be extracted from the generated 48kHz speech signal for subsequent frequency domain loss calculations.
[0088] The time-domain representations of the target speech signal (48kHz) and the same signal at high resolution (48kHz) can be compared to calculate the difference between them. The first loss function may employ metrics such as mean squared error or root mean square error to evaluate the accuracy of waveform reconstruction. For example, the waveform difference between the 48kHz speech signal generated by the model and the real 48kHz speech signal.
[0089] By comparing the first spectrogram (the Mel spectrogram of the initial speech signal) with the second spectrogram (the Mel spectrogram of the target speech signal), the difference between the initial and target speech signals can be calculated. The second loss function may employ Mel spectrogram reconstruction loss to ensure that frequency domain characteristics (such as timbre and frequency components) are accurately reconstructed. For example, the difference between the Mel spectrogram of a 4kHz speech signal and the Mel spectrogram of a 48kHz speech signal.
[0090] After obtaining the values of the first and second loss functions, they can be combined into a comprehensive loss function to guide the updating of model parameters. Through backpropagation and optimization algorithms, the speech processing model can adjust its internal parameters to simultaneously minimize losses in the time and frequency domains, thereby gradually improving the performance of speech super-resolution.
[0091] By simultaneously considering time-domain and frequency-domain losses during training, this application ensures that the speech signals generated by the model maintain high quality and high fidelity across different representations. The first loss function focuses on the naturalness and fluency of the waveform, while the second loss function ensures the recovery of the signal's frequency domain characteristics, including details such as timbre and frequency components. This two-pronged training strategy enables the model to not only generate smooth and natural waveforms in speech super-resolution conversion but also recover rich spectral features, significantly improving the sound quality, clarity, and intelligibility of the speech signal.
[0092] For example, suppose there is an application aimed at improving the audio quality of voice signals transmitted from telephone lines. Telephone voice signals are typically transmitted at a 4kHz sampling rate, resulting in low quality, particularly the lack of high-frequency information, leading to a poor call experience. A HiFi-SR model can be used to perform super-resolution processing on this 4kHz voice signal, aiming to convert it into 48kHz high-fidelity audio. During training, a second Mel-spectrum is first extracted from the model-generated 48kHz target voice signal. Then, a first loss function (such as Mean Squared Error, MSE) is used to measure the difference between the generated waveform and the real 48kHz voice signal, ensuring the naturalness and smoothness of the waveform. Simultaneously, a second loss function (such as Mel-spectrum reconstruction loss) can be used to evaluate the similarity between the Mel-spectrum of the 4kHz voice signal and the Mel-spectrum of the generated 48kHz signal, ensuring that the timbre and frequency components of the signal are accurately recovered. Finally, the first and second loss functions can be combined, and the model parameters are updated through backpropagation and optimization algorithms to minimize the overall loss and improve the model's conversion performance. After optimization using this training strategy, the HiFi-SR model can convert low-quality 4kHz telephone voice signals into high-quality 48kHz audio signals, significantly improving call quality. Even in complex environments or network conditions, it can provide a clear, natural, and high-fidelity call experience, demonstrating the model's powerful performance and broad application potential in the field of speech enhancement.
[0093] In the above embodiments of this application, constructing a first loss function based on the target speech signal and the initial speech signal includes: constructing scale loss functions of the initial speech signal and the target speech signal at different scales using a multi-scale discriminator; constructing periodic loss functions of the initial speech signal and the target speech signal at different periods using a multi-period discriminator; constructing time-frequency loss functions of the initial speech signal and the target speech signal at different time frequencies using a time-frequency discriminator; and constructing the first loss function based on the scale loss function, the periodic loss function, and the time-frequency loss function.
[0094] The scale loss function described above evaluates the similarity of the initial and target speech signals across multiple time scales for the multi-scale discriminator. By comparing signals at different scales, it ensures that the signals generated by the model remain natural and realistic at different resolutions, which helps to capture speech characteristics across different time scales.
[0095] The aforementioned periodic loss function utilizes a multi-period discriminator to evaluate the similarity between the initial and target speech signals across different periods. This is crucial for recovering periodic features in the signal, such as formants in human voices. By ensuring the accurate recovery of these periodic features, the intelligibility and naturalness of the speech signal can be improved.
[0096] The aforementioned time-frequency loss function is constructed using a multi-bandwidth, multi-scale time-frequency discriminator to evaluate the similarity of signals across different time-frequency bands. This time-frequency loss function considers not only the frequency content of the signal but also its temporal dynamics, helping to capture and recover the complex time-frequency structure of the signal and ensuring that the generated signal is richer and more realistic in detail.
[0097] In one alternative embodiment, a multi-scale discriminator (MSD) can be used to evaluate the similarity of signals across multiple time scales. For example, the MSD might operate at time scales of 1x, 2x, and 4x, using techniques such as average pooling to ensure that the model-generated signals remain consistent with the real signals at different resolutions. Next, a multi-period discriminator (MPD) can be used to evaluate the similarity of signals across different periods. For example, the MPD might process samples with periods of 2, 3, 5, 7, and 11 to ensure that formants and other periodic features in the signal are accurately recovered. Finally, a multi-bandwidth multi-scale temporal-frequency discriminator (MBD) can be used to construct the time-frequency loss function. For example, the MBD might operate on five different frequency bands of the linear spectrogram and use multiple STFT window lengths to ensure that details of the signal are accurately captured across different time-frequency bands.
[0098] The first loss function is constructed based on the scaling loss function, periodic loss function, and time-frequency loss function. The backpropagation algorithm is used to guide the update of model parameters to minimize these losses and improve the performance of the model.
[0099] When training a speech processing model, scale loss, periodic loss, and time-frequency loss functions can be constructed to evaluate the similarity between the generated speech signal and the target high-resolution signal from different perspectives. These loss functions complement each other, ensuring that the signal generated by the model maintains high quality in the time, frequency, and time-frequency domains, thereby guiding the effective updating of model parameters. By introducing scale, periodic, and time-frequency loss functions during training, this approach can comprehensively optimize the generated speech signal, ensuring that it achieves high fidelity in detail, naturalness, and intelligibility. This multi-dimensional evaluation and optimization greatly improves the model's speech super-resolution conversion capability, enabling it to stably output high-quality speech signals in different scenarios.
[0100] In the above embodiments of this application, feature enhancement is performed on the first spectrogram of the initial speech signal based on a preset dimension to obtain enhanced features of the initial speech signal, including: performing dimensional transformation on the first spectrogram to obtain multiple initial features of the first spectrogram in the preset dimension; and performing feature enhancement on the multiple initial features based on the correlation between the multiple initial features and the feature enhancement time series to obtain enhanced features, wherein the feature enhancement time series is used to represent the multiple number of cycles for feature enhancement of the multiple initial features.
[0101] The aforementioned feature enhancement time series refers to the sequence of the number of times the model processes the input features during the feature enhancement process. For example, in the Moss Former module, features may undergo multiple self-attention mechanisms, each aimed at further enhancing the representational power of the features; this sequence of the number of processing iterations is the feature enhancement time series.
[0102] The number of iterations mentioned above refers to the number of times the convolutional network iterates through the input features during feature enhancement. Each iteration includes feature enhancement operations such as convolution and nonlinear transformations, aiming to progressively improve the fineness and richness of the features to better recover the details in the speech signal. The feature enhancement time series is a dynamic representation of the feature enhancement process, containing a sequence of features progressively enhanced across multiple iterations. In each iteration, the features are enhanced through the processing of the convolutional network, and each point in the time series represents the state of the feature at the end of that iteration. These sequence points are connected to form an enhanced time series, demonstrating the evolution of the features from their original state to their final enhanced state.
[0103] The number of iterations primarily refers to the multiple iterations by which the convolutional network iterates through multiple initial features during feature enhancement. In each iteration, features are enhanced through convolutional layers and nonlinear transformations, aiming to progressively improve the resolution and detail of the features to achieve high-quality speech signal reconstruction. The feature enhancement time series is a record of a dynamic process, containing a sequence of features progressively enhanced across multiple iterations. The feature state at the end of each iteration constitutes a point in the time series, demonstrating the evolution of the features from their original state to their final enhanced state, reflecting the depth and detail of the feature enhancement.
[0104] By controlling the number of iterations, the depth of feature enhancement can be adjusted, thereby optimizing the quality of the final generated speech signal. The feature enhancement time series not only demonstrates the feature enhancement process but also provides important evidence for analyzing and adjusting the cyclic feature enhancement mechanism to adapt to the processing needs of different speech signals.
[0105] The dimensionality transformation mentioned above refers to the process in deep learning models of transforming data from one dimensional space to another through specific layers (such as linear layers). This is typically used for feature mapping, compression, or expansion so that the model can process the data more efficiently. For example, transforming the input Mel spectrogram from a lower dimension to a higher dimension space can capture more detailed and abstract features.
[0106] In one optional embodiment, the first spectrogram can be dimensionally transformed based on a preset dimension to obtain multiple initial features of the first spectrogram in the preset dimension.
[0107] When the model begins processing the initial speech signal, it first generates its Mel spectrogram representation. Then, a linear layer transforms the Mel spectrogram into a higher-dimensional space, resulting in a set of initial features in a preset dimension, which are then further enhanced. Next, the initial features input to the Moss Former component undergo a series of feature enhancement loops. Moss Former uses a joint local and global self-attention mechanism, enhancing the correlations between features by defining the number of loops in the feature enhancement time series, capturing long-term dependencies in the input sequence, and thus obtaining enhanced features. For example, the Moss Former component may contain multiple loops of self-attention mechanism processing, each process progressively optimizing the feature representation based on the correlations and dependencies between features.
[0108] When processing the Mel spectrogram of a speech signal, the model first maps the spectrogram to a higher-dimensional space through a dimension transformation layer, obtaining multiple initial features. Then, the Moss Former module enhances these initial features based on the correlations between these features and the defined feature enhancement time series, through a multi-loop self-attention mechanism. This captures and optimizes long-term global dependencies in the signal, generating optimized enhanced features that provide richer information for subsequent waveform generation.
[0109] Through dimensionality transformation and feature enhancement using the Moss Former component, the model can more effectively capture long-term dependencies and detailed features in the signal, thus maintaining the naturalness and clarity of the signal when converted to high-resolution speech. This feature enhancement strategy based on a self-attention mechanism is particularly beneficial for Mel spectrogram transformation, making the generated signal closer to real speech in its high-frequency structure, thus providing a higher-quality audio conversion solution.
[0110] In the above embodiments of this application, multiple initial features are enhanced based on the correlation between multiple initial features and the feature enhancement time series to obtain enhanced features, including: global enhancement of multiple initial features based on the correlation to obtain global enhanced features; local enhancement of global enhanced features based on the time window of the feature enhancement time series to obtain local enhanced features; and enhanced features obtained based on global enhanced features and local enhanced features.
[0111] The aforementioned global augmented features refer to feature representations that, after being processed by the model's self-attention mechanism, can capture the long-term dependencies and global features of the entire time series.
[0112] Global enhancement can be performed on each initial feature based on the association relationship to obtain the global enhanced feature corresponding to each initial feature.
[0113] The aforementioned local enhancement features refer to the model operating on local windows of the time series, focusing on the instantaneous changes and local structure of the signal, and further enhancing the features through recurrent modules to capture local temporal information and detailed feature representations.
[0114] Feature augmentation time series is a dynamic representation of the feature augmentation process. It includes a sequence of features that are progressively augmented in multiple iterations. A time window refers to the data segment used to train the model in time series analysis. The size and position of the time window can be set according to actual needs. The size of the time window can be determined based on the number of features to be augmented locally; the more features that need augmentation, the larger the time window, and vice versa.
[0115] In an optional embodiment, a self-attention mechanism can be used to capture the correlations between multiple initial features and the long-term dependencies of the signal throughout the time series, resulting in a set of globally enhanced features. These features, enhanced by the self-attention mechanism, then enter a recurrent module. This module operates on consecutive subsets of the time series using a specified time window size, locally enhancing the global features to capture instantaneous changes and local structures of the signal, resulting in locally enhanced features. Finally, the model combines the global and local enhanced features to obtain the final enhanced features, which contain both global dependencies and detailed local information, providing a richer and more optimized feature representation for subsequent speech super-resolution conversion.
[0116] When processing speech signals, a self-attention mechanism is used to globally enhance multiple initial features to capture the global dependencies of the signal. Then, in the recurrent module, a time window is used to locally enhance the globally enhanced features, focusing on the instantaneous changes and local structures of the signal. Finally, the globally enhanced features and the locally enhanced features are combined to obtain an optimized feature representation that contains both global and local information, namely the enhanced features.
[0117] By combining global and local enhancements, the model can more comprehensively understand and represent the characteristics of speech signals. Global enhancement ensures that the model can capture long-term dependencies and global features in the signal, which is crucial for high-frequency structure recovery in speech super-resolution conversion. Local enhancement, on the other hand, ensures that the model can focus on instantaneous changes and local details in the signal, which is essential for maintaining the natural fluency and clarity of the signal. This two-layer enhancement strategy enables the model to not only generate waveforms with natural fluency and high fidelity in speech super-resolution conversion, but also to recover rich spectral details and instantaneous change information in the signal, significantly improving the quality and clarity of the speech signal.
[0118] In the above embodiments of this application, the method further includes: obtaining the sampling frequency of the initial speech signal; sampling the initial speech signal based on the sampling frequency and a preset sampling frequency to obtain a sampled speech signal; and extracting the spectral signal of the sampled speech signal to obtain a first spectrogram.
[0119] The aforementioned preset sampling frequency refers to the target sampling rate set in advance during signal processing, used to convert the original signal to the required sampling rate. In this application, the preset sampling frequency can be 48kHz, which is to adapt to the input requirements of the model in order to perform high-quality speech super-resolution processing. The preset sampling frequency is not limited and can be adjusted according to the actual situation.
[0120] For example, the sampling frequency of this speech signal can be identified as 8kHz. According to the system requirements, the preset sampling frequency is 48kHz. Therefore, the 8kHz signal needs to be upsampled to 48kHz to match the model's input requirements, resulting in a sampled speech signal. Once the sampling frequency of the sampled speech signal is adjusted to 48kHz, its spectral information can be extracted using methods such as the Mel filter bank to generate a Mel spectrogram, i.e., the first spectrogram.
[0121] When processing initial speech signals with arbitrary sampling frequencies, the original sampling frequency of the signal is first identified. Then, the signal is upsampled based on a preset sampling frequency to ensure that the signal meets the requirements of the model input. Next, spectral information is extracted from the upsampled signal to generate the first spectrogram. This process provides standardized input for subsequent model processing, ensuring that signals with different sampling frequencies can be effectively processed and obtain consistent high-quality output.
[0122] The above signal preprocessing procedure not only enables the processing of speech signals with multiple sampling frequencies but also ensures that the original information and characteristics of the signal are preserved during the conversion process, reducing information loss caused by sampling frequency mismatch. Upsampling to a preset sampling frequency and extracting the Mel spectrogram provides the model with rich spectral information, which helps the model to accurately perform speech super-resolution conversion and significantly improves the sound quality and clarity of the speech signal.
[0123] In the above embodiments of this application, determining the sampled speech signal corresponding to the initial speech signal based on the sampling frequency and the preset sampling frequency includes: in response to the sampling frequency being less than the preset sampling frequency, upsampling the initial speech signal to obtain a sampled speech signal with the preset sampling frequency; and in response to the sampling frequency being greater than or equal to the preset sampling frequency, determining the initial speech signal as a sampled speech signal.
[0124] The upsampling mentioned above refers to increasing the sampling frequency of a signal, that is, switching from a lower sampling rate to a higher sampling rate. This is usually achieved by inserting zero values or through interpolation algorithms to reduce information loss and improve signal quality.
[0125] When processing speech signals, the first step is to check if the signal's sampling frequency meets the model's preset sampling frequency. If the signal's sampling frequency is lower than the preset frequency, upsampling techniques can be used to increase the signal's sampling frequency to the preset frequency to ensure that the signal meets the model's input requirements. If the signal's sampling frequency is equal to or higher than the preset frequency, the signal is directly used as the model's input without further sampling frequency adjustment. This process ensures that input signals with different sampling frequencies can be processed at a uniform sampling frequency, improving the model's versatility and processing efficiency.
[0126] In one optional embodiment, if the sampling frequency is less than the preset sampling frequency, the initial speech signal can be upsampled based on the preset sampling frequency to obtain a sampled speech signal at the preset sampling frequency.
[0127] The signal processing flow described above, based on sampling frequency and a preset sampling frequency, ensures the consistency of the model's input signal, avoiding information loss and reduced processing efficiency caused by sampling frequency mismatch. For signals with sampling frequencies below the preset frequency, upsampling ensures the integrity of the information in the signal and improves sound quality; for signals with sampling frequencies higher than or equal to the preset frequency, the initial signal is used directly as input, avoiding unnecessary processing overhead. This intelligent signal preprocessing strategy not only improves the model's processing efficiency and versatility but also ensures high quality during signal conversion, providing consistent high-quality processing for speech signals from different sources and with different sampling frequencies.
[0128] This application employs High-Fidelity Speech Super-Resolution (HiFi-SR) technology, which can increase the sampling rate of the input speech signal, for example, from 4kHz to 32kHz to 48kHz, but is not limited to this. HiFi-SR constructs a unified, end-to-end network structure and utilizes a generative adversarial training mechanism to ensure high fidelity and broad applicability, opening up a new research direction in the field of 48kHz super-resolution. This application uses an improved deep learning network structure (MossFormer2) transformer network to effectively capture long-term dependencies in the speech signal, providing a solid foundation for inferring high-frequency structures. MossFormer2, as an encoder, accurately transforms the low-resolution Mel spectrogram into a potential representation space, laying a high-quality coding foundation for subsequent waveform generation.
[0129] In terms of waveform generation, this application introduces a High Fidelity Generative Adversarial Network (HiFi-GAN) as the generator to ensure that the output speech waveform has good clarity and naturalness, which is a key step in improving audio fidelity. To enrich high-frequency details, this application further incorporates a multi-bandwidth, multi-scale time-frequency discriminator and a multi-scale Mel-ray reconstruction loss into the adversarial training mechanism, enhancing the model's ability to capture and recover high-frequency signals, so that the generated speech signal can maintain delicacy and realism in the high-frequency range.
[0130] This application achieves speech signal conversion from low sampling rate to high sampling rate by integrating MossFormer2 and HiFi-GAN, as well as innovative training strategies and discriminator design. It also makes significant breakthroughs in ensuring speech fidelity and improving the model's generalization ability, providing strong support for the development of speech enhancement, transmission and communication technologies.
[0131] Figure 3This is a schematic diagram of a high-fidelity speech super-resolution technology according to an embodiment of this application, such as... Figure 3 As shown, this structure includes a linear layer, an improved deep network learning structure module (MossFormer2Module), a recurrent module, a high-fidelity generative adversarial network generator (HiFi-GANGenerator), a multi-scale discriminator (MSD), a multi-period discriminator (MPD), and a multi-bandwidth multi-scale temporal-frequency discriminator (MBD). It can take a Mel spectrogram as input. To adapt to different input sampling rates, the low-sampling-rate speech signal can first be upsampled to 48kHz, and then the Mel spectrogram of the upsampled speech signal can be extracted. The linear layer can project the Mel spectrogram to a higher-dimensional space to obtain the input sequence. The Transformer-Convolutional Generator... In the Generator, the improved deep network learning structure module and recursive module can be repeated N times to fully capture the long-term global dependencies in the input sequence, thereby outputting an enhanced latent representation of the Mel spectrogram. Then, the enhanced latent representation is input into the high-fidelity generative adversarial network generator for waveform synthesis to obtain the synthesized speech signal. This application uses the multi-scale discriminator in the Mel Generative Network (Mel GAN) and the multi-period discriminator in the high-fidelity generative adversarial network (HiFi-GAN) to capture periodic speech patterns at different levels. Furthermore, a multi-bandwidth multi-scale time-frequency discriminator is used to improve the high-frequency fidelity and overall quality of the generated speech signal. The target speech signal is generated by combining the multi-scale discriminator, the multi-period discriminator, and the multi-bandwidth multi-scale time-frequency discriminator. The total loss can be constructed based on the target speech signal and the synthesized speech signal, and the synthesis effect of the above-mentioned high-fidelity generative adversarial network generator can be improved based on the total loss.
[0132] In the above process, the improved deep network learning structure module can utilize attention gating mechanisms to reduce the number of self-attention heads to one, thereby simplifying the requirements of multi-head attention. A wider receptive field can be achieved by expanding recurrent blocks based on Feed-Forward Sequence Memory Networks (FSMNs). These recurrent blocks are crucial for capturing recurrent patterns related to speech structure, prosody, and semantic associations in speech signals, thus improving the prediction accuracy of high-frequency details in speech signals.
[0133] The aforementioned transformer-convolution generator, based on the HiFi-GAN generator, consists of a series of transposed convolutional layers that upsample the input sequence until the output sequence length matches the length of the high-resolution waveform. Each transposed convolutional layer is followed by a Multi-Receptive Field Fusion (MRF) module. The MRF module captures patterns of varying lengths by summing the outputs from multiple residual blocks, each with a different convolutional kernel size and dilation rate, thus creating diverse receptive field patterns. This application allows adjustment of the hidden dimension, transposed convolutional kernel size, MRF convolutional kernel size, and MRF dilation rate to improve the performance of super-resolution experiments.
[0134] The multi-scale discriminator described above analyzes the generated audio waveforms at different scales, capturing details at different levels within the waveform. It can operate at various scales, adjusting the input scale through average pooling to evaluate audio quality at different resolutions. The multi-scale discriminator evaluates audio waveforms from macroscopic to microscopic levels, helping the generator learn waveform features at different scales and avoiding information loss that may result from single-scale evaluation.
[0135] The aforementioned multi-period discriminator is used to evaluate the periodic patterns of audio signals. It processes non-overlapping samples with period lengths of [2,3,5,7,11], which helps the model learn the periodic and rhythmic information unique to speech. The period lengths here are only illustrative and not specifically limited. By capturing the details of these specific periods, the multi-period discriminator makes the generated audio signals maintain natural rhythm and prosody, which is crucial for the fluency and audibility of synthesized speech.
[0136] The aforementioned multi-bandwidth, multi-scale time-frequency discriminator preserves phase information by operating on the real and imaginary parts of the complex short-time Fourier transform (STFT), enabling detailed recovery in the high-frequency region. By using five different STFT window lengths (4096, 2048, 1024, 512, 256) and dividing the frequency bands, it can comprehensively evaluate the audio quality of different frequency bands. It should be noted that the window lengths mentioned above are for illustrative purposes only and are not intended to be specific. The multi-bandwidth, multi-scale time-frequency discriminator overcomes the over-smoothing problem that may exist in multi-scale and multi-period discriminators in the high-frequency region. By preserving phase information, it can provide more accurate fidelity evaluation in the high-frequency part, helping the generator optimize the reconstruction of these key frequency regions and reduce high-frequency artifacts.
[0137] The aforementioned multi-scale discriminator focuses on the macroscopic structure of the waveform, the multi-period discriminator emphasizes the evaluation of speech rhythm, and the multi-bandwidth multi-scale time-frequency discriminator finely adjusts high-frequency details while preserving phase information. The combination of these three technologies comprehensively captures the characteristics of the audio signal from multiple dimensions, providing more comprehensive feedback to the generator and helping it optimize the quality of the generated audio. This integrated approach allows the model to simultaneously evaluate and optimize the periodicity, rhythm, and frequency details of the waveform, ensuring high fidelity of the generated audio across different frequency bands and time scales. The combination of MSD, MPD, and MBD also enhances the model's robustness. Even when dealing with speech signals with different sampling rates or domain differences, this multi-angle, multi-level evaluation method ensures the model's output quality. In summary, MSD, MPD, and MBD each enhance different characteristics of the audio waveform, and their combined use significantly improves the performance of speech super-resolution models in generating high-fidelity audio waveforms.
[0138] For the aforementioned multi-bandwidth, multi-scale time-frequency discriminator (MBD), the fidelity of the generated audio signal can be evaluated by analyzing the characteristics of the signal across different frequency bands and time scales. To achieve this goal, MBD employs the following network structure: 2D convolutional layers (83×8 cores and 32 channels), dilated 2D convolutional layers, convolutional layers with a stride of 2 along the frequency axis, and a final 2D convolutional layer (3×3 cores and a (1,1) stride). It should be noted that the structural parameters of the MDB network structure described above are for illustrative purposes only and are not intended to be specific. The structural parameters can be adjusted according to actual usage.
[0139] The aforementioned 2D convolutional layer is the foundational network block in MBD, responsible for processing the frequency and time dimensions of the input signal. The 3×8 convolutional kernel means it considers 3 points in the frequency dimension and 8 points in the time dimension; the 32 channels represent that the discriminator will learn 32 different feature maps. This design is able to capture the local time-frequency features of the input signal.
[0140] Following the base 2D convolutional layers, MBD uses 2D convolutional layers with dilated rates of 11, 22, and 44 in the temporal dimension. Dilated convolution increases the receptive field size without increasing the number of parameters or computational complexity by inserting holes (i.e., disconnected weights) between the weights of the convolutional kernel. A dilation rate of 11 means no holes, while dilation rates of 22 and 44 mean inserting one and three holes, respectively, between the kernel weights. This design increases the discriminator's receptive field in the temporal dimension, enabling it to capture longer speech structures, such as rhythm and intonation variations.
[0141] The convolutional layer described above with a stride of 2 along the frequency axis represents downsampling in the frequency dimension. A stride of 2 means moving two units at a time in the frequency dimension. This design facilitates MBD analysis at different frequency scales while reducing computational cost and the number of parameters.
[0142] The final 2D convolutional layer described above is the last layer of MBD. It uses a small convolutional kernel (33×3) and a stride of no downsampling [(1,1)(1,1)] to generate the final prediction output. The role of this layer is to perform the final feature integration and classification decision, determining whether the input signal is real or created by the generator.
[0143] Multi-bandwidth analysis (MBD) analyzes signal characteristics across different time scales and frequency bands by using the concatenated real and imaginary parts of a complex short-time Fourier transform (STFT) as input. Specifically, the STFT transforms the audio signal into the time-frequency domain, revealing the signal's energy distribution at different frequencies and time points. MBD utilizes five different STFT window lengths (4096, 2048, 1024, 512, 256) to analyze audio characteristics across different bandwidths. The window length determines the frequency resolution; shorter windows capture high-frequency details, while longer windows are better suited for extracting low- and mid-frequency information. Bandwidth division allows MBD to divide the signal spectrum into different sub-bands [0.0, 0.1, 0.25, 0.5, 0.75, 1.0], each covering a specific frequency range. This division helps the discriminator independently evaluate signal quality across different frequency ranges. For each subband and time scale, MBD uses the same network structure for shared network blocks, which includes the aforementioned 2D convolutional layers, dilated convolutional layers, and convolutional layers with a stride of 2 along the frequency axis. This shared network block design ensures consistent evaluation metrics across different frequencies and time scales while reducing the model's training complexity.
[0144] This application features a unified transformer-convolutional generator capable of seamlessly handling the prediction of latent representations and their conversion to time-domain waveforms. This design allows latent representations to transcend the limitations of Mel spectrograms, enabling the transformer network to adjust these representations for high alignment with the convolutional network during waveform generation. The convolutional network in this application uses a high-fidelity generative adversarial network generator to ensure high-quality limp generation. To further enhance high-frequency fidelity, this application introduces a multi-bandwidth, multi-scale ornament discriminator and a multi-scale Mel reconstruction loss within the adversarial training framework.
[0145] This application proposes a unified Transformer-convolutional generator architecture that combines Transformer networks and convolutional neural networks. The Transformer network encodes low-resolution Mel spectrograms, while the convolutional network upsamples these latent representations into high-resolution time-domain waveforms. This architecture seamlessly handles latent representation prediction and waveform generation within the same network, ensuring representation consistency and high fidelity in speech reconstruction. An end-to-end adversarial training method is used to jointly optimize the generator and discriminator within a unified framework, avoiding representation inconsistencies caused by independent training and concatenated networks, exhibiting higher robustness, especially in cross-domain scenarios. A multi-band, multi-scale time-frequency discriminator is introduced, enabling evaluation of generated speech signals at different frequency bands and scales, improving high-frequency fidelity and speech detail restoration capabilities. Multi-scale Mel reconstruction loss is incorporated into adversarial training, further enhancing the model's reconstruction capabilities at different scales, making the generated audio closer to the real signal. The HiFi-SR model has strong versatility, capable of upsampling any input speech signal above 4kHz to 48kHz, and exhibits excellent performance in both internal and external test scenarios, significantly surpassing existing speech super-resolution methods.
[0146] This application significantly improves speech quality and clarity by enhancing speech super-resolution capabilities, demonstrating outstanding performance in cross-domain application scenarios. This will bring significant commercial value to fields such as speech enhancement, speech transmission, and communication, improving user experience and meeting the demand for high-fidelity speech. This application effectively solves the problems of inconsistency and poor speech quality in traditional speech super-resolution by proposing a unified network structure, achieving higher quality and more stable speech reconstruction. Specifically, this application proposes a unified transformer-convolutional network that converts low-resolution Mel spectrograms into high-resolution waveforms, enabling end-to-end adversarial training.
[0147] According to an embodiment of this application, a method for training a speech processing model is also provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here. Figure 4 This is a flowchart illustrating a training method for a speech processing model according to an embodiment of this application. Figure 4 As shown, the method includes:
[0148] Step S402: Obtain the initial speech signal of the sample;
[0149] Step S404: Perform feature enhancement on the first sample spectrogram of the initial speech signal based on a preset dimension to obtain the sample-enhanced spectral features of the initial speech signal.
[0150] Step S406: Input the sample enhanced spectral features into the initial speech processing model, and use the initial speech processing model to enhance the features of the sample initial speech signal to obtain the sample target speech signal;
[0151] Step S408: Construct a target sample loss function based on the initial speech signal and the target speech signal of the sample;
[0152] Step S410: Update the model parameters of the initial speech processing model using the target sample loss function to obtain the speech processing model.
[0153] During the training phase, a target loss function is constructed to compare the target speech signal with a high-resolution version of the initial speech signal. This loss is then used to guide the update of model parameters to optimize the model's generation capabilities. This process ensures that the model can learn effective strategies for low-resolution to high-resolution conversion. Furthermore, the introduction of adversarial training and multi-scale loss improves the detail fidelity and overall quality of the generated speech signal.
[0154] Through the above steps, an initial speech signal is obtained; the first sample spectrogram of the initial speech signal is enhanced based on a preset dimension to obtain the enhanced spectral features of the initial speech signal; the enhanced spectral features are input into the initial speech processing model, and the initial speech processing model is used to enhance the features of the initial speech signal to obtain the target speech signal; a target sample loss function is constructed based on the initial speech signal and the target speech signal; the model parameters of the initial speech processing model are updated using the target sample loss function to obtain the speech processing model, which realizes the super-resolution conversion from low-resolution speech signal to high-resolution speech signal. By enhancing the features of the low-resolution first spectrogram, enhanced spectral features are obtained. The enhanced spectral features are input into the speech processing model, and the speech processing model can reconstruct the initial speech signal based on the enhanced spectral features at different scales, significantly improving the sound quality, clarity, and fidelity of the speech signal, realizing the fine detail recovery and more accurate audio representation of the high-resolution speech signal, and thus solving the technical problem of poor speech signal processing effect in related technologies.
[0155] In the above embodiments of this application, constructing a target sample loss function based on the initial sample speech signal and the target sample speech signal includes: obtaining a second sample spectrogram of the target sample speech signal; constructing a first sample loss function based on the target sample speech signal and the initial sample speech signal; constructing a second sample loss function based on the first sample spectrogram and the second sample spectrogram; and constructing the target sample loss function based on the first sample loss function and the second sample loss function.
[0156] According to an embodiment of this application, a speech signal processing method is also provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here. Figure 5 This is a flowchart of a speech signal processing method according to an embodiment of this application. Figure 5 As shown, the method includes:
[0157] Step S502: Obtain the initial voice signal by calling the first interface;
[0158] The first interface includes a first parameter, the value of which includes the initial voice signal.
[0159] The aforementioned first interface can be an interface for data interaction between the cloud server and the client. It can pass the initial voice signal into the interface function as the first parameter of the interface function, thereby achieving the purpose of uploading the initial voice signal to the cloud server.
[0160] Step S504: Perform feature enhancement on the first spectrogram of the initial speech signal based on a preset dimension to obtain the enhanced spectral features of the initial speech signal;
[0161] Step S506: Input the enhanced spectral features into the speech processing model, and use the speech processing model to enhance the initial speech signal to obtain the target speech signal;
[0162] The speech processing model is used to enhance the signal at different scales, with the target speech signal having a higher resolution than the initial speech signal.
[0163] Step S508: Output the target voice signal by calling the second interface.
[0164] The second interface includes a second parameter, the value of which includes the target speech signal.
[0165] The aforementioned second interface can be an interface for data interaction between the cloud server and the client. The cloud server can pass the target voice signal into the interface function as the second parameter of the interface function to achieve the purpose of sending the target voice signal to the client.
[0166] Through the above steps, an initial speech signal is obtained by calling a first interface, where the first interface includes a first parameter whose value includes the initial speech signal; feature enhancement is performed on the first spectrogram of the initial speech signal based on a preset dimension to obtain enhanced spectral features of the initial speech signal; the enhanced spectral features are input into a speech processing model, which is used to enhance the initial speech signal to obtain a target speech signal, wherein the speech processing model is used to achieve signal enhancement at different scales, and the resolution of the target speech signal is greater than that of the initial speech signal; the target speech signal is output by calling a second interface, where the second interface includes a second parameter whose value includes the target speech signal, thus realizing super-resolution conversion from low-resolution speech signal to high-resolution speech signal. By enhancing the features of the low-resolution first spectrogram to obtain enhanced spectral features, and inputting the enhanced spectral features into the speech processing model, the speech processing model can reconstruct the initial speech signal at different scales based on the enhanced spectral features, significantly improving the sound quality, clarity, and fidelity of the speech signal, achieving fine detail recovery and more accurate audio representation of the high-resolution speech signal, thereby solving the technical problem of poor speech signal processing effect in related technologies.
[0167] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or refuse.
[0168] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this application.
[0169] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, they can also be implemented by hardware. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk), and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods of the various embodiments of this application.
[0170] According to an embodiment of this application, a speech signal processing apparatus for implementing the above-described speech signal processing method is also provided. Figure 6 This is a schematic diagram of a speech signal processing device according to an embodiment of this application, such as... Figure 6 As shown, the device 600 includes: an acquisition module 602, an enhancement module 604, and a processing module 606.
[0171] The acquisition module is used to acquire the initial speech signal; the enhancement module is used to enhance the first spectrogram of the initial speech signal based on a preset dimension to obtain the enhanced spectral features of the initial speech signal; the processing module is used to input the enhanced spectral features into the speech processing model and use the speech processing model to enhance the initial speech signal to obtain the target speech signal. The speech processing model is used to achieve signal enhancement in different scale modes, and the resolution of the target speech signal is greater than the resolution of the initial speech signal.
[0172] It should be noted that the acquisition module 602, enhancement module 604, and processing module 606 mentioned above correspond to steps S202 to S206 in the above embodiments. The three modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also be part of the device and run in the server 10 provided in the above embodiments.
[0173] In the above embodiments of this application, the speech processing model includes a transposed matrix layer and multiple residual blocks. The inputs of the multiple residual blocks are connected to the transposed matrix layer. The enhancement module is further used to upsample the enhanced spectral features in the time dimension using the transposed matrix layer to obtain a time-domain speech signal with a preset resolution. The waveform features of the time-domain speech signal are enhanced using the multiple residual blocks in different scale modes to obtain multiple enhanced signal features. The convolution kernel size and dilation rate of the multiple residual blocks are different. The target speech signal is obtained based on the multiple enhanced signal features.
[0174] In the above embodiments of this application, the device further includes: a construction module and an update module.
[0175] The construction module is used to construct a target loss function based on the target speech signal and the initial speech signal; the update module is used to update the model parameters of the speech processing model using the target loss function.
[0176] In the above embodiments of this application, the construction module is further configured to obtain a second spectrogram of the target speech signal; construct a first loss function based on the target speech signal and the initial speech signal; construct a second loss function based on the first spectrogram and the second spectrogram; and construct the target loss function based on the first loss function and the second loss function.
[0177] In the above embodiments of this application, the construction module is further configured to construct scale loss functions for the initial speech signal and the target speech signal at different scales using a multi-scale discriminator; construct periodic loss functions for the initial speech signal and the target speech signal at different periods using a multi-period discriminator; construct time-frequency loss functions for the initial speech signal and the target speech signal at different time frequencies using a time-frequency discriminator; and construct a first loss function based on the scale loss function, the periodic loss function, and the time-frequency loss function.
[0178] In the above embodiments of this application, the enhancement module is further configured to perform dimensional transformation on the first spectrogram to obtain multiple initial features of the first spectrogram in a preset dimension; and to perform feature enhancement on the multiple initial features based on the correlation between the multiple initial features and the feature enhancement time series to obtain enhanced features, wherein the feature enhancement time series is used to represent the multiple number of loops for feature enhancement of the multiple initial features.
[0179] In the above embodiments of this application, the enhancement module is further used to perform global enhancement on multiple initial features based on the association relationship to obtain global enhanced features; to perform local enhancement on the global enhanced features based on the time window of the feature enhancement time series to obtain local enhanced features; and to obtain enhanced features based on the global enhanced features and the local enhanced features.
[0180] In the above embodiments of this application, the device further includes: an acquisition module, a sampling module, and an extraction module.
[0181] The acquisition module is used to acquire the sampling frequency of the initial speech signal; the sampling module is used to sample the initial speech signal based on the sampling frequency and the preset sampling frequency to obtain the sampled speech signal; and the extraction module is used to extract the spectral signal of the sampled speech signal to obtain the first spectrogram.
[0182] In the above embodiments of this application, the sampling module is further configured to upsample the initial speech signal to obtain a sampled speech signal with a preset sampling frequency in response to the sampling frequency being less than the preset sampling frequency; and to determine the initial speech signal as a sampled speech signal in response to the sampling frequency being greater than or equal to the preset sampling frequency.
[0183] According to an embodiment of this application, a training apparatus for a speech processing model for implementing the training method of the above-described speech processing model is also provided. Figure 7 This is a schematic diagram of a training device for a speech processing model according to an embodiment of this application, as shown below. Figure 7 As shown, the device 700 includes: an acquisition module 702, an enhancement module 704, a processing module 706, a construction module 708, and an update module 710.
[0184] The system comprises the following modules: an acquisition module for acquiring initial speech signals; an enhancement module for enhancing the first sample spectrogram of the initial speech signals based on a preset dimension to obtain enhanced spectral features of the initial speech signals; a processing module for inputting the enhanced spectral features into an initial speech processing model to enhance the initial speech signals and obtain target speech signals; a construction module for constructing a target sample loss function based on the initial speech signals and target speech signals; and an update module for updating the model parameters of the initial speech processing model using the target sample loss function to obtain a speech processing model.
[0185] It should be noted that the acquisition module 702, enhancement module 704, processing module 706, construction module 708, and update module 710 mentioned above correspond to steps S402 to S410 in the above embodiments. The five modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also run as part of the device in the server 10 provided in the above embodiments.
[0186] In the above embodiments of this application, the construction module is further configured to obtain a second sample spectrogram of the target speech signal; construct a first sample loss function based on the target speech signal and the initial speech signal; construct a second sample loss function based on the first sample spectrogram and the second sample spectrogram; and construct the target sample loss function based on the first sample loss function and the second sample loss function.
[0187] According to an embodiment of this application, a speech signal processing apparatus for implementing the above-described speech signal processing method is also provided. Figure 8This is a schematic diagram of a speech signal processing device according to an embodiment of this application, such as... Figure 8 As shown, the device 800 includes: an acquisition module 802, an enhancement module 804, a processing module 806, and an output module 808.
[0188] The acquisition module is used to acquire an initial speech signal by calling a first interface, wherein the first interface includes a first parameter, and the parameter value of the first parameter includes the initial speech signal; the enhancement module is used to perform feature enhancement on the first spectrogram of the initial speech signal based on a preset dimension to obtain the enhanced spectral features of the initial speech signal; the processing module is used to input the enhanced spectral features into a speech processing model, and use the speech processing model to enhance the initial speech signal to obtain a target speech signal, wherein the speech processing model is used to achieve signal enhancement in different scale modes, and the resolution of the target speech signal is greater than the resolution of the initial speech signal; the output module is used to output the target speech signal by calling a second interface, wherein the second interface includes a second parameter, and the parameter value of the second parameter includes the target speech signal.
[0189] It should be noted that the preferred embodiments involved in the above embodiments of this application are the same as the solutions, application scenarios and implementation processes provided in the above embodiments, but are not limited to the solutions provided in the above embodiments.
[0190] Embodiments of this application may provide a computing device, which may be any one of a group of computing devices. Optionally, in this embodiment, the computing device may also be replaced by a terminal device such as a mobile terminal.
[0191] Optionally, in this embodiment, the computing device described above may be located in at least one of a plurality of network devices in a computer network.
[0192] In this embodiment, the computer terminal described above can execute the program code in the method.
[0193] Optionally, Figure 9 This is a structural block diagram of a computing device according to an embodiment of this application. Figure 9 As shown, the computing device A may include: one or more (only one is shown in the figure) processors 102, memory 104, memory controller, and peripheral interfaces, wherein the peripheral interfaces are connected to a radio frequency module, an audio module, and a display.
[0194] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the methods and apparatus in the embodiments of this application. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to terminal A via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0195] The processor can call the information and application program stored in the memory through the transmission device to perform the following steps: acquire the initial speech signal; perform feature enhancement on the first spectrogram of the initial speech signal based on a preset dimension to obtain the enhanced spectral features of the initial speech signal; input the enhanced spectral features into the speech processing model, and use the speech processing model to enhance the initial speech signal to obtain the target speech signal, wherein the speech processing model is used to achieve signal enhancement in different scale modes, and the resolution of the target speech signal is greater than the resolution of the initial speech signal.
[0196] Embodiments of this application may provide a computing device. Figure 9 This is a structural block diagram of a computing device according to an embodiment of this application. Figure 9 As shown, the computing device 100 may include one or more (only one is shown in the figure) processors 102, memory 104, memory controller, and peripheral interfaces.
[0197] The aforementioned computing device can be understood as an integrated smart terminal, including but not limited to servers, desktop computers, PCs (Personal Computers), all-in-one model machines, etc., and the computing device may have the model described in the above embodiments of this application pre-installed.
[0198] Specifically, this computing device can pre-install various types of models, including but not limited to models in natural language processing, visual processing, speech processing, code processing, and multimodal task processing, thus providing diverse model selection. In different product forms, this computing device can support one or more model usage methods, including but not limited to model training, model invocation, model fine-tuning, model deployment, model inference, and application. In some product forms, this computing device also supports model management, including but not limited to multi-type model management (supporting the management of discriminative, generative, and other model types), model version control (supporting the control of different model versions), and model evaluation (evaluating model performance and effectiveness based on model evaluation tools). In other product forms, this computing device can also create applications based on models, providing API calling capabilities, allowing models to be called into created applications through API interfaces, and providing application management tools for application management and monitoring.
[0199] Furthermore, the computing device may also include data management (supporting the creation and management of model tuning datasets), a training center (providing abundant training resources to help users learn and master AI technology), and basic control capabilities (providing enterprise-level basic control capabilities to ensure the security and efficient operation of the system). Through the above functions, it provides a comprehensive and integrated device for AI development, training, deployment, and application.
[0200] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing the hardware related to the terminal device. The program can be stored in a computer-readable storage medium, which may include: flash drive, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.
[0201] Embodiments of this application also provide a computer-readable storage medium. Optionally, in this embodiment, the computer-readable storage medium can be used to store program code executed by the method provided in the above embodiments.
[0202] Optionally, in this embodiment, the storage medium may be located in any computing device in a group of computing devices in a computer network, or in any mobile terminal in a group of mobile terminals.
[0203] Optionally, in this embodiment, the computer-readable storage medium is configured to store program code for performing the following steps: acquiring an initial speech signal; performing feature enhancement on a first spectrogram of the initial speech signal based on a preset dimension to obtain enhanced spectral features of the initial speech signal; inputting the enhanced spectral features into a speech processing model, and using the speech processing model to enhance the initial speech signal to obtain a target speech signal, wherein the speech processing model is used to achieve signal enhancement in different scale modes, and the resolution of the target speech signal is greater than the resolution of the initial speech signal.
[0204] Embodiments of this application may provide an electronic device. Figure 10 This is a structural block diagram of an electronic device according to an embodiment of this application. Figure 10 As shown, the electronic device may include: an input / output device 112; a memory 114; and a processor 116, wherein the processor 116 is connected to the input / output device 112 and the memory 114 via a bus 118.
[0205] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the methods and apparatus in the embodiments of this application. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to terminal A via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0206] The processor can invoke an executable program stored in memory via a transmission device to execute the method described in any of the above embodiments.
[0207] Embodiments of this application also provide a computer program product. Optionally, in this embodiment, the computer program product may include a computer program that, when executed by a processor, implements the methods provided in the embodiments described above.
[0208] Embodiments of this application also provide a computer program product. Optionally, the computer program product may include a non-volatile computer-readable storage medium, which can be used to store a computer program that, when executed by a processor, implements the method provided in the above embodiments.
[0209] Embodiments of this application also provide a computer program. Optionally, in this embodiment, when the computer program is executed by a processor, it implements the method provided in the above embodiments.
[0210] In the above embodiments of this application, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0211] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling, direct coupling, or communication connection may be through some interfaces; the indirect coupling or communication connection between units or modules may be electrical or other forms.
[0212] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0213] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0214] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard drive, magnetic disk, or optical disk.
[0215] The above description is only a preferred embodiment of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of this application, and these improvements and modifications should also be considered within the scope of protection of this application.
Claims
1. A speech signal processing method, characterized in that, include: Acquire the initial speech signal; Based on a preset dimension, feature enhancement is performed on the first spectrogram of the initial speech signal to obtain the enhanced spectrogram features of the initial speech signal; The enhanced spectral features are input into a speech processing model, and the speech processing model is used to enhance the initial speech signal to obtain a target speech signal. The speech processing model is used to achieve signal enhancement at different scales, and the resolution of the target speech signal is greater than that of the initial speech signal.
2. The speech signal processing method according to claim 1, characterized in that, The speech processing model includes a transposed matrix layer and multiple residual blocks. The inputs of the multiple residual blocks are connected to the transposed matrix layer. The speech processing model is used to enhance the initial speech signal to obtain a target speech signal, including: The enhanced spectral features are upsampled in the time dimension using the transpose matrix layer to obtain a time-domain speech signal with a preset resolution. The waveform features of the time-domain speech signal are enhanced using the multiple residual blocks at different scale modes to obtain multiple enhanced signal features, wherein the convolution kernel size and dilation rate of the multiple residual blocks are different; The target speech signal is obtained based on the multiple enhanced signal features.
3. The speech signal processing method according to claim 2, characterized in that, The method further includes: Construct a target loss function based on the target speech signal and the initial speech signal; The target loss function is used to update the model parameters of the speech processing model.
4. The speech signal processing method according to claim 3, characterized in that, Constructing a target loss function based on the target speech signal and the initial speech signal includes: Obtain the second spectrogram of the target speech signal; A first loss function is constructed based on the target speech signal and the initial speech signal; A second loss function is constructed based on the first and second spectrograms; The target loss function is constructed based on the first loss function and the second loss function.
5. The speech signal processing method according to claim 4, characterized in that, A first loss function is constructed based on the target speech signal and the initial speech signal, including: A multi-scale discriminator is used to construct scale loss functions for the initial speech signal and the target speech signal at different scales; A periodic loss function for the initial speech signal and the target speech signal at different periods is constructed using a multi-period discriminator; A time-frequency loss function is constructed for the initial speech signal and the target speech signal at different time frequencies using a time-frequency discriminator; The first loss function is constructed based on the scale loss function, the periodic loss function, and the time-frequency loss function.
6. The speech signal processing method according to any one of claims 1 to 5, characterized in that, Based on a preset dimension, feature enhancement is performed on the first spectrogram of the initial speech signal to obtain enhanced features of the initial speech signal, including: The first spectrogram is subjected to dimensional transformation to obtain multiple initial features of the first spectrogram in a preset dimension; Based on the correlation between the multiple initial features and the feature enhancement time series, the multiple initial features are enhanced to obtain the enhanced features, wherein the feature enhancement time series is used to represent the multiple number of iterations for enhancing the multiple initial features.
7. The speech signal processing method according to claim 6, characterized in that, Based on the correlation between the multiple initial features and the feature enhancement time series, feature enhancement is performed on the multiple initial features to obtain the enhanced features, including: Based on the aforementioned correlation, the multiple initial features are globally enhanced to obtain globally enhanced features; Based on the time window of the feature enhancement time series, the global enhanced features are locally enhanced to obtain local enhanced features; The enhancement features are obtained based on the global enhancement features and the local enhancement features.
8. The speech signal processing method according to any one of claims 1 to 5, characterized in that, The method further includes: Obtain the sampling frequency of the initial speech signal; The initial speech signal is sampled based on the sampling frequency and the preset sampling frequency to obtain a sampled speech signal; The spectral signal of the sampled speech signal is extracted to obtain the first spectrogram.
9. The speech signal processing method according to claim 8, characterized in that, Determining the sampled speech signal corresponding to the initial speech signal based on the sampling frequency and the preset sampling frequency includes: In response to the sampling frequency being less than a preset sampling frequency, the initial speech signal is upsampled to obtain the sampled speech signal at the preset sampling frequency; In response to the sampling frequency being greater than or equal to the preset sampling frequency, the initial speech signal is determined to be the sampled speech signal.
10. A training method for a speech processing model, characterized in that, include: Acquire the initial speech signal of the sample; Based on a preset dimension, feature enhancement is performed on the first sample spectrogram of the initial speech signal to obtain the sample-enhanced spectrogram features of the initial speech signal. The sample enhanced spectral features are input into the initial speech processing model, and the initial speech processing model is used to enhance the features of the sample initial speech signal to obtain the sample target speech signal. Construct a target sample loss function based on the initial speech signal and the target speech signal of the sample; The model parameters of the initial speech processing model are updated using the target sample loss function to obtain the speech processing model.
11. The training method for the speech processing model according to claim 10, characterized in that, Based on the initial speech signal and the target speech signal, a target sample loss function is constructed, including: Obtain the second sample spectrogram of the target speech signal; A first sample loss function is constructed based on the target speech signal and the initial speech signal of the sample. Construct a second sample loss function based on the first sample spectrum and the second sample spectrum; The target sample loss function is constructed based on the first sample loss function and the second sample loss function.
12. A speech signal processing method, characterized in that, include: The initial voice signal is obtained by calling a first interface, wherein the first interface includes a first parameter, and the parameter value of the first parameter includes the initial voice signal. Based on a preset dimension, feature enhancement is performed on the first spectrogram of the initial speech signal to obtain the enhanced spectrogram features of the initial speech signal; The enhanced spectral features are input into a speech processing model, and the speech processing model is used to enhance the initial speech signal to obtain a target speech signal. The speech processing model is used to achieve signal enhancement at different scales, and the resolution of the target speech signal is greater than that of the initial speech signal. The target speech signal is output by calling a second interface, wherein the second interface includes a second parameter, and the parameter value of the second parameter includes the target speech signal.
13. A computing device, characterized in that, include: Memory, which stores executable programs; A processor for running the program, wherein the program, when running, performs the method according to any one of claims 1 to 12.
14. An electronic device, characterized in that, include: Memory, which stores executable programs; A processor, connected to the memory via a bus, is used to run the program, wherein the program executes the method according to any one of claims 1 to 12.
15. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes a stored executable program, wherein, when the executable program is executed, it controls the device on which the computer-readable storage medium is located to perform the method according to any one of claims 1 to 12.
16. A computer program product, characterized in that, It includes a computer program that, when executed by a processor, implements the method described in any one of claims 1 to 12.