A speech enhancement method and device based on vector quantization, equipment and medium
By employing a vector quantization-based speech enhancement method, and through multiple feature concatenation and decoding processes, the problem of unstable convergence in deep learning models is solved, achieving high-quality speech enhancement and accurate speech recognition results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PING AN TECH (SHENZHEN) CO LTD
- Filing Date
- 2023-05-06
- Publication Date
- 2026-06-19
AI Technical Summary
Existing deep learning-based speech enhancement models suffer from unstable convergence during speech enhancement, resulting in low speech enhancement quality. In particular, they struggle to accurately interpret the meaning of response speech data in low signal-to-noise ratio environments.
A speech enhancement method based on vector quantization is adopted. By acquiring the original speech features, performing context feature extraction and discretization, and using a vector quantizer and decoder to perform multiple feature concatenation and decoding until N vector quantizations are completed, high-quality reconstructed speech features are obtained.
It improves the quality of speech enhancement and enhances the accuracy of speech recognition, especially in low signal-to-noise ratio environments, enabling better understanding of speech meaning.
Smart Images

Figure CN116486824B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a speech enhancement method, apparatus, device, and medium based on vector quantization. Background Technology
[0002] Telephone sales are a common promotional method used by financial institutions such as banks, securities firms, and insurance companies when promoting financial products or other services. In intelligent outbound calling systems, pre-recorded voice messages are played back to users via computer, making it an indispensable component of integrated computer-telephone customer service center systems. Intelligent outbound calling systems are a widely used approach in the industry; repetitive notifications can be broadcast automatically, reducing manpower waste. In intelligent outbound calling systems, timely and accurate intent recognition of customer responses is crucial. Only by accurately identifying customer intent can targeted responses be provided, such as transferring to a live agent, handling informational questions, or modifying personal information. Currently, intelligent outbound calling systems often suffer from low intent recognition accuracy due to interference from environmental noise, background noise, reverberation, and other external factors, especially when the signal-to-noise ratio is low. This often makes it difficult to accurately understand the meaning of the response voice data. High-quality voice is typically achieved through speech enhancement, which aims to improve perceived speech quality and comprehensibility by removing background noise from the noisy input signal. In the existing technology, deep learning-based methods have made great progress in speech enhancement tasks. However, due to the unstable convergence of speech enhancement models obtained from deep learning, the quality of speech enhancement is low. Therefore, how to improve the quality of speech enhancement is an urgent problem to be solved. Summary of the Invention
[0003] Therefore, it is necessary to provide a speech enhancement method, apparatus, device, and medium based on vector quantization to address the aforementioned technical problems and solve the issue of low speech enhancement quality during the speech enhancement process.
[0004] A first aspect of this application provides a speech enhancement method based on vector quantization, the speech enhancement method comprising:
[0005] The original speech is acquired, and features are extracted from the original speech to obtain a first speech feature. Context features are extracted from the first speech feature to obtain a second speech feature with context representation.
[0006] The second speech feature is discretized based on vector quantization to obtain discretized speech features, and the discretized features are decoded to obtain reconstructed speech features.
[0007] The reconstructed speech features are concatenated with the first speech features to obtain a concatenated feature result. The concatenated feature result is used as the second speech feature. The process of discretizing the second speech feature based on vector quantization is returned to obtain the discretized speech feature. This process is repeated until N vector quantizations are completed, and the reconstructed speech feature corresponding to the Nth vector quantization is obtained, where N is an integer greater than 1.
[0008] The reconstructed speech features corresponding to the Nth vector quantization are decoded to obtain an enhanced result of the reconstructed speech as the original speech.
[0009] A second aspect of this application provides a speech enhancement device based on vector quantization, the speech enhancement device comprising:
[0010] The acquisition module is used to acquire the original speech, extract features from the original speech to obtain a first speech feature, and extract context features from the first speech feature to obtain a second speech feature with context representation.
[0011] The reconstruction module is used to discretize the second speech features based on vector quantization to obtain discretized speech features, and to decode the discretized features to obtain reconstructed speech features.
[0012] The concatenation module is used to concatenate the reconstructed speech features with the first speech features to obtain a concatenation result. The concatenation result is used as the second speech feature. The module then returns to the step of performing discretization processing on the second speech features based on vector quantization to obtain discretized speech features. This process continues until N vector quantizations are completed, resulting in the reconstructed speech features corresponding to the Nth vector quantization, where N is an integer greater than 1.
[0013] The decoding module is used to decode the reconstructed speech features corresponding to the Nth vector quantization to obtain an enhanced result of the reconstructed speech as the original speech.
[0014] Thirdly, embodiments of the present invention provide a computer device, the computer device including a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the speech enhancement method as described in the first aspect.
[0015] Fourthly, embodiments of the present invention provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the speech enhancement method as described in the first aspect.
[0016] The advantages of this invention compared to the prior art are:
[0017] The process involves acquiring the original speech, extracting features from it to obtain the first speech feature, extracting contextual features from the first speech feature to obtain the second speech feature with contextual representation, discretizing the second speech feature based on vector quantization to obtain the discretized speech feature, decoding the discretized feature to obtain the reconstructed speech feature, concatenating the reconstructed speech feature with the first speech feature to obtain the concatenated feature result, using the concatenated feature result as the second speech feature, and returning to execute the step of discretizing the second speech feature based on vector quantization to obtain the discretized speech feature, until N vector quantizations are completed to obtain the reconstructed speech feature corresponding to the Nth vector quantization, where N is an integer greater than 1, and decoding the reconstructed speech feature corresponding to the Nth vector quantization to obtain the enhanced result of the reconstructed speech being the original speech. In this invention, a vector quantizer is used to discretize the second speech features with contextual representation. After decoding and reconstructing the discretized features through a decoder, the reconstructed speech features are fused with the features in the original speech. The vector quantizer is used again to discretize the fused features, and the decoder is used again to decode and reconstruct the discretized features. This process is repeated until N vector quantizations are completed, resulting in the reconstructed speech features corresponding to the Nth vector quantization. By using discrete and reconstructed features multiple times, both discrete and reconstructed features are fully utilized, thereby obtaining high-quality enhanced speech. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a schematic diagram of an application environment for a speech enhancement method based on vector quantization provided in an embodiment of the present invention;
[0020] Figure 2 This is a schematic flowchart of a speech enhancement method based on vector quantization provided in an embodiment of the present invention;
[0021] Figure 3 This is a schematic diagram of a speech enhancement device based on vector quantization provided in an embodiment of the present invention;
[0022] Figure 4 This is a schematic diagram of the structure of a computer device provided in an embodiment of the present invention. Detailed Implementation
[0023] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0024] It should be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.
[0025] It should also be understood that the term “and / or” as used in this specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
[0026] As used in this specification and the appended claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if [described condition or event] is detected" may be interpreted, depending on the context, as meaning "once determined," "in response to determination," "once [described condition or event] is detected," or "in response to detection of [described condition or event]."
[0027] Furthermore, in the description of this invention and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0028] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of the invention include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.
[0029] The embodiments of this invention can acquire and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that utilize digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.
[0030] Foundational technologies for artificial intelligence generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning.
[0031] It should be understood that the sequence number of each step in the following embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
[0032] To illustrate the technical solution of the present invention, specific embodiments are described below.
[0033] An embodiment of the present invention provides a speech enhancement method based on vector quantization, which can be applied to, for example... Figure 1 In this application environment, the local device communicates with the server. The local device includes, but is not limited to, PDAs, desktop computers, laptops, ultra-mobile personal computers (UMPCs), netbooks, and personal digital assistants (PDAs). The server can be implemented using a standalone server or a server cluster consisting of multiple servers.
[0034] In many applications, speech enhancement technology is often used as a front-end preprocessing module, such as speech recognition, speaker recognition, and hearing aids. However, speech signals are easily interfered with by external factors such as environmental noise, background noise, and reverberation. Especially when the signal-to-noise ratio is low, listeners often find it difficult to accurately understand the meaning of the speech. In this embodiment, speech enhancement is used to remove background noise contained in noisy input signals to improve the perceived quality and intelligibility of speech, thereby improving the accuracy of speech recognition.
[0035] See Figure 2 This is a schematic flowchart of a speech enhancement method based on vector quantization provided in an embodiment of the present invention. The above-described speech enhancement method based on vector quantization can be applied to... Figure 1The server in the middle connects to the corresponding local terminal, such as Figure 2 As shown, the speech enhancement method based on vector quantization may include the following steps.
[0036] S201: Obtain the original speech, extract features from the original speech to obtain the first speech feature, extract context features from the first speech feature to obtain the second speech feature with context representation.
[0037] In step S201, the original speech is obtained, which is speech information to be enhanced with noise, and a first speech feature is obtained, which is a low-dimensional feature. Contextual features are extracted from the first speech feature to obtain a second speech feature with contextual representation.
[0038] In this embodiment, in the insurance industry, when a policy is nearing its payment due date or reinstatement deadline, users need to be reminded to make payments. This reminder is provided by an intelligent outbound calling system. During a call with a customer using this system, with the customer's permission, the server acquires the customer's voice data corresponding to the target domain of the call content and uses this data as the original audio. The acquired voice data is then processed to facilitate the application of the voice enhancement method to the intelligent outbound calling system.
[0039] It should be noted that the original speech can be any type of speech information. For example, acquiring the original speech of a video including background music and recognizing the original speech in the video can be speech information corresponding to any language such as Chinese, English, or French, or speech information corresponding to male or female voices, or speech information of any age group. Of course, it can also be other types of sample speech information with different timbres / qualityes. This example embodiment does not make any special limitations on this. The sending terminal can be any type of terminal device capable of accessing a communication network to realize voice calls. For example, the sending terminal can be a smartphone, tablet computer, or CPE (Customer Premise Equipment, any connection device used to access Ethernet or typically access services on a carrier network). This example embodiment does not make any special limitations on the type of sending terminal accessing the communication network.
[0040] It should be noted that when acquiring the original speech, the volume of the original audio can be customized. For example, it can be fixed at 80% or randomly varied. Optionally, the playback method of the original audio can be continuous or intermittent. This example embodiment does not impose any special limitations on the playback volume, playback method, or other parameters of the original audio on the sending terminal. By setting different playback volumes, playback methods, and other parameters of the original speech, the speaking environment of users in different environments can be effectively simulated, improving the quality of speech enhancement.
[0041] It should be noted that before extracting features from the original speech to obtain the first speech features, the original speech can be preprocessed. This preprocessing involves filtering the original speech using a frequency domain filtering model to extract speech features and generate corresponding feature data. The frequency domain filtering model is mainly implemented using frequency domain filters. The feature data is then pooled to obtain the preprocessed original speech.
[0042] Feature extraction is performed on the original speech to obtain the first speech feature. The encoder is used to extract features from the original speech. The encoder includes N coding layers. Each coding layer includes a one-dimensional convolutional layer, a ReLU activation function layer and a GELU activation function layer. The scale of the convolutional kernel in the one-dimensional convolutional layer is k, the stride is S and the number of channels is H.
[0043] To better utilize contextual information, positional codes are extracted from the original speech after convolutional processing. These positional codes represent the temporal position within the original speech, i.e., positional information in the time dimension. When extracting the second speech feature from the first speech feature, a Transformer decoder is used to encode the first speech feature. Its input speech feature data is the first speech feature extracted through convolutional features, with positional codes representing the original speech sequence added to it. The Transformer encoder, based on the Transformer layer structure, introduces contextual information into the first speech feature in the original speech through a self-attention mechanism. Finally, the speech feature, positional information, and contextual information at each position are fused into an implicit memory vector representation.
[0044] It should be noted that this embodiment includes N Transformer decoders, each of which contains a multi-head attention layer and a positionally fully connected forward feedback layer. The feature encoder part of the Transformer consists of 7 layers of one-dimensional convolutional network.
[0045] Optionally, feature extraction is performed on the original speech to obtain the first speech features, including:
[0046] By performing convolution operations on the original speech through a preset convolutional layer, the convolutional features corresponding to the original speech are obtained.
[0047] The convolutional features are linearly mapped using a preset first activation layer and a preset second activation layer to obtain the first speech feature corresponding to the original speech.
[0048] In this embodiment, a one-dimensional convolutional layer is used to perform convolution operations on the original speech to extract the convolutional features of the original speech. The convolutional features are then linearly mapped through a ReLU activation function layer and a GELU activation function layer to obtain the first speech feature corresponding to the original speech.
[0049] It should be noted that each convolutional layer is followed by a corresponding pooling layer. The pooling layer's function is to downsample in the time and / or frequency domains. After extracting the convolutional features from the original speech using a one-dimensional convolutional layer, these features are then downsampled.
[0050] It is important to note that the total downsampling rate of each pooling layer in the time domain should be less than the total downsampling rate in the frequency domain; that is, the total sampling rate of each pooling layer in the time domain should be greater than the total sampling rate in the frequency domain. When performing downsampling during convolution operations, the total downsampling rate of each pooling layer in the time domain should be less than the total downsampling rate in the frequency domain. To obtain better first speech features, the total downsampling rate in the time domain should be determined based on the granularity of speech classification performed on the speech to be recognized.
[0051] It should be noted that, to obtain more accurate convolutional features, multiple convolutional layers can be used for feature extraction. Each convolutional layer is followed by a corresponding pooling layer. The role of the pooling layer is to downsample in the time and / or frequency domains. The convolutional kernels of each convolutional layer have the same size, and the number of filters in each subsequent convolutional layer is an integer multiple of the number of filters in the previous convolutional layer. For example, when there are four convolutional layers, each is immediately followed by a pooling layer. In a convolutional network, the first layer is a conv64block convolutional layer, which can include M channels. For each channel, there can be 64 filters, and the kernel size of each filter can be 3x3. The pooling layer can then be pool2d_2x2, which performs downsampling. This layer can downsample in both the time and frequency domains at a sampling rate of 1 / 2. Next is the convolutional layer conv128block, which can include N channels. For each channel, there can be 128 filters, and the kernel size of each filter can be 3x3. Then, there's the pooling layer, which can be pool2d_2x2, to perform downsampling. This layer can downsample at half the sampling rate in both the time and frequency domains. Next is the convolutional layer conv256block, which can include K channels. For each channel, there can be 256 filters, and the kernel size of each filter can be 3x3. Then, there's the pooling layer pool2d_2x1, which performs downsampling only in the frequency domain at half the sampling rate. Next, the convolutional layer can be conv512block. This layer can include L channels, and for each channel, there can be 512 filters. The size of the convolutional kernel of each filter can be 3x3. Finally, the pooling layer can be pool2d_2x1. Downsampling is performed through the pooling layer. This layer only performs downsampling in the frequency domain at a sampling rate of 1 / 2.
[0052] In this embodiment, after obtaining the convolutional features corresponding to the original speech, a preset first activation layer and a preset second activation layer are used to linearly map the convolutional features to obtain the first speech features corresponding to the original speech. The preset first activation layer can be a ReLU activation function layer or a GELU activation function layer. The first speech features corresponding to the original speech are obtained by linearly mapping the convolutional features through the preset first activation layer and the preset second activation layer.
[0053] Optionally, contextual features are extracted from the first speech features to obtain second speech features with contextual representation, including:
[0054] The first speech feature is encoded by a preset coding layer to obtain the encoded feature;
[0055] Based on the attention mechanism, attention values are extracted from the encoded features, and the encoded features are weighted using the attention values to obtain the weighted encoded features corresponding to the encoded features.
[0056] By performing positional encoding on the weighted coding features through a pre-set forward feedback layer, a second speech feature with contextual representation is obtained.
[0057] In this embodiment, the first speech features are encoded using a preset encoding layer to obtain encoded features. This preset encoding layer consists of a 7-layer one-dimensional convolutional network. The 7-layer one-dimensional convolutional network extracts the encoded features from the first speech features. It should be noted that this 7-layer one-dimensional convolutional network includes a feature extractor composed of convolutional layers and subsampling layers; this feature extractor can be considered a filter. A convolutional layer refers to a layer of neurons in a convolutional neural network that performs convolution processing on the input signal. In a convolutional layer of a convolutional neural network, a neuron may only be connected to some neurons in neighboring layers. A convolutional layer typically contains several feature planes, each of which can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights; these shared weights are the convolutional kernel. The convolutional kernel can be initialized as a matrix of random size, and during the training of the convolutional neural network, the kernel can learn to obtain reasonable weights. Furthermore, the direct benefit of shared weights is reducing the connections between layers of the convolutional neural network, while also reducing the risk of overfitting.
[0058] In convolutional neural networks (CNNs), the convolutional layer is the core module. Its primary function is to extract features from raw data. A convolutional layer takes features from the previous layer and a convolutional kernel, performs a convolution operation, and then applies an activation function to obtain the convolution result for that layer, thus forming its features. Shallow convolutional layers can only extract relatively low-level features, while deeper layers extract more complex features from lower-level features. The features output by a convolutional layer are related to the convolution of several features from the previous layer. Each feature can be convolved using different convolutional kernels.
[0059] Based on the attention mechanism, attention values are extracted from the encoded features. These attention values are then used to weight the encoded features, resulting in weighted encoded features. The calculation of these attention values requires spatial transformation of the input encoded features to obtain a feature matrix of a preset dimension. In this embodiment, the input encoded features are obtained through a preset encoding layer. Spatial transformation is then performed on these encoded features to obtain the spatially transformed features. Different fully connected layers can be used to perform spatial transformation on the feature matrix corresponding to the encoded features, resulting in the corresponding transformed features. Different fully connected layers with different parameters will produce different transformed feature matrices, so multiple spatial transformations can be performed on the feature matrix corresponding to the encoded features using different fully connected layers.
[0060] The global features of the encoded features can be calculated through a self-attention mechanism, fully exploring the attention relationships between local features in the encoded features, thereby obtaining more accurate global features. This helps improve the robustness of the encoded features. In this embodiment, the self-attention mechanism allows more resources to be allocated to the most information-rich and discriminative regions during feature extraction, achieving the goal of highlighting key information and suppressing irrelevant information, thus further extracting relevant information.
[0061] In another embodiment, a multi-head attention matrix can be used to determine the attention relationships between the local feature vectors corresponding to the encoded features. Then, by weighted summation of these local feature vectors, they are fused to obtain a global feature vector. Using a multi-head attention matrix helps determine the attention relationships between local features, enabling targeted fusion and improving the representation of facial features.
[0062] By performing positional encoding on the weighted encoded features through a pre-defined feedforward layer, a second speech feature with contextual representation is obtained. The feedforward layer consists of an input layer, a fully connected layer, and an output layer. The dimensions of the input and output layers are both set to 512, and the fully connected layer is implemented using a maximum value function with a dimension of 2048.
[0063] S202: Discretize the second speech features based on vector quantization to obtain discrete speech features, and decode the discrete features to obtain reconstructed speech features.
[0064] In step S202, the second speech feature is discretized using a vector quantizer to obtain discretized speech features. Vector quantization is a method similar to clustering, which clusters continuous data into discrete data, thereby reducing the amount of data that needs to be stored and achieving data compression. The discretized features are then decoded to obtain the reconstructed speech features.
[0065] In this embodiment, the second speech features are clustered into discrete data using a vector quantizer, thereby reducing the amount of data that needs to be stored and achieving data compression. The vector quantizer also preserves the most important speech information through vector quantization. The second speech features are input into the vector quantizer and discretized to obtain discrete speech features.
[0066] Specifically, the second speech feature is first vector-quantized using a preset distance formula in the vector quantizer to obtain discrete speech features. The preset distance formula is L2 distance (Euclidean distance), which can also be replaced by L1 distance (Manhattan distance). Both L2 and L1 distance formulas can process continuous variables into discrete variables; the specific processing methods and principles will not be elaborated here.
[0067] The discretized features are decoded to obtain reconstructed speech features. In the process of decoding the discretized features, the decoder is used to decode and reconstruct the discretized features, making the discretized features closer to the original speech.
[0068] It's important to note that vector quantization is a data compression technique. Its basic idea is to group several scalar data sets into a vector, and then quantize the entire vector in the vector space, thus compressing the data with minimal information loss. Vector quantization maps from an N-dimensional real space RN to L discrete vectors within RN; it can also be called group quantization. Scalar quantization is a special case of vector quantization when the dimension is 1. Vector quantization coding is a relatively new and well-studied quantization method in speech signal coding. Its emergence is not merely for quantizer design, but more importantly, it's studied as a compression coding method. In traditional predictive and transform coding, the signal is first transformed into a sequence of numbers through a mapping transformation, and then each number is quantized individually. In vector quantization coding, the input data is divided into several groups and quantized group by group, treating these numbers as a k-dimensional vector, and then quantizing each vector individually.
[0069] For example, grouping the second speech features yields multiple sets of scalar data. These scalar data sets are then vectorized, and the vectorized features are mapped to a vector space to obtain corresponding vector vectors. These vector vectors represent the corresponding discrete features.
[0070] Specifically, the server divides the second speech feature into several scalar data points based on a preset data partitioning principle. These scalar data points are then combined to obtain multiple k-dimensional initial vectors. These multiple k-dimensional initial vectors are then mapped to the same vector space to generate a vector vector. In this embodiment, the k-dimensional initial vectors are constructed from several scalar data groups, and then the entire k-dimensional initial vector vector is quantized in the vector space, thereby compressing the data without losing much information.
[0071] Optionally, the discretized features are decoded to obtain reconstructed speech features, including:
[0072] By performing inverse convolution operations on the discretized features through a preset inverse convolution layer, the deconvolution features corresponding to the discretized features are obtained.
[0073] The deconvolution features are linearly mapped using a preset third activation layer and a preset fourth activation layer to obtain the reconstructed speech features.
[0074] In this embodiment, a decoder is used to decode and reconstruct the discretized features. The decoder includes N decoding layers, the number of which is equal to the number of encoding layers in the encoder. Each decoding layer includes a preset inverted convolutional layer, a preset third activation layer, and a preset fourth activation layer. The inverted convolutional layer uses the same kernel size and stride as the convolutional layers in the encoder for upsampling, thereby mapping the low-resolution, high-channel-count discretized features to high-resolution, low-channel-count deconvolutional features. After the inverted convolutional layer, a batch normalization layer can be used for normalization to reduce the number of neurons. The normalized features are then input to the preset third and fourth activation layers for linear mapping to obtain the reconstructed speech features.
[0075] S203: Concatenate the reconstructed speech features with the first speech features to obtain the concatenated feature result. Use the concatenated feature result as the second speech feature. Return to execute the step of discretizing the second speech feature based on vector quantization to obtain the discretized speech feature. Continue until N vector quantizations are completed to obtain the reconstructed speech feature corresponding to the Nth vector quantization.
[0076] In step S203, multiple discretization processes are performed through multiple vector quantizers. Each time vectorization is performed, the concatenation result obtained by concatenating the corresponding reconstructed speech feature with the first speech feature is used as the second speech feature and input into the corresponding vector quantizer. This process continues until N vector quantizations are completed, at which point the reconstructed speech feature corresponding to the Nth vector quantization is obtained.
[0077] In this embodiment, N vector quantizers are used for discretization to obtain corresponding discretized features. Each time discretization is performed, the input data is different. Each time discretization is performed, the reconstructed speech features obtained in the previous step are concatenated with the first speech features. The concatenation result is used as the second speech feature and input to the corresponding vector quantizer for discretization. After each discretization, the discretized features obtained are decoded and reconstructed to obtain the reconstructed speech features. Therefore, after N vector quantizations are completed, the reconstructed speech features corresponding to the Nth vector quantization are obtained.
[0078] It should be noted that the encoder extracts features from the original speech to obtain the first speech features, and the decoder decodes and reconstructs the discretized features to obtain the reconstructed speech features. The encoder includes N coding layers and the decoder includes N decoding layers. When concatenating the first speech features and the reconstructed speech features, the first speech features and the reconstructed speech features in the coding and decoding layers of the encoder and decoder are concatenated.
[0079] For example, an encoder is used to extract the first speech feature from the original speech. A Transformer encoder is then used to extract the contextual representation from the first speech feature to obtain the second speech feature. This second speech feature is input into the first vector quantizer for discretization, resulting in the first discretized feature. The first discretized feature is then input into the first decoding layer of the decoder for decoding and reconstruction, yielding the first reconstructed speech feature. After obtaining the first reconstructed speech feature, the corresponding first speech feature obtained from the original speech features extracted by each encoding layer of the encoder is acquired. The first encoding layer of the encoder is used to extract the original speech features, resulting in the first speech feature corresponding to the first encoding layer. This first speech feature is concatenated with the first reconstructed speech feature to obtain the corresponding first feature concatenation result. This first feature concatenation result is input into the second vector quantizer for discretization, resulting in the corresponding second discretized feature. This second discretized feature is input into the second decoding layer of the decoder for decoding and reconstruction, yielding the second speech feature. The second reconstructed speech feature is then concatenated with the first speech feature corresponding to the second encoding layer obtained using the second encoding layer of the encoder, resulting in the second feature concatenation result. This second feature concatenation result is then... The input is fed into the third vector quantizer for discretization to obtain the corresponding third discretized feature. The third discretized feature is then fed into the third decoding layer of the decoder for decoding and reconstruction to obtain the corresponding third reconstructed speech feature. The third reconstructed speech feature is then concatenated with the first speech feature corresponding to the third coding layer obtained using the third coding layer of the encoder to obtain the third feature concatenation result. The third feature concatenation result is then fed into the fourth vector quantizer for discretization to obtain the corresponding fourth discretized feature. The fourth discretized feature is then fed into the fourth decoding layer of the decoder for decoding and reconstruction to obtain the corresponding fourth reconstructed speech feature. This process is repeated using N vector quantizers until N vector quantizations are completed, resulting in the reconstructed speech feature corresponding to the Nth vector quantization, where N is an integer greater than 1.
[0080] In this embodiment, multiple vector quantizers are used for discretization, which can generate better clustering results from the already good clustering results, thereby achieving better noise reduction performance.
[0081] S204: Decode the reconstructed speech features corresponding to the Nth vector quantization to obtain the enhanced result of the reconstructed speech being the original speech.
[0082] In step S204, the reconstructed speech feature corresponding to the Nth vector quantization is the reconstructed speech feature after the last vector quantization clustering. The reconstructed speech feature corresponding to the Nth vector quantization is decoded to obtain the enhanced result of the reconstructed speech being the original speech.
[0083] In this embodiment, when decoding the reconstructed speech features corresponding to the Nth vector quantization, the Nth decoding layer in the corresponding decoder is used for decoding and reconstruction to obtain the reconstructed speech. The Nth decoding layer includes a preset inverted convolutional layer, a preset third activation layer, and a preset fourth activation layer. The inverted convolutional layer uses the same kernel size and stride as the convolutional layer in the encoder for upsampling, thereby mapping the low-resolution, high-channel-count discrete features to high-resolution, low-channel-count deconvolutional features. After the inverted convolutional layer, a batch normalization layer can be used for normalization processing to reduce the number of neurons. The normalized features are then input to the preset third and fourth activation layers for linear mapping to obtain the reconstructed speech features.
[0084] Optionally, before obtaining the reconstructed speech features corresponding to the Nth vector quantization, the following steps are also included:
[0085] Obtain the N vectorized losses that discretize the second speech features based on vector quantization and the Fourier transform loss corresponding to the original speech;
[0086] A speech enhancement loss function is constructed based on N vectorization losses and Fourier transform losses.
[0087] In this embodiment, the N vector quantizers include N vectorization losses, with each vector quantizer corresponding to a vector quantization loss. The vector quantization loss is the codebook density loss within the vector quantizer, and the Fourier transform loss corresponding to the original speech is the loss incurred when performing a Fourier transform on the original speech. The Fourier transform loss is used to characterize the difference between the acoustic features and Fourier spectral features of the original speech extracted by the deep neural network. Based on the N vectorization losses and the Fourier transform loss, a speech enhancement loss function is constructed. The calculation formula for the speech enhancement loss function is as follows:
[0088]
[0089] Among them, L total For speech enhancement loss, L se For Fourier transform loss, Let λ be the vectorization loss in the i-th vector quantizer. i is the parameter for adjusting the weights in the i-th vector quantizer.
[0090] Optionally, the Fourier transform loss corresponding to the original speech is obtained, including:
[0091] Perform a Fourier transform on the original speech to obtain the frequency domain loss and the time domain loss corresponding to the original speech.
[0092] Calculate the Fourier transform loss corresponding to the original speech based on the frequency domain loss and the time domain loss.
[0093] In this embodiment, the Fourier transform loss includes frequency domain loss and time domain loss. The frequency domain loss is a frequency domain loss function for speech enhancement tasks constructed based on the short-time Fourier transform amplitude spectrum of multi-resolution data. It measures the error in the short-time Fourier transform amplitude spectrum between the output speech of the initial speech enhancement model and the real speech, and is denoted as the frequency domain loss. The time domain loss is calculated from the waveform data in the time domain of the short-time Fourier transform between the output speech of the initial speech enhancement model and the real speech. The Fourier transform loss calculation formula is as follows:
[0094]
[0095] Among them, L se The Fourier transform loss is given, and y represents the time-domain speech. This is the short-time Fourier transform of time-domain speech. Let M be the loss at the i-th resolution, and M be the number of short-time Fourier transform resolutions. This represents time-domain loss.
[0096] The process involves acquiring the original speech, extracting features from it to obtain the first speech feature, extracting contextual features from the first speech feature to obtain the second speech feature with contextual representation, discretizing the second speech feature based on vector quantization to obtain the discretized speech feature, decoding the discretized feature to obtain the reconstructed speech feature, concatenating the reconstructed speech feature with the first speech feature to obtain the concatenated feature result, using the concatenated feature result as the second speech feature, and returning to execute the step of discretizing the second speech feature based on vector quantization to obtain the discretized speech feature, until N vector quantizations are completed to obtain the reconstructed speech feature corresponding to the Nth vector quantization, where N is an integer greater than 1, and decoding the reconstructed speech feature corresponding to the Nth vector quantization to obtain the enhanced result of the reconstructed speech being the original speech. In this invention, a vector quantizer is used to discretize the second speech features with contextual representation. After decoding and reconstructing the discretized features through a decoder, the reconstructed speech features are fused with the features in the original speech. The vector quantizer is used again to discretize the fused features, and the decoder is used again to decode and reconstruct the discretized features. This process is repeated until N vector quantizations are completed, resulting in the reconstructed speech features corresponding to the Nth vector quantization. By using discrete and reconstructed features multiple times, both discrete and reconstructed features are fully utilized, thereby obtaining high-quality enhanced speech.
[0097] See Figure 3 , Figure 3 This is a schematic diagram of a speech enhancement device based on vector quantization provided in an embodiment of the present invention. In this embodiment, the terminal includes units used for performing... Figure 2The steps in the corresponding embodiments. Please refer to the details. Figure 2 as well as Figure 2 The relevant descriptions in the corresponding embodiments are shown below. For ease of explanation, only the parts relevant to this embodiment are shown. Figure 3 As shown, the voice enhancement device 30 includes: an acquisition module 31, a reconstruction module 32, a splicing module 33, and a decoding module 34.
[0098] The acquisition module 31 is used to acquire the original speech, extract features from the original speech to obtain the first speech feature, and extract context features from the first speech feature to obtain the second speech feature with context representation.
[0099] The reconstruction module 32 is used to discretize the second speech features based on vector quantization to obtain discretized speech features, and to decode the discretized features to obtain reconstructed speech features.
[0100] The splicing module 33 is used to splice the reconstructed speech features with the first speech features to obtain the splicing result. The splicing result is used as the second speech feature. The module then returns to perform the step of discretizing the second speech feature based on vector quantization to obtain the discretized speech feature. This process continues until N vector quantizations are completed, and the reconstructed speech feature corresponding to the Nth vector quantization is obtained, where N is an integer greater than 1.
[0101] Optionally, the aforementioned voice enhancement device 30 further includes:
[0102] The loss acquisition module is used to acquire the N vectorized losses that are discretized based on vector quantization of the second speech features and the Fourier transform loss corresponding to the original speech.
[0103] The module is used to construct a speech enhancement loss function based on N vectorized losses and Fourier transform losses.
[0104] Optionally, the acquisition module 31 includes:
[0105] A convolutional unit is used to perform convolution operations on the original speech through a preset convolutional layer to obtain the convolutional features corresponding to the original speech.
[0106] The first linear mapping unit is used to perform linear mapping on the convolutional features using a preset first activation layer and a preset second activation layer to obtain the first speech features corresponding to the original speech.
[0107] Optionally, the acquisition module 31 includes:
[0108] The feature encoding unit is used to encode the first speech feature through a preset encoding layer to obtain the encoded feature.
[0109] Attention units are used to extract attention values from encoded features based on attention mechanisms, and then use these attention values to weight the encoded features to obtain the weighted encoded features corresponding to the encoded features.
[0110] The positional encoding unit is used to perform positional encoding on the weighted encoding features through a preset forward feedback layer to obtain a second speech feature with contextual representation.
[0111] Optionally, the above-mentioned reconstruction module 32 includes:
[0112] The inverted convolution unit is used to perform inverted convolution operation on the discretized features through a preset inverted convolution layer to obtain the deconvolution features corresponding to the discretized features.
[0113] The second linear mapping unit is used to linearly map the deconvolution features using a preset third activation layer and a preset fourth activation layer to obtain reconstructed speech features.
[0114] Optionally, the loss acquisition module mentioned above includes:
[0115] The Fourier transform unit is used to perform Fourier transform on the original speech to obtain the frequency domain loss and the time domain loss corresponding to the original speech.
[0116] The computation unit is used to calculate the Fourier transform loss corresponding to the original speech based on the frequency domain loss and the time domain loss.
[0117] It should be noted that the information interaction and execution process between the above-mentioned units are based on the same concept as the method embodiments of the present invention. For details on their specific functions and technical effects, please refer to the method embodiments section, which will not be repeated here.
[0118] Figure 4 This is a schematic diagram of the structure of a computer device provided in an embodiment of the present invention. Figure 4 As shown, the computer device of this embodiment includes: at least one processor ( Figure 4 Only one is shown in the diagram), a memory, and a computer program stored in the memory and executable on at least one processor, which, when executed by the processor, implements any of the above-described vector quantization-based speech enhancement method steps.
[0119] This computer device may include, but is not limited to, a processor and memory. Those skilled in the art will understand that... Figure 4 The examples of computer devices are merely examples and do not constitute a limitation on computer devices. Computer devices may include more or fewer components than shown in the illustration, or combinations of certain components, or different components, such as network interfaces, displays, and input devices.
[0120] The processor referred to can be a CPU, but it can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.
[0121] Memory includes readable storage media, internal memory, etc., wherein internal memory can be the RAM of a computer device, providing an environment for the operation of the operating system and computer-readable instructions stored in the readable storage media. The readable storage media can be the hard drive of a computer device, or in other embodiments, it can be an external storage device of the computer device, such as a plug-in hard drive, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card. Furthermore, memory can include both internal storage units and external storage devices of a computer device. Memory is used to store the operating system, applications, bootloader, data, and other programs, such as program code for computer programs. Memory can also be used to temporarily store data that has been output or will be output.
[0122] Those skilled in the art will understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the functions described above can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this invention. The specific working process of the units and modules in the above device can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here. If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the present invention can implement all or part of the processes in the methods of the above embodiments by instructing related hardware through a computer program. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the above method embodiments. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable medium can include at least: any entity or device capable of carrying computer program code, a recording medium, a computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electrical carrier signals or telecommunication signals.
[0123] The present invention can implement all or part of the processes in the methods of the above embodiments, or it can be accomplished by a computer program product. When the computer program product is run on a computer device, the computer device executes the steps in the above method embodiments.
[0124] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0125] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0126] In the embodiments provided by this invention, it should be understood that the disclosed apparatus / computer devices and methods can be implemented in other ways. For example, the apparatus / computer device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0127] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.
[0128] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.
Claims
1. A speech enhancement method based on vector quantization, characterized in that, The speech enhancement method includes: The original speech is acquired, and features are extracted from the original speech to obtain a first speech feature. Context features are extracted from the first speech feature to obtain a second speech feature with context representation. The second speech feature is discretized based on vector quantization to obtain discretized speech features, and the discretized speech features are decoded to obtain reconstructed speech features. The reconstructed speech features are concatenated with the first speech features to obtain a concatenated feature result. The concatenated feature result is used as the second speech feature. The process of discretizing the second speech feature based on vector quantization is returned to obtain the discretized speech feature. This process is repeated until N vector quantizations are completed, and the reconstructed speech feature corresponding to the Nth vector quantization is obtained, where N is an integer greater than 1. Decode the reconstructed speech features corresponding to the Nth vector quantization to obtain an enhanced result where the reconstructed speech is the original speech; Before obtaining the reconstructed speech features corresponding to the Nth vector quantization, the process also includes: Obtain the N vectorization losses that are discretized based on vector quantization of the second speech features and the Fourier transform loss corresponding to the original speech; Based on the N vectorization losses and the Fourier transform loss, a speech enhancement loss function is constructed, and the calculation formula of the speech enhancement loss function is as follows: in, For speech enhancement loss, For Fourier transform loss, The vectorization loss in the i-th vector quantizer is... is the parameter for adjusting the weights in the i-th vector quantizer; the vector quantization loss is the codebook density loss in the vector quantizer; and the Fourier transform loss is used to characterize the difference between the acoustic features and Fourier spectral features of the original speech extracted by the deep neural network.
2. The speech enhancement method of claim 1, wherein, The step of extracting features from the original speech to obtain the first speech features includes: The original speech is convolved by a preset convolutional layer to obtain the convolutional features corresponding to the original speech. The convolutional features are linearly mapped using a preset first activation layer and a preset second activation layer to obtain the first speech features corresponding to the original speech.
3. The speech enhancement method of claim 1, wherein, The step of extracting contextual features from the first speech features to obtain second speech features with contextual representation includes: The first speech feature is encoded by a preset coding layer to obtain the encoded feature; Based on the attention mechanism, attention values are extracted from the encoded features, and the encoded features are weighted using the attention values to obtain the weighted encoded features corresponding to the encoded features; The weighted coding features are positionally encoded by a preset forward feedback layer to obtain a second speech feature with contextual representation.
4. The speech enhancement method of claim 1, wherein, Decoding the discretized speech features to obtain reconstructed speech features includes: The discrete speech features are inverted by performing an inverted convolution operation on the preset inverted convolution layer to obtain the deconvolution features corresponding to the discrete speech features; The deconvolutional features are linearly mapped using a preset third activation layer and a preset fourth activation layer to obtain reconstructed speech features.
5. The speech enhancement method of claim 1, wherein, The step of obtaining the Fourier transform loss corresponding to the original speech includes: Perform a Fourier transform on the original speech to obtain the frequency domain loss and the time domain loss corresponding to the original speech; The Fourier transform loss corresponding to the original speech is calculated based on the frequency domain loss and the time domain loss.
6. A speech enhancement apparatus based on vector quantization, characterized by The speech enhancement device includes: The acquisition module is used to acquire the original speech, extract features from the original speech to obtain a first speech feature, and extract context features from the first speech feature to obtain a second speech feature with context representation. The reconstruction module is used to discretize the second speech features based on vector quantization to obtain discretized speech features, and to decode the discretized speech features to obtain reconstructed speech features; The concatenation module is used to concatenate the reconstructed speech features with the first speech features to obtain a concatenation result. The concatenation result is used as the second speech feature. The module then returns to the step of performing discretization processing on the second speech features based on vector quantization to obtain discretized speech features. This process continues until N vector quantizations are completed, resulting in the reconstructed speech features corresponding to the Nth vector quantization, where N is an integer greater than 1. The decoding module is used to decode the reconstructed speech features corresponding to the Nth vector quantization to obtain an enhanced result in which the reconstructed speech is the original speech; The speech enhancement device also includes: The loss acquisition module is used to acquire N vectorized losses that discretize the second speech features based on vector quantization and the Fourier transform loss corresponding to the original speech. The construction module is used to construct a speech enhancement loss function based on the N vectorized losses and the Fourier transform loss; the calculation formula of the speech enhancement loss function is as follows: in, For speech enhancement loss, For Fourier transform loss, The vectorization loss in the i-th vector quantizer is... is the parameter for adjusting the weights in the i-th vector quantizer; the vector quantization loss is the codebook density loss in the vector quantizer; and the Fourier transform loss is used to characterize the difference between the acoustic features and Fourier spectral features of the original speech extracted by the deep neural network.
7. A computer device, characterized in that, The computer device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the speech enhancement method as described in any one of claims 1 to 5.
8. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the speech enhancement method as described in any one of claims 1 to 5.