Audio restoration method and device, storage medium and electronic device
By extracting local matrix elements and refining feature submatrices from the audio feature matrix, and combining matrix decomposition networks and audio reconstruction networks, the problems of low accuracy and efficiency in existing audio reconstruction methods are solved, achieving higher accuracy and precision in audio reconstruction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG HUACHUANG VISION TECH CO LTD
- Filing Date
- 2023-03-29
- Publication Date
- 2026-06-12
AI Technical Summary
Existing neural network-based audio restoration methods suffer from low accuracy and efficiency.
By extracting features from the audio to be restored, an audio feature matrix is obtained. Local matrix elements are extracted to form multiple feature sub-matrices, and the audio is restored using the preset audio restoration parameters corresponding to each feature sub-matrix. Finally, the audio restoration results are merged to obtain the audio restoration results. Matrix factorization network and audio restoration network, including recurrent convolutional neural network and convolutional neural network, are used for processing.
It improves the accuracy and precision of audio restoration, avoids the mutual influence between different features, and enhances the effect of audio restoration.
Smart Images

Figure CN116524952B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, and in particular to an audio restoration method, apparatus, storage medium, and electronic device. Background Technology
[0002] With the development of science and technology, audio recognition technology has received widespread attention. Research in audio recognition technology can improve audio recognition accuracy, thereby enhancing the user experience. Audio reconstruction technology is an important component of audio recognition technology, which can improve audio recognition accuracy by reconstructing audio signals.
[0003] In recent years, with the development of neural network technology, researchers have begun to develop audio restoration methods based on neural networks. However, these methods often have some drawbacks, such as low accuracy and low efficiency. Summary of the Invention
[0004] This application provides at least one audio restoration method, an audio restoration device, a computer-readable storage medium, and an electronic device.
[0005] The first aspect of this application provides an audio restoration method, comprising: extracting features from the audio to be restored to obtain an audio feature matrix; extracting local matrix elements from the audio feature matrix to obtain multiple feature sub-matrices; using preset audio restoration parameters corresponding to each feature sub-matrice to perform audio restoration on each feature sub-matrice to obtain an initial audio restoration signal corresponding to each feature sub-matrice; and merging each initial audio restoration signal to obtain an audio restoration result corresponding to the audio to be restored.
[0006] In one embodiment, extracting local matrix elements from the audio feature matrix to obtain multiple feature sub-matrices includes: obtaining audio parameters of the audio to be restored; determining a matrix element selection strategy based on the audio parameters; and using the matrix element selection strategy to extract local matrix elements from the audio feature matrix to obtain multiple feature sub-matrices.
[0007] In one embodiment, the audio parameters of the audio to be restored include audio length parameters and audio frequency parameters; determining the matrix element selection strategy based on the audio parameters includes: determining the selection window size and window movement step size of the local matrix elements based on at least one of the audio length parameters and audio frequency parameters of the audio to be restored; and obtaining the matrix element selection strategy corresponding to the audio feature matrix based on the selection window size and window movement step size.
[0008] In one embodiment, a matrix element selection strategy is used to extract local matrix elements from the audio feature matrix to obtain multiple feature sub-matrices, including: extracting local matrix elements from the audio feature matrix based on the selection window size and window movement step size to obtain multiple local matrix elements; and recombining the multiple local matrix elements to obtain multiple feature sub-matrices.
[0009] In one embodiment, the audio restoration model includes a matrix factorization network and an audio restoration network, the audio restoration network including multiple audio restoration structures; extracting local matrix elements from the audio feature matrix to obtain multiple feature sub-matrices includes: inputting the audio feature matrix into the matrix factorization network to obtain multiple feature sub-matrices output by the matrix factorization network; using preset audio restoration parameters corresponding to each feature sub-matrice to perform audio restoration on each feature sub-matrice to obtain an initial audio restoration signal corresponding to each feature sub-matrice, including: obtaining a target audio restoration structure matching the feature sub-matrices, using the target audio restoration structure as the preset audio restoration parameters corresponding to the feature sub-matrices; inputting the feature sub-matrices into the target audio restoration structure to obtain the initial audio restoration signal output by the target audio restoration structure.
[0010] In one embodiment, obtaining the target audio reconstruction structure that matches the feature submatrix includes: obtaining the matrix dimension information of the feature submatrix and the structural parameters of each audio reconstruction structure; and using the audio reconstruction structure whose structural parameters match the matrix dimension information of the feature submatrix as the target audio reconstruction structure that matches the feature submatrix.
[0011] In one embodiment, the matrix factorization network is obtained based on a recurrent convolutional neural network, and the audio reconstruction structure is obtained based on a convolutional neural network and a deconvolutional neural network.
[0012] A second aspect of this application provides an audio restoration apparatus, comprising: a feature extraction module for extracting features from the audio to be restored to obtain an audio feature matrix; a matrix splitting module for extracting local matrix elements from the audio feature matrix to obtain multiple feature sub-matrices; an audio restoration module for performing audio restoration on each feature sub-matrice using preset audio restoration parameters corresponding to each feature sub-matrice to obtain an initial audio restoration signal corresponding to each feature sub-matrice; and a result generation module for merging each initial audio restoration signal to obtain an audio restoration result corresponding to the audio to be restored.
[0013] A third aspect of this application provides an electronic device, including a memory and a processor, wherein the processor is configured to execute program instructions stored in the memory to implement the aforementioned audio restoration method.
[0014] The fourth aspect of this application provides a computer-readable storage medium having program instructions stored thereon, which, when executed by a processor, implement the above-described audio restoration method.
[0015] The above scheme extracts features from the audio to be restored to obtain an audio feature matrix. Local matrix elements are then extracted from the audio feature matrix to obtain multiple feature sub-matrices. These feature sub-matrices refine the data features, allowing for more detailed handling of the relationships between different features and avoiding mutual interference. Furthermore, each feature sub-matrix is restored using preset audio restoration parameters, resulting in an initial audio restoration signal for each feature sub-matrix. This allows for different processing methods to be applied to different features, improving the accuracy of audio restoration. Finally, each initial audio restoration signal is merged to obtain the audio restoration result corresponding to the audio to be restored, thus improving the precision and accuracy of audio restoration.
[0016] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this application. Attached Figure Description
[0017] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with this application and, together with the specification, serve to explain the technical solutions of this application.
[0018] Figure 1 This is a schematic diagram of an implementation environment for an exemplary embodiment of the audio restoration method of this application;
[0019] Figure 2 This is a flowchart of an exemplary embodiment of the audio restoration method of this application;
[0020] Figure 3 This is a schematic diagram illustrating the acquisition of a feature submatrix, as shown in an exemplary embodiment of this application;
[0021] Figure 4 This is a structural diagram of an audio restoration model illustrated in an exemplary embodiment of this application;
[0022] Figure 5 This is a schematic diagram illustrating audio restoration as shown in an exemplary embodiment of this application;
[0023] Figure 6 This is a flowchart illustrating the training of an audio restoration model, as shown in an exemplary embodiment of this application;
[0024] Figure 7 This is a block diagram illustrating an audio restoration apparatus according to an exemplary embodiment of this application;
[0025] Figure 8This is a schematic diagram of the structure of an electronic device shown in an exemplary embodiment of this application;
[0026] Figure 9 This is a schematic diagram illustrating the structure of a computer-readable storage medium, as shown in an exemplary embodiment of this application. Detailed Implementation
[0027] The embodiments of this application will now be described in detail with reference to the accompanying drawings.
[0028] In the following description, specific details such as particular system architectures, interfaces, and technologies are presented for illustrative purposes rather than for limiting purposes, in order to provide a thorough understanding of this application.
[0029] In this document, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " generally indicates that the preceding and following related objects have an "or" relationship. Furthermore, "many" in this document means two or more. Moreover, the term "at least one" in this document means any combination of at least two of any one or more of a plurality of objects. For example, including at least one of A, B, and C can mean including any one or more elements selected from the set consisting of A, B, and C.
[0030] Please refer to Figure 1 The diagram illustrates the operating environment of an audio restoration method provided in one embodiment of this application. This operating environment may include a terminal 10 and a server 20.
[0031] Terminal 10 includes, but is not limited to, mobile phones, computers, smart audio interaction devices, smart home appliances, vehicle terminals, aircraft, game consoles, e-book readers, multimedia playback devices, wearable devices, and other electronic devices.
[0032] Terminal 10 can have an application client installed. The application can be any application capable of providing audio restoration services. Optionally, this application includes, but is not limited to, map navigation applications, smart assistant applications, video applications, shopping applications, content sharing applications, etc., and this embodiment does not limit this. Furthermore, different applications may have different audio content and corresponding functions, which can be pre-configured according to actual needs, and this embodiment does not limit this. Optionally, the terminal 10 runs the client of the aforementioned application.
[0033] Server 20 provides background services to clients of applications in terminal 10. For example, server 20 can be a background server for the aforementioned applications. Server 20 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms. Optionally, server 20 can simultaneously provide background services to applications in multiple terminals 10.
[0034] Optionally, terminal 10 and server 20 can communicate with each other via network 30. Terminal 10 and server 20 can be directly or indirectly connected via wired or wireless communication, which is not limited herein.
[0035] Optionally, server 20 may undertake the main audio restoration work and terminal 10 may undertake the secondary audio restoration work; or, server 20 may undertake the secondary audio restoration work and terminal 10 may undertake the main audio restoration work; or, server 20 or terminal 10 may each undertake the audio restoration work independently.
[0036] It is understood that in the specific implementation of this application, data such as audio to be restored and user information are involved. When the above embodiments of this application are applied to specific products or technologies, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.
[0037] Please refer to Figure 2 The document illustrates a flowchart of an audio restoration method provided in one embodiment of this application. This method can be applied to a computer device, which refers to an electronic device capable of data computation and processing. For example, the executing entity for each step could be... Figure 1 The example shown is of terminal 10 or server 20 in the operating environment. It should be understood that this method can also be applied to other exemplary implementation environments and specifically executed by devices in other implementation environments. This embodiment does not limit the implementation environment to which this method is applicable.
[0038] The audio restoration method proposed in this application embodiment will be described in detail below, taking the server as the specific execution subject.
[0039] like Figure 2 As shown, in an exemplary embodiment, the audio restoration method includes at least steps S210 to S240, which are described in detail below:
[0040] Step S210: Extract features from the audio to be restored to obtain the audio feature matrix.
[0041] The audio to be restored includes, but is not limited to, voice, songs, and ambient sounds, and this application embodiment does not limit this. For example, a user can control the device to perform corresponding operations through voice commands, such as controlling in-vehicle devices or virtual assistants. When the device detects a voice command, it performs audio recording processing to obtain the aforementioned audio to be restored.
[0042] The audio to be restored can be a segment of audio data in an audio stream or a complete segment of audio stream data; this application embodiment does not limit this.
[0043] The audio features of the audio to be restored are extracted to obtain an audio feature matrix. The audio features of the audio to be restored may include, but are not limited to, Mel frequency cepstrum coefficient (MFCC) features, Fbank features, wave features, gamma features, proso features, etc., and this application embodiment does not limit them.
[0044] For example, the audio to be restored is divided into several time periods to obtain multiple audio frames. Then, a short-time Fourier transform (STFT) is performed on each audio frame to obtain the energy value of the audio signal at different frequencies within each audio frame, thus obtaining the frequency feature representation of each audio frame. Then, the frequency feature representations of each audio frame are merged according to the time sequence of each audio frame to obtain the audio feature matrix corresponding to the audio to be restored.
[0045] For example, if the audio to be restored is a 10-second audio signal, this audio signal can be divided into 10 one-second audio frames, and then a short-time Fourier transform can be performed on each audio frame. If the sampling frequency within each audio frame is chosen to be 16kHz, and the window length within each audio frame is chosen to be 256 sampling points, then a two-dimensional matrix of 10 rows and 256 columns can be obtained. Each row in the two-dimensional matrix represents the energy value of the audio signal at different frequencies within each audio frame.
[0046] Step S220: Extract local matrix elements from the audio feature matrix to obtain multiple feature submatrices.
[0047] The audio feature matrix consists of multiple matrix elements, and a local matrix element refers to a subset of the overall elements of the audio feature matrix.
[0048] Multiple feature submatrices are obtained by extracting local matrix elements from the audio feature matrix.
[0049] It should be noted that the selection of the feature submatrix requires consideration of multiple factors, and the method for selecting the feature submatrix is not fixed. It needs to be determined based on specific audio restoration requirements, the implementation environment of the audio restoration, and other factors. For example, the factors to be considered in selecting the feature submatrix include at least one of the following: the structure of the audio restoration model, the requirements of the audio restoration task, the signal-to-noise ratio of the audio to be restored, the type of audio to be restored, and the resources of the entity executing the audio restoration task.
[0050] The structure of the audio restoration model: Different neural network layers in the audio restoration model can process feature submatrices of different sizes. For example, if a deconvolution layer in the model can only process a 2x256 feature submatrice, then the audio feature matrix needs to be divided into multiple 2x256 feature submatrices. Similarly, if a convolutional layer in the audio restoration model has a 3x3 kernel, then a feature submatrix of corresponding size needs to be selected for the convolution operation. The size of each neural network layer used for audio restoration needs to match the size of the feature submatrices. If the feature submatrices are too small, they may not capture the key features in the audio to be restored; if the feature submatrices are too large, it may lead to excessive computation and overfitting during model training.
[0051] The requirements of audio restoration tasks: The selection of feature sub-matrices should also be determined based on the requirements of the audio restoration task. For example, if it is necessary to restore a long audio file, it may be necessary to divide the audio feature matrix into multiple larger feature sub-matrices to ensure that the complete audio signal can be processed.
[0052] Signal-to-noise ratio (SNR) of the audio to be restored: Audio is often affected by noise, so the SNR must be considered when selecting the feature matrix. Smaller feature matrices are usually more affected by noise, while larger feature matrices may contain information from multiple signals, leading to information overlap and confusion. For example, if the audio to be restored was recorded in a low SNR environment, it may be necessary to divide the audio feature matrix into multiple smaller feature matrices, with overlapping matrix elements in each feature matrix, to increase information repeatability and stability.
[0053] The type of audio to be restored: Different types of audio to be restored have different characteristics. For example, if the audio to be restored is audio, then the gender of the audio subject, speaking speed, dialect, etc., will affect the spectral distribution and time-domain waveform of the audio to be restored. Therefore, appropriate feature submatrices can be selected according to different types.
[0054] Resources for the audio reconstruction task: The size and selection of the feature submatrix must also consider the resource constraints of the task's execution entity, including both hardware and software resources. Larger feature submatrices require more memory and computational resources, while smaller feature submatrices may lead to slower convergence or lower accuracy during model training.
[0055] Therefore, the size and selection method of the feature submatrix can be flexibly adjusted according to specific circumstances in order to better meet the application requirements of audio restoration.
[0056] In some implementations, local matrix elements are extracted from the audio feature matrix to obtain multiple feature sub-matrices, including: obtaining audio parameters of the audio to be restored; determining a matrix element selection strategy based on the audio parameters; and using the matrix element selection strategy to extract local matrix elements from the audio feature matrix to obtain multiple feature sub-matrices.
[0057] Matrix element selection strategies are used to limit the extraction method of local matrix elements, the size of feature submatrices, etc.
[0058] The matrix element selection strategy is determined by the audio parameters, and the local matrix elements in the audio feature matrix are extracted using the matrix element selection strategy to obtain multiple feature submatrices.
[0059] For example, the audio parameters of the audio to be restored include audio length parameters and audio frequency parameters; determining the matrix element selection strategy based on the audio parameters includes: determining the selection window size and window movement step size of the local matrix elements based on at least one of the audio length parameters and audio frequency parameters of the audio to be restored; and obtaining the matrix element selection strategy corresponding to the audio feature matrix based on the selection window size and window movement step size.
[0060] For example, an audio feature matrix is used to characterize the temporal and frequency domain features of the audio to be reconstructed. The size of the selection window for local matrix elements is determined based on the temporal and frequency domain features of the audio feature matrix, thereby determining the appropriate size of the feature submatrix. Generally, if the temporal features of the audio feature matrix represent a short duration of the audio to be reconstructed, a smaller local matrix element selection window can be selected, such as 2x256. If the frequency features of the audio feature matrix represent a low frequency of the audio to be reconstructed, a larger local matrix element selection window can be used, such as 8x256, to better capture the temporal features. Conversely, if the frequency features of the audio feature matrix represent a high frequency of the audio to be reconstructed, a smaller local matrix element selection window can be used, and the matrix elements in adjacent local matrix element selection windows can be overlapped, i.e., the window movement step size is smaller than the length of the local matrix element selection window, to better capture the frequency domain features.
[0061] Furthermore, based on the selection window size and window movement step size of the local matrix elements, local matrix elements are extracted from the audio feature matrix to obtain multiple local matrix elements; these multiple local matrix elements are then recombine to obtain multiple feature submatrices.
[0062] The same feature submatrix contains at least one local matrix element. The local matrix elements contained in the same feature submatrix may or may not be in adjacent positions in the audio feature matrix. The selection window size and window movement step size of the local matrix elements are determined.
[0063] For example, please refer to Figure 3 , Figure 3 This is a schematic diagram illustrating the selection of feature submatrices as shown in an exemplary embodiment of this application. For example... Figure 3 As shown, the audio feature matrix is a 10x256 matrix. If the matrix element selection strategy indicates that the selection window size of the local matrix element is 2x256, the window movement step is 2 bits down, and each extracted local matrix element is used as a feature submatrix, that is, the feature submatrix is obtained by selecting adjacent rows in the audio feature matrix. This can ensure that the selected feature submatrix has a continuous time series, which is more in line with the time domain characteristics of the audio signal.
[0064] In some implementations, the selection window size and window movement step size of the local matrix elements can be adjusted to adopt a non-adjacent row selection method, thereby increasing the diversity and randomness of the data and improving the robustness and generalization ability of the audio reconstruction model. For example, if the window movement step size is random, the selection window size of the local matrix elements is 1x256, and the size of the first feature submatrix to be selected is 2x256, then during the selection of local matrix elements of the first feature submatrix, local matrix elements in the first and third rows of the audio feature matrix are randomly selected, and the local matrix elements in the first and third rows are combined to obtain the first feature submatrix.
[0065] In some implementations, in addition to decomposing the rows of the audio feature matrix, the columns of the audio feature matrix can also be decomposed. It is important to note that when performing column decomposition, the size of the feature submatrices needs to take into account the temporal characteristics of the audio feature matrix.
[0066] In some implementations, rows or columns of the audio feature matrix can be selected based on the distribution characteristics of the data in the audio feature matrix to construct a feature submatrix, so as to extract more accurate feature information from the feature submatrix. For example, the weight of each matrix element in the audio feature matrix can be calculated based on an attention mechanism, and the row or column of the currently selected audio feature matrix can be determined based on the weight of each matrix element.
[0067] When selecting feature submatrices, reasonable optimization and adjustment are necessary based on specific circumstances to maximize the efficiency and accuracy of data analysis. Feature submatrices refine data features, allowing for more detailed handling of the relationships between different features and avoiding mutual interference. For example, frequency and time are two important and closely related features. Treating the entire audio feature matrix as a single unit for audio reconstruction would lead to mutual interference between frequency and time features, affecting the audio reconstruction effect. Therefore, the resulting feature submatrices allow for a more detailed understanding and processing of information across different frequency ranges and time durations. Furthermore, they enable better utilization of the correlation between feature submatrices, improving the precision and accuracy of audio reconstruction.
[0068] Step S230: Using the preset audio restoration parameters corresponding to each feature submatrix, perform audio restoration on each feature submatrix to obtain the initial audio restoration signal corresponding to each feature submatrix.
[0069] Different feature submatrices correspond to different preset audio restoration parameters.
[0070] In some implementations, the preset audio restoration parameters refer to the neural network parameters in the audio restoration model.
[0071] The audio restoration model includes a matrix factorization network and an audio restoration network. The audio restoration network includes multiple audio restoration structures. It extracts local matrix elements from the audio feature matrix to obtain multiple feature sub-matrices, including: inputting the audio feature matrix into the matrix factorization network to obtain multiple feature sub-matrices output by the matrix factorization network; using preset audio restoration parameters corresponding to each feature sub-matrice to perform audio restoration on each feature sub-matrice to obtain an initial audio restoration signal corresponding to each feature sub-matrice, including: obtaining a target audio restoration structure matching the feature sub-matrices and using the target audio restoration structure as the preset audio restoration parameters corresponding to the feature sub-matrices; inputting the feature sub-matrices into the target audio restoration structure to obtain the initial audio restoration signal output by the target audio restoration structure.
[0072] In some implementations, the matrix factorization network is obtained based on a recurrent convolutional neural network, and the audio reconstruction structure is obtained based on a convolutional neural network and a deconvolutional neural network.
[0073] For example, please see Figure 4 , Figure 4 This is a schematic diagram illustrating the model structure of an audio restoration model as shown in an exemplary embodiment of this application. Figure 4 As shown, the audio restoration model includes an input layer, a recursive convolutional layer, an input convolutional layer, an upsampling layer, a pooling layer, a deconvolutional layer, and an output layer.
[0074] The input layer is used to divide the audio to be restored into frames and preprocess each frame, such as normalization, noise reduction, audio signal enhancement, and feature extraction, to obtain an audio feature matrix.
[0075] Recursive convolutional layers (RCLs) are responsible for decomposing the input audio feature matrix into multiple sub-matrices and recursively processing each sub-matrix. For example, in an RCL, the input audio feature matrix can be split into multiple sub-matrices according to the time or frequency dimension, and then a recursive convolution operation is performed on each sub-matrix. The recursive convolution operation further decomposes the sub-matrices until a certain depth is reached or the decomposed sub-matrices reach their minimum size, resulting in multiple feature sub-matrices. The method of splitting the sub-matrices in an RCL can be adjusted according to actual needs, such as splitting by time or frequency, or adjusted according to model performance. At the same time, attention must be paid to the size of the sub-matrices to fully utilize computational resources and ensure model performance.
[0076] In some implementations, a recursive convolutional layer consists of multiple recursive convolutional blocks. Each recursive convolutional block includes: a recursive layer that converts the input audio feature matrix into a feature submatrix through recursive convolution operations; a gate layer that controls the information flow of the recursive layer through a gating mechanism; and an attention layer that calculates the similarity between the audio features at the current time step and the audio features at each time step of the recursive convolutional layer, then uses the softmax activation function to convert the similarity into weight coefficients, and finally uses the weight coefficients to weight the audio features at each time step of the recursive convolutional layer to generate new audio feature representations. In this way, the model can adaptively learn which audio features at which time steps are more important for the audio reconstruction task.
[0077] The audio restoration network in the audio restoration model comprises multiple audio restoration structures based on convolutional neural networks and deconvolutional neural networks. For example, each audio restoration structure includes an input convolutional layer, an upsampling layer, a pooling layer, and a deconvolutional layer.
[0078] The input convolutional layer uses a convolutional neural network to perform convolution operations on the feature submatrix output by the recursive convolutional layer to extract local features. The input convolutional layer typically uses one-dimensional convolution operations with a kernel size of k and a number of kernels of m. Each kernel generates a feature sequence of length n-k+1, where n is the dimension of the input feature submatrix.
[0079] After the input convolutional layer, an upsampling layer is needed to convert the low-resolution feature sequence obtained after the convolution operation into a high-resolution audio signal. Upsampling can be implemented using interpolation methods (such as bilinear interpolation or trilinear interpolation) or deconvolution operations.
[0080] The pooling layer performs pooling operations on the upsampled feature sequence to reduce the feature dimension (the pooling operation can be max pooling or average pooling).
[0081] The deconvolutional layer uses deconvolution operations to restore the pooled features to the original audio feature matrix size.
[0082] The output layer performs post-processing on the deconvolutioned audio feature matrix, such as removing invalid frames, windowing, and restoring the sampling rate, to obtain the restored initial audio signal.
[0083] In some implementations, a residual connection is added between the recurrent convolutional layer and the input convolutional layer to improve the training efficiency of the model and reduce overfitting.
[0084] The audio feature matrix of the input audio reconstruction model is decomposed into multiple feature sub-matrices by a recurrent convolutional neural network, and then fed into different audio reconstruction structures for processing. This increases the non-linear expressive power of the model and improves the feature processing effect. At the same time, by decomposing and processing the sub-matrices, the computational complexity and memory consumption of the model can be reduced, thereby improving the training speed and running efficiency of the model.
[0085] In some implementations, obtaining the target audio reconstruction structure that matches the feature submatrix includes: obtaining the matrix dimension information of the feature submatrix and the structural parameters of each audio reconstruction structure; and using the audio reconstruction structure whose structural parameters match the matrix dimension information of the feature submatrix as the target audio reconstruction structure that matches the feature submatrix.
[0086] In the audio reconstruction model, the audio reconstruction structures are concatenated in a specific order. Each reconstruction structure processes the input feature submatrix and produces an output, which is then fed into the next reconstruction structure as input, until the final output of the audio signal reconstructed from each feature submatrix is generated. Within each reconstruction structure, the parameters of the convolutional and deconvolutional layers are shared.
[0087] The audio reconstruction structure that matches the feature matrix is dynamically selected based on the size and dimension of the feature matrix to maximize model performance.
[0088] For example, the process of determining the target audio reconstruction structure that matches the feature submatrix may include:
[0089] First, determine the size of the current eigenvalue submatrix.
[0090] Secondly, determine the size of the feature matrix that each convolutional or deconvolutional layer in the audio reconstruction structure can process. This is typically determined by the parameters of the convolutional or deconvolutional layers in the audio reconstruction structure, such as kernel size and stride size. Therefore, for each convolutional or deconvolutional layer in the audio reconstruction structure, it is necessary to pre-determine the input and output sizes that they can process.
[0091] Finally, based on the size of the current feature submatrix and the size of the feature matrix that each convolutional or deconvolutional layer in the audio reconstruction structure can process, it is possible to determine which convolutional or deconvolutional layer in the audio reconstruction structure the current feature submatrix should be input into, thus obtaining the target audio reconstruction structure that matches the feature submatrix.
[0092] Typically, the size of the feature submatrix should be equal to or a multiple of the size of the feature matrix that the convolutional or deconvolutional layer can process. This ensures that the feature submatrix can be completely input into the convolutional or deconvolutional layer for processing.
[0093] In some implementations, if the size of the feature submatrix does not match the size of the feature matrix that a certain convolutional or deconvolutional layer can process, appropriate adjustments need to be made, such as padding or cropping the feature submatrix to meet the requirements of the convolutional or deconvolutional layer.
[0094] For example, if an audio reconstruction structure contains a convolutional layer with an input size of 2x256, an output channel size of 64, and a kernel size of 3x3, then if a feature submatrix has a size of 2x256, it can be input into the convolutional layer and convolutional operations can be performed using the kernel of that convolutional layer.
[0095] In some implementations, if the size of the feature submatrix matches the input size of multiple audio reconstruction structures, then a suitable audio reconstruction structure can be selected from multiple audio reconstruction structures based on the characteristics of the model design and training data to achieve better audio reconstruction results.
[0096] For example, factors that can also be considered in the selection of audio restoration structure include:
[0097] Model structure: There may be relationships between different convolutional or deconvolutional layers in the model, such as residual connections. When selecting convolutional or deconvolutional layers, the model structure can be considered to ensure that the feature submatrix input has a suitable audio reconstruction structure.
[0098] Model training results: The training results of convolutional or deconvolutional layers affect their feature extraction or feature reconstruction capabilities. Therefore, when choosing convolutional or deconvolutional layers, it is necessary to consider the training results of each audio reconstruction structure and select the best-performing audio reconstruction structure.
[0099] Furthermore, factors such as the number of parameters, receptive field size, and computational cost of each audio reconstruction structure can also be considered, but this application does not impose any limitations on these factors.
[0100] By comprehensively considering the above factors, it can be determined which audio reconstruction structure needs to be input into the feature submatrix. For example, if the size of the feature submatrix is small and simpler features need to be extracted, a convolutional layer with a smaller kernel size can be selected; if the size of the feature submatrix is large and more complex features need to be extracted, a convolutional layer with a larger kernel size can be selected.
[0101] Inputting the decomposed feature sub-matrices into different audio reconstruction structures allows each structure to focus on processing feature sub-matrices at different scales, thus improving audio reconstruction quality. For example, low-frequency audio signals, which change slowly, can be processed using larger convolutional kernels; while high-frequency audio signals, which change rapidly, can be processed using smaller convolutional kernels. Directly inputting the entire audio feature matrix into a single convolutional network may result in poor reconstruction quality because the kernel size is not flexible enough to handle features at different scales simultaneously.
[0102] In addition, convolutional or deconvolutional layers in different audio restoration structures have different receptive fields and feature extraction capabilities. By processing different feature sub-matrices in different convolutional or deconvolutional layers, the feature extraction capabilities of different convolutional or deconvolutional layers can be utilized more fully, local features can be captured better, and thus the accuracy and quality of audio restoration can be improved.
[0103] Furthermore, in audio restoration scenarios, the input audio to be restored may contain noise and interference, which may negatively impact the model's performance. By splitting the matrix and inputting it into different convolutional or deconvolutional networks for processing, the model can pay more attention to the details in the audio and better remove noise and resist interference.
[0104] Understandable, Figure 4 The audio restoration model shown is only illustrative. In practical applications, the audio restoration model may contain more or fewer model structures, and this application does not limit this.
[0105] Step S240: Merge each initial audio restoration signal to obtain the audio restoration result corresponding to the audio to be restored.
[0106] By merging each initial audio restoration signal, the audio restoration result corresponding to the audio to be restored is obtained.
[0107] Please see Figure 5 , Figure 5 This is a schematic diagram illustrating an audio restoration method as shown in an exemplary embodiment of this application. Figure 5 As shown, the input audio feature matrix is decomposed into n feature sub-matrices, and then fed into different audio reconstruction structures for processing. For example, the audio feature matrix is a 10x256 two-dimensional matrix, where each row represents the energy value of the audio signal at different frequencies within the corresponding time period. This two-dimensional matrix is then decomposed into multiple feature sub-matrices, such as the audio feature matrix described above being decomposed into three feature sub-matrices of sizes 2x256, 4x256, and 8x256. Each feature sub-matrix is then fed into a different audio reconstruction structure for processing.
[0108] The audio reconstruction structure consists of convolutional and deconvolutional layers. The convolutional layer performs convolution operations on the input using a set of convolutional kernels, thereby extracting features such as time and frequency of the audio signal. The deconvolutional layer performs deconvolution operations on the input using a set of deconvolutional kernels, thereby gradually restoring the time, frequency, and other features of the audio signal back to the original audio signal.
[0109] For example, the three feature sub-matrices mentioned above are fed into three different audio reconstruction structures for different audio reconstruction processes, resulting in initial audio reconstruction signals for each feature sub-matrix. Finally, the initial audio reconstruction signals are combined to obtain the complete audio reconstruction result corresponding to the audio to be reconstructed.
[0110] Next, this application further describes the training process of the audio restoration model:
[0111] Please see Figure 6 , Figure 6 This is a flowchart illustrating the training process of an audio restoration model as shown in an exemplary embodiment of this application. Figure 6 As shown, the training steps of the audio restoration model include steps S610 to S640:
[0112] Step S610: Obtain the audio of the training samples.
[0113] The training sample audio can be preprocessed audio data, and each training sample audio corresponds to a sample reconstruction audio.
[0114] In addition, when training the model, a variety of training sample audios can be used to cover different types of audio to ensure that the model has a certain generalization ability.
[0115] Step S620: Input the training sample audio into the initial audio restoration model to obtain the training audio restoration result output by the initial audio restoration model.
[0116] The structure of the initial audio restoration model can be found in [reference]. Figure 4 The process by which the initial audio restoration model processes the input audio can be found in steps S210 to S240, and will not be described in detail here.
[0117] Step S630: Calculate the loss value based on the training audio reconstruction results and the sample reconstruction audio corresponding to the training sample audio.
[0118] The loss function used to calculate the loss value can be the mean square error (MSE) function, which quantifies the difference between the sample restored audio and the restored training audio as the loss.
[0119] For example, let x be the audio sample to be reconstructed. Given the restored audio signal, the loss function can be expressed as:
[0120]
[0121] Where N is the number of training sample audio files.
[0122] Understandably, the initial audio reconstruction model contains multiple layers of networks. The loss function can be calculated for each layer, and then the losses of all layers are summed to obtain the loss of the entire initial audio reconstruction model. For example, let K be the total number of layers in the initial audio reconstruction model, and L... k Let be the loss value of the k-th layer. Then the loss of the entire network can be expressed as:
[0123]
[0124] In some implementations, to prevent overfitting during model training, a regularization term, such as L2 regularization, can be added to the loss function, as follows:
[0125]
[0126] This represents the weight matrix connecting the p-th input channel and the q-th output channel in the k-th layer. This is the regularization coefficient. Therefore, the loss function for model training can be:
[0127] L = L total +L reg
[0128] Step S640: Adjust the parameters of the initial audio restoration model according to the calculated loss value, and iteratively execute steps S610 to S640 until the preset model training stopping condition is met, and obtain the trained audio restoration model.
[0129] The preset model training stopping condition can be either reaching a preset number of iterations or the loss value being less than a preset loss value; this application does not impose any restrictions on this.
[0130] By adjusting the parameters of the initial audio reconstruction model using the loss value, the audio signal output by the model can be made closer to the desired audio signal, thereby improving the model's performance.
[0131] The initial audio reconstruction model parameters can be adjusted based on the calculated loss value using stochastic gradient descent (SGD). In each iteration, SGD randomly selects a mini-batch of data from the training audio sample set, calculates the gradient of this mini-batch, and then uses gradient descent to update the model parameters. This reduces computational cost and accelerates model convergence.
[0132] In some implementations, model optimization methods, such as learning rate decay and regularization, can be used during model training to improve model performance. For example:
[0133] Model structure optimization: Improve model performance by adjusting the number of network layers, the size of convolutional kernels, and the number of recurrent layers.
[0134] Learning rate optimization: The learning rate controls the speed and direction of parameter updates. During training, learning rate adjustment strategies (such as learning rate decay and learning rate restart) can be used to improve model performance and generalization ability.
[0135] Data augmentation optimization: Data augmentation increases the quantity and quality of training data to improve the model's generalization ability. In audio restoration, data augmentation methods, such as adding noise or reverberation, can be used to increase the diversity and complexity of the training data.
[0136] Model ensemble optimization: Model ensemble improves the generalization ability and robustness of a model. In audio sound reconstruction, multiple models can be ensembled to enhance the audio reconstruction results.
[0137] In some implementations, a test sample audio set can also be used to evaluate the model's audio reproduction performance, calculating metrics such as accuracy and recall. These evaluation metrics include:
[0138] Signal-to-noise ratio (SNR): SNR is a commonly used metric for measuring the quality of audio reproduction. The performance of a model can be evaluated by calculating the SNR between the reproduced audio and the original audio input to the model. Generally, the higher the SNR, the better the reproduction quality.
[0139] In addition to signal-to-noise ratio, other audio quality evaluation metrics can be used to evaluate model performance, such as perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI).
[0140] Subjective listening evaluation: Subjective listening evaluation is an intuitive method for evaluating audio quality. It allows professional listeners or ordinary users to assess the listening experience of the reproduced audio in order to evaluate the performance of the model.
[0141] Time and space complexity: The time and space complexity of a model in audio reconstruction are also important metrics for evaluating model performance. Generally, the lower the time and space complexity, the better the model's performance.
[0142] The audio restoration method of this application divides the audio feature matrix into multiple feature sub-matrices and performs audio restoration on each feature sub-matrice separately using preset audio restoration parameters corresponding to each feature sub-matrice. This allows for different processing methods to be applied to different features, improving the accuracy of audio restoration. For example, using different preset audio restoration parameters for low-frequency and high-frequency information can more accurately restore the low-frequency and high-frequency components in the audio signal. Furthermore, using different preset audio restoration parameters to process different feature sub-matrices can better utilize computational resources. In audio restoration scenarios, since the audio sampling rate is usually high and the input audio signal is long, a large amount of computational resources are required for processing. Decomposing the audio feature matrix into multiple feature sub-matrices and processing them with different preset audio restoration parameters allows the computational task to be distributed across multiple computing devices for parallel processing, improving computational efficiency. Simultaneously, for some irregularly shaped audio signals, decomposing them into multiple feature sub-matrices can also better process these signals, because the irregular audio signal can be divided into multiple regularly shaped feature sub-matrices, and then processed separately, improving the audio restoration effect.
[0143] Figure 7 This is a block diagram illustrating an audio restoration apparatus according to an exemplary embodiment of this application. Figure 7As shown, the exemplary audio restoration device 700 includes: a feature extraction module 710, a matrix splitting module 720, an audio restoration module 730, and a result generation module 740. Specifically:
[0144] The feature extraction module 710 is used to extract features from the audio to be restored and obtain an audio feature matrix.
[0145] The matrix splitting module 720 is used to extract local matrix elements from the audio feature matrix to obtain multiple feature sub-matrices;
[0146] The audio restoration module 730 is used to perform audio restoration on each feature submatrix using the preset audio restoration parameters corresponding to each feature submatrix, so as to obtain the initial audio restoration signal corresponding to each feature submatrix.
[0147] The result generation module 740 is used to merge each initial audio restoration signal to obtain the audio restoration result corresponding to the audio to be restored.
[0148] In the above-described exemplary audio restoration device, feature extraction is performed on the audio to be restored to obtain an audio feature matrix. Local matrix elements are extracted from the audio feature matrix to obtain multiple feature sub-matrices. The feature sub-matrices refine the data features, allowing for more detailed handling of the relationships between different features and avoiding mutual interference between different features. Using the preset audio restoration parameters corresponding to each feature sub-matrice, audio restoration is performed on each feature sub-matrice separately to obtain an initial audio restoration signal corresponding to each feature sub-matrice. Different processing methods are applied to different features to improve the accuracy of audio restoration. Then, each initial audio restoration signal is merged to obtain the audio restoration result corresponding to the audio to be restored, thereby improving the precision and accuracy of audio restoration.
[0149] The functions of each module can be found in the audio restoration method embodiment, and will not be repeated here.
[0150] Please see Figure 8 , Figure 8 This is a schematic diagram of the structure of an embodiment of the electronic device of this application. The electronic device 800 includes a memory 801 and a processor 802. The processor 802 is used to execute program instructions stored in the memory 801 to implement the steps in any of the above-described audio restoration method embodiments. In a specific implementation scenario, the electronic device 800 may include, but is not limited to, a microcomputer or a server. In addition, the electronic device 800 may also include mobile devices such as laptops and tablets, which are not limited here.
[0151] Specifically, processor 802 controls itself and memory 801 to implement the steps in any of the above-described audio restoration method embodiments. Processor 802 can also be referred to as a CPU (Central Processing Unit). Processor 802 may be an integrated circuit chip with signal processing capabilities. Processor 802 can also be a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. A general-purpose processor can be a microprocessor or any conventional processor. Furthermore, processor 802 can be implemented using integrated circuit chips.
[0152] Please see Figure 9 , Figure 9 This is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application. The computer-readable storage medium 900 stores program instructions 910 that can be executed by a processor. The program instructions 910 are used to implement the steps in any of the above-described audio restoration method embodiments.
[0153] In some embodiments, the functions or modules of the apparatus provided in this disclosure can be used to perform the methods described in the above method embodiments. The specific implementation can be referred to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.
[0154] The description of the various embodiments above tends to emphasize the differences between the various embodiments. The similarities or similarities between them can be referred to, and for the sake of brevity, they will not be repeated here.
[0155] In the several embodiments provided in this application, it should be understood that the disclosed methods and apparatus can be implemented in other ways. For example, the apparatus implementations described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection of devices or units may be electrical, mechanical, or other forms.
[0156] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
Claims
1. An audio restoration method, characterized by, The audio restoration model includes a matrix factorization network and an audio restoration network. The audio restoration network includes multiple audio restoration structures, and the matrix factorization network includes recurrent convolutional layers, each composed of multiple recurrent convolutional blocks. The method includes: Feature extraction is performed on the audio to be restored to obtain the audio feature matrix; The audio feature matrix is input into the matrix decomposition network. The recursive convolutional layer in the matrix decomposition network splits the audio feature matrix into multiple sub-matrices according to the time or frequency dimension. A recursive convolution operation is performed on each sub-matrix to further decompose the sub-matrix until the decomposed sub-matrix reaches the minimum size, thereby obtaining multiple feature sub-matrices. The process of recursive convolution operation in the recursive convolution block includes: converting the input audio features into a feature submatrix through the recursive layer in the recursive convolution block; controlling the information flow of the recursive layer through a gating layer using a gating mechanism; calculating the similarity between the audio features at the current time and the audio features at each time through an attention layer; converting the similarity into weight coefficients using a softmax activation function; and weighting the audio features at each time using the weight coefficients to generate a feature submatrix with new audio feature representations. The step of splitting the audio feature matrix into multiple sub-matrices according to the time or frequency dimension includes: determining the selection window size and window movement step size of the local matrix elements based on at least one of the audio length parameter and audio frequency parameter of the audio to be restored; extracting local matrix elements from the audio feature matrix based on the selection window size and window movement step size to obtain multiple local matrix elements; and recombining the multiple local matrix elements to obtain multiple feature sub-matrices, wherein the local matrix elements contained in the feature sub-matrices are either in adjacent positions or not in adjacent positions in the audio feature matrix. Obtain the matrix dimension information of the feature submatrix and the structural parameters of each audio restoration structure; take the audio restoration structure whose structural parameters match the matrix dimension information of the feature submatrix as the target audio restoration structure that matches the feature submatrix; input the feature submatrix into the target audio restoration structure to obtain the initial audio restoration signal output by the target audio restoration structure; By merging each initial audio restoration signal, the audio restoration result corresponding to the audio to be restored is obtained.
2. The method of claim 1, wherein, The audio restoration structure is obtained based on convolutional neural networks and deconvolutional neural networks.
3. An audio restoration device, characterized by The audio restoration model includes a matrix factorization network and an audio restoration network. The audio restoration network includes multiple audio restoration structures, and the matrix factorization network includes recurrent convolutional layers, each composed of multiple recurrent convolutional blocks. The device includes: The feature extraction module is used to extract features from the audio to be restored, and obtain the audio feature matrix. The matrix splitting module is used to input the audio feature matrix into the matrix factorization network. The recursive convolutional layers in the matrix factorization network split the audio feature matrix into multiple sub-matrices according to the time or frequency dimension. A recursive convolution operation is performed on each sub-matrix to further decompose it until the decomposed sub-matrices reach their minimum size, resulting in multiple feature sub-matrices. The recursive convolution operation process includes: converting the input audio features into feature sub-matrices through the recursive layers in the recursive convolution block; controlling the information flow of the recursive layers using a gating mechanism through a gating layer; calculating the similarity between the audio features at the current time step and the audio features at various time steps through an attention layer; and using softmax activation. The function converts the similarity into weight coefficients, and uses these weight coefficients to weight the audio features at each time point, generating a feature sub-matrix with new audio feature representations. The step of splitting the audio feature matrix into multiple sub-matrices according to the time or frequency dimension includes: determining the selection window size and window movement step size for local matrix elements based on at least one of the audio length parameter and audio frequency parameter of the audio to be restored; extracting local matrix elements from the audio feature matrix based on the selection window size and window movement step size to obtain multiple local matrix elements; and recombinating the multiple local matrix elements to obtain multiple feature sub-matrices, wherein the local matrix elements contained in the feature sub-matrices are either adjacent or non-adjacent in the audio feature matrix. The audio restoration module is used to obtain the matrix dimension information of the feature submatrix and the structural parameters of each audio restoration structure; the audio restoration structure whose structural parameters match the matrix dimension information of the feature submatrix is used as the target audio restoration structure that matches the feature submatrix; the feature submatrix is input into the target audio restoration structure to obtain the initial audio restoration signal output by the target audio restoration structure. The result generation module is used to merge each initial audio restoration signal to obtain the audio restoration result corresponding to the audio to be restored.
4. A computer-readable storage medium having program instructions stored thereon, characterized in that, When the program instructions are executed by the processor, they implement the method of any one of claims 1 to 2.
5. An electronic device, characterized in that, It includes a memory and a processor, the processor being configured to execute program instructions stored in the memory to implement the method of any one of claims 1 to 2.