Timbre conversion method and device, storage medium and computer device

By extracting semantic information from audio data and cross-processing the target timbre feature vector during timbre conversion, the problems of low accuracy and language limitation in existing timbre conversion technologies are solved, achieving timbre conversion with high accuracy and wide applicability.

CN116312583BActive Publication Date: 2026-06-16BEIJING ZHIMEIYUANSU TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING ZHIMEIYUANSU TECH CO LTD
Filing Date
2023-02-17
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing technologies, timbre conversion algorithms cannot effectively preserve the tone of the original speech, resulting in low accuracy of timbre conversion, and can only be converted for a single language.

Method used

By acquiring the audio data to be converted and the target timbre, a semantic information vector is extracted using a preset semantic prediction model. This vector is then cross-processed with the target timbre feature vector and input into a preset timbre conversion model for timbre conversion, generating audio with the target timbre and original intonation.

🎯Benefits of technology

It improves the accuracy of timbre conversion, is applicable to any language, avoids the problem of needing to train a model for each language, expands the scope of application, and improves conversion efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116312583B_ABST
    Figure CN116312583B_ABST
Patent Text Reader

Abstract

The application discloses a tone conversion method and device, a storage medium and computer equipment, relates to the technical field of artificial intelligence, and mainly aims at improving the conversion accuracy of tone. The method comprises the following steps: obtaining audio data to be converted and a target tone; inputting the audio data to be converted into a preset semantic prediction model for semantic prediction to obtain a semantic information vector corresponding to the audio data to be converted; determining a tone feature vector corresponding to the target tone; performing cross processing on the semantic information vector and the tone feature vector to obtain a tone cross feature vector; inputting the tone cross feature vector into a preset tone conversion model for tone conversion to obtain target audio, wherein the target audio is audio with the target tone and the original tone of the audio data to be converted. The application is suitable for converting tone.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a method, apparatus, storage medium, and computer device for timbre conversion. Background Technology

[0002] With the rapid development of AI (Artificial Intelligence) technology in content creation, AI has evolved from merely a tool to assist in content creation to AIGC (AI Generated Content), which can now independently complete creative tasks such as dialogue chat and video generation. The speed of this evolution is remarkable. AIGC is now widely used in various scenarios; for example, the application of timbre replication technology in short video scenarios makes converting audio into audio with a specified timbre particularly important.

[0003] Currently, the common approach is to break down the text in speech into phonemes to train the algorithm, and then use the trained algorithm for timbre conversion. However, this algorithm performs timbre conversion on text, which cannot detect the tone of the original speech. This results in the converted speech lacking the tone of the original speech, leading to low accuracy in timbre conversion. Summary of the Invention

[0004] This invention provides a timbre conversion method, apparatus, storage medium, and computer equipment, which mainly improves the accuracy of timbre conversion.

[0005] According to a first aspect of the present invention, a timbre conversion method is provided, comprising:

[0006] Obtain the audio data to be converted and the target timbre;

[0007] The audio data to be converted is input into a preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted.

[0008] Determine the timbre feature vector corresponding to the target timbre;

[0009] The semantic information vector and the timbre feature vector are cross-processed to obtain the timbre cross feature vector;

[0010] The timbre cross feature vector is input into a preset timbre conversion model for timbre conversion to obtain the target audio, wherein the target audio is an audio with the target timbre and the original tone in the audio data to be converted.

[0011] Preferably, before inputting the audio data to be converted into a preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted, the method further includes:

[0012] Construct a pre-defined initial semantic prediction model;

[0013] Obtain sample audio data and the actual semantic information vector corresponding to the sample audio data;

[0014] The sample audio data is input into the preset initial semantic prediction model for semantic prediction to obtain a predicted semantic information vector.

[0015] Based on the actual semantic information vector and the predicted semantic information vector, construct the loss function corresponding to the preset initial semantic prediction model;

[0016] Based on the loss function, the preset initial semantic prediction model is trained to construct the preset semantic prediction model.

[0017] Preferably, the preset semantic prediction model is a preset encoder, which includes an attention layer and a feedforward neural network layer. The step of inputting the audio data to be converted into the preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted includes:

[0018] Determine the frequency feature vector corresponding to the audio data to be converted;

[0019] The frequency feature vector is input into the attention layer for feature extraction to obtain the first feature vector corresponding to the audio data to be converted.

[0020] The first feature vector and the frequency feature vector are added together to obtain the second feature vector corresponding to the audio data to be converted;

[0021] The second feature vector is input into the feedforward neural network layer for feature extraction to obtain the semantic information vector corresponding to the audio data to be converted.

[0022] Preferably, determining the frequency feature vector corresponding to the audio data to be converted includes:

[0023] The audio data to be converted is sequentially subjected to pre-emphasis, frame segmentation, and windowing to obtain the processed audio data;

[0024] Perform a Fourier transform on the processed audio data to obtain the spectrogram corresponding to the audio data to be converted;

[0025] The spectrogram is filtered using a Mel filter bank to obtain a spectrogram with energy waves output by the Mel filter bank, and the logarithm of the energy waves is calculated to obtain a logarithmic spectrogram.

[0026] The discrete cosine transform is performed on the logarithmic spectrogram to obtain the frequency feature vector corresponding to the audio data to be converted.

[0027] Preferably, the step of cross-processing the semantic information vector and the timbre feature vector to obtain a timbre cross-feature vector includes:

[0028] The semantic information vector and the timbre feature vector are cross-convolved to obtain the first cross vector.

[0029] A low-order cross processing is performed on the semantic information vector and the timbre feature vector to obtain a second cross vector;

[0030] A third cross vector is obtained by performing a cross-linear processing on the semantic information vector and the timbre feature vector.

[0031] The first cross vector, the second cross vector, and the third cross vector are transformed using a preset transformation function to obtain the timbre cross feature vector.

[0032] Preferably, the step of inputting the timbre cross-feature vector into a preset timbre conversion model for timbre conversion to obtain the target audio includes:

[0033] Obtain Gaussian noise and determine the noise step index for denoising the Gaussian noise;

[0034] The Gaussian noise, the noise step index, and the timbre cross feature vector are input into a preset timbre conversion model to perform timbre conversion and obtain the target audio.

[0035] Preferably, the preset timbre conversion model is a preset vocoder model, which consists of a positional coding layer, a downsampling layer, and a conditional upsampling layer. The step of inputting the Gaussian noise, the noise step index, and the timbre cross-feature vector into the preset timbre conversion model for timbre conversion to obtain the target audio includes:

[0036] The noise step index is input into the position encoding layer, and the noise reduction feature vector is output through the position encoding layer.

[0037] The noise reduction feature vector is added to the timbre cross feature vector to obtain the fused feature vector;

[0038] The Gaussian noise is input into the downsampling layer to obtain a noise feature vector;

[0039] The noise feature vector and the fused feature vector are input into the conditional upsampling layer to obtain the target audio.

[0040] According to a second aspect of the present invention, a tone conversion device is provided, comprising:

[0041] The acquisition unit is used to acquire the audio data to be converted and the target timbre;

[0042] The semantic prediction unit is used to input the audio data to be converted into a preset semantic prediction model to perform semantic prediction and obtain the semantic information vector corresponding to the audio data to be converted.

[0043] A determining unit is used to determine the timbre feature vector corresponding to the target timbre;

[0044] The cross-processing unit is used to cross-process the semantic information vector and the timbre feature vector to obtain the timbre cross-feature vector;

[0045] The timbre conversion unit is used to input the timbre cross feature vector into a preset timbre conversion model to perform timbre conversion and obtain target audio, wherein the target audio is audio with the target timbre and the original tone in the audio data to be converted.

[0046] According to a third aspect of the present invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the above-described timbre conversion method.

[0047] According to a fourth aspect of the present invention, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the above-described timbre conversion method.

[0048] According to the present invention, a timbre conversion method, apparatus, storage medium, and computer device, compared with the current method of decomposing text in speech into phonemes to train an algorithm and then using the trained algorithm for timbre conversion, the present invention acquires the audio data to be converted and the target timbre; inputs the audio data to be converted into a preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted; simultaneously, determines the timbre feature vector corresponding to the target timbre; then cross-processes the semantic information vector and the timbre feature vector to obtain a timbre cross-feature vector; finally, inputs the timbre cross-feature vector into a preset timbre conversion model for timbre conversion to obtain the target audio, wherein the target audio is audio with the target timbre... This invention identifies the original tone and voice in the audio data to be converted. By analyzing the semantics of the speech in the audio data, a semantic information vector is obtained. Then, based on the semantic information vector and the corresponding tone feature vector of the target tone, tone conversion is performed to obtain the converted target tone. This ensures that the converted audio still contains the tone of the original audio, thus improving the accuracy of tone conversion. Furthermore, by directly analyzing the audio, it avoids the problem of phoneme analysis limiting tone conversion to a single language. Therefore, this invention is applicable to tone conversion for any language, broadening its scope. Additionally, this invention avoids the need to train a corresponding model for each language tone conversion, thus improving tone conversion efficiency. Attached Figure Description

[0049] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings:

[0050] Figure 1 A flowchart of a timbre conversion method provided by an embodiment of the present invention is shown;

[0051] Figure 2 A flowchart of another timbre conversion method provided by an embodiment of the present invention is shown;

[0052] Figure 3 A schematic diagram of a tone conversion device provided in an embodiment of the present invention is shown;

[0053] Figure 4 A schematic diagram of another tone conversion device provided in an embodiment of the present invention is shown;

[0054] Figure 5 A schematic diagram of the physical structure of a computer device provided in an embodiment of the present invention is shown. Detailed Implementation

[0055] The present invention will be described in detail below with reference to the accompanying drawings and embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in the present application can be combined with each other.

[0056] Currently, the method of breaking down text in speech into phonemes to train the algorithm and then using the trained algorithm for timbre conversion cannot obtain the intonation of the original audio. This results in the converted speech lacking the intonation of the original speech, leading to low accuracy in timbre conversion.

[0057] To address the aforementioned problems, embodiments of the present invention provide a timbre conversion method, such as... Figure 1 As shown, the method includes:

[0058] 101. Obtain the audio data to be converted and the target timbre.

[0059] The audio data to be converted includes the original timbre, speech content, and original tone; the embodiment of the present invention requires converting the original timbre in the audio to be converted into the target timbre.

[0060] In this embodiment of the invention, the timbre database first stores various timbres. The audio data to be converted can be pre-recorded audio data, audio data recorded live, or audio extracted from an audio device. When it is necessary to convert the original timbre in the audio data to be converted into the target timbre, the target timbre can be directly obtained from the timbre database. Then, the semantic information in the audio data to be converted (including tone information) is determined. Finally, based on the semantic information and the target timbre information, the audio data to be converted is converted into audio with the target timbre. This results in audio with tone and the target timbre, improving the accuracy of timbre conversion.

[0061] 102. Input the audio data to be converted into the preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted.

[0062] The semantic information vector contains tone information of the original audio in the transformed audio data, such as exclamatory tone, surprised tone, cheerful tone, and sad tone.

[0063] In this embodiment of the invention, after obtaining the audio data to be converted and the target timbre, a preset semantic prediction model is used to predict the semantics in the audio data to be converted to obtain a semantic information vector. Then, the audio data to be converted is converted according to the semantic information vector and the target timbre, so that the converted audio not only carries the target timbre, but also carries the corresponding tone, so that the converted audio can express real emotions and enhance the user's auditory experience.

[0064] 103. Determine the timbre feature vector corresponding to the target timbre.

[0065] In this embodiment of the invention, the target audio can be obtained in advance, and the target timbre can be extracted from the target audio. Alternatively, the target timbre can be obtained directly from the timbre database. In order to convert the audio data to be converted into audio with the target timbre, it is first necessary to determine the timbre features corresponding to the target timbre. Based on this, the spectral features and temporal features corresponding to the target timbre can be determined in advance. Then, the timbre descriptor is extracted from the spectral features and temporal features, and a timbre feature vector is constructed based on the timbre descriptor. Finally, the timbre conversion of the audio data to be converted is performed according to the timbre feature vector corresponding to the target timbre and the semantic information vector corresponding to the audio data to be converted, so as to convert the audio data to be converted into audio with the target timbre and corresponding tone.

[0066] 104. Cross-process the semantic information vector and the timbre feature vector to obtain the timbre cross-feature vector.

[0067] In this embodiment of the invention, the semantic information vector corresponding to the audio data to be converted and the timbre feature vector corresponding to the target timbre are vectors of different dimensions. In order to improve the accuracy of timbre conversion, the semantic information vector and the timbre feature vector need to be processed into vectors of the same dimension. Based on this, the semantic information vector and the timbre feature vector can be cross-processed to obtain a timbre cross-feature vector. Then, the timbre cross-feature vector can be input into a preset timbre conversion model for timbre conversion, thereby improving the prediction accuracy of the timbre conversion model and thus improving the accuracy of timbre conversion.

[0068] 105. Input the timbre cross feature vector into the preset timbre conversion model to perform timbre conversion and obtain the target audio, wherein the target audio is the audio with the target timbre and the original tone in the audio data to be converted.

[0069] In this embodiment of the invention, after obtaining the timbre cross-feature vector, the timbre cross-feature vector is input into a preset timbre conversion model for timbre conversion, ultimately obtaining the converted target audio with the target timbre and target intonation. Thus, by analyzing the semantics of the speech in the audio data, a semantic information vector is obtained. Then, based on the semantic information vector and the timbre feature vector corresponding to the target timbre, the converted target timbre is obtained. This ensures that the converted audio still includes the intonation of the original audio, thereby improving the accuracy of timbre conversion.

[0070] According to the timbre conversion method provided by the present invention, compared with the current method of decomposing text in speech into phonemes to train an algorithm and then using the trained algorithm to perform timbre conversion, the present invention acquires the audio data to be converted and the target timbre; inputs the audio data to be converted into a preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted; simultaneously, determines the timbre feature vector corresponding to the target timbre; then cross-processes the semantic information vector and the timbre feature vector to obtain a timbre cross feature vector; finally, inputs the timbre cross feature vector into a preset timbre conversion model for timbre conversion to obtain the target audio, wherein the target audio contains the target timbre and the timbre to be converted. The original tone of the audio data is extracted, and semantic information vectors are obtained by analyzing the speech in the audio data. Then, based on the semantic information vectors and the timbre feature vectors corresponding to the target timbre, timbre conversion is performed to obtain the converted target timbre. This allows the converted audio to still contain the tone of the original audio, thereby improving the accuracy of timbre conversion. At the same time, by directly analyzing the audio, the problem of timbre conversion only applicable to a single language due to phoneme analysis can be avoided. Therefore, this invention can be applied to timbre conversion of any language, thus broadening its applicability. Furthermore, this invention avoids the problem of needing to train a corresponding model for each language timbre conversion, thereby improving the efficiency of timbre conversion.

[0071] Furthermore, to better illustrate the above-described timbre conversion process, as a refinement and extension of the above embodiments, this invention provides another timbre conversion method, such as... Figure 2 As shown, the method includes:

[0072] 201. Obtain the audio data to be converted and the target timbre.

[0073] Specifically, audio data to be converted can be extracted from a certain audio data in the playback device, and the target timbre can be obtained from the timbre database. In another embodiment of the present invention, if a user feels that the timbre of the audio being played in a certain player is more pleasant, audio with the target timbre can be extracted from the player, and the target timbre can be obtained from the audio with the target timbre.

[0074] 202. Construct a preset initial semantic prediction model and obtain sample audio data and the actual semantic information vector corresponding to the sample audio data.

[0075] The sample audio data can include audio data from multiple languages, and the actual semantic information vector refers to the semantic information vector corresponding to the standard tone contained in the sample audio data.

[0076] 203. Input the sample audio data into the preset initial semantic prediction model to perform semantic prediction and obtain the predicted semantic information vector.

[0077] 204. Based on the actual semantic information vector and the predicted semantic information vector, construct the loss function corresponding to the preset initial semantic prediction model.

[0078] 205. Based on the loss function, train the preset initial semantic prediction model to construct the preset semantic prediction model.

[0079] In this embodiment of the invention, to improve the prediction accuracy of the preset semantic prediction model, it is first necessary to construct the preset semantic prediction model. Based on this, the method includes: firstly, constructing a preset initial semantic prediction model; simultaneously, acquiring multilingual sample audio data, where the multilingual languages ​​may include English, Chinese, French, etc.; the sample audio data contains original timbre and speech content; after acquiring the multilingual sample audio data, extracting actual semantic information vectors from the multilingual sample audio data; further, performing preprocessing, Fourier transform processing, Mel filter bank processing, and inverse discrete transform processing on the sample audio data sequentially to obtain sample audio feature vectors corresponding to the multilingual sample audio data; then inputting the sample audio feature vectors into the preset initial semantic prediction model for semantic prediction to obtain predicted semantic information vectors corresponding to the multilingual sample audio data; then constructing a loss function based on the actual semantic information vectors and predicted semantic information vectors; and using the loss function to optimize the parameters of the preset initial semantic prediction model, ultimately obtaining a preset semantic prediction model with higher accuracy. Furthermore, when specifically using the loss function to train the preset initial semantic prediction model, a self-supervised approach can be adopted to train the preset initial semantic prediction model. Specifically, the Online K-means clustering quantization method can be used to optimize the model parameters of the preset initial semantic prediction model, and finally the preset initial semantic prediction model with optimized parameters is determined as the preset semantic prediction model.

[0080] 206. Input the audio data to be converted into the preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted.

[0081] The preset semantic prediction model is a preset encoder, which includes an attention layer and a feedforward neural network layer.

[0082] In this embodiment of the invention, after constructing the preset semantic prediction model, the audio data to be converted needs to be input into the preset semantic prediction model for semantic prediction. Based on this, step 206 specifically includes: determining the frequency feature vector corresponding to the audio data to be converted; inputting the frequency feature vector into the attention layer for feature extraction to obtain the first feature vector corresponding to the audio data to be converted; adding the first feature vector and the frequency feature vector to obtain the second feature vector corresponding to the audio data to be converted; and inputting the second feature vector into the feedforward neural network layer for feature extraction to obtain the semantic information vector corresponding to the audio data to be converted.

[0083] Specifically, to improve the prediction accuracy of the preset semantic prediction model, it is first necessary to determine the frequency feature vector corresponding to the audio data to be converted. Based on this, the method includes: performing pre-emphasis, framing, and windowing processing on the audio data to be converted in sequence to obtain processed audio data; performing Fourier transform on the processed audio data to obtain the spectrogram corresponding to the audio data to be converted; filtering the spectrogram using a Mel filter bank to obtain a spectrum diagram with energy waves output by the Mel filter bank, and calculating the logarithm of the energy waves to obtain a logarithmic spectrum diagram; and performing discrete cosine transform on the logarithmic spectrum diagram to obtain the frequency feature vector corresponding to the audio data to be converted.

[0084] The Mel filter bank consists of multiple triangular filters with varying bandwidths.

[0085] Specifically, after acquiring the audio data to be converted, in order to improve the accuracy of timbre conversion, redundant data in the audio data needs to be removed first. Based on this, the audio data can be pre-emphasized to obtain the first audio data. The pre-emphasis processing is mainly to remove the influence of lip radiation and increase the high-frequency resolution of speech in the audio data. Then, in order to ensure that the input audio signal is stable, we need to divide the first audio data into a small segment, that is, frame processing, to obtain the second audio data. Then, the sampling points in each audio frame are multiplied by the corresponding elements in the window function, that is, windowing processing is performed on the second audio data. Windowing processing is to solve the leakage problem caused by the non-periodic truncation of the audio signal, which causes the spectrum to have a trailing phenomenon throughout the frequency band. It can make the global spectrum more continuous and avoid the Gibbs effect. Thus, we can obtain processed audio data with obvious signal characteristics. Furthermore, a Fourier transform is performed on the processed audio data to convert it from the time domain to the frequency domain, obtaining the corresponding spectrogram. Then, a Mel filter bank is used to filter the spectrogram to obtain the energy wave in each Mel filter. The logarithm of the energy wave in each Mel filter is then taken to obtain the logarithmic spectrogram. Finally, a discrete cosine transform is performed on the logarithmic spectrogram to obtain the frequency feature vector corresponding to the audio data to be converted.

[0086] Furthermore, after determining the frequency feature vector corresponding to the audio data to be converted, the frequency feature vector needs to be input into a preset encoder for semantic prediction. There can be multiple preset encoders, each connected end-to-end. The encoder specifically includes an attention layer and a feedforward neural network layer. The specific method for extracting the semantic information vector using the encoder is as follows: the frequency feature vector is input into the attention layer for feature extraction to obtain a first feature vector. The first feature vector and the frequency feature vector are added to obtain a second feature vector. Then, the second feature vector is input into the feedforward neural network layer of the first encoder for feature extraction to obtain the output vector of the first encoder. Since this embodiment of the invention includes multiple encoders, and the multiple encoders are connected end-to-end, the output vector of the first encoder is input into the second encoder for feature extraction to obtain the output vector of the second encoder. In this way, the output vector of the previous encoder is used as the input vector of the next encoder. Finally, the output vector of the last encoder is determined as the semantic information vector corresponding to the audio data to be converted.

[0087] 207. Determine the timbre feature vector corresponding to the target timbre.

[0088] Specifically, a neural network model can be used to extract the timbre feature vector corresponding to the target timbre. Then, the timbre feature vector corresponding to the target timbre and the semantic information vector corresponding to the audio data to be converted are cross-processed. Finally, the cross-processed vector is input into the preset timbre conversion model for timbre conversion.

[0089] 208. Cross-process the semantic information vector and the timbre feature vector to obtain the timbre cross-feature vector.

[0090] In this embodiment of the invention, in order to fully utilize the relationships between data, extract more latent features, and simultaneously consider both high-order and low-order processing to make data utilization more efficient and the subsequent transformation results more accurate, thus meeting the needs of practical application scenarios, it is necessary to perform cross processing on the semantic information vector and the timbre feature vector. Based on this, step 208 specifically includes: performing cross-convolution processing on the semantic information vector and the timbre feature vector to obtain a first cross vector; performing low-order cross processing on the semantic information vector and the timbre feature vector to obtain a second cross vector; performing cross-linear processing on the semantic information vector and the timbre feature vector to obtain a third cross vector; and using a preset transformation function to transform the first cross vector, the second cross vector, and the third cross vector to obtain the timbre cross feature vector.

[0091] Specifically, in practical applications, the timbre feature vector corresponding to the target timbre and the semantic information vector corresponding to the audio data to be converted are vectors of different dimensions. Therefore, in order to improve the conversion accuracy of the model, it is necessary to process the vectors from different domains into vectors of the same dimension. Based on this, the method is as follows: if the semantic information vector is (a1, a2) and the timbre feature vector is (b1, b2), the specific cross-processing includes: performing cross-convolution processing between the semantic information vector and the timbre feature vector, that is, after performing a Cartesian product of all elements between the vectors, a convolution transformation is performed under certain weights to obtain the first cross vector as f(w*(a1*b1, a1*b2, a2*b1, a2*b2)); at the same time, the semantic information vector is processed into a cross-convolution vector. A low-order cross processing is performed between the semantic information vector and the timbre feature vector. This involves combining each element of the semantic information vector and the timbre feature vector pairwise and performing a low-order cross processing to obtain a second cross vector, f(w(a1,a2,b1,b2,)). Simultaneously, a linear cross processing is performed between the semantic information vector and the timbre feature vector. This involves performing a Cartesian product on each element of the vectors, assigning different weights to each product, and then performing a linear transformation to obtain a third cross vector, f(w1*a1*b1,w2*a1*b2,w3*a2*b1,w4*a2*b2). Finally, the results of these three processing steps are combined and transformed using a preset transformation function to obtain the timbre cross feature vector. This preset function can be set according to actual conditions, and this embodiment does not impose any limitations on it. It should be noted that the above examples are merely illustrative and do not limit the scope of this invention.

[0092] 209. Input the timbre cross feature vector into the preset timbre conversion model to perform timbre conversion and obtain the target audio, wherein the target audio is the audio with the target timbre and the original tone in the audio data to be converted.

[0093] In this embodiment of the invention, after cross-processing the semantic information vector and the timbre feature vector to obtain the timbre cross-feature vector, the timbre cross-feature vector needs to be input into a preset timbre conversion model for timbre conversion. In this embodiment of the invention, in order to improve the accuracy of the preset timbre conversion model, it is first necessary to construct a preset timbre conversion model. The specific construction method includes: firstly, obtaining sample audio data and sample timbre, and determining the actual audio after the sample audio data is converted into the sample timbre; then, determining the sample semantic information vector corresponding to the sample audio data, and at the same time, determining the sample timbre feature vector corresponding to the sample timbre; then, cross-processing the sample semantic information vector and the sample timbre feature vector to obtain the sample timbre cross-feature vector, and inputting the sample timbre cross-feature vector into the preset initial timbre conversion model for timbre conversion to obtain the converted audio; then, based on the converted audio and the converted actual audio corresponding to the same sample audio data, constructing a loss function, and using the loss function to train the preset initial timbre conversion model, that is, continuously optimizing the model parameters of the preset timbre conversion model, and finally obtaining a preset timbre conversion model with high conversion accuracy.

[0094] Furthermore, after constructing a preset timbre conversion model with high conversion accuracy, it is necessary to use the preset timbre conversion model to perform timbre conversion on the audio data to be converted. Based on this, step 209 specifically includes: obtaining Gaussian noise and determining the noise step index for denoising the Gaussian noise; inputting the Gaussian noise, the noise step index, and the timbre cross feature vector into the preset timbre conversion model for timbre conversion to obtain the target audio.

[0095] The noise step index is a step size for denoising audio. The value of the noise step index is set according to the actual situation. This embodiment of the invention does not impose a specific limitation on the value of the noise step index.

[0096] Specifically, after determining the timbre cross-feature vector corresponding to the converted speech, in order to perform timbre conversion on the audio data to be converted, it is also necessary to first randomly generate Gaussian noise, and at the same time determine the noise step index in the denoising process. Finally, the timbre conversion of the audio to be converted is performed based on the Gaussian noise, the noise step index, and the timbre cross-feature vector. Based on this, the method includes: inputting the noise step index into the position coding layer, and outputting a denoising feature vector through the position coding layer; adding the denoising feature vector to the timbre cross-feature vector to obtain a fused feature vector; inputting the Gaussian noise into the downsampling layer to obtain a noise feature vector; and inputting the noise feature vector and the fused feature vector into the conditional upsampling layer to obtain the target audio.

[0097] Specifically, the preset timbre conversion model can be a preset vocoder model based on DDPM (Denoising Diffusion Probabilistic Model). The preset vocoder model includes a positional coding layer, a downsampling layer, and a conditional upsampling layer. The downsampling layer consists of one 2D convolution, two 1D convolutions, and four LRelu layers (non-linear activation layers). The conditional upsampling layer consists of two 2D convolution layers, two 1D convolution layers, two LRelu layers, and four gated activation layers.

[0098] Specifically, the noise step index is encoded using the positional encoding layer in the preset vocoder model to obtain a denoised feature vector. This denoised feature vector is then added to the timbre cross-feature vector to obtain a fused feature vector. The addition method involves horizontally concatenating the denoised and timbre cross-feature vectors. Simultaneously, Gaussian noise is input to the downsampling layer in the preset vocoder model. After activation by a 2D convolutional layer, a 1D convolutional layer, and a nonlinear activation layer in the downsampling layer, a noise feature vector is output. The noise feature vector and the fused feature vector are then input to a conditional upsampling layer. In the conditional upsampling layer, LReLU activation and convolution with a 1D convolutional layer are performed to obtain the initial audio. The initial audio is then passed through a Gated Activation layer to increase its nonlinearity. Finally, the nonlinearized initial audio is added to the residual in the Gated Activation layer. The summation result is then input into a 2D convolutional layer for convolution processing, and finally the target audio with the target timbre is output. By using Gaussian noise and noise step index as input, noise in the audio can be removed, ensuring the clarity of the obtained target audio.

[0099] According to another timbre conversion method provided by the present invention, compared with the current method of decomposing text in speech into phonemes to train an algorithm and using the trained algorithm to perform timbre conversion, the present invention acquires the audio data to be converted and the target timbre; inputs the audio data to be converted into a preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted; simultaneously, determines the timbre feature vector corresponding to the target timbre; then cross-processes the semantic information vector and the timbre feature vector to obtain a timbre cross feature vector; finally, inputs the timbre cross feature vector into a preset timbre conversion model for timbre conversion to obtain the target audio, wherein the target audio contains the target timbre and the timbre to be converted. The original tone of the audio data is extracted, and semantic information vectors are obtained by analyzing the speech in the audio data. Then, based on the semantic information vectors and the timbre feature vectors corresponding to the target timbre, timbre conversion is performed to obtain the converted target timbre. This allows the converted audio to still contain the tone of the original audio, thereby improving the accuracy of timbre conversion. At the same time, by directly analyzing the audio, the problem of timbre conversion only applicable to a single language due to phoneme analysis can be avoided. Therefore, this invention can be applied to timbre conversion of any language, thus broadening its applicability. Furthermore, this invention avoids the problem of needing to train a corresponding model for each language timbre conversion, thereby improving the efficiency of timbre conversion.

[0100] Furthermore, as Figure 1 In a specific implementation, embodiments of the present invention provide a timbre conversion device, such as... Figure 3 As shown, the device includes: an acquisition unit 31, a semantic prediction unit 32, a determination unit 33, a cross-processing unit 34, and a timbre conversion unit 35.

[0101] The acquisition unit 31 can be used to acquire the audio data to be converted and the target timbre.

[0102] The semantic prediction unit 32 can be used to input the audio data to be converted into a preset semantic prediction model to perform semantic prediction and obtain the semantic information vector corresponding to the audio data to be converted.

[0103] The determining unit 33 can be used to determine the timbre feature vector corresponding to the target timbre.

[0104] The cross-processing unit 34 can be used to cross-process the semantic information vector and the timbre feature vector to obtain a timbre cross-feature vector.

[0105] The timbre conversion unit 35 can be used to input the timbre cross feature vector into a preset timbre conversion model to perform timbre conversion and obtain target audio, wherein the target audio is an audio with the target timbre and the original tone in the audio data to be converted.

[0106] In specific application scenarios, in order to construct a pre-defined semantic prediction model, such as Figure 4 As shown, the device also includes a construction unit 36.

[0107] The construction unit 36 ​​can be used to construct a preset initial semantic prediction model.

[0108] The acquisition unit 31 can also be used to acquire sample audio data and the actual semantic information vector corresponding to the sample audio data.

[0109] The semantic prediction unit 32 can also be used to input the sample audio data into the preset initial semantic prediction model to perform semantic prediction and obtain a predicted semantic information vector.

[0110] The construction unit 36 ​​can be specifically used to construct the loss function corresponding to the preset initial semantic prediction model based on the actual semantic information vector and the predicted semantic information vector.

[0111] The construction unit 36 ​​can also be used to train the preset initial semantic prediction model based on the loss function to construct the preset semantic prediction model.

[0112] In specific application scenarios, in order to perform semantic prediction on audio data with conversion, the semantic prediction unit 32 includes a first determination module 321, a feature extraction module 322, and a feature addition module 323.

[0113] The first determining module 321 can be used to determine the frequency feature vector corresponding to the audio data to be converted.

[0114] The feature extraction module 322 can be used to input the frequency feature vector into the attention layer for feature extraction to obtain the first feature vector corresponding to the audio data to be converted.

[0115] The feature addition module 323 can be used to add the first feature vector and the frequency feature vector to obtain the second feature vector corresponding to the audio data to be converted.

[0116] The feature extraction module 322 can be specifically used to input the second feature vector into the feedforward neural network layer for feature extraction, so as to obtain the semantic information vector corresponding to the audio data to be converted.

[0117] In specific application scenarios, in order to determine the frequency feature vector corresponding to the audio data to be converted, the first determining module 321 includes a preprocessing submodule, a transformation submodule, a filtering submodule and a cosine transform submodule.

[0118] The preprocessing submodule can be used to perform pre-emphasis, frame segmentation and windowing processing on the audio data to be converted in sequence to obtain the processed audio data.

[0119] The transformation submodule can be used to perform Fourier transform on the processed audio data to obtain the spectrogram corresponding to the audio data to be transformed.

[0120] The filtering submodule can be used to filter the spectrogram using a Mel filter bank to obtain a spectrogram with energy waves output by the Mel filter bank, and to calculate the logarithm of the energy waves to obtain a logarithmic spectrogram.

[0121] The cosine transform submodule can be used to perform discrete cosine transform on the logarithmic spectrum to obtain the frequency feature vector corresponding to the audio data to be converted.

[0122] In specific application scenarios, in order to determine the timbre cross feature vector, the cross processing unit 34 includes a convolutional cross module 341, a low-order cross module 342, a linear cross module 343, and a transformation module 344.

[0123] The convolutional cross module 341 can be used to perform cross-convolution processing on the semantic information vector and the timbre feature vector to obtain a first cross vector.

[0124] The low-order cross module 342 can be used to perform low-order cross processing on the semantic information vector and the timbre feature vector to obtain a second cross vector.

[0125] The linear cross module 343 can be used to perform cross-linear processing on the semantic information vector and the timbre feature vector to obtain a third cross vector.

[0126] The transformation module 344 can be used to transform the first cross vector, the second cross vector and the third cross vector using a preset transformation function to obtain the timbre cross feature vector.

[0127] In specific application scenarios, in order to perform timbre transformation on the audio to be converted, the timbre conversion unit 35 includes a second determining module 351 and a timbre conversion module 352.

[0128] The second determining module 351 can be used to acquire Gaussian noise and determine the noise step index for denoising the Gaussian noise.

[0129] The timbre conversion module 352 can be used to input the Gaussian noise, the noise step index and the timbre cross feature vector into a preset timbre conversion model to perform timbre conversion and obtain the target audio.

[0130] In specific application scenarios, in order to perform timbre conversion using a preset timbre conversion model, the timbre conversion module 352 can be used to input the noise step index into the position encoding layer, output a noise reduction feature vector through the position encoding layer; add the noise reduction feature vector to the timbre cross feature vector to obtain a fusion feature vector; input the Gaussian noise into the downsampling layer to obtain a noise feature vector; and input the noise feature vector and the fusion feature vector into the conditional upsampling layer to obtain the target audio.

[0131] It should be noted that other corresponding descriptions of the functional modules involved in the tone conversion device provided in this embodiment of the invention can be found in the following references. Figure 1 The corresponding description of the method shown will not be repeated here.

[0132] Based on the above, Figure 1 Accordingly, this embodiment of the invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, performs the following steps: acquiring audio data to be converted and a target timbre; inputting the audio data to be converted into a preset semantic prediction model for semantic prediction to obtain a semantic information vector corresponding to the audio data to be converted; determining a timbre feature vector corresponding to the target timbre; performing cross processing on the semantic information vector and the timbre feature vector to obtain a timbre cross feature vector; inputting the timbre cross feature vector into a preset timbre conversion model for timbre conversion to obtain the target audio, wherein the target audio is audio with the target timbre and the original tone in the audio data to be converted.

[0133] Based on the above, Figure 1 The method shown and as Figure 3 The embodiment of the device shown in the invention also provides a physical structure diagram of a computer device, such as... Figure 5As shown, the computer device includes: a processor 41, a memory 42, and a computer program stored in the memory 42 and executable on the processor. Both the memory 42 and the processor 41 are mounted on a bus 43. When the processor 41 executes the program, it performs the following steps: acquiring audio data to be converted and a target timbre; inputting the audio data to be converted into a preset semantic prediction model for semantic prediction to obtain a semantic information vector corresponding to the audio data to be converted; determining the timbre feature vector corresponding to the target timbre; performing cross-processing on the semantic information vector and the timbre feature vector to obtain a timbre cross-feature vector; inputting the timbre cross-feature vector into a preset timbre conversion model for timbre conversion to obtain the target audio, wherein the target audio is audio with the target timbre and the original tone from the audio data to be converted.

[0134] Through the technical solution of this invention, the invention acquires audio data to be converted and a target timbre; inputs the audio data to be converted into a preset semantic prediction model for semantic prediction to obtain a semantic information vector corresponding to the audio data to be converted; simultaneously, it determines the timbre feature vector corresponding to the target timbre; then, it performs cross processing on the semantic information vector and the timbre feature vector to obtain a timbre cross feature vector; finally, it inputs the timbre cross feature vector into a preset timbre conversion model for timbre conversion to obtain the target audio, wherein the target audio is audio containing the target timbre and the original tone of the audio data to be converted, thereby improving the speech in the audio data. The semantics of the audio are analyzed to obtain a semantic information vector. Then, based on the semantic information vector and the timbre feature vector corresponding to the target timbre, timbre conversion is performed to obtain the converted target timbre. This allows the converted audio to still contain the tone of the original audio, thereby improving the accuracy of timbre conversion. At the same time, by directly analyzing the audio, the problem of phoneme analysis being limited to timbre conversion for a single language can be avoided. Therefore, this invention can be applied to timbre conversion for any language, thus broadening its applicability. Furthermore, this invention avoids the problem of needing to train a corresponding model for each language timbre conversion, thereby improving the efficiency of timbre conversion.

[0135] It is obvious to those skilled in the art that the modules or steps of the present invention described above can be implemented using general-purpose computing devices. They can be centralized on a single computing device or distributed across a network of multiple computing devices. Optionally, they can be implemented using computer-executable program code, thereby storing them in a storage device for execution by a computing device. In some cases, the steps shown or described can be performed in a different order than those presented herein, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, the present invention is not limited to any particular combination of hardware and software.

[0136] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A timbre conversion method, characterized in that, include: Obtain the audio data to be converted and the target timbre; The audio data to be converted is input into a preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted. Determine the timbre feature vector corresponding to the target timbre; The semantic information vector and the timbre feature vector are cross-convolved to obtain a first cross vector; the semantic information vector and the timbre feature vector are cross-convolved at a low order to obtain a second cross vector; the semantic information vector and the timbre feature vector are cross-linearly cross-convolved to obtain a third cross vector; and the first cross vector, the second cross vector, and the third cross vector are transformed using a preset transformation function to obtain the timbre cross feature vector. Gaussian noise is obtained, and the noise step index for denoising the Gaussian noise is determined; the Gaussian noise, the noise step index, and the timbre cross feature vector are input into a preset timbre conversion model for timbre conversion to obtain the target audio, wherein the target audio is an audio with the target timbre and the original tone in the audio data to be converted.

2. The method according to claim 1, characterized in that, Before inputting the audio data to be converted into a preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted, the method further includes: Construct a pre-defined initial semantic prediction model; Obtain sample audio data and the actual semantic information vector corresponding to the sample audio data; The sample audio data is input into the preset initial semantic prediction model for semantic prediction to obtain a predicted semantic information vector. Based on the actual semantic information vector and the predicted semantic information vector, construct the loss function corresponding to the preset initial semantic prediction model; Based on the loss function, the preset initial semantic prediction model is trained to construct the preset semantic prediction model.

3. The method according to claim 1, characterized in that, The preset semantic prediction model is a preset encoder, which includes an attention layer and a feedforward neural network layer. The step of inputting the audio data to be converted into the preset semantic prediction model for semantic prediction to obtain the semantic information vector corresponding to the audio data to be converted includes: Determine the frequency feature vector corresponding to the audio data to be converted; The frequency feature vector is input into the attention layer for feature extraction to obtain the first feature vector corresponding to the audio data to be converted. The first feature vector and the frequency feature vector are added together to obtain the second feature vector corresponding to the audio data to be converted; The second feature vector is input into the feedforward neural network layer for feature extraction to obtain the semantic information vector corresponding to the audio data to be converted.

4. The method according to claim 3, characterized in that, Determining the frequency feature vector corresponding to the audio data to be converted includes: The audio data to be converted is sequentially subjected to pre-emphasis, frame segmentation, and windowing to obtain the processed audio data; Perform a Fourier transform on the processed audio data to obtain the spectrogram corresponding to the audio data to be converted; The spectrogram is filtered using a Mel filter bank to obtain a spectrogram with energy waves output by the Mel filter bank, and the logarithm of the energy waves is calculated to obtain a logarithmic spectrogram. The discrete cosine transform is performed on the logarithmic spectrogram to obtain the frequency feature vector corresponding to the audio data to be converted.

5. The method according to claim 1, characterized in that, The preset timbre conversion model is a preset vocoder model, which consists of a positional coding layer, a downsampling layer, and a conditional upsampling layer. The step of inputting the Gaussian noise, the noise step index, and the timbre cross-feature vector into the preset timbre conversion model for timbre conversion to obtain the target audio includes: The noise step index is input into the position encoding layer, and the noise reduction feature vector is output through the position encoding layer. The noise reduction feature vector is added to the timbre cross feature vector to obtain the fused feature vector; The Gaussian noise is input into the downsampling layer to obtain a noise feature vector; The noise feature vector and the fused feature vector are input into the conditional upsampling layer to obtain the target audio.

6. A tone conversion device, characterized in that, include: The acquisition unit is used to acquire the audio data to be converted and the target timbre; The semantic prediction unit is used to input the audio data to be converted into a preset semantic prediction model to perform semantic prediction and obtain the semantic information vector corresponding to the audio data to be converted. A determining unit is used to determine the timbre feature vector corresponding to the target timbre; The cross-processing unit is used to perform cross-convolution processing on the semantic information vector and the timbre feature vector to obtain a first cross vector; perform low-order cross processing on the semantic information vector and the timbre feature vector to obtain a second cross vector; perform cross-linear processing on the semantic information vector and the timbre feature vector to obtain a third cross vector; and use a preset transformation function to transform the first cross vector, the second cross vector, and the third cross vector to obtain the timbre cross feature vector. A timbre conversion unit is used to acquire Gaussian noise and determine the noise step index for denoising the Gaussian noise; the Gaussian noise, the noise step index, and the timbre cross feature vector are input into a preset timbre conversion model for timbre conversion to obtain target audio, wherein the target audio is audio with the target timbre and the original tone in the audio data to be converted.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 5.

8. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 5.