Audio synthesis method and device, training method and device, electronic device and medium
By generating phoneme sequences and using dynamic temporal warping algorithms, combined with a target acoustic model and synthesizer, the problem of obtaining clean singing speech is solved, achieving high-quality audio synthesis effects and simplifying the model training process.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA MOBILE (SUZHOU) SOFTWARE TECH CO LTD
- Filing Date
- 2022-10-31
- Publication Date
- 2026-06-16
Smart Images

Figure CN116778904B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of audio technology, and in particular to an audio synthesis method and apparatus, a training method and apparatus, an electronic device, and a storage medium. Background Technology
[0002] In related technologies, singing synthesis is mainly based on two methods.
[0003] The first method requires a large amount of clean singing speech. The clean singing speech is input into a neural network model, and the neural network model is used to fit the nonlinear transformation relationship from MIDI node features to singing acoustic features.
[0004] The second method is to generate singing acoustic features based on the signal level. By using signal processing, the song is converted into features, and then the singing features are converted according to the duration, pitch and other features in the song.
[0005] In the first method, because clean singing voice is difficult to obtain, noise interference in the singing voice can easily lead to poor model training results. The second method suffers from unsatisfactory synthesis results, limiting its application.
[0006] Therefore, proposing an audio synthesis method that is easy to implement and produces ideal synthesis results is a technical problem that urgently needs to be solved by existing technologies. Summary of the Invention
[0007] This disclosure provides an audio synthesis method, apparatus, electronic device, and storage medium to improve and / or simplify the audio synthesis effect.
[0008] The first aspect of this disclosure provides an audio synthesis method, including:
[0009] A phoneme sequence is generated based on target text containing pauses, wherein the phoneme sequence includes: one or more phoneme elements; wherein the phoneme elements include pause features;
[0010] The phoneme sequence is input into the target acoustic model to obtain the first acoustic feature; wherein, the target model includes: a base model trained using sample data from the target user, wherein the base model is: trained using sample data from multiple sample users; wherein, the sample data includes: audio data and text corresponding to the audio data;
[0011] Based on the first acoustic feature, the text features of the target text, the reference audio corresponding to the target text, and the synthesizer, a target audio with the pronunciation characteristics of the target user is synthesized.
[0012] Based on the above scheme, the step of inputting the phoneme sequence into the target acoustic model to obtain the first acoustic feature includes:
[0013] The phoneme sequence and the identification information of a specific user are input into the target acoustic model to obtain the first acoustic feature, wherein the specific user is the sample user that meets similar conditions to the target user.
[0014] Based on the above scheme, the step of inputting the phoneme sequence into the target acoustic model to obtain the first acoustic feature includes:
[0015] The phoneme sequence is input into the target acoustic model to obtain phoneme embedding features;
[0016] The phoneme embedding features are processed by a pre-processing network to obtain higher-dimensional features;
[0017] The first convolutional feature is obtained by processing the up-dimensional feature using one or more convolutional modules.
[0018] The first acoustic feature is generated based on the first convolutional feature.
[0019] Based on the above scheme, the step of inputting the phoneme sequence into the target acoustic model to obtain the first acoustic feature includes:
[0020] Generate encoded features based on the first convolutional features;
[0021] The encoded features are processed using one or more convolutional modules to obtain second convolutional features;
[0022] The first acoustic feature is generated based on the second convolution.
[0023] Based on the above scheme, the step of synthesizing target audio with the pronunciation characteristics of the target user based on the first acoustic feature, the text features of the target text, the reference audio corresponding to the target text, and the synthesizer includes:
[0024] A first audio signal is synthesized based on the first acoustic feature;
[0025] The transformation relationship is obtained based on the Dynamic Time-Domain Warping (DTW) algorithm, the first duration information of the first audio, and the second duration information of the reference audio; wherein, the second audio features include: textual features and / or frequency features of the dry audio.
[0026] According to the conversion relationship, the first acoustic feature is converted into the second acoustic feature;
[0027] The target audio is synthesized based on the second acoustic feature.
[0028] Based on the above scheme, the step of synthesizing the target audio according to the second acoustic feature includes:
[0029] A third audio signal is synthesized based on the second acoustic feature;
[0030] Based on the dry sound distribution position of the reference audio, background audio is mixed into the third audio to obtain the target audio.
[0031] Based on the above scheme, the first acoustic feature includes at least one of the following: fundamental frequency, spectral features, and / or, aperiodic spectral features.
[0032] A second aspect of this disclosure provides a method for training a target acoustic model, the method comprising:
[0033] The base model is obtained by training a preset model using sample data from multiple sample users;
[0034] A base model is trained using sample data from the target user to obtain the target acoustic model; wherein, the sample data includes: audio data and text corresponding to the audio data;
[0035] The target acoustic model is used to synthesize target audio with the pronunciation characteristics of the target user based on the target text and the corresponding reference audio.
[0036] Based on the above scheme, the method further includes:
[0037] Based on the sample data of the target user and the sample data of the sample user, the sample user who meets the similarity conditions with the target user is determined; wherein, the user identifier of the sample user who meets the similarity conditions with the target user and the target text are used by the target acoustic model to synthesize the target audio.
[0038] A third aspect of this disclosure provides an audio synthesis apparatus, comprising:
[0039] A generation module is used to generate a phoneme sequence based on target text containing pauses, wherein the phoneme sequence includes: one or more phoneme elements; wherein the phoneme elements include pause features;
[0040] A module is configured to input the phoneme sequence into a target acoustic model to obtain a first acoustic feature; wherein the target model comprises: a base model trained using sample data from a target user, wherein the base model is: trained using sample data from multiple sample users; wherein the sample data comprises: audio data and text corresponding to the audio data;
[0041] The synthesis module is used to synthesize target audio with the pronunciation characteristics of the target user based on the first acoustic feature, the text features of the target text, the reference audio corresponding to the target text, and the synthesizer.
[0042] A fourth aspect of this disclosure provides a target model training apparatus, the apparatus comprising:
[0043] The first training module is used to train a preset model using sample data from multiple sample users to obtain the base model.
[0044] The second training module is used to train a base model using sample data from the target user to obtain a target acoustic model; wherein, the sample data includes: audio data and text corresponding to the audio data;
[0045] The target acoustic model is used to synthesize target audio with the pronunciation characteristics of the target user based on the target text and the corresponding reference audio.
[0046] A fifth aspect of this disclosure provides an electronic device, the electronic device comprising:
[0047] Memory;
[0048] A processor, connected to the memory, is configured to implement the audio synthesis method provided by either the first or second aspect by executing computer-executable instructions stored in the memory.
[0049] A sixth aspect of this disclosure provides a computer storage medium storing computer-executable instructions; when executed by a processor, the computer-executable instructions can implement the audio synthesis method provided by any of the technical solutions of the first or second aspect.
[0050] The technical solution provided in this disclosure utilizes a base model trained by multiple users, and then trains a target audio model that can simulate the pronunciation characteristics of the target user using a small amount of sample data from the target user. In this way, a target acoustic model can be obtained without acquiring a large amount of dry audio data of the target user. Furthermore, because the target acoustic model processes the target text, it can obtain the acoustic features of the target user speaking or chanting the target text. Then, using a synthesizer, based on the acoustic features and the baseline audio of the target text, a high-quality target audio can be synthesized, thereby improving the audio effect of the target audio. Attached Figure Description
[0051] Figure 1 This is a schematic flowchart of an audio synthesis method provided in an embodiment of the present disclosure;
[0052] Figure 2 This is a schematic flowchart of an audio synthesis method provided in an embodiment of the present disclosure;
[0053] Figure 3 This is a schematic flowchart of an audio synthesis method provided in an embodiment of the present disclosure;
[0054] Figure 4 This is a schematic diagram of the structure of a target acoustic model provided in an embodiment of the present disclosure;
[0055] Figure 5 A schematic diagram of a DTW diagram provided in an embodiment of this disclosure;
[0056] Figure 6 A schematic diagram of an acoustic feature stretching (expansion) provided for an embodiment of this disclosure;
[0057] Figure 7 A flowchart illustrating a target acoustic model training method provided in an embodiment of this disclosure;
[0058] Figure 8 This is a schematic flowchart of an audio synthesis method provided in an embodiment of this disclosure;
[0059] Figure 9 This is a schematic diagram of the structure of an audio synthesis device provided in an embodiment of this disclosure;
[0060] Figure 10 This is a schematic diagram of the structure of a target acoustic model training device provided in an embodiment of this disclosure;
[0061] Figure 11 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation
[0062] To gain a more detailed understanding of the features and technical content of this disclosure, the implementation of this disclosure will be described in detail below with reference to the accompanying drawings. The accompanying drawings are for reference and illustration only and are not intended to limit this disclosure.
[0063] like Figure 1 As shown, this disclosure provides an audio synthesis method, including:
[0064] S1110: Generate a phoneme sequence based on target text containing pauses, wherein the phoneme sequence includes: one or more phoneme elements; wherein the phoneme elements include pause features;
[0065] S1120: Input the phoneme sequence into the target acoustic model to obtain the first acoustic feature; wherein, the target model includes: a base model trained using sample data from the target user, wherein the base model is: trained using sample data from multiple sample users; wherein, the sample data includes: audio data and text corresponding to the audio data;
[0066] S1130: Based on the first acoustic feature, the text feature of the target text, the reference audio corresponding to the target text, and the synthesizer, synthesize a target audio with the pronunciation characteristics of the target user.
[0067] The pause character can be any character that indicates a pause. For example, the pause character may include: a rest used in music, etc.
[0068] The target text can be the lyrics of the song to be synthesized.
[0069] The target text is converted into a phoneme sequence. A phoneme sequence can include multiple phoneme elements. These phoneme elements are generated based on the text's pronunciation features. Pause features are generated based on pauses.
[0070] After obtaining the phoneme sequence, inputting it into the target acoustic model will yield the first acoustic feature output by the target acoustic model. This first acoustic feature characterizes the pronunciation features when the target user speaks or chants the target text.
[0071] The target acoustic model can be obtained by further training a base model trained on a large number of sample users using sample data from target users. Since the base training of the model is completed using sample data from sample users, subsequent training can be performed using only a small amount of sample data from target users, with only minor adjustments to a few parameters and / or slight tuning of the base model's parameters. This allows for the acquisition of acoustic features capable of converting text into audio with the pronunciation characteristics of the target users. Thus, only sample data from target users needs to be collected. Furthermore, since the base model can be used by different target users, the training workload of the base model is small when distributed across multiple target users, making it easy to implement.
[0072] For example, when the electronic device is distributed to the target user, the electronic device has built-in model parameters of the basic model, and collects sample data of the target user while the user is using the electronic device to make a call or record audio data.
[0073] For example, when the server provides audio synthesis services online, it imports the model parameters of the base model, then instructs the target user to record a small number of sample audios, and then optimizes the base model corresponding to each target user to obtain the target acoustic model.
[0074] In this embodiment, the target acoustic model is not an end-to-end audio synthesis model. In this embodiment, after obtaining the first acoustic features, the target text and the corresponding reference audio are combined to obtain the target audio when spoken or sung by the target user. In this embodiment, the synthesizer is independent of the target acoustic model, thus allowing the use of any synthesizer provided in related technologies without training a neural network to synthesize audio, thereby further reducing the training load on the target acoustic model.
[0075] The reference audio may be: the audio of the anchor reading the target text, and / or the audio of the singer performing the song corresponding to the target text.
[0076] In this embodiment of the disclosure, a target model that can simulate the pronunciation characteristics of the target user can be trained by using a base model trained by multiple users and then using a small amount of sample data from the target user. In this way, the target acoustic model can be obtained without obtaining a large amount of dry audio data of the target user. Since the target acoustic model processes the target text, it can obtain the acoustic features of the target user speaking or chanting the target text. Then, using a synthesizer, based on the acoustic features and the baseline audio of the target text, a high-quality target audio can be synthesized, thereby improving the audio effect of the target audio.
[0077] In some embodiments, inputting the phoneme sequence into the target acoustic model to obtain the first acoustic feature includes:
[0078] The phoneme sequence and the identification information of a specific user are input into the target acoustic model to obtain the first acoustic feature, wherein the specific user is the sample user that meets similar conditions to the target user.
[0079] For example, the specific user includes at least one of the following:
[0080] The user whose voice features are most similar to those of the target user among the sample users;
[0081] The user whose pronunciation features are most similar to those of the target user among the sample users;
[0082] The user whose pronunciation and voice features are most similar to the target user's pronunciation and voice features is the one with the highest overall similarity.
[0083] For example, voice characteristics depend more on the user's physiological characteristics such as vocal cords; pronunciation characteristics may also depend on the user's pronunciation habits.
[0084] Determining the specific user may include:
[0085] The audio data of the target user and the sample user are input into the feature extraction model to obtain the first feature and the second feature, respectively; the first feature is the pronunciation feature and / or voice feature of the target user; the second feature is the pronunciation feature and / or voice feature of the sample user.
[0086] Calculate the similarity between the first feature and the second feature.
[0087] In one embodiment, the second features of each sample user are obtained in advance, so that the second features do not need to be obtained on a temporary basis.
[0088] In another embodiment, for example, the electronic device pre-acquires the second features of the sample users and clusters the second features of the sample users to obtain cluster features. First, the first features of the target user are compared with the cluster features to determine the cluster to which the target user belongs. Then, the second features of the target user are compared with those of the sample users in the cluster to quickly identify the specific user.
[0089] Using a specific user's identifier as input to the target acoustic model allows the target acoustic model to automatically generate acoustic features of the target text that contain the target user's pronunciation characteristics and / or voice features.
[0090] In some embodiments, such as Figure 2 As shown, S1120 may include:
[0091] S1121: Input the phoneme sequence into the target acoustic model to obtain phoneme embedding features;
[0092] S1122: The phoneme embedding features are processed by a pre-processing network to obtain higher-dimensional features;
[0093] S1123: Process the up-dimensional features using one or more convolutional modules to obtain the first convolutional feature;
[0094] S1124: Generate the first acoustic feature based on the first convolutional feature.
[0095] For example, the phoneme sequence can be converted into phoneme embedding features through the embedding layer of the target model.
[0096] Upscaling features are obtained through one or more processing steps in a prenet. For example, phonemes from different languages can be combined into high-dimensional upscaling features by merging low-dimensional phoneme embedding features through one or more concatenation layers (e.g., fully connected layers) in the prenet. The length of an upscaling feature corresponds to the length of a phoneme embedding feature.
[0097] After obtaining the dimensionality-upgraded features, convolutional features are obtained through one or more convolutional modules. Instead of using a Convolutional Bank Highway Network Gated Recurrent Unit (CBHG) module for further feature optimization, the convolutional modules are used, which makes the implementation simple.
[0098] For example, one of the convolutional modules can be a convolutional layer. For example, the first convolutional feature is obtained by transforming the increased-dimensional feature using three concatenated convolutional layers.
[0099] The first acoustic feature may include at least frequency features. The frequency features may at least reflect the pitch and / or timbre of the sample user.
[0100] For example, the first acoustic feature may include at least: fundamental frequency, frequency distribution features, periodic frequency features and / or non-periodic frequency features.
[0101] Further, S1124 may include:
[0102] S1224: Generate encoded features based on the first convolutional features;
[0103] S1324: Process the encoded features using one or more convolutional modules to obtain second convolutional features;
[0104] S1424: Generate the first acoustic feature based on the second convolution feature.
[0105] In one embodiment, feature encoding is performed using one or more encoding layers and first convolutional features to obtain encoded features. The encoded features are then processed by convolution through one or more convolutional modules to obtain second convolutional features.
[0106] After optimizing the encoded features through one or more convolutional modules, acoustic features (i.e., the first acoustic features) that better reflect the target user's reading or chanting of the target text can be obtained.
[0107] In some embodiments, S1224 may include:
[0108] The first convolutional feature is processed through a Local Sensitive Attention (LSA) layer.
[0109] After the first convolutional features pass through the LSA layer, they are input into another pre-network; this pre-network may also include one or more fully connected layers.
[0110] The features from the previous prenet step are input into the attention mechanism recurrent neural network (RNN) to obtain processed features. These processed features and the aforementioned first convolutional features are used as inputs to the LSA layer, and the output of the LSA layer is fed into the aforementioned prenet.
[0111] Meanwhile, the output of the attention RNN will be input into the decoder RNN, and the output of the decoder RNN will be input into one or more convolutional modules. After processing by the convolutional modules, the second convolutional feature will be obtained.
[0112] The aforementioned attention RNN is a neural network consisting of an attention mechanism layer added to the input and / or output of an RNN.
[0113] Of course, the above is just an example of generating features in the second convolution, and the actual implementation is not limited to the above example.
[0114] The aforementioned encoding can be a decoder for an RNN or a decoder for other types of deep learning models, for example, a decoder for a Long Short Memory network.
[0115] For example, after the identification information of the specific user is encoded into embedded features, the embedded features are processed through one or more fully connected layers (FC), and then processed through activation function layers, and then input into the aforementioned pre-network and encoder respectively. In this way, the identification information of the specific user is used as input, which makes it easier for the target acoustic network to encode the first acoustic features that better reflect the target user.
[0116] Figure 4 This is a schematic diagram of a target acoustic model.
[0117] exist Figure 4 The character embedding in the text refers to the embedding features of the phoneme sequence, i.e., the phoneme embedding features.
[0118] In the diagram, 3-Conv represents 3 convolutional modules.
[0119] The aforementioned activation function layer is where the activation function is executed. For example... Figure 4 As shown, the activation function can be softsign.
[0120] Of course, the above are just examples of the first acoustic feature, and the actual implementation is not limited to the examples above.
[0121] In some embodiments, synthesizing target audio with the pronunciation characteristics of the target user based on the first acoustic feature, the text features of the target text, the reference audio corresponding to the target text, and a synthesizer includes:
[0122] A first audio signal is synthesized based on the first acoustic feature;
[0123] The transformation relationship is obtained based on the Dynamic Time Warping (DTW) algorithm, the first duration information of the first audio, and the second duration information of the reference audio;
[0124] According to the conversion relationship, the first acoustic feature is converted into the second acoustic feature;
[0125] The target audio is synthesized based on the second acoustic feature.
[0126] After obtaining the first acoustic feature, the first acoustic feature is input into a synthesizer, and the synthesizer outputs the first audio signal. The synthesizer can be a Sakura, Harmor, or Sylenth1 synthesizer, etc. In some embodiments, the synthesizer can specifically be a WORLD synthesizer (Vocoder).
[0127] At this point, the duration of the synthesized first audio differs from the duration required for the target audio. In this embodiment, the DTW algorithm is used to determine the mapping relationship between the first audio and the reference audio.
[0128] Based on the DTW algorithm, the first duration information of the first audio, and the second duration information of the reference audio, the conversion relationship is obtained, which may include:
[0129] The duration ratio is determined based on the first duration information and the second duration information;
[0130] Based on the duration ratio, the extension of the first acoustic feature is determined;
[0131] The expanded first acoustic feature and the second acoustic feature of the reference audio are mapped onto the DTW to obtain the DTW path. The curve corresponding to the DTW path represents the transformation relationship.
[0132] For example, if the first duration information indicates that the first audio has 10 frames and the second duration information indicates that the reference audio has 20 frames, then the duration ratio can be 2.
[0133] The first acoustic feature of the nth frame is extended to the first acoustic features of the (2n-1)th and 2nth frames, where n is a positive integer.
[0134] In some embodiments, the first duration information and the second duration information of the first audio are obtained by alignment technology of Automatic Speech Recognition (ASR).
[0135] The first duration information and the second duration information may be specifically the number of frames and / or a specific duration value, such as Y seconds.
[0136] The first acoustic features of frame 2n-1 and frame 2n are completely identical. Here, 2 can be replaced by any duration ratio x. In this embodiment, the duration ratio is the ratio of the second duration information to the first duration information.
[0137] For example, a DTW diagram is constructed based on the acoustic features of the reference audio and the first acoustic features of the first audio after expansion. For instance, the acoustic features of the reference audio are mapped onto the X-axis of the DTW diagram, and the first acoustic features of the first audio after expansion are mapped onto the Y-axis of the DTW diagram. Thus, the curve formed on the DTW diagram represents the aforementioned transformation relationship.
[0138] For example, the DTW algorithm may include the traditional DTW algorithm and the fast DTW algorithm.
[0139] Based on the transformation relationship, the first acoustic feature is stretched or otherwise altered to obtain the transformed second acoustic feature.
[0140] For example, mapping the acoustic features of the reference audio to the X-axis of the DTW graph and mapping the expanded first acoustic features of the first audio to the Y-axis of the DTW graph may include:
[0141] Mapping the spectral features (sp) of the reference audio onto the X-axis of the DTW graph, and mapping the expanded spectral features (sp) of the first audio onto the Y-axis of the DTW graph, will yield the DTW path of sp (i.e., the transformation relationship of sp).
[0142] And / or,
[0143] Mapping the aperiodic frequency features (ap) of the reference audio onto the X-axis of the DTW graph, and mapping the extended aperiodic frequency features (ap) of the first audio onto the Y-axis of the DTW graph, will yield the DTW path of ap (i.e., the transformation relationship of ap).
[0144] After obtaining the conversion relationship of sp and the conversion relationship of ap, stretching is performed according to sp and ap of the first acoustic feature respectively to obtain the first acoustic feature (i.e. the second acoustic feature) whose stretched duration is equal to the duration indicated by the second duration information.
[0145] The first acoustic feature, which is equal to the second duration information indicating the duration, is input into the synthesizer, and the target audio of the dry tone is output.
[0146] Figure 5 The diagram shown is a DTW diagram, in which Figure 5 The curve in the image is one of the aforementioned DTW paths.
[0147] In some embodiments, synthesizing the target audio based on the second acoustic feature includes:
[0148] A third audio signal is synthesized based on the second acoustic feature;
[0149] Based on the dry sound distribution position of the reference audio, background audio is mixed into the third audio to obtain the target audio.
[0150] The dry sound mentioned in this embodiment refers to the human voice and does not involve background sounds such as instrument sounds.
[0151] In this embodiment of the disclosure, inputting the second acoustic feature back into the synthesizer will generate a third audio signal.
[0152] In one embodiment, the third audio can be used directly as the target audio.
[0153] In another embodiment, the target audio may include a mixture of background audio and a third audio.
[0154] For example, based on the location of the dry audio, it can be ensured that when there is background audio, the third audio and background audio are mixed in the dry audio location; and only the background audio is retained in the dry audio location.
[0155] This method allows for the generation of target audio with better sound quality with simple operation.
[0156] In some embodiments, the method further includes:
[0157] The target audio is post-processed to obtain target audio with further optimized sound quality.
[0158] The post-processing includes, but is not limited to, at least one of the following:
[0159] Perform equalization processing on the target audio to make it more robust;
[0160] The target audio is filtered (e.g., low-pass filtered) to eliminate glitches or loud background noise, thereby optimizing the target audio.
[0161] In one embodiment, the first acoustic feature further includes a fundamental frequency, and the fundamental frequency in the second acoustic feature may be the fundamental frequency of the first acoustic feature or the one obtained after processing the fundamental frequency of the first acoustic feature.
[0162] For example, the processing of the fundamental frequency of the first acoustic feature may include:
[0163] Performing a difference processing on the fundamental frequency.
[0164] Specifically, the difference processing on the fundamental frequency
[0165] sets the fundamental frequency value equal to 0 to a non-0 value of a preset size; this non-0 value can be any value preset to be close to 0; performs a linear interpolation on the non-0 value replaced for the fundamental frequency value equal to 0;
[0166] For the fundamental frequency value not equal to 0, constructs a DTW path of the fundamental frequency (f0) according to the aforementioned duration ratio, and performs a difference on the fundamental frequency based on the DTW path of f0 to obtain an extended fundamental frequency.
[0167] Assume the "zhong" in the current text. According to phoneme division, it can be divided into "zh" and "ong". According to voiceless / voiced (U / V) division, "zh" is voiceless with a fundamental frequency of 0, and "ong" is voiced with a non-zero fundamental frequency. For those with a fundamental frequency of 0 like "zh", linear interpolation is performed, and for those with a non-zero fundamental frequency like "ong", DTW interpolation is performed. The DTW interpolation here is to perform interpolation according to the DTW path.
[0168] In some embodiments, the first acoustic feature includes at least one of the following: fundamental frequency, spectral feature, and / or aperiodic spectral feature.
[0169] As Figure 7 shown, an embodiment of the present disclosure provides a method for training a target acoustic model. The method includes:
[0170] S2110: Training a preset model using sample data of multiple sample users to obtain the basic model;
[0171] S2120: Training the basic model using sample data of the target user to obtain a target acoustic model; wherein, the sample data includes: audio data and text corresponding to the audio data;
[0172] Among them, the target acoustic model is used to synthesize a target audio with the pronunciation characteristics of the target user according to the target text and a reference audio corresponding to the target text.
[0173] This disclosure provides a method for training a target model. First, a preset model is trained using sample data from multiple users. For example, in this disclosure, the preset model can be a Tacotron model where the CBCH module is replaced with one or more convolutional modules, resulting in a base model trained using sample data from multiple users.
[0174] Then, the base model is further trained (or optimized) using sample data from the specific target user, thus obtaining the target acoustic model. This reduces the amount of model training for a single user and lowers the implementation difficulty.
[0175] In one embodiment, the method further includes:
[0176] Based on the sample data of the target user and the sample data of the sample user, the sample user who meets the similarity conditions with the target user is determined; wherein, the user identifier of the sample user who meets the similarity conditions with the target user and the target text are used by the target acoustic model to synthesize the target audio.
[0177] When training a pre-defined model using sample users, the sample users are encoded to obtain their identification information.
[0178] Furthermore, during the training of the base model, sample users who meet similar conditions to the target user will be found. These sample users are the specific users mentioned in the aforementioned embodiments.
[0179] For example, the specific user includes at least one of the following:
[0180] The user whose voice features are most similar to those of the target user among the sample users;
[0181] The user whose pronunciation features are most similar to those of the target user among the sample users;
[0182] The user whose pronunciation and voice features are most similar to the target user's pronunciation and voice features is the one with the highest overall similarity.
[0183] For example, voice characteristics depend more on the user's physiological characteristics such as vocal cords; pronunciation characteristics may also depend on the user's pronunciation habits.
[0184] The trained target acoustic model can be as follows: Figure 4 As shown.
[0185] Figure 6 The text phoneme sequence shown is the phoneme sequence converted from the aforementioned target text.
[0186] The synthesis duration feature is the first duration information of the synthesized first speech. UV prediction predicts whether each phoneme in the phoneme sequence is a voiceless (U) or a voiced (V) sound. Pitch U / V adjustment is performed based on UV to output the pitch feature. This pitch can be represented by the aforementioned fundamental frequency. The extended fundamental frequency is obtained through pitch U / V adjustment.
[0187] If the reference audio is a song, the song's text phoneme sequence and singing duration features are obtained through feature mapping. Based on the transformation relationship obtained from the mapping, the features output by the target acoustic model are transformed to obtain the expanded AP features and the expanded SP features.
[0188] like Figure 8 As shown, this disclosure provides a complete system for lyric resource creation, model training, and lyric conversion and synthesis.
[0189] A corpus refers to a dataset where a user records only a small amount of fixed text audio data, which is sufficient to sing all the songs in the prepared resource. The corpus can contain 20 or more audio samples; naturally, the larger the corpus, the better the conversion and synthesis effect. For example, a corpus can contain less than 50 audio samples, etc. This is just an example, of course.
[0190] Resource creation may include, but is not limited to, the following operations:
[0191] The song resource library mainly stores the following resources: musical pitch features, duration features, background music, lyrics with pauses, and acoustic features extracted by a World vocoder. The musical pitch features may include, but are not limited to, at least one of the following: fundamental frequency, F0, and / or pitch. These musical pitch features can be features obtained through frame-by-frame processing. The duration features can correspond to the duration, which represents the frame length of the song corresponding to each word or phoneme. This embodiment uses the frame length corresponding to the phoneme. This embodiment can use the Spleeper open-source tool to extract background music; having clean dry audio and background music is even better.
[0192] Song resource production requires: a song with mixed background music (or a dry version of the background music separated from the background music), lyrics file, or a MidiNode for the music, or manual annotation of the duration.
[0193] The resource creation process may include, but is not limited to, the following operations:
[0194] 1. If using a music file with mixed background music, use the open-source tool Spleter to separate the background music, obtaining the dry audio and background music. If you need to use segments of a song to create resources, use the Praat audio annotation tool (not limited to Praat; lyc files with lyrics can also be used) to mark the song segments. That is, select a portion of the entire song as the song resource. Then, use a script tool to separate the dry audio and background music.
[0195] 2. If there is a MidiNode file, extract the lyrics and the time corresponding to each character from the MidiNode. Then, according to the statistical rules, obtain the ratio of the initial consonant to the final vowel for each character (also called a syllable in Chinese), and then obtain the duration of the initial consonant and final vowel. This duration is a rough estimate.
[0196] If no MidiNode file is available, a rough duration feature is obtained using an alignment model based on Automatic Speech Recognition (ASR) technology. Pause locations in the song lyrics can be extracted using MidiNode; if no MidiNode is available, pauses are manually marked.
[0197] 3. Use the Praat tool to perform fine annotation of duration features, and accurately record the duration of the initial and final phonemes in the syllables of each lyric word.
[0198] 4. Extract the song's fundamental frequency using Yin (including but not limited to Yin, Melodia, Wrold, Reaper, etc.). Adjust the song's fundamental frequency using the vowel UV (U indicates a frame with a fundamental frequency of 0, V indicates a frame with a non-zero fundamental frequency) in the duration features. That is, if the fundamental frequency is set to 0 at the position of a voiceless sound (U), and the extracted fundamental frequency is 0 at the position of a voiced sound (V), then it is linearly interpolated. Copy the synthesized audio using a World vocoder and manually adjust it to obtain a more accurate fundamental frequency.
[0199] 5. If a MIDIMode file is available, the pitch value of the MidiNode is used. Each word in the MidiNode has only one or more average pitch values, so the fundamental frequency extracted by tools such as Yin is used for adjustment to obtain a more accurate fundamental frequency and reduce manual operation.
[0200] 6. Use the World vocoder to extract the acoustic features of the singing dry notes to obtain the spectral features and non-periodic features of the song.
[0201] Model training can be performed as follows:
[0202] The embodiments disclosed herein employ the tacotron acoustic model from current mainstream text-to-speech (TTS) technology.
[0203] The tacotron acoustic model is an attention-based sequence-to-sequence (seq2seq) acoustic model, which is based on an encoder-decoder mechanism.
[0204] This embodiment modifies the Tacotron model by replacing the Tacotron CBHG module with three CNN models and introducing speaker embedding. The speaker embedding module consists of an embedding layer and a dense layer. After passing through the softsign activation function, different speaker IDs are encoded into 256-dimensional embeddings, which are then connected to the three parts of the decoder: the pre-net module, the decoder RNN module, and the post-net module. The tensors are concatenated to combine the embeddings.
[0205] This embodiment employs a combination of a world vocoder and a partially end-to-end tacotron model. The input consists of serialized features with pauses (prosody) and phonetic symbols (phonemes), while the output is the world vocoder's features (fundamental frequency, spectral features, denoted as sp, and aperiodic features, denoted as ap). The acoustic model is as follows, and the model outputs sp, ap, and f0 are fed into the world vocoder for synthesis.
[0206] This disclosure introduces a multi-speaker (i.e., sample user) mechanism to further optimize the tacotron acoustic model.
[0207] During training, multiple speakers are selected (20 are used in this embodiment, but not limited to) and each speaker's speaker ID is labeled. Finally, a base model of the speakers is obtained through training.
[0208] After a user records audio data for a fixed text, it is fed into the model for training. The input includes gender and voice timbre information, and the user selects the corresponding speaker ID (which is a speaker ID from the base model; ideally, males select male IDs, and females select female IDs, prioritizing speakers with the closest timbre to the base model). ID is an abbreviation for the identifier.
[0209] After selecting the speaker ID, fine-tuning is performed on the pre-trained base model to obtain the user's acoustic model, which is then used to synthesize the user's acoustic features.
[0210] The model is frozen, thus fixing the model's parameters.
[0211] Lyrics conversion and synthesis may include the following operations:
[0212] Step 1: Input data: Lyrics text with pauses output during resource creation.
[0213] Step 2: The text is processed by a speech synthesis front-end. The front-end functions include, but are not limited to, text normalization, word segmentation, part-of-speech tagging, and polyphonic characters, generating text tags. The text tags contain the phoneme information and pause information of the audio.
[0214] Step 3: The text is processed by phonetic annotation to convert it into phonemes, which contain pause features. At the same time, the phonemes are serialized according to the dictionary.
[0215] Step 4: Using the trained model, the text sequence is converted into acoustic features, including the synthesized fundamental frequency, synthesized spectral features (sp), and synthesized aperiodic features (ap). Then, the user's synthesized voice is synthesized using World Vocoder.
[0216] Step 5: The user-synthesized audio from step 4 is processed using Alignment technology in ASR to obtain the duration information of the synthesized audio (the Tacotron model does not support extracting the duration information of the synthesized audio; since the synthesized audio is very clean, the alignment is relatively accurate). If the alignment algorithm supports inputting three synthesis features, then a vocoder synthesis process is not required. This embodiment adopts the second method, where the alignment algorithm supports inputting spectral features and fundamental frequency features.
[0217] Step 6, as follows Figure 3 As shown, the duration information in the song resource is analyzed to obtain the duration corresponding to each phoneme. The synthesis features from step 4 and the song features of the singing resource are used as the values for calculating DTW, and sp and ap correspond to two DTW graphs to obtain the regularized path of DTW (i.e., the DTW path).
[0218] Finally, the transformation relationship is obtained based on the projection of the DTW path onto the x and y axes. The most significant feature of the synthesized feature (the first acoustic feature output by the target acoustic model) is stretched according to this transformation relationship to obtain a feature with the same duration as the song's features.
[0219] The regularized path of DTW is as follows Figure 5The middle line in this embodiment uses the FastDTW algorithm, which is more efficient and accurate than the traditional DTW algorithm, and can better match the relationship between two features. Compared with the linear interpolation scheme in related technologies, it has higher naturalness. In this embodiment, the DTW algorithm stretches phonemes with a fundamental frequency of non-zero, while phonemes with a fundamental frequency of zero are stretched using a linear interpolation method.
[0220] Step 7, Music Fundamental Frequency Processing: First, using linear interpolation, all 0s in the fundamental frequency are interpolated to non-zero values. Special processing is applied to the beginning and end of sentences to ensure that the fundamental frequency values at these points do not asymptotically approach 0. Then, adjustments are made based on the fundamental frequency calculated from the acoustic model. If a 0 appears in the fundamental frequency of a consonant without a fundamental frequency in the acoustic model, the corresponding fundamental frequency in the music is modified to 0.
[0221] Step 8: Input the adjusted fundamental frequency and the stretched acoustic features from step 6 into the World vocoder to synthesize the user's singing audio.
[0222] Step 9: Post-process the synthesized sound: Based on the average fundamental frequency of the song, perform an equalization (EQ) operation on the synthesized song, that is, adjust the frequency proportion in the fundamental frequency to make the synthesized music sound better. At the same time, perform low-pass filtering on the synthesized audio to eliminate sound artifacts.
[0223] Step 10: Use the background music resources and the dry audio positions in the duration file to mix the background music, and add reverb in post-processing.
[0224] In summary, the technical solution of this disclosure proposes a novel adaptive speaker acoustic model based on tacotron. This model has low requirements for the dataset; a speaker adaptive base model is trained, and then a good adaptive model can be trained with 20 sentences of corpus. The FastDTW algorithm is used to optimize the feature stretching of initials and finals, and a linear interpolation scheme is used to achieve feature stretching of initials without fundamental frequency, thus achieving more accurate feature stretching transformation.
[0225] like Figure 9 As shown, this disclosure provides an audio synthesis apparatus, including:
[0226] The generation module 110 is configured to generate a phoneme sequence based on target text containing pauses, wherein the phoneme sequence includes: one or more phoneme elements; wherein the phoneme elements include pause features;
[0227] The module 120 is used to input the phoneme sequence into the target acoustic model to obtain the first acoustic feature; wherein, the target model includes: a base model trained using sample data of the target user, wherein, the base model is: trained using sample data of multiple sample users; wherein, the sample data includes: audio data and text corresponding to the audio data;
[0228] The synthesis module 130 is used to synthesize target audio with the pronunciation characteristics of the target user based on the first acoustic feature, the text features of the target text, the reference audio corresponding to the target text, and the synthesizer.
[0229] In some embodiments, the audio synthesis device may be any electronic device.
[0230] For example, the generation module 110, the obtaining module 120, and the synthesis module 130 may be program modules; after the program modules are executed by the processor, they can perform the above operations.
[0231] As another example, the generation module 110, the obtaining module 120, and the synthesis module 130 may be hardware or software modules; the hardware or software modules may be programmable arrays; the programmable arrays include, but are not limited to, field-programmable arrays and / or complex programmable arrays.
[0232] As an example, the generation module 110, the obtaining module 120, and the synthesis module 130 may be pure hardware modules; the pure hardware modules may include, but are not limited to, application-specific integrated circuits.
[0233] In one embodiment, the obtaining module 120 is used to input the phoneme sequence and the identification information of a specific user into the target acoustic model to obtain the first acoustic feature, wherein the specific user is: the sample user that meets similar conditions to the target user.
[0234] In one embodiment, the obtaining module 120 is further configured to input the phoneme sequence into the target acoustic model to obtain phoneme embedding features; and to process the phoneme embedding features through a pre-processing network to obtain up-dimensional features;
[0235] The first convolutional feature is obtained by processing the dimensionality-upgraded feature using one or more convolutional modules; the first acoustic feature is generated based on the first convolutional feature.
[0236] In one embodiment, the obtaining module 120 can also be used to generate encoded features based on the first convolutional features; process the encoded features using one or more convolutional modules to obtain second convolutional features; and generate the first acoustic features based on the second convolutional features.
[0237] In one embodiment, the synthesis module 130 is specifically configured to synthesize a first audio based on the first acoustic feature; obtain a conversion relationship based on the Dynamic Time-Domain Warping (DTW) algorithm, the first duration information of the first audio, and the second duration information of the reference audio; convert the first acoustic feature into a second acoustic feature based on the conversion relationship; and synthesize the target audio based on the second acoustic feature.
[0238] In one embodiment, the synthesis module 130 is used to synthesize a third audio based on the second acoustic feature; and to mix background audio into the third audio based on the dry tone distribution position of the reference audio to obtain the target audio.
[0239] In one embodiment, the first acoustic feature includes at least one of the following: fundamental frequency, spectral feature, and / or, aperiodic spectral feature.
[0240] like Figure 10 As shown, this disclosure provides a target model training apparatus, the apparatus comprising:
[0241] The first training module 210 is used to train a preset model using sample data from multiple sample users to obtain the basic model.
[0242] The second training module 220 is used to train a base model using sample data from the target user to obtain a target acoustic model; wherein the sample data includes: audio data and text corresponding to the audio data;
[0243] The target acoustic model is used to synthesize target audio with the pronunciation characteristics of the target user based on the target text and the corresponding reference audio.
[0244] The target acoustic model training device can be any electronic device.
[0245] In one embodiment, the first training module 210 and the second training module 220 may be: a program module, a programmable module, and / or a pure hardware module, etc.
[0246] In some embodiments, the target acoustic model device may further include:
[0247] The determination module is used to determine the sample users that meet similar conditions to the target user based on the sample data of the target user and the sample data of the sample users; wherein, the user identifier of the sample users that meet similar conditions to the target user and the target text are used by the target acoustic model to synthesize the target audio.
[0248] This disclosure presents an embodiment of an adaptive singing conversion method based on dynamic time warping (DTW) with limited corpus. The method takes text and song resources as input and outputs singing voice. It is a method in which the user inputs 20 sentences of fixed text and can sing out all the songs in the song resource library.
[0249] The embodiments disclosed herein are mainly divided into four parts: resource production, singing synthesis, vocal model training, and post-processing.
[0250] like Figure 11 As shown, this disclosure provides an electronic device, the electronic device comprising:
[0251] Memory;
[0252] The processor, connected to the memory, is configured to implement the audio synthesis method provided in any of the foregoing embodiments by executing computer-executable instructions stored in the memory, such as performing... Figures 1 to 4 and / or Figures 6 to 8 Any audio synthesis method and / or target acoustic model training method shown.
[0253] like Figure 11 As shown, the electronic device may also include a network interface, which can be used for information exchange between the first device and the second device.
[0254] This disclosure provides a computer storage medium storing computer-executable instructions; when executed by a processor, these computer-executable instructions can implement the audio synthesis method provided in any of the foregoing embodiments, for example, performing... Figures 1 to 4 and / or Figures 6 to 8 Any audio synthesis method and / or target acoustic model training method shown. The computer storage medium is a non-transient storage medium.
[0255] The technical solutions described in the embodiments of this disclosure can be combined arbitrarily without conflict.
[0256] In the several embodiments provided in this disclosure, it should be understood that the disclosed methods and smart devices can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components may be combined, or integrated into another system, or some features may be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.
[0257] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of this embodiment according to actual needs.
[0258] In addition, each functional unit in the various embodiments of this disclosure can be integrated into a second processing unit, or each unit can be a separate unit, or N or more units can be integrated into a unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.
[0259] The above description is merely a specific embodiment of this disclosure, but the scope of protection of this disclosure is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this disclosure should be included within the scope of protection of this disclosure.
Claims
1. An audio synthesis method, characterized in that, include: A phoneme sequence is generated based on target text containing pauses, wherein the phoneme sequence includes: one or more phoneme elements; wherein the phoneme elements include pause features; The phoneme sequence is input into the target acoustic model to obtain the first acoustic feature; wherein, the target acoustic model includes: a base model trained using sample data from the target user, wherein the base model is: trained using sample data from multiple sample users; wherein, the sample data includes: audio data and text corresponding to the audio data; A first audio signal is synthesized based on the first acoustic feature; The transformation relationship is obtained based on the Dynamic Temporal Warping (DTW) algorithm, the first duration information of the first audio, and the second duration information of the reference audio corresponding to the target text; According to the conversion relationship, the first acoustic feature is converted into the second acoustic feature; The target audio is synthesized based on the second acoustic feature; the target audio has the pronunciation characteristics of the target user.
2. The method according to claim 1, characterized in that, The step of inputting the phoneme sequence into the target acoustic model to obtain the first acoustic feature includes: The phoneme sequence and the identification information of a specific user are input into the target acoustic model to obtain the first acoustic feature, wherein the specific user is the sample user that meets similar conditions to the target user.
3. The method according to claim 1 or 2, characterized in that, The step of inputting the phoneme sequence into the target acoustic model to obtain the first acoustic feature includes: The phoneme sequence is input into the target acoustic model to obtain phoneme embedding features; The phoneme embedding features are processed by a pre-processing network to obtain higher-dimensional features; The first convolutional feature is obtained by processing the up-dimensional feature using one or more convolutional modules. The first acoustic feature is generated based on the first convolutional feature.
4. The method according to claim 3, characterized in that, The step of inputting the phoneme sequence into the target acoustic model to obtain the first acoustic feature includes: Generate encoded features based on the first convolutional features; The encoded features are processed using one or more convolutional modules to obtain second convolutional features; The first acoustic feature is generated based on the second convolution.
5. The method according to claim 1, characterized in that, The process of synthesizing the target audio based on the second acoustic feature includes: A third audio signal is synthesized based on the second acoustic feature; Based on the dry sound distribution position of the reference audio, background audio is mixed into the third audio to obtain the target audio.
6. The method according to claim 1 or 2, characterized in that, The first acoustic feature includes at least one of the following: fundamental frequency, periodic frequency feature, and non-periodic frequency feature.
7. A method for training a target acoustic model, characterized in that, The method includes: A pre-set model is trained using sample data from multiple sample users to obtain a basic model; A base model is trained using sample data from the target user to obtain the target acoustic model; wherein, the sample data includes: audio data and text corresponding to the audio data; The target acoustic model is used to determine target audio with the pronunciation characteristics of the target user based on the input phoneme sequence; the phoneme sequence is generated based on target text containing pauses; the phoneme sequence includes one or more phoneme elements; the phoneme elements include pause features; the target audio is synthesized based on a second acoustic feature; the second acoustic feature is obtained by converting the first acoustic feature output by the target acoustic model according to a conversion relationship; the conversion relationship is obtained based on the DTW algorithm, the duration information of the first audio, and the second duration information of the reference audio corresponding to the target text; the first audio is synthesized based on the first acoustic feature.
8. The method according to claim 7, characterized in that, The method further includes: Based on the sample data of the target user and the sample data of the sample user, the sample user who meets the similarity conditions with the target user is determined; wherein, the user identifier of the sample user who meets the similarity conditions with the target user and the target text are used by the target acoustic model to synthesize the target audio.
9. An audio synthesis device, characterized in that, include: A generation module is used to generate a phoneme sequence based on target text containing pauses, wherein the phoneme sequence includes: one or more phoneme elements; wherein the phoneme elements include pause features; A module is configured to input the phoneme sequence into a target acoustic model to obtain a first acoustic feature; wherein the target acoustic model comprises: a base model trained using sample data from a target user, wherein the base model is: trained using sample data from multiple sample users; wherein the sample data comprises: audio data and text corresponding to the audio data; The synthesis module is used to synthesize a first audio based on the first acoustic feature; obtain a conversion relationship based on the Dynamic Time-Domain Warping (DTW) algorithm, the first duration information of the first audio, and the second duration information of the reference audio corresponding to the target text; convert the first acoustic feature into a second acoustic feature based on the conversion relationship; and synthesize a target audio based on the second acoustic feature; the target audio has the pronunciation characteristics of the target user.
10. A target model training device, characterized in that, The device includes: The first training module is used to train a preset model using sample data from multiple sample users to obtain a basic model. The second training module is used to train a base model using sample data from the target user to obtain a target acoustic model; wherein, the sample data includes: audio data and text corresponding to the audio data; The target acoustic model is used to determine target audio with the pronunciation characteristics of the target user based on the input phoneme sequence; the phoneme sequence is generated based on target text containing pauses; the phoneme sequence includes one or more phoneme elements; the phoneme elements include pause features; the target audio is synthesized based on a second acoustic feature; the second acoustic feature is obtained by converting the first acoustic feature output by the target acoustic model according to a conversion relationship; the conversion relationship is obtained based on the DTW algorithm, the duration information of the first audio, and the second duration information of the reference audio corresponding to the target text; the first audio is synthesized based on the first acoustic feature.
11. An electronic device, characterized in that, The electronic device includes: Memory; A processor, connected to the memory, is configured to implement the method provided by any one of claims 1 to 6 or 7 to 8 by executing computer-executable instructions stored on the memory.
12. A computer storage medium, characterized in that, The computer storage medium stores computer-executable instructions; when executed by a processor, the computer-executable instructions can implement the method provided by any one of claims 1 to 6 or 7 to 8.