Rap audio generation method, device and equipment and readable storage medium
By extracting semantic PPG features and voiceprint features, and using ASR and GE2E models to generate rap audio, the problems of high data collection costs and unnatural synthesis effects are solved, and high-quality rap audio generation is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGZHOU QUWAN NETWORK TECH CO LTD
- Filing Date
- 2022-12-30
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies require parallel speech data from both the speaker and the target audience when generating rap audio, resulting in high data collection costs and unnatural synthesis effects, often leading to mechanical sounds and unclear pronunciation.
By extracting semantic PPG features and voiceprint features, voiceprint recognition is performed using the ASR and GE2E models, and the HIFIGAN model is combined to generate rap audio with the user's timbre, thus optimizing the synthesis effect and avoiding mechanical sounds.
It enhances the expressiveness and naturalness of rap audio, solves the problem of unnatural synthesis effects, and can quickly process long rap songs.
Smart Images

Figure CN116013248B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of audio analysis, and more specifically, to methods, apparatus, devices, and readable storage media for generating rap audio. Background Technology
[0002] With the airing of rap shows, rap culture has become popular among young people. Therefore, in the offline functionalities of entertainment social platforms, there is a demand to meet users' needs by generating personalized rap songs. According to the product logic, users read the lyrics aloud, and after finishing, click to generate a rap song with their own vocal timbre. The generated rap song has good rhythm, pronunciation, and emotion. The generation of the rap song utilizes a rap audio conversion algorithm.
[0003] Currently used voice conversion and generation algorithms extract acoustic features from both speaker and target audio corpora during the training phase. They then use inter-frame alignment algorithms such as Dynamic Time Planning (DTW) to align these features, and employ models like Gaussian Mixture Models (GMM) or Artificial Neural Networks (ANNs) to learn the mapping relationship between input and target acoustic features. In other words, commonly used voice conversion and generation algorithms require parallel speaker-to-target audio corpora.
[0004] However, the data for training models to achieve parallel speaker and target speech corpora is expensive and difficult to collect, and the synthesis effect is prone to unnaturalness, mechanical sounds, and unclear pronunciation in long audio synthesis. Such synthesis effect will greatly reduce the promotion of gameplay and user experience.
[0005] Based on the above-mentioned actual situation, this application proposes a rap audio generation scheme to solve the above-mentioned drawbacks. Summary of the Invention
[0006] In view of this, this application provides a method, apparatus, device and readable storage medium for generating rap audio. By extracting semantic PPG features and voiceprint features and converting them into Melp features of the user's voice, the rap audio is finally converted into rap audio with the user's voice, which optimizes the rap audio synthesis effect, improves the expressiveness and naturalness of rap audio, and avoids the occurrence of mechanical sounds.
[0007] A method for generating rap audio includes:
[0008] Obtain user-recorded audio and rap templates;
[0009] The parameters of the rap template are validated, and the semantic PPG features of the rap template are extracted using the ASR model.
[0010] The voiceprint features of the user's recorded audio are extracted using the GE2E model, which is trained for a voiceprint recognition task using the GE2E loss function.
[0011] The PPG semantic features and the voiceprint features extracted by the GE2E model are combined and converted into Melp features of the user's voice timbre.
[0012] The HIFIGAN model is used to convert the Melp features of the user's voice into waveforms to generate rap audio.
[0013] Optionally, parameter validation is performed on the rap template, including:
[0014] The sampling rate, number of channels, and quantization bit width of the rap template are validated.
[0015] Optionally, the PPG semantic features and the voiceprint features extracted by the GE2E model are combined to convert the user's voice timbre into Melp features, including:
[0016] Extract the fundamental frequency characteristics of the user-recorded audio;
[0017] Based on the PPG semantic features and the fundamental frequency features, the initial extended features are obtained;
[0018] The voiceprint features extracted by the GE2E model and the initial extended features are concatenated according to the time dimension to generate the target extended features;
[0019] The target extended features are subjected to block-based relative position attention decoding to obtain the Melp features of the user's timbre.
[0020] Optionally, based on the PPG semantic features and the fundamental frequency features, initial extended features are obtained, including:
[0021] The PPG semantic features are extracted by convolutional feature extraction using a PPG processing network to obtain the first feature;
[0022] The fundamental frequency features are extracted by convolutional feature extraction using a fundamental frequency processing network to obtain the second feature;
[0023] The first feature and the second feature are added together to obtain the initial extended feature.
[0024] Optionally, the HIFIGAN model includes a HIFIGAN vocoder and a convolutional residual structure;
[0025] The HIFIGAN vocoder includes a multi-scale discriminator and a multi-period discriminator, used to generate rap audio based on the Melp features of the user's timbre;
[0026] The convolutional residual structure increases the receptive field by alternating between holed convolution and ordinary convolution, thereby ensuring the synthesized sound quality of the rap audio and improving inference speed.
[0027] Optionally, the target extended features are subjected to block-based relative position attention decoding to obtain the Melp features of the user's timbre, including:
[0028] The hidden features of the target extended features are captured using an RNN network, and the size of the block features is determined.
[0029] The contextual features of the latent features are extracted based on the block feature size using a block relative position attention mechanism, and the Melp features of the user's voice are generated.
[0030] Optionally, the block relative position attention mechanism is as follows:
[0031]
[0032]
[0033] μ i =μ i-1 +Δ i
[0034]
[0035] Where SM is the Sofrmax function, SP is the softplus function, σ is the sigmoid function, and K represents K sets of mixing parameters, each set of parameters including weight w, mean step size Δ, scaling size σ, and mean size μ, α i,j is the attention weight of the output, p is the block size, i represents the decoding output of the i-th step, and j represents the time coordinate of the latent features involved.
[0036] A rap audio generation device, comprising:
[0037] The material acquisition unit is used to acquire user-recorded audio and rap templates;
[0038] PPG feature units are used to perform parameter verification on the rap template and extract the semantic PPG features of the rap template using an ASR model.
[0039] The voiceprint feature unit is used to extract the voiceprint features of the user's recorded audio using the GE2E model, which is trained by the GE2E loss function for the voiceprint recognition task.
[0040] The Melp feature unit is used to combine the PPG semantic features and the voiceprint features extracted by the GE2E model to convert them into Melp features of the user's voice timbre.
[0041] The audio generation unit is used to convert the Melp features of the user's voice into waveforms using the HIFIGAN model to generate rap audio.
[0042] A rap audio generation device, including a memory and a processor;
[0043] The memory is used to store programs;
[0044] The processor is used to execute the program to implement the various steps of the rap audio generation method described above.
[0045] A readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the rap audio generation method described above.
[0046] As can be seen from the above technical solutions, the rap audio generation method, apparatus, device, and readable storage medium provided in this application obtain user-recorded audio and a rap template, perform parameter verification on the rap template, and extract the semantic PPG features of the rap template using an ASR model. The voiceprint features of the user-recorded audio are extracted using a GE2E model, which is trained using a GE2E loss function for a voiceprint recognition task. The PPG semantic features and the voiceprint features extracted by the GE2E model are combined and converted into Melp features of the user's timbre. Finally, the Melp features of the user's timbre are converted into a waveform using a HIFIGAN model to generate the rap audio.
[0047] This application utilizes the ASR model to extract the semantic PPG features of the rap template and the GE2E model to extract the voiceprint features of the user-recorded audio. By replacing the voiceprint features extracted from the user audio with the rap template, a rap audio with the user's timbre can be generated, thereby optimizing the rap audio synthesis effect, improving the expressiveness and naturalness of the rap audio, and avoiding the appearance of mechanical sounds.
[0048] In addition, to address the issue of reduced conversion quality for long speech, a segmented relative position attention-based voice conversion model is used, which can convert long rap songs at a faster speed and with better conversion results, thus ensuring the conversion quality of long speech. Attached Figure Description
[0049] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0050] Figure 1 This is a flowchart of a rap audio generation method disclosed in an embodiment of this application;
[0051] Figure 2 This is a diagram of a rap audio synthesis framework disclosed in an embodiment of this application;
[0052] Figure 3 This is a schematic diagram of the structure of the CLR-VC conversion model disclosed in the embodiments of this application;
[0053] Figure 4 This is a schematic diagram illustrating an example of rap audio generation disclosed in this application.
[0054] Figure 5 This is a structural block diagram of a rap audio generation device disclosed in an embodiment of this application;
[0055] Figure 6 This is a hardware structure block diagram of a rap audio generation device disclosed in an embodiment of this application. Detailed Implementation
[0056] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0057] This application can be used in a wide variety of general-purpose or special-purpose computing environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor devices, distributed computing environments including any of the above devices, etc.
[0058] This application provides a method for generating rap audio. This method can be applied to various computer terminals or smart terminals, and its execution entity can be the processor or server of the computer terminal or smart terminal.
[0059] The following section introduces the solution proposed in this application. The technical solution is as follows, and details are provided below.
[0060] Figure 1 This is a flowchart of a rap audio generation method disclosed in an embodiment of this application. Figure 2 This is a diagram of a rap audio synthesis framework disclosed in an embodiment of this application, such as... Figure 1 and Figure 2 As shown, the method may include:
[0061] Step S1: Obtain the user-recorded audio and rap template.
[0062] Specifically, the first step is to prepare rap templates for the rap songs to be released online, collect rap song clips sung by singers, use a backing track and vocal separation tool to separate the rap vocals and backing track, and annotate the rap lyrics. Then, process them into WAV and TXT files as required. In other words, the rap template is WAV format data.
[0063] Step S2: Perform parameter verification on the rap template and extract the semantic PPG features of the rap template using the ASR model.
[0064] Specifically, the parameters of the rap template are validated, including the sampling rate, number of channels, and quantization bit width of the rap template.
[0065] The sampling rate, number of channels, and quantization bit width of the obtained WAV format rap template are verified to ensure that the reference audio and the audio to be tested meet the requirements of 16KHz sampling rate, single channel, and 16-bit bit width. The trained ASR model is then used to extract semantic PPG features from the template.
[0066] Step S3: Extract the voiceprint features of the user's recorded audio using the GE2E model, which is trained using the GE2E loss function for voiceprint recognition tasks.
[0067] Specifically, the GE2E model consists of an LSTM network structure and is trained using the GE2E loss function for voiceprint recognition tasks. The GE2E loss function aims to identify speakers with similar voiceprint features and speakers with different voiceprint features with low similarity. Voiceprint features are extracted from user-recorded audio. During this extraction process, valid audio is detected and extracted. Subsequently, a sliding window is used to extract voiceprint features from multiple segments, which are then averaged to obtain the final user voiceprint features.
[0068] Step S4: Combine the PPG semantic features and the voiceprint features extracted by the GE2E model to convert them into Melp features of the user's voice timbre.
[0069] Specifically, the process combines PPG semantic features and voiceprint features extracted by the GE2E model, then uses traditional signal processing methods to extract fundamental frequency features. These three features are then processed by the CLR-VC conversion model to output Mel spectrum features. Simply replacing these features with voiceprint features extracted from the user's audio is sufficient to generate rap audio with the user's vocal timbre.
[0070] Semantic PPG features and fundamental frequency features are processed by PPG processing network and fundamental frequency processing network respectively, then added together and concatenated along the time dimension to convert into Melp features of the user's timbre.
[0071] Step S5: Use the HIFIGAN model to convert the Melp features of the user's voice into waveforms to generate rap audio.
[0072] Specifically, the HIFIGAN model includes a HIFIGAN vocoder and a convolutional residual structure. The HIFIGAN vocoder comprises a multi-scale discriminator and a multi-period discriminator, used to generate rap audio based on the Melp features of the user's timbre. The convolutional residual structure increases the receptive field by alternately using dilated convolutions and ordinary convolutions, ensuring the synthesized sound quality of the rap audio and improving inference speed.
[0073] As can be seen from the above technical solutions, the rap audio generation method, apparatus, device, and readable storage medium provided in this application obtain user-recorded audio and a rap template, perform parameter verification on the rap template, and extract the semantic PPG features of the rap template using an ASR model. The voiceprint features of the user-recorded audio are extracted using a GE2E model, which is trained using a GE2E loss function for a voiceprint recognition task. The PPG semantic features and the voiceprint features extracted by the GE2E model are combined and converted into Melp features of the user's timbre. Finally, the Melp features of the user's timbre are converted into a waveform using a HIFIGAN model to generate the rap audio.
[0074] This application utilizes the ASR model to extract the semantic PPG features of the rap template and the GE2E model to extract the voiceprint features of the user-recorded audio. By replacing the voiceprint features extracted from the user audio with the rap template, a rap audio with the user's timbre can be generated, thereby optimizing the rap audio synthesis effect, improving the expressiveness and naturalness of the rap audio, and avoiding the appearance of mechanical sounds.
[0075] In addition, to address the issue of reduced conversion quality for long speech, a segmented relative position attention-based voice conversion model is used, which can convert long rap songs at a faster speed and with better conversion results, thus ensuring the conversion quality of long speech.
[0076] In some embodiments of this application, Figure 3 This is a schematic diagram of the structure of the CLR-VC conversion model provided in the embodiments of this application, as shown below. Figure 3 As shown, this application can also describe step S4, the process of converting the PPG semantic features and the voiceprint features extracted by the GE2E model into Melp features of the user's voice timbre, which may specifically include:
[0077] Step S41: Extract the fundamental frequency features of the user-recorded audio.
[0078] Specifically, the fundamental frequency features of the user-recorded audio are extracted, and these features are obtained by processing the audio using the pyworld tool.
[0079] Step S42: Based on the PPG semantic features and the fundamental frequency features, obtain the initial extended features.
[0080] Specifically, the process of obtaining the initial extended features based on the PPG semantic features and the fundamental frequency features may include:
[0081] ① The PPG semantic features are extracted by convolutional feature extraction using the PPG processing network to obtain the first feature.
[0082] ② The fundamental frequency features are extracted by convolutional feature extraction using a fundamental frequency processing network to obtain the second feature.
[0083] ③ Add the first feature and the second feature together to obtain the initial extended feature.
[0084] The PPG processing network extracts convolutional features from the PPG semantic features, and the fundamental frequency processing network extracts convolutional features from the fundamental frequency features. The two networks are then added together to obtain features with the same time dimension and 256 feature channels, which are the initial extended features. The PPG processing network and the fundamental frequency processing network are composed of convolutional neural networks and ReLU activation functions.
[0085] Step S43: Concatenate the voiceprint features extracted from the GE2E model and the initial extended features according to the time dimension to generate the target extended features.
[0086] Specifically, the GE2E model extracts the voiceprint features from the user's recorded audio, resulting in 256-dimensional voiceprint features. These features are then expanded along the time dimension of the summed features and concatenated along the time dimension to form the target expanded features, which serve as input to the block-based relative position attention decoding module. The formula for the input to the block-based relative position attention decoding module is as follows:
[0087] s i =RNN Att ([x i-1 ,c i-1 ],s i-1 )
[0088] α i =CLRAttention(s) i )
[0089]
[0090] d i =RNN Dec ([ci ,s i ],d i-1 )
[0091] χ i =Linear out (d i ,c i )
[0092] h j denoted by , x represents the output after concatenation, s and d represent the latent features output by the RNN network, c represents the context features extracted using the attention mechanism, p represents the block feature size, α represents the attention weights output by the block relative position attention module, i represents the decoding output of the i-th step, and j represents the time coordinates of the latent features involved.
[0093] Step S44: Perform block-based relative position attention decoding on the target extended features to obtain the Melp features of the user's timbre.
[0094] Specifically, the process of performing block-based relative position attention decoding on the target extended features to obtain the Melp features of the user's timbre may include:
[0095] ①Use an RNN network to capture the hidden features of the target extended features and determine the size of the block features.
[0096] ② Use a block-based relative position attention mechanism to extract the context features of the latent features based on the block feature size, and generate the Melp features of the user's voice.
[0097] Specifically, the block-based relative position attention mechanism is as follows:
[0098]
[0099]
[0100] μ i =μ i-1 +Δ i
[0101]
[0102] Where SM is the Sofrmax function, SP is the softplus function, σ is the sigmoid function, and K represents K sets of mixing parameters, each set of parameters including weight w, mean step size Δ, scaling size σ, and mean size μ, α i,j is the attention weight of the output, p is the block size, i represents the decoding output of the i-th step, and j represents the time coordinate of the latent features involved.
[0103] The following example will be used to illustrate this application in detail.
[0104] like Figure 4 As shown, in practical applications, rap song clips sung by singers are collected. A vocal-accompaniment separation tool is used to separate the rap vocals from the accompaniment, and the rap lyrics are annotated. The lyrics are then processed into WAV and TXT files according to requirements. A pre-trained PPG model is used to extract PPG semantic features from the WAV files.
[0105] Users enter the AI rap interface, click the record button, read the rap lyrics aloud, and then upload the recording. The audio quality detection module judges the quality of the user's audio reading through effective audio detection and the GOP algorithm, returning a quality score to the client. Audio that meets the quality requirements will be sent to the rap synthesis service.
[0106] In the rap synthesis service, noise reduction and voiceprint extraction are performed on the user's audio to obtain a voiceprint feature vector. Then, the pre-processed semantic PPG features and fundamental frequency features of the song template are read and combined with the user's voiceprint features, which are then input into the CLR-VC conversion model and HIFIGAN vocoder to output synthesized rap audio unique to the user.
[0107] The following describes the rap audio generation apparatus provided in the embodiments of this application. The rap audio generation apparatus described below can be referred to in correspondence with the rap audio generation method described above.
[0108] See Figure 5 , Figure 5 This is a structural block diagram of a rap audio generation device disclosed in an embodiment of this application.
[0109] like Figure 5 As shown, the rap audio generation device may include:
[0110] Material acquisition unit 110 is used to acquire user-recorded audio and rap templates;
[0111] PPG feature unit 120 is used to perform parameter verification on the rap template and extract the semantic PPG features of the rap template using an ASR model;
[0112] Voiceprint feature unit 130 is used to extract voiceprint features of the user's recorded audio using a GE2E model, which is trained for a voiceprint recognition task using a GE2E loss function.
[0113] Melp feature unit 140 is used to combine the PPG semantic features and the voiceprint features extracted by the GE2E model to convert them into Melp features of the user's voice timbre;
[0114] The audio generation unit 150 is used to convert the Melp features of the user's voice into waveforms using the HIFIGAN model to generate rap audio.
[0115] As can be seen from the above technical solutions, the rap audio generation method, apparatus, device, and readable storage medium provided in this application obtain user-recorded audio and a rap template, perform parameter verification on the rap template, and extract the semantic PPG features of the rap template using an ASR model. The voiceprint features of the user-recorded audio are extracted using a GE2E model, which is trained using a GE2E loss function for a voiceprint recognition task. The PPG semantic features and the voiceprint features extracted by the GE2E model are combined and converted into Melp features of the user's timbre. Finally, the Melp features of the user's timbre are converted into a waveform using a HIFIGAN model to generate the rap audio.
[0116] This application utilizes the ASR model to extract the semantic PPG features of the rap template and the GE2E model to extract the voiceprint features of the user-recorded audio. By replacing the voiceprint features extracted from the user audio with the rap template, a rap audio with the user's timbre can be generated, thereby optimizing the rap audio synthesis effect, improving the expressiveness and naturalness of the rap audio, and avoiding the appearance of mechanical sounds.
[0117] In addition, to address the issue of reduced conversion quality for long speech, a segmented relative position attention-based voice conversion model is used, which can convert long rap songs at a faster speed and with better conversion results, thus ensuring the conversion quality of long speech.
[0118] Optionally, the PPG feature unit, which performs parameter verification on the rap template, may include:
[0119] The sampling rate, number of channels, and quantization bit width of the rap template are validated.
[0120] Optionally, the Melp feature unit, which combines the PPG semantic features and the voiceprint features extracted by the GE2E model to convert into Melp features of the user's timbre, may include:
[0121] Extract the fundamental frequency characteristics of the user-recorded audio;
[0122] Based on the PPG semantic features and the fundamental frequency features, the initial extended features are obtained;
[0123] The voiceprint features extracted by the GE2E model and the initial extended features are concatenated according to the time dimension to generate the target extended features;
[0124] The target extended features are subjected to block-based relative position attention decoding to obtain the Melp features of the user's timbre.
[0125] Optionally, the Melp feature unit, based on the PPG semantic features and the fundamental frequency features, obtains initial extended features, which may include:
[0126] The PPG semantic features are extracted by convolutional feature extraction using a PPG processing network to obtain the first feature;
[0127] The fundamental frequency features are extracted by convolutional feature extraction using a fundamental frequency processing network to obtain the second feature;
[0128] The first feature and the second feature are added together to obtain the initial extended feature.
[0129] Optionally, the HIFIGAN model includes a HIFIGAN vocoder and a convolutional residual structure;
[0130] The HIFIGAN vocoder includes a multi-scale discriminator and a multi-period discriminator, used to generate rap audio based on the Melp features of the user's timbre;
[0131] The convolutional residual structure increases the receptive field by alternating between holed convolution and ordinary convolution, thereby ensuring the synthesized sound quality of the rap audio and improving inference speed.
[0132] Optionally, the Melp feature unit, which performs block-based relative position attention decoding on the target extended features to obtain the Melp features of the user's timbre, may include:
[0133] The hidden features of the target extended features are captured using an RNN network, and the size of the block features is determined.
[0134] The contextual features of the latent features are extracted based on the block feature size using a block relative position attention mechanism, and the Melp features of the user's voice are generated.
[0135] Optionally, the block-relative position attention mechanism of the Melp feature unit is as follows:
[0136]
[0137]
[0138] μ i =μ i-1 +Δ i
[0139]
[0140] Where SM is the Sofrmax function, SP is the softplus function, σ is the sigmoid function, and K represents K sets of mixing parameters, each set of parameters including weight w, mean step size Δ, scaling size σ, and mean size μ, α i,j is the attention weight of the output, p is the block size, i represents the decoding output of the i-th step, and j represents the time coordinate of the latent features involved.
[0141] The rap audio generation device provided in this application embodiment can be applied to rap audio generation equipment. Figure 6 The hardware structure block diagram of the rap audio generation device is shown below, with reference to... Figure 6 The hardware structure of a rap audio generation device may include: at least one processor 1, at least one communication interface 2, at least one memory 3, and at least one communication bus 4;
[0142] In this embodiment, the number of processor 1, communication interface 2, memory 3, and communication bus 4 is at least one, and processor 1, communication interface 2, and memory 3 communicate with each other through communication bus 4.
[0143] Processor 1 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
[0144] Memory 3 may include high-speed RAM, and may also include non-volatile memory, such as at least one disk storage device;
[0145] The memory stores a program, which the processor can call. The program is used for:
[0146] Obtain user-recorded audio and rap templates;
[0147] The parameters of the rap template are validated, and the semantic PPG features of the rap template are extracted using the ASR model.
[0148] The voiceprint features of the user's recorded audio are extracted using the GE2E model, which is trained for a voiceprint recognition task using the GE2E loss function.
[0149] The PPG semantic features and the voiceprint features extracted by the GE2E model are combined and converted into Melp features of the user's voice timbre.
[0150] The HIFIGAN model is used to convert the Melp features of the user's voice into waveforms to generate rap audio.
[0151] Optionally, the refined and extended functions of the program can be referred to the above description.
[0152] This application embodiment also provides a readable storage medium that can store a program suitable for execution by a processor, the program being used for:
[0153] Obtain user-recorded audio and rap templates;
[0154] The parameters of the rap template are validated, and the semantic PPG features of the rap template are extracted using the ASR model.
[0155] The voiceprint features of the user's recorded audio are extracted using the GE2E model, which is trained for a voiceprint recognition task using the GE2E loss function.
[0156] The PPG semantic features and the voiceprint features extracted by the GE2E model are combined and converted into Melp features of the user's voice timbre.
[0157] The HIFIGAN model is used to convert the Melp features of the user's voice into waveforms to generate rap audio.
[0158] Optionally, the refined and extended functions of the program can be referred to the above description.
[0159] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0160] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0161] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A method of generating rap audio, characterized by, include: Obtain user-recorded audio and rap templates; The parameters of the rap template are validated, and the PPG semantic features of the rap template are extracted using the ASR model. The voiceprint features of the user's recorded audio are extracted using the GE2E model, which is trained for a voiceprint recognition task using the GE2E loss function. Combining the PPG semantic features and the voiceprint features extracted by the GE2E model, the user's voice timbre is converted into Melp features, including: extracting the fundamental frequency features of the user's recorded audio; obtaining initial extended features based on the PPG semantic features and the fundamental frequency features; concatenating the voiceprint features extracted by the GE2E model and the initial extended features along the time dimension to generate target extended features; using an RNN network to capture the latent features of the target extended features and determine the block feature size; using a block relative position attention mechanism to extract the context features of the latent features based on the block feature size, and generating the user's voice timbre Melp features; The HIFIGAN model is used to convert the Melp features of the user's voice into waveforms to generate rap audio. The block-relative position attention mechanism is as follows: Where SM is the sofrmax function and SP is the softplus function. K groups of mixed parameters, each group containing weights. Mean step size Scaling and the size of the mean , Here, p represents the attention weights of the output, p is the block size, i represents the decoding output at step i, and j represents the temporal coordinates of the latent features involved. These are the hidden features output by the RNN network.
2. The method according to claim 1, characterized in that, Perform parameter validation on the rap template, including: The sampling rate, number of channels, and quantization bit width of the rap template are validated.
3. The method of claim 1, wherein, Based on the PPG semantic features and the fundamental frequency features, initial extended features are obtained, including: The PPG semantic features are extracted by convolutional feature extraction using a PPG processing network to obtain the first feature; The fundamental frequency features are extracted by convolutional feature extraction using a fundamental frequency processing network to obtain the second feature; The first feature and the second feature are added together to obtain the initial extended feature.
4. The method of claim 1, wherein, The HIFIGAN model includes a HIFIGAN vocoder and a convolutional residual structure; The HIFIGAN vocoder includes a multi-scale discriminator and a multi-period discriminator, used to generate rap audio based on the Melp features of the user's timbre; The convolutional residual structure increases the receptive field by alternating between holed convolution and ordinary convolution, thereby ensuring the synthesized sound quality of the rap audio and improving inference speed.
5. A rap audio generating apparatus characterized by comprising: include: The material acquisition unit is used to acquire user-recorded audio and rap templates; PPG feature units are used to perform parameter verification on the rap template and extract the PPG semantic features of the rap template using an ASR model. The voiceprint feature unit is used to extract the voiceprint features of the user's recorded audio using the GE2E model, which is trained by the GE2E loss function for the voiceprint recognition task. The Melp feature unit is used to combine the PPG semantic features and the voiceprint features extracted by the GE2E model to convert them into Melp features of the user's voice timbre. The unit includes: extracting the fundamental frequency features of the user's recorded audio; obtaining initial extended features based on the PPG semantic features and the fundamental frequency features; concatenating the voiceprint features extracted by the GE2E model and the initial extended features in a time dimension to generate target extended features; using an RNN network to capture the latent features of the target extended features and determine the block feature size; using a block relative position attention mechanism to extract the context features of the latent features based on the block feature size; and generating the Melp features of the user's voice timbre. An audio generation unit is used to convert the Melp features of the user's timbre into waveforms using a HIFIGAN model to generate rap audio. The block-relative position attention mechanism is as follows: Where SM is the sofrmax function and SP is the softplus function. K groups of mixed parameters, each group containing weights. Mean step size Scaling and the size of the mean , Here, p represents the attention weights of the output, p is the block size, i represents the decoding output at step i, and j represents the temporal coordinates of the latent features involved. These are the hidden features output by the RNN network.
6. A rap audio generation device characterized by comprising: Including memory and processor; The memory is used to store programs; The processor is configured to execute the program to implement each step of the rap audio generation method as described in any one of claims 1-4.
7. A readable storage medium, having stored thereon a computer program, characterized in that, When the computer program is executed by a processor, it implements each step of the rap audio generation method as described in any one of claims 1-4.