Song generation method, apparatus, device, and storage medium
By generating songs using deep learning models and voice imitation techniques, and combining volume balancing and rhythm adjustment, the problem of insufficient quality in song generation has been solved, and high-quality songs have been generated.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUNDAI TECH CO LTD
- Filing Date
- 2023-10-20
- Publication Date
- 2026-06-23
Smart Images

Figure CN117496923B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, and in particular to a song generation method, apparatus, device, and storage medium. Background Technology
[0002] In recent years, with the rapid development of artificial intelligence technology, natural language processing, as an important direction of artificial intelligence, has also made significant progress. Currently, natural language processing technology has been applied in many fields, such as song generation.
[0003] For song generation tasks, the quality of the generated songs is crucial, as it significantly impacts user experience. Therefore, there is an urgent need for an effective method to automatically generate songs and improve their quality. Summary of the Invention
[0004] This application provides a song generation method, apparatus, device, and storage medium, which can improve song quality. The technical solution is as follows:
[0005] On the one hand, a song generation method is provided, the method comprising:
[0006] Obtain a song generation request; wherein, the song generation request includes input song attribute information and reference sound samples for sound imitation; the song attribute information includes at least the basic components of song style and song emotional type;
[0007] Generate initial audio based on the song attribute information;
[0008] The encoder of the voice imitation model is invoked to map the reference voice sample into a latent vector; and the decoder of the voice imitation model is invoked to generate an imitation audio that matches the voice features of the human voice in the reference voice sample based on the latent vector.
[0009] The initial audio and the simulated audio are combined to obtain the target song.
[0010] In one possible implementation, generating the initial audio based on the song attribute information includes:
[0011] The song generation model is invoked to generate the initial audio based on the song attribute information;
[0012] The training process of the song generation model includes:
[0013] Obtain a music dataset and convert the music data included in the music dataset into a music sequence; wherein, a music sequence includes notes at multiple time steps;
[0014] The first deep learning model is trained based on the transformed music sequence to obtain the song generation model;
[0015] The training objective during model training is to maximize the first log-likelihood function of the notes generated by the model. The first log-likelihood function represents the probability that the model generates the i-th element given the first i-1 elements. The first i-1 elements and the i-th element come from the music sequence input to the model.
[0016] In one possible implementation, the song generation request further includes an initial input music sequence; the step of calling the song generation model to generate the initial audio based on the song attribute information includes:
[0017] The song generation model is invoked to generate the initial audio based on the initial music sequence, under the constraints of the song attribute information.
[0018] In one possible implementation, the training process of the voice imitation model includes:
[0019] Obtain an unlabeled human voice dataset;
[0020] Preprocess the voice data included in the voice dataset to obtain voice samples;
[0021] The second deep learning model is trained based on the obtained human voice samples to obtain the voice imitation model;
[0022] The training objective during model training is to maximize the second log-likelihood function of the human voice samples generated by the model and minimize the distance between the first latent vector and the second latent vector.
[0023] The second log-likelihood function is used to represent the probability that the model generates the i-th element given the first i-1 elements and the third latent vector; the first i-1 elements and the i-th element are derived from human voice samples input to the model;
[0024] The first latent vector is the latent vector of the human voice sample generated by the model; the second latent vector is the latent vector of the human voice sample input to the model; and the third latent vector is sampled from the probability distribution of the second latent vector.
[0025] In one possible implementation, synthesizing the initial audio and the simulated audio to obtain the target song includes:
[0026] The initial audio and the simulated audio are superimposed to obtain the superimposed audio.
[0027] Perform volume balancing, rhythm adjustment, and loss compensation operations on the superimposed audio to obtain the target song;
[0028] The volume balancing operation is used to adjust the volume of different parts of the audio; the rhythm adjustment operation is used to adjust the audio rhythm based on the user's rhythm requirements; and the loss compensation operation is used to repair the sound quality.
[0029] In one possible implementation, the step of superimposing the initial audio and the simulated audio to obtain the superimposed audio includes:
[0030] After aligning the initial audio and the simulated audio in time, the sample values of the waveform corresponding to the initial audio and the sample values of the waveform corresponding to the simulated audio are added together to obtain the superimposed audio.
[0031] In one possible implementation, obtaining the song generation request includes:
[0032] Display the song settings interface; wherein, the song settings interface includes a song attribute information setting control and a reference sound sample upload control;
[0033] Based on the song attribute information, set the control to obtain the input song attribute information;
[0034] Based on the reference sound sample upload control, obtain the uploaded reference sound sample;
[0035] Based on the input song attribute information and the uploaded reference sound sample, the song generation request is generated.
[0036] On the other hand, a song generation apparatus is provided, the apparatus comprising:
[0037] The acquisition unit is configured to acquire a song generation request; wherein the song generation request includes input song attribute information and reference sound samples for sound imitation; the song attribute information includes at least the basic components of song style and song emotional type;
[0038] The first generation unit is configured to generate initial audio based on the song attribute information;
[0039] The second generation unit is configured to invoke the encoder of the voice imitation model to map the reference sound sample into a latent vector; and to invoke the decoder of the voice imitation model to generate an imitation audio that matches the voice features of the human voice in the reference sound sample based on the latent vector.
[0040] The synthesis unit is configured to synthesize the initial audio and the simulated audio to obtain the target song.
[0041] In one possible implementation, the first generation unit is configured to invoke a song generation model to generate the initial audio based on the song attribute information;
[0042] The training process of the song generation model includes:
[0043] Obtain a music dataset and convert the music data included in the music dataset into a music sequence; wherein, a music sequence includes notes at multiple time steps;
[0044] The first deep learning model is trained based on the transformed music sequence to obtain the song generation model;
[0045] The training objective during model training is to maximize the first log-likelihood function of the notes generated by the model. The first log-likelihood function represents the probability that the model generates the i-th element given the first i-1 elements. The first i-1 elements and the i-th element come from the music sequence input to the model.
[0046] In one possible implementation, the song generation request further includes an input initial music sequence; the first generation unit is configured to invoke the song generation model to generate the initial audio based on the initial music sequence under the constraints of the song attribute information.
[0047] In one possible implementation, the training process of the voice imitation model includes:
[0048] Obtain an unlabeled human voice dataset;
[0049] Preprocess the voice data included in the voice dataset to obtain voice samples;
[0050] The second deep learning model is trained based on the obtained human voice samples to obtain the voice imitation model;
[0051] The training objective during model training is to maximize the second log-likelihood function of the human voice samples generated by the model and minimize the distance between the first latent vector and the second latent vector.
[0052] The second log-likelihood function is used to represent the probability that the model generates the i-th element given the first i-1 elements and the third latent vector; the first i-1 elements and the i-th element are derived from human voice samples input to the model;
[0053] The first latent vector is the latent vector of the human voice sample generated by the model; the second latent vector is the latent vector of the human voice sample input to the model; and the third latent vector is sampled from the probability distribution of the second latent vector.
[0054] In one possible implementation, the synthesis unit is configured as follows:
[0055] The initial audio and the simulated audio are superimposed to obtain the superimposed audio.
[0056] Perform volume balancing, rhythm adjustment, and loss compensation operations on the superimposed audio to obtain the target song;
[0057] The volume balancing operation is used to adjust the volume of different parts of the audio; the rhythm adjustment operation is used to adjust the audio rhythm based on the user's rhythm requirements; and the loss compensation operation is used to repair the sound quality.
[0058] In one possible implementation, the synthesis unit is configured as follows:
[0059] After aligning the initial audio and the simulated audio in time, the sample values of the waveform corresponding to the initial audio and the sample values of the waveform corresponding to the simulated audio are added together to obtain the superimposed audio.
[0060] In one possible implementation, the acquisition unit is configured as follows:
[0061] Display the song settings interface; wherein, the song settings interface includes a song attribute information setting control and a reference sound sample upload control;
[0062] Based on the song attribute information, set the control to obtain the input song attribute information;
[0063] Based on the reference sound sample upload control, obtain the uploaded reference sound sample;
[0064] Based on the input song attribute information and the uploaded reference sound sample, the song generation request is generated.
[0065] On the other hand, a computer device is provided, the device including a processor and a memory, the memory storing at least one piece of program code, the at least one piece of program code being loaded and executed by the processor to implement the song generation method described above.
[0066] On the other hand, a computer-readable storage medium is provided, wherein at least one piece of program code is stored therein, the at least one piece of program code being loaded and executed by a processor to implement the above-described song generation method.
[0067] On the other hand, a computer program product or computer program is provided, which includes computer program code stored in a computer-readable storage medium. A processor of a computer device reads the computer program code from the computer-readable storage medium and executes the computer program code, causing the computer device to perform the song generation method described above.
[0068] The song generation scheme provided in this application can generate songs with specific vocal features according to user needs. Specifically, the scheme first obtains a song generation request, which includes user-inputted song attribute information and reference sound samples for voice imitation. Then, the song is generated based on the song attribute information. Because user needs are considered during song generation, the generated song better matches user expectations, improving song quality. Furthermore, the scheme includes a voice imitation process, where the encoder of the voice imitation model maps the reference sound sample into a latent vector, and the decoder of the voice imitation model generates an imitation audio that matches the vocal features of the reference sound sample based on the latent vector. Finally, by synthesizing the initial audio and the imitation audio, a song with specific vocal features is generated, greatly enriching the song generation methods. Attached Figure Description
[0069] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0070] Figure 1 This is a schematic diagram of the implementation environment involved in a song generation method provided in an embodiment of this application;
[0071] Figure 2 This is a schematic diagram of the architecture of a song generation method provided in an embodiment of this application;
[0072] Figure 3 This is a flowchart of a song generation method provided in an embodiment of this application;
[0073] Figure 4 This is a flowchart of another song generation method provided in an embodiment of this application;
[0074] Figure 5 This is a schematic diagram of the structure of a song generation device provided in an embodiment of this application;
[0075] Figure 6This is a schematic diagram of the structure of a computer device provided in an embodiment of this application;
[0076] Figure 7 This is a schematic diagram of the structure of another computer device provided in an embodiment of this application. Detailed Implementation
[0077] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.
[0078] In this application, the terms "first," "second," etc., are used to distinguish identical or similar items that have essentially the same function. It should be understood that there is no logical or temporal dependency between "first," "second," and "nth," nor does it limit the quantity or execution order. It should also be understood that although the following description uses the terms "first," "second," etc., to describe various elements, these elements should not be limited by the terms.
[0079] These terms are simply used to distinguish one element from another. For example, without departing from the various examples, the first element can be referred to as the second element, and similarly, the second element can be referred to as the first element. Both the first and second elements can be elements, and in some cases, they can be separate and distinct elements.
[0080] "At least one" refers to one or more elements. For example, at least one element can be one element, two elements, three elements, or any integer number of elements greater than or equal to one. "Multiple" refers to two or more elements. For example, multiple elements can be two elements, three elements, or any integer number of elements greater than or equal to two.
[0081] In this article, "and / or" indicates that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following objects are in an "or" relationship.
[0082] It should be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data used for analysis, data stored, data displayed, etc.) and signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant regions.
[0083] Figure 1 This is a schematic diagram of the implementation environment involved in a song generation method provided in this application embodiment.
[0084] In this embodiment of the application, the implementation environment includes a computer device. For example, see [link to relevant documentation]. Figure 1 The aforementioned computer equipment includes a terminal 101 and a server 102. In other words, the song generation method is jointly executed by the terminal 101 and the server 102, and this application does not limit this.
[0085] In one possible implementation, server 102 is used to train the song generation model and the voice imitation model. Upon receiving a song generation request from terminal 101, server 102 automatically generates a song based on the trained model and returns the generated song to terminal 101. Alternatively, a dedicated server can train the song generation model and the voice imitation model and send the trained model to server 102. In this case, upon receiving a song generation request from terminal 101, server 102 automatically generates a song based on the trained model and returns the generated song to terminal 101.
[0086] For example, terminal 101 is a computer device with a display screen, such as a smartphone or tablet computer; while server 102 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, and this application does not limit it.
[0087] Furthermore, the servers involved in the embodiments of this application may also include other servers to provide more comprehensive and diversified services. Additionally, those skilled in the art will understand that the number of terminals may be more or fewer than illustrated. For example, the number of terminals may be only a few, or dozens or hundreds, or even more; this application does not limit this number.
[0088] For example, terminal 101 is equipped with an application that provides song generation function. For instance, users can download the application via mobile phone or tablet. Server 102 is used to provide background services for the application. For instance, server 102 is equipped with a trained song generation model and a sound imitation model that can automatically generate songs.
[0089] Based on the aforementioned implementation environment, this application provides a song generation scheme that utilizes deep learning technology for song generation and voice imitation. Exemplarily, this scheme uses a large model as the song generation model, i.e., it uses a large model to generate the song, and uses a voice imitation model (such as a variational autoencoder) to imitate a specific human voice. Figure 2 As shown, this application embodiment provides a system 20 for generating songs and performing human voice imitation. See also... Figure 2 The system includes: a song generation module 21, a sound imitation module 22, and a synthesis module 23.
[0090] In one possible implementation, the song generation module 21 uses a large model to generate songs. This large model learns from a large amount of music data to generate new songs, which include elements such as melody, chords, and rhythm. The voice imitation module 22 is used to imitate human voices based on the VITS model. The VITS model is a deep learning model combining variational autoencoders and Transformers. This module learns from a large amount of human voice data to generate audio that imitates a specific human voice, i.e., imitation audio. This imitation audio can be used for audio synthesis to generate songs that match the vocal characteristics of a specific human voice, i.e., songs with specified vocal features. The synthesis module 23 is used to synthesize the song generated by the song generation module and the imitation audio output by the voice imitation module to obtain a complete song.
[0091] The following is combined with Figure 1 and Figure 2 The following describes in detail the song generation scheme provided in the embodiments of this application.
[0092] Figure 3 This is a flowchart illustrating a song generation method provided in an embodiment of this application. The method is executed by a computer device. See also... Figure 3 The method flow provided in this application embodiment includes:
[0093] 301. A computer device acquires a song generation request; wherein the song generation request includes input song attribute information and reference sound samples for sound imitation; the song attribute information includes at least the basic components of the song style and the song emotional type.
[0094] In one possible implementation, the song attribute information includes not only the basic elements of song style, but also the song's emotional type, in order to generate songs that meet the user's emotional needs.
[0095] For example, the basic elements of a song style include, but are not limited to, melody, rhythm, meter, dynamics, register, timbre, harmony, polyphony, mode, and tonality; the emotional types of a song include, but are not limited to, reminiscence, healing, longing, sadness, loneliness, sweetness, happiness, inspiration, passion, and tranquility.
[0096] In another possible implementation, this application embodiment obtains the song generation request in the following manner: a computer device displays a song settings interface; wherein the song settings interface includes at least a song attribute information setting control and a reference sound sample upload control; then, the computer device obtains the song attribute information input by the user based on the song attribute information setting control; and, based on the reference sound sample upload control, obtains the reference sound sample uploaded by the user; finally, the computer device generates a song generation request based on the song attribute information input by the user and the reference sound sample uploaded by the user.
[0097] For example, the song attribute information setting control includes multiple setting items, and each setting item supports the user to set a basic component. This application does not limit this.
[0098] In addition, taking the joint execution of this solution by the terminal and the server as an example, after the terminal generates a song generation request, it will upload the song generation request to the server to request the server to generate the song according to the song generation request.
[0099] 302. The computer device generates the initial audio based on the song's attribute information.
[0100] This step is by Figure 2 The song generation module 21 is executed based on the song generation model. Furthermore, in this embodiment, the song generated by the song generation module 21 is named using the initial audio.
[0101] The song generation module 21 generates initial audio based on the song's attribute information, which falls under conditional generation. In conditional generation, the computer device acquires additional information or conditions provided by the user to guide song generation. For example, the user can provide information or conditions such as the song's melody, rhythm, beat, timbre, harmony, and emotional type, and the song generation module can then generate the song based on this information or conditions. This conditional generation method helps users control the characteristics of the generated song more precisely.
[0102] Furthermore, since songs are typically sequences of notes and time steps, song generation models can employ sequence generation techniques to progressively generate the notes for each time step.
[0103] In addition, for song generation, users can also provide an initial music sequence, which includes multiple notes. The song generation model then uses a conditional generation method to gradually expand the entire song based on the initial music sequence. This application does not impose any limitations on this.
[0104] 303. The computer device invokes the encoder of the sound imitation model to map a reference sound sample into a latent vector; and invokes the decoder of the sound imitation model to generate an imitation audio that matches the sound features of a human voice in the reference sound sample based on the latent vector.
[0105] This step is by Figure 2 The song generation module 21 in the middle is executed. It should be noted that the latent vector is also called the hidden vector or latent vector. For example, the reference sound sample is a human voice sample used for sound imitation.
[0106] For voice imitation, the process includes encoding a reference voice sample into a latent vector using an encoder of the voice imitation model, and generating a human voice similar to the reference voice sample using a decoder of the voice imitation model.
[0107] 304. The computer equipment synthesizes the initial audio and the simulated audio to obtain the target song.
[0108] This step is through... Figure 2 The synthesis module 23 in the middle synthesizes the generated song and the imitated human voice to generate the final song.
[0109] The song generation scheme provided in this application can generate songs with specific vocal features according to user needs. Specifically, the scheme first obtains a song generation request, which includes user-inputted song attribute information and reference sound samples for voice imitation. Then, the song is generated based on the song attribute information. Because user needs are considered during song generation, the generated song better matches user expectations, improving song quality. Furthermore, the scheme includes a voice imitation process, where the encoder of the voice imitation model maps the reference sound sample into a latent vector, and the decoder of the voice imitation model generates an imitation audio that matches the vocal features of the reference sound sample based on the latent vector. Finally, by synthesizing the initial audio and the imitation audio, a song with specific vocal features is generated, greatly enriching the song generation methods.
[0110] The above describes some technical details of the song generation scheme provided in the embodiments of this application. The following is based on... Figure 4 The specific implementation method shown introduces the song generation scheme.
[0111] Figure 4 This is a flowchart of another song generation method provided in an embodiment of this application. The method is executed by a computer device. See also... Figure 4 The method flow provided in this application embodiment includes:
[0112] 401. The computer device obtains a song generation request; wherein the song generation request includes input song attribute information and an initial music sequence, as well as reference sound samples for sound imitation; the song attribute information includes at least the basic components of the song style and the song emotional type.
[0113] In one possible implementation, the song generation request includes not only song attribute information and reference sound samples, but also an initial music sequence. This initial music sequence serves as input data for the song generation model, which then progressively expands the entire song based on this input data, thus obtaining the initial audio.
[0114] 402. The computer device calls the song generation model and, under the constraints of the song's attribute information, generates the initial audio based on the initial music sequence.
[0115] In one possible implementation, the training process of the song generation model includes the following steps:
[0116] 4021. Obtain a music dataset and convert the music data contained in the dataset into a music sequence; wherein a music sequence includes notes at multiple time steps.
[0117] In this embodiment of the application, the music dataset includes a large amount of music data collected in advance. Exemplarily, the music dataset includes various types of music data.
[0118] During the training process, since the raw music data is not suitable as direct input to the model, it will be preprocessed first, that is, the music data will be converted into a series of notes or note groups, and then the converted notes or note groups will be used to train the song generation model, as detailed in step 4022 below.
[0119] 4022. Train the first deep learning model based on the transformed music sequence to obtain the song generation model.
[0120] For example, the first deep learning model described above includes, but is not limited to, generative adversarial networks, recurrent neural networks, or Transformer structures, and this application does not limit it in this regard.
[0121] For training the song generation model, the training objective is to maximize the log-likelihood function of the generated notes. This log-likelihood function represents the probability that the model generates the ith element given the first i-1 elements; where the first i-1 elements and the ith element are derived from the music sequence input to the model. In another possible implementation, the log-likelihood function takes the following form:
[0122]
[0123] in, It is the i-th element of the input music sequence. It is the first i-1 elements of the input music sequence. These are the model parameters, and P represents the probability distribution.
[0124] It should be noted that, in order to distinguish it from the log-likelihood function that appears later in the text, the log-likelihood function here is also called the first log-likelihood function, and the log-likelihood function that appears later in the text is also called the second log-likelihood function.
[0125] In another possible implementation, songs can be generated through random sampling, optimization, or a hybrid approach. For random sampling, taking a variational autoencoder as an example, a set of latent vectors with mean and variance can be provided. Then, random sampling is performed from the probability distribution of these latent vectors (obtained by encoding input data into the latent space), generating latent vectors (different from the previous ones). These generated latent vectors are then converted into audio data by a decoder, resulting in the generated song. Optimization generates songs by minimizing or maximizing an objective function. For example, an objective function can be defined, perhaps related to the song's sound quality, rhythm, etc. Then, optimization algorithms are used to adjust model parameters or latent vectors to best meet user requirements. Optimization methods typically require more computational resources and time but offer finer control. Hybrid approaches combine multiple methods to achieve better generation results. That is, hybrid approaches leverage the advantages of different generation methods to meet user needs. For example, a rough music clip can be generated first using random sampling, and then the generated music clip can be adjusted using optimization or conditional generation methods.
[0126] In another possible implementation, the following operation can also be performed during the audio generation process to achieve rhythm adjustment:
[0127] When generating songs, note interpolation and expansion techniques can be used to increase or decrease the number of notes to suit specific rhythmic requirements. This approach helps generate songs with good coherence, ensuring smooth transitions between notes. Alternatively, the song generation model can have automated rhythm adjustment capabilities, such as automatically adjusting the duration and intensity of notes to meet the user's rhythmic needs. This method ensures a coherent rhythm between different parts of the generated song.
[0128] 403. The computer device invokes the encoder of the sound imitation model to map a reference sound sample into a latent vector; and invokes the decoder of the sound imitation model to generate an imitation audio that matches the sound features of a human voice in the reference sound sample based on the latent vector.
[0129] In one possible implementation, the training process of the sound imitation model includes the following steps:
[0130] 4031. Obtain the unlabeled human voice dataset; preprocess the human voice data included in the human voice dataset to obtain human voice samples.
[0131] In this embodiment, the voice dataset includes a large amount of pre-collected voice data. For example, the voice dataset includes speech and singing clips from different people. It should be noted that this embodiment achieves voice imitation by learning from unlabeled voice data.
[0132] During the training process, this embodiment of the application processes human voice data into human voice samples and uses the human voice samples to train the model, as detailed in step 4032 below.
[0133] Processing human voice data into human voice samples is called preprocessing. The reason for preprocessing is that the raw human voice data is not suitable as direct input to the model.
[0134] For example, if the human voice data is a singing segment, the preprocessing can be to convert the singing segment into a series of notes or groups of notes; if the human voice data is user speech, the preprocessing can be to perform frame segmentation and speech activity detection on the user speech, etc., and this application does not limit it.
[0135] 4032. Train a second deep learning model based on the obtained human voice samples to obtain a voice imitation model.
[0136] For training the voice imitation model, the training objective is to maximize the log-likelihood function (second log-likelihood function) of the generated human voice samples and minimize the distance between the first latent vector and the second latent vector. Here, the first latent vector is the latent vector of the generated human voice samples; the second latent vector is the latent vector of the input human voice samples.
[0137] In another possible implementation, taking singing segments as an example, the human voice data input to the model is also called a human voice sequence (composed of musical notes), and the corresponding log-likelihood function takes the following form:
[0138]
[0139] in, It is the i-th element of the input human voice sequence. It is the first i-1 elements of the input human voice sequence. These are the model parameters, and P represents the probability distribution.
[0140] For the above equation, the log-likelihood function represents the probability that the model generates the i-th element given the first i-1 elements and the third latent vector z; where the first i-1 elements and the i-th element come from the human voice samples input to the model. The third latent vector z is sampled from the probability distribution of the second latent vector.
[0141] In another possible implementation, the following operation can also be performed during the audio generation process to achieve volume balancing:
[0142] The volume and sound quality of the generated audio can be adjusted by controlling parameters in the latent vector. Alternatively, continuously varying audio can be generated by interpolating different latent vectors in the latent space. Interpolation in the latent space allows for volume balancing, resulting in a smooth volume transition as the volume of the generated audio gradually changes. Alternatively, volume balancing can be achieved through real-time monitoring and adjustment. This method allows for real-time monitoring and adjustment of the volume during the generation process, ensuring that the generated audio meets volume balance requirements.
[0143] 404. The computer equipment synthesizes the initial audio and the simulated audio to obtain the target song.
[0144] In one possible implementation, the initial audio and the simulated audio can be synthesized in the following way:
[0145] 4041. Superimpose the initial audio and the simulated audio to obtain the superimposed audio.
[0146] For example, the initial audio and the simulated audio are superimposed to obtain the superimposed audio, including but not limited to the following methods:
[0147] After aligning the initial audio and the simulated audio in time, the sample values of the waveform corresponding to the initial audio and the sample values of the waveform corresponding to the simulated audio are added together to obtain the superimposed audio.
[0148] 4042. Perform volume balancing, rhythm adjustment, and loss compensation operations on the superimposed audio to obtain the target song; among them, the volume balancing operation is used to adjust the volume of different parts of the audio; the rhythm adjustment operation is used to adjust the audio rhythm based on the user's rhythm requirements; and the loss compensation operation is used to repair the sound quality.
[0149] In the embodiments of this application, Figure 2 The synthesis module 23 also features volume balancing, rhythm adjustment, and loss compensation functions to ensure good consistency and coherence in the final generated song. The volume balancing, rhythm adjustment, and loss compensation functions are described in detail below.
[0150] Volume balancing is the process of adjusting the volume of different parts of an audio file to ensure that the overall volume of the song sounds balanced and consistent. In other words, the main purpose of volume balancing is to prevent certain parts of the audio from being too loud or too soft, which would result in an unbalanced sound in the final song. For example, audio post-processing tools can be used to automatically or semi-automatically adjust the volume of different parts of the audio to ensure that the volume remains balanced throughout the song.
[0151] For example, rhythm adjustment can be performed using timeline adjustment techniques or rhythm matching techniques. For timeline adjustment, the generated song typically has a fixed timeline, but may need to be adjusted according to specific rhythm and duration requirements. By adjusting the timing points, beats, or measures on the timeline, it can be ensured that the generated song conforms to the desired rhythm. Rhythm matching is a technique that matches generated notes to a desired rhythmic pattern. For example, rhythm matching achieves this by identifying the start time, duration, and intensity of notes and adjusting them to match the desired rhythm.
[0152] Loss compensation refers to the process of repairing or compensating for potential audio quality loss during audio generation. For example, loss compensation includes, but is not limited to: noise reduction and dereverberation, equalizer adjustment, compression and limiting, audio restoration, and dynamic processing.
[0153] Regarding noise reduction and denoising, noise or reverberation may be introduced during music generation, and noise reduction and denoising can help eliminate these unwanted interferences to ensure the sound quality of the music. In other words, noise reduction and denoising help reduce noise and reverberation effects in audio, making the music sound cleaner and more transparent.
[0154] Equalizer adjustments involve using an equalizer to adjust the frequency response of audio to enhance or reduce the sound in a specific frequency range, which helps improve sound quality.
[0155] Compression and limiting techniques can be used to adjust the dynamic range of audio to avoid drastic volume differences between different parts of the audio. This helps to achieve volume balance in music, preventing certain parts of the audio from being too loud or too soft.
[0156] For audio restoration, various distortions or sound loss may be introduced during the audio generation process, such as popping sounds and noise gaps. Audio restoration technology can help repair these problems to maintain the integrity of the music. This helps to achieve musical continuity and avoid audio interruptions or noise interference.
[0157] For dynamic processing, the dynamic range of the audio can be balanced, and the loudness of different parts of the music can be kept consistent. This makes the music sound more balanced and coherent.
[0158] In conclusion, loss compensation techniques can improve the sound quality of generated songs. By denoising and dereverberation, adjusting the frequency response and dynamic range of the audio, audio restoration, and dynamic processing, the generated songs can sound clearer, more balanced, and more coherent and consistent, thereby improving the sound quality and listenability of the generated songs.
[0159] The song generation scheme provided in this application embodiment can generate songs with specific vocals according to user needs, significantly improving song quality.
[0160] In detail, this solution utilizes a song generation model trained on a large amount of music data to generate high-quality songs. Furthermore, a voice imitation model trained on a large amount of human voice data accurately mimics specific human voices, exhibiting high audibility and natural expressiveness. Additionally, the model training process requires no manual data annotation, making it more practical. Moreover, the absence of manual data annotation allows for training on large-scale datasets, significantly reducing implementation costs. Furthermore, by performing operations such as volume balancing, rhythm adjustment, and loss compensation, the generated songs ensure good consistency and coherence, achieving a level of quality approaching that of human composers, thus improving the sound quality. Finally, the song generation and voice imitation processes can be parallelized, enabling rapid song generation and voice imitation, thus improving song generation efficiency.
[0161] Figure 5 This is a schematic diagram of the structure of a song generation device provided in an embodiment of this application. See also... Figure 5 The device includes:
[0162] The acquisition unit 501 is configured to acquire a song generation request; wherein, the song generation request includes input song attribute information and reference sound samples for sound imitation; the song attribute information includes at least the basic components of song style and song emotional type;
[0163] The first generation unit 502 is configured to generate initial audio based on the song attribute information;
[0164] The second generation unit 503 is configured to call the encoder of the voice imitation model to map the reference sound sample into a latent vector; and to call the decoder of the voice imitation model to generate an imitation audio that matches the voice features of the human voice in the reference sound sample based on the latent vector.
[0165] Synthesis unit 504 is configured to synthesize the initial audio and the simulated audio to obtain the target song.
[0166] The song generation scheme provided in this application can generate songs with specific vocal features according to user needs. Specifically, the scheme first obtains a song generation request, which includes user-inputted song attribute information and reference sound samples for voice imitation. Then, the song is generated based on the song attribute information. Because user needs are considered during song generation, the generated song better matches user expectations, improving song quality. Furthermore, the scheme includes a voice imitation process, where the encoder of the voice imitation model maps the reference sound sample into a latent vector, and the decoder of the voice imitation model generates an imitation audio that matches the vocal features of the reference sound sample based on the latent vector. Finally, by synthesizing the initial audio and the imitation audio, a song with specific vocal features is generated, greatly enriching the song generation methods.
[0167] In one possible implementation, the first generation unit 502 is configured to invoke a song generation model to generate the initial audio based on the song attribute information;
[0168] The training process of the song generation model includes:
[0169] Obtain a music dataset and convert the music data included in the music dataset into a music sequence; wherein, a music sequence includes notes at multiple time steps;
[0170] The first deep learning model is trained based on the transformed music sequence to obtain the song generation model;
[0171] The training objective during model training is to maximize the first log-likelihood function of the notes generated by the model. The first log-likelihood function represents the probability that the model generates the i-th element given the first i-1 elements. The first i-1 elements and the i-th element come from the music sequence input to the model.
[0172] In one possible implementation, the song generation request further includes an input initial music sequence; the first generation unit 502 is configured to call the song generation model to generate the initial audio based on the initial music sequence under the constraints of the song attribute information.
[0173] In one possible implementation, the training process of the voice imitation model includes:
[0174] Obtain an unlabeled human voice dataset;
[0175] Preprocess the voice data included in the voice dataset to obtain voice samples;
[0176] The second deep learning model is trained based on the obtained human voice samples to obtain the voice imitation model;
[0177] The training objective during model training is to maximize the second log-likelihood function of the human voice samples generated by the model and minimize the distance between the first latent vector and the second latent vector.
[0178] The second log-likelihood function is used to represent the probability that the model generates the i-th element given the first i-1 elements and the third latent vector; the first i-1 elements and the i-th element are derived from human voice samples input to the model;
[0179] The first latent vector is the latent vector of the human voice sample generated by the model; the second latent vector is the latent vector of the human voice sample input to the model; and the third latent vector is sampled from the probability distribution of the second latent vector.
[0180] In one possible implementation, the synthesis unit 504 is configured as follows:
[0181] The initial audio and the simulated audio are superimposed to obtain the superimposed audio.
[0182] Perform volume balancing, rhythm adjustment, and loss compensation operations on the superimposed audio to obtain the target song;
[0183] The volume balancing operation is used to adjust the volume of different parts of the audio; the rhythm adjustment operation is used to adjust the audio rhythm based on the user's rhythm requirements; and the loss compensation operation is used to repair the sound quality.
[0184] In one possible implementation, the synthesis unit 504 is configured as follows:
[0185] After aligning the initial audio and the simulated audio in time, the sample values of the waveform corresponding to the initial audio and the sample values of the waveform corresponding to the simulated audio are added together to obtain the superimposed audio.
[0186] In one possible implementation, the acquisition unit 501 is configured as follows:
[0187] Display the song settings interface; wherein, the song settings interface includes a song attribute information setting control and a reference sound sample upload control;
[0188] Based on the song attribute information, set the control to obtain the input song attribute information;
[0189] Based on the reference sound sample upload control, obtain the uploaded reference sound sample;
[0190] Based on the input song attribute information and the uploaded reference sound sample, the song generation request is generated.
[0191] All of the above-mentioned optional technical solutions can be combined in any way to form the optional embodiments of this application, and will not be described in detail here.
[0192] It should be noted that the song generation device provided in the above embodiments is only illustrated by the division of the above functional modules. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the song generation device and the song generation method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be repeated here.
[0193] Figure 6 This is a schematic diagram of the structure of a computer device 600 provided in an embodiment of this application.
[0194] Typically, computer device 600 includes a processor 601 and a memory 602.
[0195] Processor 601 includes one or more processing cores, such as a quad-core processor or an octa-core processor. Processor 601 is implemented using at least one hardware form selected from DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). Alternatively, processor 601 includes a main processor and a coprocessor. The main processor, also known as a CPU (Central Processing Unit), is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In one possible implementation, processor 601 integrates a GPU (Graphics Processing Unit) for rendering and drawing content required for the display screen. In another possible implementation, processor 601 also includes an AI (Artificial Intelligence) processor for handling computational operations related to machine learning.
[0196] Memory 602 includes one or more computer-readable storage media that are non-transitory. Memory 602 also includes high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In one possible implementation, the non-transitory computer-readable storage media in memory 602 is used to store at least one program code for execution by processor 601 to implement the song generation method provided in the method embodiments of this application.
[0197] In one possible implementation, the computer device 600 further includes a peripheral device interface 603 and at least one peripheral device. The processor 601, memory 602, and peripheral device interface 603 are connected via a bus or signal line. Each peripheral device is connected to the peripheral device interface 603 via a bus, signal line, or circuit board. The peripheral device includes at least one of the following: a radio frequency circuit 604, a display screen 605, a camera assembly 606, an audio circuit 607, a positioning assembly 608, and a power supply 609.
[0198] Peripheral interface 603 is used to connect at least one I / O (Input / Output) related peripheral device to processor 601 and memory 602. In one possible implementation, processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in another possible implementation, any one or two of processor 601, memory 602, and peripheral interface 603 are implemented on separate chips or circuit boards, which is not limited in this application.
[0199] The radio frequency (RF) circuit 604 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The RF circuit 604 communicates with communication networks and other communication devices via electromagnetic signals. The RF circuit 604 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals back into electrical signals. In one possible implementation, the RF circuit 604 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, etc. The RF circuit 604 communicates with other terminals via at least one wireless communication protocol. This wireless communication protocol includes, but is not limited to: the World Wide Web, metropolitan area networks, intranets, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and / or WiFi (Wireless Fidelity) networks. In one possible implementation, the RF circuit 604 also includes circuitry related to NFC (Near Field Communication), which is not limited in this application.
[0200] Display screen 605 is used to display a UI (User Interface). This UI includes graphics, text, icons, videos, and any combination thereof. If display screen 605 is a touch display, it also has the ability to collect touch signals on or above its surface. These touch signals are input as control signals to processor 601 for processing. Display screen 605 also provides virtual buttons and / or a virtual keyboard, also known as soft buttons and / or a soft keyboard. In one possible implementation, there is one display screen 605, located on the front panel of computer device 600; in another possible implementation, there are at least two display screens 605, respectively located on different surfaces of computer device 600 or in a folded design; in yet another possible implementation, display screen 605 is a flexible display screen, located on a curved or folded surface of computer device 600. Alternatively, display screen 605 may be configured as a non-rectangular, irregular shape, i.e., a non-rectangular screen. Display screen 605 is made of materials such as LCD (Liquid Crystal Display) or OLED (Organic Light-Emitting Diode).
[0201] Camera assembly 606 is used to acquire images or videos. In one possible implementation, camera assembly 606 includes a front-facing camera and a rear-facing camera. Typically, the front-facing camera is located on the front panel of the terminal, and the rear-facing camera is located on the back of the terminal. In one possible implementation, there are at least two rear-facing cameras, which can be any one of a main camera, a depth-sensing camera, a wide-angle camera, or a telephoto camera, to achieve background blurring by fusion of the main camera and the depth-sensing camera, panoramic shooting by fusion of the main camera and the wide-angle camera, VR (Virtual Reality) shooting, or other fusion shooting functions. In another possible implementation, camera assembly 606 also includes a flash. The flash is a single-color temperature flash or a dual-color temperature flash. A dual-color temperature flash refers to a combination of a warm-light flash and a cool-light flash, used for light compensation at different color temperatures.
[0202] The audio circuit 607 includes a microphone and a speaker. The microphone is used to collect sound waves from the user and the environment, converting them into electrical signals that are input to the processor 601 for processing, or to the radio frequency circuit 604 for voice communication. For stereo sound acquisition or noise reduction purposes, multiple microphones are used, each located in a different part of the computer device 600. Alternatively, the microphones may be array microphones or omnidirectional microphones. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The speaker may be a conventional thin-film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, it can convert electrical signals not only into audible sound waves but also into inaudible sound waves for purposes such as distance measurement. In one possible implementation, the audio circuit 607 also includes a headphone jack.
[0203] Positioning component 608 is used to locate the current geographic location of computer device 600 in order to enable navigation or LBS (Location Based Service). Positioning component 608 can be a positioning component based on the US GPS (Global Positioning System), China's BeiDou system, Russia's GLONASS system, or the European Union's Galileo system.
[0204] Power supply 609 is used to supply power to the various components in computer device 600. Power supply 609 can be alternating current, direct current, a disposable battery, or a rechargeable battery. If power supply 609 includes a rechargeable battery, the rechargeable battery can be a wired or wirelessly rechargeable battery. A wired rechargeable battery is charged via a wired connection, while a wirelessly rechargeable battery is charged via a wireless coil. The rechargeable battery also supports fast charging technology.
[0205] In one possible implementation, the computer device 600 further includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: an accelerometer 611, a gyroscope 612, a pressure sensor 613, a fingerprint sensor 614, an optical sensor 615, and a proximity sensor 616.
[0206] Accelerometer 611 detects the magnitude of acceleration along the three coordinate axes of a coordinate system established by computer device 600. For example, accelerometer 611 is used to detect the components of gravitational acceleration along the three coordinate axes. Processor 601 controls display screen 605 to display the user interface in either a landscape or portrait view based on the gravitational acceleration signal acquired by accelerometer 611. Accelerometer 611 is also used for collecting motion data from games or users.
[0207] The gyroscope sensor 612 detects the orientation and rotation angle of the computer device 600. The gyroscope sensor 612 and the accelerometer sensor 611 work together to acquire the user's 3D movements on the computer device 600. Based on the data acquired by the gyroscope sensor 612, the processor 601 performs the following functions: motion sensing (e.g., changing the UI based on the user's tilt), image stabilization during shooting, game control, and inertial navigation.
[0208] A pressure sensor 613 is disposed on the side bezel of the computer device 600 and / or on the lower layer of the display screen 605. When the pressure sensor 613 is disposed on the side bezel of the computer device 600, it detects the user's grip signal on the computer device 600, and the processor 601 performs left / right hand recognition or quick operation based on the grip signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed on the lower layer of the display screen 605, the processor 601 controls operable controls on the UI interface based on the user's pressure operation on the display screen 605. Operable controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.
[0209] The fingerprint sensor 614 is used to collect a user's fingerprint. The processor 601 identifies the user based on the fingerprint collected by the fingerprint sensor 614, or vice versa. When the user's identity is identified as trusted, the processor 601 authorizes the user to perform relevant sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, making payments, and changing settings. The fingerprint sensor 614 is located on the front, back, or side of the computer device 600. If the computer device 600 has physical buttons or a manufacturer's logo, the fingerprint sensor 614 is integrated with the physical buttons or manufacturer's logo.
[0210] An optical sensor 615 is used to collect ambient light intensity. In one possible implementation, the processor 601 controls the display brightness of the display screen 605 based on the ambient light intensity collected by the optical sensor 615. When the ambient light intensity is high, the display brightness of the display screen 605 is increased; when the ambient light intensity is low, the display brightness of the display screen 605 is decreased. In another possible implementation, the processor 601 also dynamically adjusts the shooting parameters of the camera assembly 606 based on the ambient light intensity collected by the optical sensor 615.
[0211] A proximity sensor 616, also known as a distance sensor, is typically mounted on the front panel of the computer device 600. The proximity sensor 616 is used to detect the distance between the user and the front of the computer device 600. In one possible implementation, when the proximity sensor 616 detects that the distance between the user and the front of the computer device 600 is gradually decreasing, the processor 601 controls the display screen 605 to switch from a screen-on state to a screen-off state; conversely, when the proximity sensor 616 detects that the distance between the user and the front of the computer device 600 is gradually increasing, the processor 601 controls the display screen 605 to switch from a screen-off state to a screen-on state.
[0212] Those skilled in the art will understand that Figure 6 The structure shown does not constitute a limitation on the computer device 600, and may include more or fewer components than shown, or combine certain components, or use different component arrangements.
[0213] Figure 7 This is a schematic diagram of the structure of another computer device 700 provided in an embodiment of this application.
[0214] The computer 700 can be a server. The computer device 700 can vary significantly due to differences in configuration or performance, and may include one or more Central Processing Units (CPUs) 701 and one or more memories 702. The memories 702 store at least one line of program code, which is loaded and executed by the processor 701 to implement the song generation methods provided in the various method embodiments described above. Of course, the computer device 700 may also have wired or wireless network interfaces, a keyboard, and input / output interfaces for input and output. The computer device 700 may also include other components for implementing device functions, which will not be elaborated upon here.
[0215] In an exemplary embodiment, a computer-readable storage medium is also provided, such as a memory including program code that can be executed by a processor in a computer device to complete the song generation method in the above embodiments. For example, the computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a compact disc read-only memory (CD-ROM), magnetic tape, floppy disk, and optical data storage device, etc.
[0216] In an exemplary embodiment, a computer program product or computer program is also provided, which includes computer program code stored in a computer-readable storage medium. A processor of a computer device reads the computer program code from the computer-readable storage medium and executes the computer program code, causing the computer device to perform the song generation method described above.
[0217] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.
[0218] The above description is merely an optional embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. A song generation method, characterized in that, The method includes: Obtain a song generation request; wherein, the song generation request includes input song attribute information and reference sound samples for sound imitation; the song attribute information includes at least the basic components of song style and song emotional type; Generate initial audio based on the song attribute information; The encoder of the voice imitation model is invoked to map the reference voice sample into a latent vector; and the decoder of the voice imitation model is invoked to generate an imitation audio that matches the voice features of the human voice in the reference voice sample based on the latent vector. The initial audio and the simulated audio are combined to obtain the target song; The training process of the voice imitation model includes: Obtain an unlabeled human voice dataset; preprocess the human voice data included in the human voice dataset to obtain human voice samples; train a second deep learning model based on the obtained human voice samples to obtain the voice imitation model; The training objective during model training is to maximize the second log-likelihood function of the human voice samples generated by the model and minimize the distance between the first latent vector and the second latent vector. The second log-likelihood function represents the probability that the model generates the i-th element given the first i-1 elements and the third latent vector. The first i-1 elements and the i-th element come from the human voice samples input to the model. The first latent vector is the latent vector of the human voice sample generated by the model; the second latent vector is the latent vector of the human voice sample input to the model; and the third latent vector is sampled from the probability distribution of the second latent vector.
2. The method according to claim 1, characterized in that, The step of generating initial audio based on the song attribute information includes: The song generation model is invoked to generate the initial audio based on the song attribute information; The training process of the song generation model includes: Obtain a music dataset and convert the music data included in the music dataset into a music sequence; wherein, a music sequence includes notes at multiple time steps; The first deep learning model is trained based on the transformed music sequence to obtain the song generation model; The training objective during model training is to maximize the first log-likelihood function of the notes generated by the model. The first log-likelihood function represents the probability that the model generates the i-th element given the first i-1 elements. The first i-1 elements and the i-th element come from the music sequence input to the model.
3. The method according to claim 2, characterized in that, The song generation request also includes an input initial music sequence; the step of calling the song generation model to generate the initial audio based on the song attribute information includes: The song generation model is invoked to generate the initial audio based on the initial music sequence, under the constraints of the song attribute information.
4. The method according to claim 1, characterized in that, The process of synthesizing the initial audio and the simulated audio to obtain the target song includes: The initial audio and the simulated audio are superimposed to obtain the superimposed audio. Perform volume balancing, rhythm adjustment, and loss compensation operations on the superimposed audio to obtain the target song; The volume balancing operation is used to adjust the volume of different parts of the audio; the rhythm adjustment operation is used to adjust the audio rhythm based on the user's rhythm requirements; and the loss compensation operation is used to repair the sound quality.
5. The method according to claim 4, characterized in that, The step of superimposing the initial audio and the simulated audio to obtain the superimposed audio includes: After aligning the initial audio and the simulated audio in time, the sample values of the waveform corresponding to the initial audio and the sample values of the waveform corresponding to the simulated audio are added together to obtain the superimposed audio.
6. The method according to any one of claims 1 to 5, characterized in that, The process of obtaining the song generation request includes: Display the song settings interface; wherein, the song settings interface includes a song attribute information setting control and a reference sound sample upload control; Based on the song attribute information, set the control to obtain the input song attribute information; Based on the reference sound sample upload control, obtain the uploaded reference sound sample; Based on the input song attribute information and the uploaded reference sound sample, the song generation request is generated.
7. A song generation device, characterized in that, The device includes: The acquisition unit is configured to acquire a song generation request; wherein the song generation request includes input song attribute information and reference sound samples for sound imitation; the song attribute information includes at least the basic components of song style and song emotional type; The first generation unit is configured to generate initial audio based on the song attribute information; The second generation unit is configured to invoke the encoder of the voice imitation model to map the reference sound sample into a latent vector; and to invoke the decoder of the voice imitation model to generate an imitation audio that matches the voice features of the human voice in the reference sound sample based on the latent vector. A synthesis unit is configured to synthesize the initial audio and the simulated audio to obtain a target song; The training process of the voice imitation model includes: Obtain an unlabeled human voice dataset; preprocess the human voice data included in the human voice dataset to obtain human voice samples; train a second deep learning model based on the obtained human voice samples to obtain the voice imitation model; The training objective during model training is to maximize the second log-likelihood function of the human voice samples generated by the model and minimize the distance between the first latent vector and the second latent vector. The second log-likelihood function represents the probability that the model generates the i-th element given the first i-1 elements and the third latent vector. The first i-1 elements and the i-th element come from the human voice samples input to the model. The first latent vector is the latent vector of the human voice sample generated by the model; the second latent vector is the latent vector of the human voice sample input to the model; and the third latent vector is sampled from the probability distribution of the second latent vector.
8. A computer device, characterized in that, The device includes a processor and a memory, the memory storing at least one line of program code, which is loaded and executed by the processor to implement the song generation method as claimed in any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, The storage medium stores at least one piece of program code, which is loaded and executed by a processor to implement the song generation method as described in any one of claims 1 to 6.