Music generation method, apparatus, and electronic device

By acquiring music generation instructions and lyrics text, generating creation parameters and target music style skeleton based on music creation constraint information, and combining mixing reference audio to generate natural, smooth and stylistically consistent music, the problem of low accuracy in music generation in existing technologies is solved.

CN122245260APending Publication Date: 2026-06-19HANGZHOU LIANYING NETWORK TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU LIANYING NETWORK TECHNOLOGY CO LTD
Filing Date
2026-03-26
Publication Date
2026-06-19

Smart Images

  • Figure CN122245260A_ABST
    Figure CN122245260A_ABST
Patent Text Reader

Abstract

This invention provides a music generation method, apparatus, and electronic device, relating to the field of music technology. The method includes: acquiring music generation instructions and lyrics text, the music generation instructions including music composition constraint information; generating composition parameters based on the music composition constraint information; generating a target music style skeleton based on the music composition constraint information and composition parameters, the target music style skeleton including track information of multiple instruments; generating a mixing reference audio based on the track information of multiple instruments; and generating target music based on the lyrics text, composition parameters, and mixing reference audio. This invention can generate mixing reference audio based on music generation instructions, and ultimately generate target music by fusing multi-source information such as lyrics text, composition parameters, and mixing reference audio. This results in target music with natural and smooth vocals and accompaniment, a unified style, and a clear musical structure, thereby improving the accuracy of music generation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of music technology, and more particularly to a music generation method, apparatus, and electronic device. Background Technology

[0002] In recent years, artificial intelligence technology has provided new pathways for music generation. By analyzing massive amounts of music, models can automatically generate music fragments, bringing new possibilities to fields such as assisted composition, game sound effects, and personalized content.

[0003] In related technologies, music generation instructions are usually directly input into a large language model to obtain the music output by the large language model.

[0004] However, the aforementioned technologies generate music based solely on music generation instructions, which results in problems such as style drift, inconsistency between vocals and accompaniment, and misalignment of musical structure, thereby reducing the accuracy of music generation. Summary of the Invention

[0005] This invention provides a music generation method, apparatus, and electronic device to address the shortcomings of existing technologies that reduce the accuracy of music generation.

[0006] This invention provides a music generation method, comprising the following steps.

[0007] Obtain music generation instructions and lyrics text, wherein the music generation instructions include music composition constraint information; Generate creation parameters based on the aforementioned music creation constraint information; Based on the music composition constraint information and the composition parameters, a target music style skeleton is generated, which includes the track information of multiple instruments. Based on the track information of the multiple instruments, a mixing reference audio is generated; The target music is generated based on the lyrics, the creation parameters, and the mixing reference audio.

[0008] According to a music generation method provided by the present invention, the music creation constraint information includes at least one of the following: music style, emotional tendency, rhythm and speed, musical phrase structure, and language type.

[0009] According to a music generation method provided by the present invention, the step of generating composition parameters based on the music composition constraint information includes: The music composition constraint information is input into the large language model to obtain the composition parameters output by the large language model. The composition parameters include at least one of the following: tonality, meter, tempo, music style, musical phrase structure, language type, and vocal gender.

[0010] According to a music generation method provided by the present invention, the step of generating a target music style skeleton based on the music composition constraint information and the composition parameters includes: The music composition constraint information and the composition parameters are input into the music style skeleton generation model to obtain at least two candidate music style skeletons output by the music style skeleton generation model. The candidate music style skeleton selected by the user from the at least two candidate music style skeletons is determined as the target music style skeleton.

[0011] According to a music generation method provided by the present invention, the step of generating a mixing reference audio based on the track information of the plurality of instruments includes: The track information of the multiple instruments is decomposed to obtain the independent instrument tracks of each instrument; Based on the beat in the creation parameters, each of the independent instrument tracks is synchronized to obtain the synchronized instrument tracks. For each synchronized instrument track, based on the correspondence between the instrument track and the virtual instrument sound source, the synchronized instrument track is assigned to the corresponding target virtual instrument sound source to obtain the multi-track audio corresponding to the synchronized instrument track. At least one audio track edited by the user from all the said audio tracks is identified as the target audio track. The target audio tracks are mixed using an automatic mixing engine to obtain the mixing reference audio. The mixing process includes at least one of the following: dynamic equalization, image localization, compression and limiting, and loudness normalization.

[0012] According to a music generation method provided by the present invention, generating target music based on the lyrics text, the composition parameters, and the mixing reference audio includes: The lyrics text, the creation parameters, and the mixing reference audio are input into the song generation model to obtain the target music output by the song generation model.

[0013] According to a music generation method provided by the present invention, the method further includes: The correspondence between target information and task identifiers in the entire music generation process is stored in the cloud server, and the task identifiers are added to the local history interface. The target information includes at least one of the following: music generation instructions, creation parameters, candidate music style skeletons, track audio, mixing reference audio, lyrics text, and target music.

[0014] According to a music generation method provided by the present invention, the method further includes: Upon receiving a user's selection operation of the task identifier on the history interface, an information retrieval request is sent to the cloud server, the information retrieval request including the task identifier. Receive the target information corresponding to the task identifier sent by the cloud server; Load the target information into the editing interface; Receive user modification operations on the target information in the editing interface, and obtain the modified target information; New music is generated based on the modified target information.

[0015] The present invention also provides a music generation apparatus, comprising: The acquisition unit is used to acquire music generation instructions and lyrics text, wherein the music generation instructions include music creation constraint information; The first generation unit is used to generate creation parameters based on the music creation constraint information; The second generation unit is used to generate a target music style skeleton based on the music composition constraint information and the composition parameters. The target music style skeleton includes the track information of multiple instruments. The third generation unit is used to generate a mixing reference audio based on the track information of the multiple musical instruments; The fourth generation unit is used to generate target music based on the lyrics text, the creation parameters, and the mixing reference audio.

[0016] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement any of the music generation methods described above.

[0017] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the music generation method as described above.

[0018] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements any of the music generation methods described above.

[0019] The music generation method, apparatus, and electronic device provided by this invention acquire music generation instructions and lyrics text. The music generation instructions include music composition constraint information. Based on the music composition constraint information, composition parameters are generated. Based on the music composition constraint information and composition parameters, a target music style skeleton including track information of multiple instruments is generated. Based on the track information of multiple instruments, a mixing reference audio is generated. Finally, based on the lyrics text, composition parameters, and mixing reference audio, the target music is generated. It can be seen that this invention can generate mixing reference audio based on music generation instructions, and finally generate target music by fusing multi-source information such as lyrics text, composition parameters, and mixing reference audio. This results in target music with natural and smooth vocals and accompaniment, unified style, and clear musical structure, thereby improving the accuracy of music generation. Attached Figure Description

[0020] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0021] Figure 1 This is one of the flowcharts of the music generation method provided in the embodiments of the present invention.

[0022] Figure 2 This is the second flowchart of the music generation method provided in this embodiment of the invention.

[0023] Figure 3 This is the third flowchart of the music generation method provided in this embodiment of the invention.

[0024] Figure 4 This is the fourth flowchart of the music generation method provided in this embodiment of the invention.

[0025] Figure 5 This is a schematic diagram of the structure of the music generation device provided in an embodiment of the present invention.

[0026] Figure 6 This is a schematic diagram of the physical structure of the electronic device provided in an embodiment of the present invention. Detailed Implementation

[0027] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0028] The following is combined Figures 1 to 4 The present invention describes a music generation method. The subject executing this music generation method can be an electronic device such as a terminal, tablet computer, computer, or server, or a music generation device installed in such an electronic device. This music generation device can be implemented through software, hardware, or a combination of both.

[0029] Figure 1 This is one of the flowcharts illustrating the music generation method provided in this embodiment of the invention, such as... Figure 1 As shown, the music generation method includes the following steps: Step 101: Obtain music generation instructions and lyrics text, wherein the music generation instructions include music creation constraint information.

[0030] For example, the system receives music generation instructions input by the user. These instructions can be natural language instructions, which express the user's music creation constraints. These constraints include at least one of the following: music style, mood, tempo, phrasing, and language type. For example, the music style is pop, the mood is upbeat, the tempo is slow, the phrasing is intro-verse-chorus, and the language is Chinese.

[0031] The lyrics can be directly input by the user, or, with the user's authorization, creative parameters generated based on music composition constraints can be input into a large language model. The large language model then automatically generates lyrics that match the musical mood, rhythmic structure, and duration based on these parameters, ensuring that the generated lyrics maintain consistency with the overall musical context in terms of emotional tone and semantic content. Furthermore, to ensure temporal alignment and structural coordination between the lyrics and subsequent music generation processes, paragraph tags can be added to the lyrics, such as marking the verses and choruses. These paragraph tags provide crucial temporal anchors for subsequent vocal synthesis and rhythmic alignment, thereby ensuring the logical coherence and artistic expression of the complete song.

[0032] Step 102: Generate creation parameters based on the music creation constraint information.

[0033] Optionally, the music composition constraint information is input into a large language model to obtain the composition parameters output by the large language model. The composition parameters include at least one of the following: tonality, meter, tempo, music style, musical phrase structure, language type, and vocal gender.

[0034] For example, upon receiving user-inputted music composition constraints, this information is input into a large language model. The model performs semantic parsing and intent understanding, extracting explicit and implicit constraints, and ultimately outputting a structured composition parameter, SongSpec. SongSpec can be represented in JSON data format (e.g., named SongSpec.json), including at least one of the following: key, time signature, tempo, musical style, verse structure, language, and vocal gender. This serves as a shared context for subsequent modules, ensuring consistency and controllability of the music's semantics throughout the entire process. For instance, the key might be C major, the time signature 3 / 4, the tempo slow, the musical style pop, the verse structure intro-verse-chorus, the language Chinese, and the vocal gender male.

[0035] It should be noted that the large language model can be any existing large model that can correctly perform semantic parsing and intent understanding; this invention does not limit it.

[0036] It should be noted that users with certain music expertise can output all generated creative parameters, receive user selections of these parameters, and execute subsequent steps based on the user's selected parameters, thereby forming a more professional and stable control process to meet the creative needs of users with music expertise. This invention does not limit this aspect.

[0037] Step 103: Based on the music creation constraint information and the creation parameters, generate a target music style skeleton, which includes the track information of multiple instruments.

[0038] For example, based on the aforementioned music composition constraints (such as music style, emotional inclination, tempo, phrasing, and language type) and composition parameters (such as tonality, meter, tempo, music style, phrasing, language type, and vocal gender), a corresponding target music style skeleton can be generated. This target music style skeleton is presented in a multi-track arrangement format, represented in MIDI format (e.g., the target music style skeleton is named Seed.mid), providing the basic framework for the music. It includes track information for multiple instruments, such as rhythm groups (drums, bass), harmony groups (piano, guitar), melody groups (lead instruments, vocal melody lines), and ambient background (synthesizers, strings). The instrument track information includes the track sequence and note events. It should be noted that the track information for multiple instruments is a mixed track information for all instruments.

[0039] Step 104: Generate a mixing reference audio based on the track information of the multiple instruments.

[0040] For example, mixing reference audio can be generated based on track information from multiple instruments for mixing guidance.

[0041] Step 105: Generate the target music based on the lyrics text, the creation parameters, and the mixing reference audio.

[0042] For example, a complete target music can be generated based on the lyrics, composition parameters, and mixing reference audio. This process matches the rhythm, accents, and melody of the lyrics and generates a suitable vocal melody line based on composition parameters (such as tonality, time signature, tempo, musical style, phrasing, language type, and vocal gender). Finally, it is synthesized, aligned, and rendered with the pre-arranged multi-instrument accompaniment skeleton in the mixing reference audio, outputting a target music that includes vocals, harmony, rhythm, and atmosphere.

[0043] The music generation method provided by this invention obtains music generation instructions and lyrics text. The music generation instructions include music composition constraint information. Based on the music composition constraint information, composition parameters are generated. Based on the music composition constraint information and composition parameters, a target music style skeleton including track information of multiple instruments is generated. Based on the track information of multiple instruments, a mixing reference audio is generated. Finally, based on the lyrics text, composition parameters, and mixing reference audio, the target music is generated. It can be seen that this invention can generate mixing reference audio based on music generation instructions, and finally generate target music by fusing multi-source information such as lyrics text, composition parameters, and mixing reference audio. This results in target music with natural and smooth vocals and accompaniment, unified style, and clear musical structure, thereby improving the accuracy of music generation.

[0044] In one embodiment, step 103 above generates a target music style skeleton based on the music composition constraint information and the composition parameters, which can be implemented in the following ways: The music composition constraint information and the composition parameters are input into the music style skeleton generation model to obtain at least two candidate music style skeletons output by the music style skeleton generation model; the candidate music style skeleton selected by the user from the at least two candidate music style skeletons is determined as the target music style skeleton.

[0045] For example, based on music composition constraints and parameters, prompts are generated for a music style skeleton generation model. These prompts are then formatted and encapsulated according to the model's input protocol. The encapsulated prompts are input into the music style skeleton generation model, which can be an existing large language model specifically designed for symbolic music generation. The final output consists of multiple candidate music style skeletons with consistent style but different track information. Each candidate skeleton includes track information for multiple instruments and can be output in standard MIDI file (SMF) format. For instance, three candidate music style skeletons are output, one of which includes track information for piano, drums, and bass, with track information including instrument timing and note events. The resulting candidate music style skeletons are then output for the user to select from. The user-selected candidate music style skeleton that meets the requirements is designated as the target music style skeleton.

[0046] In this embodiment, music composition constraints and composition parameters are input into the music style skeleton generation model to obtain at least two candidate music style skeletons output by the music style skeleton generation model. The candidate music style skeleton selected by the user from the at least two candidate music style skeletons is determined as the target music style skeleton. This enables the user to edit the candidate music style skeleton, and the selected target candidate music style skeleton is more in line with the user's needs. As a result, the target music generated based on the target candidate music style skeleton better meets the user's needs, and further improves the accuracy of music generation.

[0047] In one embodiment, Figure 2 This is a second schematic flowchart of the music generation method provided in this embodiment of the invention, as shown below. Figure 2 As shown, step 104 above generates a mixing reference audio based on the track information of the multiple instruments, which can be achieved through the following steps: Step 201: Decompose the track information of the multiple musical instruments to obtain the independent instrument tracks of each instrument.

[0048] For example, decomposing the audio track information of multiple instruments refers to separating and extracting the audio track information of multiple instruments that are mixed together. The specific process is as follows: extracting information such as timestamps, pitches, and dynamics of each instrument, or using audio source separation technology to identify and lock the signal components corresponding to each instrument (such as piano, drum kit, bass, guitar, etc.), and then peeling off these signal components one by one to regenerate an independent audio track for each instrument that contains only its own performance information. In this way, we obtain independent audio tracks for each instrument that are clearly structured and do not interfere with each other.

[0049] Step 202: Based on the beat in the creation parameters, synchronize each of the independent instrument tracks to obtain synchronized instrument tracks.

[0050] For example, based on the beat in the creation parameters, the individual instrument tracks are synchronized separately. This means that the audio rendering module uses a unified rhythm and timing value as a reference to perform beat alignment, measure boundary correction, and length unification on the previously separated individual instrument tracks along the entire music timeline. This ensures that all instrument tracks are precisely synchronized on the timeline, ultimately resulting in synchronized instrument tracks. The synchronized instrument tracks are also output in MIDI format (for example, the synchronized instrument tracks are named aligned_all.mid). The audio rendering module here can be any existing module capable of synchronization processing; this invention will not elaborate further.

[0051] Step 203: For each synchronized instrument track, based on the correspondence between the instrument track and the virtual instrument sound source, the synchronized instrument track is assigned to the corresponding target virtual instrument sound source to obtain the sub-track audio corresponding to the synchronized instrument track.

[0052] The correspondence between instrument tracks and virtual instrument sound sources can also be called sound source mapping rules.

[0053] For example, for each synchronized instrument track, the target virtual instrument source corresponding to the synchronized instrument track is found in the pre-stored correspondence between instrument tracks and virtual instrument sources. The MIDI data of the synchronized instrument track is sent to the target virtual instrument source, which then renders and generates the corresponding audio waveform signal in real time. Finally, the multi-track audio corresponding to the synchronized instrument track is obtained. Each multi-track audio can be independently rendered as a mono or stereo WAV format audio file.

[0054] Step 204: Determine at least one track audio file edited by the user from all the said track audio files as the target track audio file.

[0055] For example, after obtaining the individual audio tracks corresponding to all synchronized instrument tracks, all individual audio tracks are output for the user to edit. The user can view, delete, select, edit, or download the displayed individual audio tracks. Finally, the selected or edited individual audio track is determined as the target individual audio track. There can be one or more target individual audio tracks, and multiple target individual audio tracks form a set of individual audio tracks.

[0056] Step 205: Perform mixing processing on all the target audio tracks based on the automatic mixing engine to obtain the mixing reference audio. The mixing processing includes at least one of the following: dynamic equalization processing, sound image localization processing, compression and limiting processing, and loudness normalization processing.

[0057] For example, upon obtaining the target audio tracks, all target audio tracks are input into an automatic mixing engine. The automatic mixing engine performs mixing processing on all target audio tracks, specifically including dynamic equalization, image localization, compression and limiting, and loudness normalization. Dynamic equalization automatically adjusts the gain across multiple frequency bands based on the real-time frequency characteristics of the target audio tracks to optimize timbre clarity and avoid frequency masking. Image localization assigns precise left and right positions and widths to each target audio track in a stereo or surround sound field based on a preset sound field model, creating a distinct sense of space. Compression and limiting automatically controls the dynamic range of each track, balancing volume fluctuations and ensuring peak levels do not exceed a set threshold to increase overall loudness stability and impact. Loudness normalization adjusts the total output after integrating all tracks to a target loudness value that meets industry standards, ensuring a consistent and suitable listening volume across different playback devices. By performing the above series of mixing processes, the automatic mixing engine finally generates a high-fidelity, spectrally balanced and dynamically coordinated mixing reference audio, which can be output in WAV format (for example, the mixing reference audio is named mix.wav).

[0058] In this embodiment, the track information of multiple instruments is sequentially decomposed, synchronized, rendered, and automatically mixed to obtain a mixing reference audio. This mixing reference audio is not only used for user auditory evaluation and creative decision-making, but also serves as a key auditory prior condition for the song synthesis model in subsequent stages, thereby improving the accuracy of music generation. In addition, the separate track audio corresponding to each synchronized instrument track obtained by this invention can be edited by the user, making the final generated target music more in line with user needs.

[0059] In one embodiment, step 105 above generates target music based on the lyrics text, the creation parameters, and the mixing reference audio, which can be implemented in the following ways: The lyrics text, the creation parameters, and the mixing reference audio are input into the song generation model to obtain the target music output by the song generation model.

[0060] The song generation model can adopt the open-source model SongGeneration based on the LeVo architecture. This invention does not limit this. The lyrics text can include paragraph tags, such as marking the verses, choruses, etc.

[0061] For example, inputting lyrics, composition parameters, and mixing reference audio as multimodal collaborative conditions into the song generation model is the final synthesis step in the music generation process. The lyrics provide the song generation model with precise temporal structure and semantic context, ensuring that vocal rhythm, accent distribution, and emotional expression conform to the logic of musical segments. The musical style in the composition parameters serves as a global control signal, constraining key attributes such as overall musical style, language type, and vocal gender. The mixing reference audio provides auditory priors for the accompaniment, guiding the generated results to maintain consistency in dynamic range, spatiality, and spectral characteristics. The song generation model, based on algorithms such as deep learning and generative adversarial networks, jointly models this information in a unified latent space, deconstructing and reconstructing melody, harmony, rhythm, and acoustic features to generate melodic lines matching the lyrics' rhythm. It then synthesizes separate audio tracks with corresponding spatiality, dynamics, and equalization effects according to the mixing features of the mixing reference audio. Finally, it synthesizes a complete finished song audio containing natural vocals and blended accompaniment, i.e., the generated target music, where vocal timbre, singing style, and overall musical atmosphere maintain a high degree of consistency. The final generated target music audio can support file formats such as WAV and mp3. For example, the target music can be named full_song.mp3.

[0062] In one embodiment, Figure 3 This is the third flowchart illustrating the music generation method provided in this embodiment of the invention, as shown below. Figure 3 As shown, after step 105 above, the music generation method further includes the following steps: Step 106: Store the correspondence between target information and task identifiers in the entire music generation process to the cloud server, and add the task identifiers to the local history interface. The target information includes at least one of the following: music generation instructions, creation parameters, candidate music style skeletons, track audio, mixing reference audio, lyrics text, and target music.

[0063] For example, a unique task identifier is generated for this music generation task, and a corresponding relationship is established between the task identifier and the target information generated throughout the entire music generation task process. This relationship is stored in the distributed object storage system of the cloud server. The target information covers some or all of the key data from the initial input to the final output, including at least one of the following: music generation instructions, creation parameters, candidate music style skeletons, track audio, mixing reference audio, lyrics text, and target music. At the same time, the task identifier is also added to the user's local history interface, usually presented in the form of an interactive entry, so that the user can conveniently trace and manage the task history locally, and rely on cloud storage to achieve complete traceability of the creation process and persistent management of task data.

[0064] In this embodiment, the correspondence between template information and task identifiers throughout the music generation process is stored on a cloud server, which enables complete traceability of the creation process and persistent management of task data by relying on cloud storage, and solves the problem of data loss due to local caching limitations.

[0065] In one embodiment, Figure 4 This is the fourth flowchart of the music generation method provided in this embodiment of the invention. After step 106 above, the music generation method further includes the following steps: Step 401: Upon receiving a user's selection operation on the task identifier in the history interface, send an information retrieval request to the cloud server, the information retrieval request including the task identifier.

[0066] For example, when it is detected that a user has performed a selection operation (such as clicking or touching) on ​​the task identifier corresponding to a music generation task in the history interface, a remote data retrieval process will be triggered based on this selection operation: a structured information retrieval request will be automatically sent to the cloud server, which includes the task identifier selected by the user.

[0067] Step 402: Receive the target information corresponding to the task identifier sent by the cloud server.

[0068] For example, after receiving an information retrieval request, the cloud server will quickly retrieve the corresponding target information from its stored correspondence between task identifiers and target information based on the task identifier carried in the information retrieval request (the target information covers the entire chain of information for this music generation task, from music generation instructions, creation parameters, candidate music style skeletons, track audio, mixing reference audio, lyrics text, and target music), and then securely return this target information to the local electronic device.

[0069] Step 403: Load the target information into the editing interface.

[0070] For example, upon receiving the target information corresponding to the task identifier, the target information is loaded into the editing interface for the user to edit.

[0071] Step 404: Receive the user's modification operation on the target information in the editing interface, and obtain the modified target information.

[0072] For example, users can modify the displayed target information in the editing interface based on their secondary creative needs to obtain the modified target information. For instance, users can modify the creative parameters included in the target information while keeping other information unchanged.

[0073] Step 405: Generate new music based on the modified target information.

[0074] For example, if the modified target information includes music generation instructions, lyrics text, and modified creation parameters, then a new target music style skeleton is regenerated based on the music creation constraint information and the modified creation parameters. A new mixing reference audio is generated based on the new target music style skeleton. Finally, a new music is generated based on the lyrics text, the modified creation parameters, and the new mixing reference audio, without having to repeat the step of generating creation parameters based on the music creation constraint information, thus completing the secondary creation.

[0075] If the modified target information includes music generation instructions, lyrics text, creation parameters, and user-modified mixing reference audio, then new music can be generated based on the lyrics text, creation parameters, and user-modified mixing reference audio. It is not necessary to re-execute the steps of generating creation parameters based on music creation constraint information, nor is it necessary to re-execute the steps of generating the target music style skeleton based on music creation constraint information and creation parameters, nor is it necessary to re-execute the steps of generating mixing reference audio based on the track information of multiple instruments. The user only needs to fine-tune some information in the target information to complete the secondary creation.

[0076] It should be noted that users can also select multiple task identifiers simultaneously in the history interface to load the target information corresponding to each of the multiple task identifiers in the editing interface. This makes it easier for users to compare information from different versions and make targeted modifications, and then create secondary works based on the modified information. This invention does not limit this.

[0077] In this embodiment, the target information corresponding to the task identifier can be loaded into the editing interface, allowing users to modify the content of the target information in the editing interface, and then generate new music based on the modified target information to complete secondary creation. This achieves traceability, reusability, and collaborative friendliness of the creation process, truly opening up the complete creative chain from creative expression to professional production, serving the broad needs of music lovers and professional producers.

[0078] The music generation apparatus provided by the present invention will be described below. The music generation apparatus described below can be referred to in correspondence with the music generation method described above.

[0079] Figure 5 This is a schematic diagram of the structure of the music generation device provided in an embodiment of the present invention, as shown below. Figure 5 As shown, the music generation device 500 includes an acquisition unit 501, a first generation unit 502, a second generation unit 503, a third generation unit 504, and a fourth generation unit 505; wherein: The acquisition unit 501 is used to acquire music generation instructions and lyrics text, wherein the music generation instructions include music creation constraint information; The first generation unit 502 is used to generate creation parameters based on the music creation constraint information; The second generation unit 503 is used to generate a target music style skeleton based on the music creation constraint information and the creation parameters. The target music style skeleton includes the track information of multiple instruments. The third generation unit 504 is used to generate a mixing reference audio based on the track information of the multiple musical instruments; The fourth generation unit 505 is used to generate target music based on the lyrics text, the creation parameters and the mixing reference audio.

[0080] The music generation apparatus provided by this invention acquires music generation instructions and lyrics text. The music generation instructions include music composition constraint information. Based on the music composition constraint information, composition parameters are generated. Based on the music composition constraint information and composition parameters, a target music style skeleton including track information of multiple instruments is generated. Based on the track information of multiple instruments, a mixing reference audio is generated. Finally, based on the lyrics text, composition parameters, and mixing reference audio, the target music is generated. It can be seen that this invention can generate mixing reference audio based on music generation instructions, and finally generate target music by fusing multi-source information such as lyrics text, composition parameters, and mixing reference audio. This results in target music with natural and smooth vocals and accompaniment, unified style, and clear musical structure, thereby improving the accuracy of music generation.

[0081] Based on any of the above embodiments, the music composition constraint information includes at least one of the following: music style, emotional tendency, rhythm and speed, musical phrase structure, and language type.

[0082] Based on any of the above embodiments, the first generation unit 502 is specifically used for: The music composition constraint information is input into the large language model to obtain the composition parameters output by the large language model. The composition parameters include at least one of the following: tonality, meter, tempo, music style, musical phrase structure, language type, and vocal gender.

[0083] Based on any of the above embodiments, the second generation unit 503 is specifically used for: The music composition constraint information and the composition parameters are input into the music style skeleton generation model to obtain at least two candidate music style skeletons output by the music style skeleton generation model. The candidate music style skeleton selected by the user from the at least two candidate music style skeletons is determined as the target music style skeleton.

[0084] Based on any of the above embodiments, the third generation unit 504 is specifically used for: The track information of the multiple instruments is decomposed to obtain the independent instrument tracks of each instrument; Based on the beat in the creation parameters, each of the independent instrument tracks is synchronized to obtain the synchronized instrument tracks. For each synchronized instrument track, based on the correspondence between the instrument track and the virtual instrument sound source, the synchronized instrument track is assigned to the corresponding target virtual instrument sound source to obtain the multi-track audio corresponding to the synchronized instrument track. At least one audio track edited by the user from all the said audio tracks is identified as the target audio track. The target audio tracks are mixed using an automatic mixing engine to obtain the mixing reference audio. The mixing process includes at least one of the following: dynamic equalization, image localization, compression and limiting, and loudness normalization.

[0085] Based on any of the above embodiments, the fourth generation unit 505 is specifically used for: The lyrics text, the creation parameters, and the mixing reference audio are input into the song generation model to obtain the target music output by the song generation model.

[0086] Based on any of the above embodiments, the music generation device 500 further includes: The storage unit is used to store the correspondence between target information and task identifiers in the entire music generation process to the cloud server, and to add the task identifiers to the local history interface. The target information includes at least one of the following: music generation instructions, creation parameters, candidate music style skeletons, track audio, mixing reference audio, lyrics text, and target music.

[0087] Based on any of the above embodiments, the music generation device 500 further includes: The sending unit is configured to send an information retrieval request to the cloud server when it receives a user's selection operation on the task identifier in the history interface, the information retrieval request including the task identifier; The first receiving unit is used to receive the target information corresponding to the task identifier sent by the cloud server; The loading unit is used to load the target information into the editing interface; The second receiving unit is used to receive the user's modification operation on the target information in the editing interface, and obtain the modified target information; The fifth generation unit is used to generate new music based on the modified target information.

[0088] Figure 6 This is a schematic diagram of the physical structure of the electronic device provided in the embodiments of the present invention, such as... Figure 6 As shown, the electronic device may include a processor 610, a communications interface 620, a memory 630, and a communication bus 640, wherein the processor 610, the communications interface 620, and the memory 630 communicate with each other via the communication bus 640. The processor 610 can call logical instructions in the memory 630 to execute a music generation method, which includes: acquiring music generation instructions and lyrics text, wherein the music generation instructions include music composition constraint information; Generate creation parameters based on the aforementioned music creation constraint information; Based on the music composition constraint information and the composition parameters, a target music style skeleton is generated, which includes the track information of multiple instruments. Based on the track information of the multiple instruments, a mixing reference audio is generated; The target music is generated based on the lyrics, the creation parameters, and the mixing reference audio.

[0089] Furthermore, the logical instructions in the aforementioned memory 630 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0090] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being able to be stored on a non-transitory computer-readable storage medium, the computer program being executed by a processor, the computer being able to execute the music generation method provided by the above methods, the method including: obtaining music generation instructions and lyrics text, the music generation instructions including music creation constraint information; Generate creation parameters based on the aforementioned music creation constraint information; Based on the music composition constraint information and the composition parameters, a target music style skeleton is generated, which includes the track information of multiple instruments. Based on the track information of the multiple instruments, a mixing reference audio is generated; The target music is generated based on the lyrics, the creation parameters, and the mixing reference audio.

[0091] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the music generation method provided by the above methods, the method comprising: acquiring music generation instructions and lyrics text, wherein the music generation instructions include music composition constraint information; Generate creation parameters based on the aforementioned music creation constraint information; Based on the music composition constraint information and the composition parameters, a target music style skeleton is generated, which includes the track information of multiple instruments. Based on the track information of the multiple instruments, a mixing reference audio is generated; The target music is generated based on the lyrics, the creation parameters, and the mixing reference audio.

[0092] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0093] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0094] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for generating music, characterized in that, include: Obtain music generation instructions and lyrics text, wherein the music generation instructions include music composition constraint information; Generate creation parameters based on the aforementioned music creation constraint information; Based on the music composition constraint information and the composition parameters, a target music style skeleton is generated, which includes the track information of multiple instruments. Based on the track information of the multiple instruments, a mixing reference audio is generated; The target music is generated based on the lyrics, the creation parameters, and the mixing reference audio.

2. The music generation method according to claim 1, characterized in that, The music composition constraints include at least one of the following: musical style, emotional inclination, tempo, musical phrase structure, and language type.

3. The music generation method according to claim 1, characterized in that, The generation of composition parameters based on the music composition constraint information includes: The music composition constraint information is input into the large language model to obtain the composition parameters output by the large language model. The composition parameters include at least one of the following: tonality, meter, tempo, music style, musical phrase structure, language type, and vocal gender.

4. The music generation method according to claim 1, characterized in that, The process of generating a target music style skeleton based on the music composition constraint information and the composition parameters includes: The music composition constraint information and the composition parameters are input into the music style skeleton generation model to obtain at least two candidate music style skeletons output by the music style skeleton generation model. The candidate music style skeleton selected by the user from the at least two candidate music style skeletons is determined as the target music style skeleton.

5. The music generation method according to claim 1, characterized in that, The process of generating a mixing reference audio based on the track information of the multiple instruments includes: The track information of the multiple instruments is decomposed to obtain the independent instrument tracks of each instrument; Based on the beat in the creation parameters, each of the independent instrument tracks is synchronized to obtain the synchronized instrument tracks. For each synchronized instrument track, based on the correspondence between the instrument track and the virtual instrument sound source, the synchronized instrument track is assigned to the corresponding target virtual instrument sound source to obtain the multi-track audio corresponding to the synchronized instrument track. At least one audio track edited by the user from all the said audio tracks is identified as the target audio track. The target audio tracks are mixed using an automatic mixing engine to obtain the mixing reference audio. The mixing process includes at least one of the following: dynamic equalization, image localization, compression and limiting, and loudness normalization.

6. The music generation method according to claim 1, characterized in that, The process of generating target music based on the lyrics text, the creation parameters, and the mixing reference audio includes: The lyrics text, the creation parameters, and the mixing reference audio are input into the song generation model to obtain the target music output by the song generation model.

7. The music generation method according to claim 1, characterized in that, The method further includes: The correspondence between target information and task identifiers in the entire music generation process is stored in the cloud server, and the task identifiers are added to the local history interface. The target information includes at least one of the following: music generation instructions, creation parameters, candidate music style skeletons, track audio, mixing reference audio, lyrics text, and target music.

8. The music generation method according to claim 7, characterized in that, The method further includes: Upon receiving a user's selection operation of the task identifier on the history interface, an information retrieval request is sent to the cloud server, the information retrieval request including the task identifier. Receive the target information corresponding to the task identifier sent by the cloud server; Load the target information into the editing interface; Receive user modification operations on the target information in the editing interface, and obtain the modified target information; New music is generated based on the modified target information.

9. A music generation device, characterized in that, include: The acquisition unit is used to acquire music generation instructions and lyrics text, wherein the music generation instructions include music creation constraint information; The first generation unit is used to generate creation parameters based on the music creation constraint information; The second generation unit is used to generate a target music style skeleton based on the music composition constraint information and the composition parameters. The target music style skeleton includes the track information of multiple instruments. The third generation unit is used to generate a mixing reference audio based on the track information of the multiple musical instruments; The fourth generation unit is used to generate target music based on the lyrics text, the creation parameters, and the mixing reference audio.

10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the music generation method as described in any one of claims 1 to 8.