Text generation model generation method, text generation method and device thereof

By training a text generation model using rhythm and format information extracted from lyrics sample data, the problems of poor sentence structure and high annotation costs in traditional lyrics generation applications are solved. This enables rhythm-matched lyrics generation, reduces the need for manual annotation, and improves the applicability and structural control of lyrics generation.

CN115391595BActive Publication Date: 2026-06-16BEIJING YOUZHUJU NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING YOUZHUJU NETWORK TECH CO LTD
Filing Date
2022-08-22
Publication Date
2026-06-16

Smart Images

  • Figure CN115391595B_ABST
    Figure CN115391595B_ABST
Patent Text Reader

Abstract

Embodiments of the present disclosure provide a method for generating a text generation model, a text generation method and an apparatus thereof. The method can include determining rhythm information of text units in text sample data. The method can also include determining format information of the text units in the text sample data. Furthermore, the method can further include training the text generation model based at least on the text sample data, the rhythm information and the format information. The text generation model obtained according to the training manner of the present disclosure can generate text conforming to relevant rhythm information, and reduces the demand for manual annotation of lyrics text data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments of this disclosure relate to the field of data processing, and more specifically, to methods for generating text generation models, text generation methods, apparatus, electronic devices, and computer-readable storage media. Background Technology

[0002] Lyrics are an essential component of music, especially pop music. Currently, the vast majority of music lyrics are written by the creators themselves. With the rapid development of artificial intelligence technology, human-machine collaboration has become possible. Creators can utilize their life experiences, literary skills, and AI-powered lyric generation applications to create the lyrics they desire. However, lyrics generated by traditional lyric generation applications often suffer from poor sentence structure, offering little beyond simple control over sentence length and number of lines. Therefore, traditional lyric generation applications typically only provide inspirational prompts, requiring creators to adjust and modify the lyrics themselves. Summary of the Invention

[0003] The embodiments of this disclosure provide a text generation model generation scheme.

[0004] In a first aspect of this disclosure, a method for generating a text generation model is provided. The method may include determining rhythmic information of text units in text sample data. The method may also include determining format information of text units in the text sample data. Furthermore, the method may further include training the text generation model based at least on the text sample data, the rhythmic information, and the format information.

[0005] In a second aspect of this disclosure, a text generation method is provided. The method may include acquiring predetermined rhythmic information associated with text units in the text. Furthermore, the method may also include applying at least the rhythmic information to a trained text generation model to determine the text.

[0006] In a third aspect of this disclosure, a text generation model generation apparatus is provided, the apparatus comprising: a rhythm information determination module configured to determine rhythm information of text units in text sample data; a format information determination module configured to determine format information of text units in the text sample data; and a model generation module configured to train the text generation model based at least on the text sample data, the rhythm information, and the format information.

[0007] In a fourth aspect of this disclosure, a text generation apparatus is provided, comprising: a rhythm information acquisition module configured to acquire predetermined rhythm information associated with text units in the text; and a text determination module configured to apply at least the rhythm information to a trained text generation model to determine the text.

[0008] In a fifth aspect of this disclosure, an electronic device is provided, comprising: a processor; and a memory coupled to the processor, the memory having instructions stored therein, the instructions causing the electronic device to perform actions when executed by the processor, the actions including: determining rhythm information of text units in text sample data; determining format information of text units in the text sample data; and training the text generation model based at least on the text sample data, the rhythm information, and the format information.

[0009] In a sixth aspect of this disclosure, an electronic device is provided, comprising: a processor; and a memory coupled to the processor, the memory having instructions stored therein, the instructions, when executed by the processor, causing the electronic device to perform actions, the actions including: acquiring predetermined rhythm information associated with text units in the text; and at least applying the rhythm information to a trained text generation model to determine the text.

[0010] In a seventh aspect of this disclosure, a computer program product is provided, which is tangibly stored on a computer-readable medium and includes machine-executable instructions that, when executed, cause a machine to perform any step of the method according to the first or second aspect.

[0011] This content section is provided for the purpose of presenting a simplified form of the chosen concepts, which will be further described in the detailed embodiments below. This content section is not intended to identify key or major features of this disclosure, nor is it intended to limit the scope of this disclosure. Attached Figure Description

[0012] The above and other objects, features, and advantages of this disclosure will become more apparent from the accompanying drawings, in which the same or similar reference numerals generally represent the same or similar parts. In the drawings:

[0013] Figure 1 A schematic diagram of an example environment in which several embodiments of the present disclosure can be implemented is shown;

[0014] Figure 2 A schematic diagram of a detailed example environment for training and applying a model according to embodiments of the present disclosure is shown;

[0015] Figure 3 A flowchart illustrating the generation process for a text generation model according to an embodiment of the present disclosure is shown;

[0016] Figure 4 A schematic diagram illustrating the determination of composite template sample data based on lyrics text sample data according to an embodiment of the present disclosure is shown.

[0017] Figure 5 A schematic diagram of a text generation model based on composite template sample data according to an embodiment of the present disclosure is shown;

[0018] Figure 6 A schematic diagram of a self-supervised text generation model according to an embodiment of the present disclosure is shown;

[0019] Figure 7 A flowchart is shown that can be used to implement the text generation process of this disclosure;

[0020] Figure 8 A schematic diagram of a generation apparatus for a text generation model according to an embodiment of the present disclosure is shown;

[0021] Figure 9 A schematic diagram of a text generation apparatus according to an embodiment of the present disclosure is shown; and

[0022] Figure 10 A schematic block diagram of an example device that can be used to implement embodiments of the present disclosure is shown. Detailed Implementation

[0023] It is understood that before using the technical solutions disclosed in the various embodiments of this disclosure, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this disclosure in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.

[0024] For example, upon receiving a user's active request, a prompt message is sent to the user to explicitly inform them that the requested operation will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the software or hardware, such as the electronic device, application, server, or storage medium performing the operations of this disclosed technical solution, based on the prompt message.

[0025] As an optional but non-limiting implementation, in response to a user's active request, sending a prompt message to the user can be done via a pop-up window, where the prompt message can be presented in text format. Furthermore, the pop-up window can also include a selection control allowing the user to choose "agree" or "disagree" to provide personal information to the electronic device.

[0026] It is understood that the above notification and user authorization process are merely illustrative and do not constitute a limitation on the implementation of this disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of this disclosure.

[0027] It is understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) shall comply with the requirements of relevant laws, regulations and related provisions.

[0028] The principles of this disclosure will now be described with reference to several exemplary embodiments shown in the accompanying drawings.

[0029] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.

[0030] In the description of embodiments of this disclosure, the term "text" can be a combination of words or sentences with structured features, such as lyrics in a song, as well as poems, lyrics, couplets, parallel sentences, etc. The term "rhythm information" is used to indicate the temporal and duration relationships between each word in the text. For example, each word in a lyric can correspond to a specific beat in a specific measure of a musical score; therefore, encoding the lyric based on the measure position and beat position corresponding to each word in the lyric determines the rhythm information of the lyric. The term "format information" typically refers to the positional relationships between each word in the text. For example, each word in a lyric can correspond to a specific word in a specific line of text; therefore, encoding the lyric based on the line position and word position corresponding to each word in the lyric determines the format information of the lyric.

[0031] As described above, with the continuous development of computer technology, artificial intelligence technology is widely applied to all aspects of people's lives. To better perform text generation tasks such as song lyrics, the training process of traditional text generation models needs to be optimized. In traditional text generation systems, text generation typically focuses on rhyme and emotional information, thus generating song lyrics that basically meet rhyme requirements and convey the specified emotion. However, song lyrics often have certain structural information, such as multi-segment structures, repetitive structures, and verse-chorus structures. For related sentences, even highly consistent grammatical word order is required in the generated text. This structural information is often limited by the melody to be paired with; therefore, there is an urgent need for a text generation solution with controllable structure. Furthermore, since text generation solutions require a certain amount of labeled song lyric text data, there are high human and time costs. Therefore, the training process of traditional text generation models needs to be improved to reduce the need for labeled song lyric text data.

[0032] According to embodiments of this disclosure, a text generation scheme based on rhythm information is proposed. Upon receiving a sample dataset, this scheme extracts both format information and rhythm information from the sample dataset, thereby enabling the training of a text generation model based on the sample dataset, rhythm information, and format information.

[0033] It should be understood that the verse-chorus structure in lyrics refers to a phenomenon in popular songs, where a song can be structurally divided into multiple sections, some called verses and others choruses. Verses often carry the narrative portion of the song, and their lyrical format and correspondence requirements are relatively lower. Unlike verses, choruses correspond to the climax of the lyrics, requiring the generated lyrics to have semantic relationships such as repetition, progression, and echoing, while also demanding a higher degree of rhythm and emotional tension. Furthermore, these requirements for verse-chorus lyrics are often difficult to capture directly at the textual level, but they exhibit relatively stable consistency when mapped to rhythmic information. Therefore, rhythmic information can be used to assist the model in learning the correspondences in the lyrics.

[0034] Therefore, the text generation model obtained by the model generation method of this scheme can generate text such as lyrics that conform to relevant rhythm information. Since rhythm information is usually a relatively low-level control image, it reduces the need for manual annotation of lyric text data, thereby solving the above-mentioned problems and / or other potential problems.

[0035] The embodiments of this disclosure will be described in detail below with reference to example scenarios. It should be understood that this is for illustrative purposes only and is not intended to limit the scope of this disclosure in any way.

[0036] Figure 1 A block diagram of a generation example system 100 for a text generation model according to an embodiment of the present disclosure is shown. It should be understood that... Figure 1 The system 100 shown is merely one example of an embodiment that can be implemented according to this disclosure and is not intended to limit the scope of this disclosure. The embodiments of this disclosure are equally applicable to other systems or architectures.

[0037] like Figure 1 As shown, system 100 may include computing device 120. Computing device 120 may be configured to receive user input 110 from a user side, such as a creator or video / music producer. Computing device 120 generates the user-desired lyrics text 130 based on the user input 110. Specifically, computing device 120 may generate the user-desired lyrics text using a text generation model 140 disposed therein.

[0038] In some embodiments, the user input 110 acquired by the computing device 120 may be rhythmic information directly extracted from a given melody, enabling the generated lyrics to directly adapt to the given melody. Alternatively or additionally, the user input 110 acquired by the computing device 120 may come from a given lyric text, making the generated lyric text highly relevant to the given text. Alternatively or additionally, the user input 110 acquired by the computing device 120 may come from editing information of a video stream or short video, thereby enabling the generated lyric text to adapt to the rhythm of the video.

[0039] In this disclosure, the text generation model 140 can be designed to perform text generation tasks. Examples of text generation models include, but are not limited to, various deep neural networks (DNNs), convolutional neural networks (CNNs), support vector machines (SVMs), decision trees, random forest models, and so on. In the implementation of this disclosure, the recommendation model may also be referred to as a "neural network," "learning model," "learning network," "model," and "network," which are used interchangeably.

[0040] In some embodiments, the computing device 120 may include, but is not limited to, personal computers, server computers, handheld or laptop devices, mobile devices (such as mobile phones, personal digital assistants (PDAs), media players, etc.), consumer electronics, minicomputers, mainframe computers, cloud computing resources, etc.

[0041] It should be understood that the devices and / or units included in system 100 are exemplary only and are not intended to limit the scope of this disclosure. It should be understood that system 100 may also include additional devices and / or units not shown. For example, in some embodiments, the computing device 120 of system 100 may further include a storage unit (not shown) for storing pre-input hyperparameters, etc.

[0042] The following will refer to Figure 2 The generation and use of models in computing device 120 are described.

[0043] Figure 2 A schematic diagram of a detailed example environment 200 according to an embodiment of the present disclosure is shown. Figure 1 Similarly, example environment 200 may include computing device 220, user input 210 input to computing device 220, and text 230 output from computing device 220. The difference is that example environment 200 may generally include model generation system 260 and model application system 270. As an example, model generation system 260 and / or model application system 270 may be, for example... Figure 1 The computing device 120 shown or such Figure 2 The examples are implemented in the computing device 220 shown. It should be understood that the description of the structure and functionality of the example environment 200 is for illustrative purposes only and is not intended to limit the scope of the topics described herein. The topics described herein may be implemented in different structures and / or functionalities.

[0044] As mentioned earlier, the process of processing user input to determine the lyrics text the user expects can be divided into two stages: the model generation stage and the model application stage. For example, ... Figure 2 As shown, in the model generation phase, the model generation system 260 can train the model 240 using the sample dataset 250. It should be understood that the sample dataset 250 may be lyrics text and other relevant information from currently released popular music. In the model application phase, the model application system 270 can receive the trained model 240. Thus, the model 240 loaded into the computing device 220 of the model application system 270 can determine the text 230 based on user input 210.

[0045] In other embodiments, model 240 can be constructed as a learning network. In some embodiments, the learning network may include multiple networks, each of which may be a multi-layer neural network composed of a large number of neurons. Through a training process, the corresponding parameters of the neurons in each network can be determined. The parameters of the neurons in these networks are collectively referred to as the parameters of model 240.

[0046] The training process of model 240 can be performed iteratively until at least some of the parameters of model 240 converge or until a predetermined number of iterations is reached, thereby obtaining the final model parameters.

[0047] The technical solutions described above are for illustrative purposes only and are not intended to limit this disclosure. It should be understood that other networks can also be arranged in different ways and with different connections. To more clearly explain the principles of the above solutions, references will be made below. Figure 3 Let's describe the text generation model training process in more detail.

[0048] Figure 3 A flowchart of a generation process 300 for a text generation model according to an embodiment of the present disclosure is shown. In some embodiments, process 300 may be performed at... Figure 1 The computing device 120 and Figure 2 This is implemented in computing device 220. Now refer to... Figure 3 The text generation process 300 according to an embodiment of this disclosure is described. For ease of understanding, the specific examples mentioned in the following description are exemplary and are not intended to limit the scope of protection of this disclosure.

[0049] At 302, computing device 120 can determine the rhythmic information of text units in the text sample data. It should be understood that text sample data is a combination of words or sentences with structured features, such as lyrics from currently released popular songs, as well as poems, lyrics, couplets, parallel sentences, etc. In some embodiments, the text sample data is lyrics text sample data, and the text unit is each character in the lyrics text sample data. For example, if a line of lyrics has eight characters, this can be described as the line of text sample data containing eight text units.

[0050] In some embodiments, to determine rhythm information, the computing device 120 can apply text sample data to a trained rhythm determination model to determine the rhythm information. In other words, the computing device 120 can input a line or section of lyrics from a published popular song into the trained rhythm determination model, which can determine the rhythm information of each word in the line or section of the song based on the melody corresponding to that line or section.

[0051] It should be understood that in the application scenario of lyric generation, since the lyrics themselves correspond to the melody of the song, each note in the melody has a specific duration. As an example, in a 4 / 4 time signature, each note in each measure can correspond to a specific beat. For instance, if the first word in the lyrics corresponds to a beat (quarter note), then that word corresponds to the first beat in that measure and can be encoded as 0. The second word corresponds to a half beat (eighth note), then that word corresponds to the second beat in that measure and can be encoded as 1. And so on, the encoding of each word in the lyrics, i.e., the rhythmic information, can be determined. It should also be understood that in other application scenarios such as poetry, lyrics, couplets, and parallel sentences, the duration of each word or unit of duration can also be used for encoding to determine the rhythmic information.

[0052] In some embodiments, in order to train the text generation model, it is necessary to extract useful information such as rhythm information from the lyrics text sample data. Figure 4 A schematic diagram of scenario 400, illustrating the determination of composite template sample data based on lyrics text sample data according to an embodiment of the present disclosure, is shown. Figure 4 As shown, in scenario 400, lyrics text sample data 410 is input to computing device 420. More specifically, computing device 420 includes multiple modules, such as masking module 421, rhythm extraction module 423, format extraction module 425, and autoregressive prediction module 427.

[0053] To extract rhythm information from the lyrics text sample data 410, the rhythm extraction module 423 in the computing device 420 can be configured to determine the beat position information of each word in the lyrics text sample data 410 in the current measure, as part of the rhythm information B. As an example, in Figure 4 In the middle, the beat position information b1, b2, ... b n These are used to indicate the corresponding code of each character in a line of lyrics in the current measure of the lyrics text sample data 410.

[0054] Furthermore, to simplify the encoding rules, the rhythm extraction module 423 in the computing device 420 can be configured to determine the measure position information of each word in the lyrics text sample data 410, as another part A of the rhythm information. As an example, in Figure 4 In the middle, the beat position information a1, a2, ... a n These are used to indicate the corresponding code for each word in the current measure of a line of lyrics in the lyric text sample data 410. This method allows for the rapid and accurate acquisition of rhythmic information from the sample data, thereby improving model performance and reducing the need for sample data annotation.

[0055] Back Figure 3 At 304, computing device 120 can determine the format information of text units in the text sample data. In some embodiments, such as Figure 4 As shown, in order to determine the format information, the format extraction module 425 of the computing device 420 can determine the character position information of each character in the lyrics text sample data 410 in the current line, as part of the format information P. As an example, in Figure 4 In the middle, the beat position information p1, p2, ... p n These are used to indicate the corresponding code of each character in a line of lyrics in the lyrics text sample data 410 in the current line.

[0056] Furthermore, to simplify the encoding rules, the format extraction module 425 of the computing device 420 can determine the line position information of each character in the lyrics text sample data 410 as another part of the format information S. As an example, in Figure 4In the middle, the beat position information s1, s2, ... s n These are used to indicate the corresponding code of the line containing each character in a line of lyrics in the lyrics text sample data 410.

[0057] Back Figure 3 In 306, computing device 120 can train a text generation model based at least on text sample data, rhythm information, and format information. In some embodiments, such as Figure 4 As shown, in order to train the text generation model, the masking module 421 in the computing device 420 can determine the masked text sample data M based on the lyrics text sample data 410. Specifically, the masking module 421 randomly performs a masking operation on a portion of the lyrics in a line of lyrics in the lyrics text sample data 410 according to a predetermined probability, and replaces the portion of lyrics text with mask labels (m1, m2, ... m). n Furthermore, the masked sample data is used as input to the model, and ground truth values ​​are determined based on the masked characters to train the text generation model. It should be understood that the masking module 421 is designed to accelerate model convergence, and its specific technical details are known in the art, so they will not be elaborated here.

[0058] Furthermore, the computing device 420 can train a text generation model based on the lyrics text sample data 410, the masked text sample data M, the rhythm information B and A, and the format information S and P.

[0059] like Figure 4 As shown, the computing device 420 also includes an autoregressive prediction module 427. This module converts the lyrics text sample data 410 (i.e., the original lyrics) into sequence labels L (l1, l2, ... ln) according to a pre-set text encoder. Thus, the computing device 420 can process the lyrics text sample data 410 into composite template sample data 430. It should be understood that, in addition to the six feature data mentioned above, information such as rhyme, emotion, and melody can also be introduced into the composite template sample data 430. In this way, batch processing of sample data is achieved, and the training process is simplified.

[0060] Figure 5 A schematic diagram of a generation scenario 500 based on a composite template sample data text generation model according to an embodiment of the present disclosure is shown. Figure 5 As shown, the masked text sample data M, rhythm information B, A, and format information S, P in the composite template sample data 430 are determined as the first data segment 510, and the rhythm information B, A, format information S, P, and the sequence number label L of the lyrics text sample data 410 are determined as the second data segment 520.

[0061] To train the text generation model 530, the computing device 420 can apply a first set of data segments 510 to the encoder 531 of the text generation model 530 and apply a second set of data segments 520 to the decoder 533 of the text generation model 530. For example... Figure 5 As shown, decoder 533 can be configured to receive the processing result of encoder 531. Therefore, decoder 533 can output text prediction data 540. Then, computing device 420 can determine the loss function value of text generation model 530 based on the text prediction data 540 output by decoder 533 and the sequence label L of lyrics text sample data 410, thereby adjusting the parameters of text generation model 530. It should be understood that text generation model 530 can be a pre-trained T5 model, MT5 model, or other model. In this way, model training can be achieved with only a small amount of labeled information.

[0062] To further reduce the cost of manual annotation, a self-supervised training framework can be built. Figure 6 A schematic diagram of a generation system 600 with a self-supervised text generation model according to an embodiment of the present disclosure is shown.

[0063] like Figure 6 As shown, system 600 includes composite template sample data 610, text generation model 620, and rhythm determination model 640. In Figure 6 In the diagram, the composite template sample data 610 is shown to include at least text sample data 611, masked text sample data 613, and rhythm information 615. It should be understood that... Figure 6 Composite template sample data 610 and Figure 4 , Figure 5 The composite template sample data 430 is similar, so the details will not be repeated here. Furthermore, it is similar to... Figure 5 Similarly, in the text generation module 530, the text generation module 620 also includes an encoder 621 and a decoder 623, and the decoder 623 is configured to output text prediction data 630. As described above, the computing device 120 can determine the loss function value 670 of the text generation model 620 based on the text prediction data 630 and the text sample data 611, thereby adjusting the parameters of the text generation model 620 until it converges.

[0064] In addition, Figure 6 In this process, the text prediction data 630 is also applied to the rhythm determination model 640, thereby determining the predicted rhythm information 650. Thus, the computing device 120 can determine the loss function value 660 of the rhythm determination model 640 based on the rhythm information 615 and the predicted rhythm information 650 output by the rhythm determination model 640, thereby adjusting the parameters of the rhythm determination model 640 until it converges. In this way, Figure 6The system eliminates the need for manual annotation of rhythmic information in lyrics, as well as the need for manual annotation of lyrics corresponding to rhythmic information, thus achieving a self-supervised model training method. Furthermore, by implementing cyclical training between lyrics and rhythm, the correlation between lyrics and rhythmic information can be further improved.

[0065] Once the parameters of the text generation model converge or a predetermined number of training iterations are reached, the computing device 120 can apply at least the given rhythm information to the trained text generation model to generate text that matches the given rhythm information, or at least apply at least the given first text to the trained text generation model to generate second text that matches the first text. In this way, lyrics that match the given rhythm or given text in terms of rhythmic information can be generated.

[0066] Furthermore, during model application (inference), the definitions of B, A, S, and P used are consistent with those in the training process, and M can be set to consist entirely of mask labels. The model output is the sampling probability of the current lyrics. Generally, a sampling algorithm is needed to sample the lyrics to obtain the generated lyrics text; this is a standard operation in NLG and will not be elaborated here. When keyword or topic control is needed, some labels in M ​​can be selected and replaced with the corresponding control text to achieve the control effect. When a specific line in the lyrics needs to be modified, simply replace the corresponding line's text with M, and the model will automatically generate the corresponding replacement text.

[0067] Figure 7 A flowchart is shown that can be used to implement the text generation process of this disclosure. In some embodiments, process 700 can be performed... Figure 1 The computing device 120 and Figure 2 This is implemented in computing device 220. Now refer to... Figure 7 The process 700 for text generation according to embodiments of this disclosure is described below. For ease of understanding, the specific examples mentioned in the following description are exemplary and are not intended to limit the scope of protection of this disclosure.

[0068] At 702, computing device 120 can acquire predetermined rhythmic information associated with text units in the text. It should be understood that text is a combination of words or sentences with structured features, such as lyrics that the user expects to generate, as well as poems, lyrics, couplets, parallel sentences, etc., and the text unit is each word in the lyrics.

[0069] At 704, computing device 120 can at least apply rhythm information to a trained text generation model to determine the text.

[0070] In some embodiments, the acquired rhythm information may be rhythm information extracted from the melody of a given song. Alternatively or additionally, the acquired rhythm information may be rhythm information extracted from a given lyrics text. Alternatively or additionally, the acquired rhythm information may be rhythm information extracted from a given video stream.

[0071] Through the aforementioned embodiments, this disclosure can generate lyrics or poems with well-defined relationships of repetition, similarity, and progression, enhancing the ability to produce music based on the generated text. Furthermore, by considering the rhythmic information of the lyrics, this disclosure significantly reduces the need for extensive rhythmic data annotation of the generated lyrics, thereby saving labor costs. Moreover, the lyrics generated based on the text generation model of this disclosure have good structural controllability and can be used in various scenarios requiring strict structural control of the generated lyrics, fully enhancing the applicability of the lyrics generation technology.

[0072] This disclosure also provides a model training apparatus. Specifically, Figure 8 A schematic diagram of a text generation apparatus 800 according to an embodiment of the present disclosure is shown. Figure 8 As shown, the device 800 may include at least: a rhythm information determination module 802, configured to determine the rhythm information of text units in text sample data; a format information determination module 804, configured to determine the format information of text units in the text sample data; and a model generation module 806, configured to train the text generation model based at least on the text sample data, the rhythm information, and the format information.

[0073] In some embodiments, the rhythm information determination module 802 may be configured to apply text sample data to the rhythm determination model to determine rhythm information.

[0074] In some embodiments, the text sample data may be lyrics text sample data, and the text unit may be each word in the lyrics text sample data.

[0075] In some embodiments, the rhythm information determination module 802 may include: a beat position information determination module configured to determine the beat position information of each word in the lyrics text sample data in the current measure, as part of the rhythm information; and a measure position information determination module configured to determine the measure position information of each word in the lyrics text sample data in the current measure, as another part of the rhythm information.

[0076] In some embodiments, the format information determination module 804 may include: a character position information determination module configured to determine the character position information of each character in the lyrics text sample data in the current line, as part of the format information; and a line position information determination module configured to determine the line position information of each character in the lyrics text sample data in the current line, as another part of the format information.

[0077] In some embodiments, the model generation module 806 may be configured to: determine masked text sample data based on the text sample data; and train the text generation model based on the text sample data, the masked text sample data, the rhythm information, and the format information.

[0078] In some embodiments, the model generation module 806 may be configured to: apply the masked text sample data, the rhythm information, and the format information to the encoder of the text generation model; apply the text sample data, the rhythm information, and the format information to the decoder of the text generation model, the decoder being configured to receive the processing result of the encoder; and determine the loss function value of the text generation model based on the text prediction data output by the decoder and the text sample data.

[0079] In some embodiments, the model generation module 806 may be configured to: determine the loss function value of the text generation model based on the text sample data and the text prediction data output by the text generation model, and the device may be configured to: apply the text prediction data to the rhythm determination model; and determine the loss function value of the rhythm determination model based on the rhythm information and the predicted rhythm information output by the rhythm determination model.

[0080] In some embodiments, the apparatus 800 may further include: a first application module configured to apply at least given rhythm information to the trained text generation model to generate text matching the given rhythm information; or a second application module configured to apply at least given first text to the trained text generation model to generate second text matching the first text.

[0081] Furthermore, this disclosure also provides a text generation apparatus. Specifically, Figure 9 A schematic diagram of a text generation apparatus 900 according to an embodiment of the present disclosure is shown. (As shown) Figure 9 As shown, the device 900 may include at least: a rhythm information acquisition module 902, configured to acquire predetermined rhythm information associated with text units in the text; and a text determination module 904, configured to apply the rhythm information to at least a trained text generation model to determine the text.

[0082] In some embodiments, the acquired rhythm information may include at least one of the following: rhythm information extracted from the melody of a given song; rhythm information extracted from a given lyrics text; and rhythm information extracted from a given video stream.

[0083] Figure 10 A schematic block diagram of an example device 1000 that can be used to implement embodiments of the present disclosure is shown. For example, such as Figure 1 The computing device 120 shown and Figure 2 The computing device 220 shown can be implemented by device 1000. As shown, device 1000 includes a central processing unit (CPU) 1001, which can perform various appropriate actions and processes according to computer program instructions stored in read-only memory (ROM) 1002 or loaded from storage unit 1008 into random access memory (RAM) 1003. The RAM 1003 may also store various programs and data required for the operation of device 1000. The CPU 1001, ROM 1002, and RAM 1003 are interconnected via bus 1004. Input / output (I / O) interface 1005 is also connected to bus 1004.

[0084] Multiple components in device 1000 are connected to I / O interface 1005, including: input unit 1006, such as keyboard, mouse, etc.; output unit 1007, such as various types of monitors, speakers, etc.; storage unit 1008, such as disk, optical disk, etc.; and communication unit 1009, such as network card, modem, wireless transceiver, etc. Communication unit 1009 allows device 1000 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks. It should be understood that this disclosure can utilize output unit 1007 to display real-time dynamic changes in user satisfaction, key factor identification information for group or individual users of satisfaction, optimization strategy information, and strategy implementation effectiveness evaluation information, etc.

[0085] Processing unit 1001 may be implemented by one or more processing circuits. Processing unit 1001 may be configured to perform the various processes and procedures described above, such as processes 300 and 700. For example, in some embodiments, processes 300 and 700 may be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 1008. In some embodiments, part or all of the computer program may be loaded and / or installed on device 1000 via ROM 1002 and / or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by CPU 1001, one or more steps of processes 300 and 700 described above may be performed.

[0086] This disclosure can be a system, method, and / or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for performing various aspects of this disclosure.

[0087] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example—but not limited to—electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination thereof. The computer-readable storage media used herein are not to be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0088] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.

[0089] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk, C++, etc., and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing state information from the computer-readable program instructions to implement various aspects of this disclosure.

[0090] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0091] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0092] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0093] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0094] According to one or more embodiments of this disclosure. Example 1. A method for generating a text generation model, comprising: determining rhythm information of text units in text sample data; determining format information of text units in the text sample data; and training the text generation model based at least on the text sample data, the rhythm information, and the format information.

[0095] Example 2. According to the method of Example 1, determining the rhythm information includes: applying the text sample data to a rhythm determination model to determine the rhythm information.

[0096] Example 3. According to the method described in Example 1, wherein the text sample data is lyrics text sample data, and the text unit is each word in the lyrics text sample data.

[0097] Example 4. According to the method described in Example 3, determining the rhythm information includes: determining the beat position information of each word in the lyrics text sample data in the current measure as part of the rhythm information; and determining the measure position information of each word in the current measure as another part of the rhythm information.

[0098] Example 5. According to the method described in Example 4, determining the format information includes: determining the character position information of each character in the lyrics text sample data in the current line as part of the format information; and determining the line position information of the current line in which each character in the lyrics text sample data is located as another part of the format information.

[0099] Example 6. According to the method of Example 1, training the text generation model includes: determining masked text sample data based on the text sample data; and training the text generation model based on the text sample data, the masked text sample data, the rhythm information, and the format information.

[0100] Example 7. According to the method of Example 6, training the text generation model includes: applying the masked text sample data, the rhythm information, and the format information to an encoder of the text generation model; applying the text sample data, the rhythm information, and the format information to a decoder of the text generation model, the decoder being configured to receive the processing result of the encoder; and determining the loss function value of the text generation model based on the text prediction data output by the decoder and the text sample data.

[0101] Example 8. According to the method of Example 2, training the text generation model includes: determining the loss function value of the text generation model based on the text sample data and the text prediction data output by the text generation model, and the method further includes: applying the text prediction data to the rhythm determination model; and determining the loss function value of the rhythm determination model based on the rhythm information and the predicted rhythm information output by the rhythm determination model.

[0102] Example 9. The method according to Example 1 further includes: applying at least the given rhythm information to the trained text generation model to generate text matching the given rhythm information, or applying at least the given first text to the trained text generation model to generate second text matching the first text.

[0103] Example 10. A text generation method, comprising: acquiring predetermined rhythmic information associated with text units in the text; and applying at least the rhythmic information to a trained text generation model to determine the text.

[0104] Example 11. The method according to Example 10, wherein the obtained rhythm information includes at least one of the following: rhythm information extracted from the melody of a given song; rhythm information extracted from a given lyrics text; and rhythm information extracted from a given video stream.

[0105] Example 12. A text generation model generation apparatus, comprising: a rhythm information determination module configured to determine rhythm information of text units in text sample data; a format information determination module configured to determine format information of text units in the text sample data; and a model generation module configured to train the text generation model based at least on the text sample data, the rhythm information, and the format information.

[0106] Example 13. The apparatus according to Example 12, wherein the rhythm information determination module is configured to apply the text sample data to a rhythm determination model to determine the rhythm information.

[0107] Example 14. The apparatus according to Example 12, wherein the text sample data is lyrics text sample data, and the text unit is each word in the lyrics text sample data.

[0108] Example 15. The apparatus according to Example 14, wherein the rhythm information determination module comprises: a beat position information determination module configured to determine the beat position information of each word in the lyrics text sample data in the current measure as part of the rhythm information; and a measure position information determination module configured to determine the measure position information of each word in the lyrics text sample data in the current measure as another part of the rhythm information.

[0109] Example 16. The apparatus according to Example 15, wherein the format information determination module comprises: a character position information determination module configured to determine the character position information of each character in the lyrics text sample data in the current line as part of the format information; and a line position information determination module configured to determine the line position information of the current line in which each character in the lyrics text sample data is located as another part of the format information.

[0110] Example 17. The apparatus according to Example 12, wherein the model generation module is configured to: determine masked text sample data based on the text sample data; and train the text generation model based on the text sample data, the masked text sample data, the rhythm information, and the format information.

[0111] Example 18. The apparatus according to Example 17, wherein the model generation module is configured to: apply the masked text sample data, the rhythm information, and the format information to an encoder of the text generation model; apply the text sample data, the rhythm information, and the format information to a decoder of the text generation model, the decoder being configured to receive the processing result of the encoder; and determine a loss function value of the text generation model based on text prediction data output by the decoder and the text sample data.

[0112] Example 19. The apparatus according to Example 13, wherein the model generation module is configured to: determine the loss function value of the text generation model based on the text sample data and the text prediction data output by the text generation model, and the apparatus is configured to: apply the text prediction data to the rhythm determination model; and determine the loss function value of the rhythm determination model based on the rhythm information and the predicted rhythm information output by the rhythm determination model.

[0113] Example 20. The apparatus according to Example 12 further includes: a first application module configured to apply at least given rhythm information to the trained text generation model to generate text matching the given rhythm information; or a second application module configured to apply at least given first text to the trained text generation model to generate second text matching the first text.

[0114] Example 21. A text generation apparatus, comprising: a rhythm information acquisition module configured to acquire predetermined rhythm information associated with text units in the text; and a text determination module configured to apply at least the rhythm information to a trained text generation model to determine the text.

[0115] Example 22. The apparatus according to Example 21, wherein the acquired rhythm information includes at least one of the following: rhythm information extracted from the melody of a given song; rhythm information extracted from a given lyrics text; and rhythm information extracted from a given video stream.

[0116] Example 23. An electronic device, comprising: a processor; and a memory coupled to the processor, the memory having instructions stored therein, the instructions causing the electronic device to perform actions when executed by the processor, the actions including: determining rhythm information of text units in text sample data; determining format information of text units in the text sample data; and training the text generation model based at least on the text sample data, the rhythm information, and the format information.

[0117] Example 24. An electronic device includes: a processor; and a memory coupled to the processor, the memory having instructions stored therein, the instructions causing the electronic device to perform actions when executed by the processor, the actions including: acquiring predetermined rhythmic information associated with text units in the text; and at least applying the rhythmic information to a trained text generation model to determine the text.

[0118] Example 25. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method as described in any one of Examples 1-11.

[0119] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, and are not limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical applications, or technical improvements to the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method for generating a text generation model, comprising: Determine the rhythmic information of text units in the text sample data; Determine the format information of the text units in the text sample data; as well as The text generation model is trained based at least on the text sample data, the rhythm information, and the format information. The text sample data mentioned therein is lyrics text sample data, and the rhythm information is the encoding of each character in the lyrics text sample data; The rhythm information includes the beat position information of each word in the current measure and the measure position information of the current measure.

2. The method according to claim 1, wherein determining the rhythm information comprises: The text sample data is applied to the rhythm determination model to determine the rhythm information.

3. The method according to claim 1, wherein the text unit is each character in the lyrics text sample data.

4. The method of claim 1, wherein determining the rhythm information comprises: The beat position information is determined as part of the rhythm information; as well as The measure position information is determined as another part of the rhythm information.

5. The method of claim 4, wherein determining the format information includes: The position information of each character in the current line of the lyrics text sample data is determined and included as part of the format information; as well as The line position information of each character in the lyrics text sample data in the current line is determined as another part of the format information.

6. The method of claim 1, wherein training the text generation model comprises: Based on the text sample data, determine the masked text sample data; as well as The text generation model is trained based on the text sample data, the masked text sample data, the rhythm information, and the format information.

7. The method of claim 6, wherein training the text generation model comprises: The masked text sample data, the rhythm information, and the format information are applied to the encoder of the text generation model; The text sample data, the rhythm information, and the format information are applied to the decoder of the text generation model, and the decoder is configured to receive the processing result of the encoder; as well as Based on the text prediction data output by the decoder and the text sample data, the loss function value of the text generation model is determined.

8. The method of claim 2, wherein training the text generation model comprises: Based on the text sample data and the text prediction data output by the text generation model, the loss function value of the text generation model is determined, and the method further includes: Apply the text prediction data to the rhythm determination model; and Based on the rhythm information and the predicted rhythm information output by the rhythm determination model, the loss function value of the rhythm determination model is determined.

9. The method according to claim 1, further comprising: At least the given rhythm information is applied to the trained text generation model to generate text that matches the given rhythm information, or At least the given first text is applied to the trained text generation model to generate a second text that matches the first text.

10. A text generation method, comprising: Obtain predetermined rhythmic information associated with text units in the text; as well as The rhythm information is applied at least to a trained text generation model to determine the text; The text mentioned therein is lyrics text, and the rhythm information is the encoding of each word in the lyrics text; The rhythm information includes the beat position information of each word in the current measure and the measure position information of the current measure.

11. The method of claim 10, wherein the acquired rhythm information comprises at least one of the following: Rhythmic information extracted from the melody of a given song; Rhythmic information extracted from a given lyrics text; and Rhythm information extracted from a given video stream.

12. A text generation device for a text generation model, comprising: The rhythm information determination module is configured to determine the rhythm information of text units in the text sample data; The format information determination module is configured to determine the format information of text units in the text sample data; as well as The model generation module is configured to train the text generation model based at least on the text sample data, the rhythm information, and the format information. The text sample data mentioned therein is lyrics text sample data, and the rhythm information is the encoding of each character in the lyrics text sample data; The rhythm information includes the beat position information of each word in the current measure and the measure position information of the current measure.

13. A text generation apparatus, comprising: The rhythm information acquisition module is configured to acquire predetermined rhythm information associated with text units in the text; as well as A text determination module is configured to apply at least the rhythm information to a trained text generation model to determine the text; The text mentioned therein is lyrics text, and the rhythm information is the encoding of each word in the lyrics text; The rhythm information includes the beat position information of each word in the current measure and the measure position information of the current measure.

14. An electronic device comprising: processor; as well as A memory coupled to the processor, the memory having instructions stored therein, the instructions which, when executed by the processor, cause the electronic device to perform actions, the actions including: Determine the rhythmic information of text units in the text sample data; Determine the format information of the text units in the text sample data; and The text generation model is trained based at least on the text sample data, the rhythm information, and the format information. The text sample data mentioned therein is lyrics text sample data, and the rhythm information is the encoding of each character in the lyrics text sample data; The rhythm information includes the beat position information of each word in the current measure and the measure position information of the current measure.

15. An electronic device comprising: processor; as well as A memory coupled to the processor, the memory having instructions stored therein, the instructions which, when executed by the processor, cause the electronic device to perform actions, the actions including: To acquire pre-determined rhythmic information associated with text units in the text; and The rhythm information is applied at least to a trained text generation model to determine the text; The text mentioned therein is lyrics text, and the rhythm information is the encoding of each word in the lyrics text; The rhythm information includes the beat position information of each word in the current measure and the measure position information of the current measure.

16. A computer-readable storage medium having a computer program stored thereon, the program being executed by a processor to implement the method as described in any one of claims 1-11.