Large model-based visual content generation and target large model training method and apparatus
By introducing a thinking phase and multimodal processing into a large model, and using maximum likelihood estimation and reinforcement learning to train the target large model, the problem of insufficient visual content generation quality is solved, and higher quality visual content generation and smoother user interaction are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING BAIDU NETCOM SCI & TECH CO LTD
- Filing Date
- 2025-06-03
- Publication Date
- 2026-06-16
AI Technical Summary
Existing large models produce poor-quality results in visual content generation, failing to meet users' high-quality requirements.
By introducing a thinking phase into the large model, using a multimodal large model to process user input instructions, and training the target large model through maximum likelihood estimation and reinforcement learning, the accuracy of the generated results is improved.
It improves the accuracy and quality of visual content generation, enabling the generated results to better meet user needs and enhance the smoothness of user interaction with large models.
Smart Images

Figure CN120893504B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, specifically to the fields of deep learning, large models, computer vision, and natural language processing, and particularly to methods and apparatus for visual content generation and target large model training based on large models. Background Technology
[0002] Large-scale models refer to deep learning models trained on large amounts of text data. They can generate natural language text or understand the meaning of natural language text, and can simulate human language cognition and generation processes to a certain extent. Currently, large-scale models are widely used in various scenarios, such as visual content generation. Visual content generation refers to generating corresponding visual content, such as images or videos, based on user-input instructions using large-scale models. Summary of the Invention
[0003] This disclosure provides methods and apparatus for large-model-based visual content generation and target large-model training.
[0004] A visual content generation method based on a large model includes:
[0005] Obtain target instruction information;
[0006] The target instruction information is input into the target large model to obtain the corresponding target result information and output it. The target result information includes target visual content. The target result information is generated by the target large model based on the target thinking information. The target thinking information is the thinking process information generated by the target large model in response to the target instruction information.
[0007] A method for training a large target model includes:
[0008] Obtain the pre-trained base model;
[0009] Acquire first training data, which includes: first sample instruction information, first sample result information corresponding to the first sample instruction information, and first sample thinking information. The first sample thinking information is thinking process information generated in response to the first sample instruction information, and the first sample result information includes first visual content.
[0010] The basic large model is trained based on the first training data, and the target large model is determined based on the training results.
[0011] A visual content generation device based on a large model includes: an instruction acquisition module and a result generation module;
[0012] The instruction acquisition module is used to acquire target instruction information;
[0013] The result generation module is used to input the target instruction information into the target large model, obtain the corresponding target result information and output it. The target result information includes target visual content. The target result information is generated by the target large model based on the target thinking information. The target thinking information is the thinking process information generated by the target large model in response to the target instruction information.
[0014] A target large model training device includes: a model acquisition module, a data acquisition module, and a model training module;
[0015] The model acquisition module is used to acquire the pre-trained basic large model;
[0016] The data acquisition module is used to acquire first training data, which includes: first sample instruction information, first sample result information corresponding to the first sample instruction information, and first sample thinking information. The first sample thinking information is thinking process information generated in response to the first sample instruction information, and the first sample result information includes first visual content.
[0017] The model training module is used to train the basic large model based on the first training data, and determine the target large model based on the training results.
[0018] An electronic device, comprising:
[0019] At least one processor; and
[0020] A memory communicatively connected to the at least one processor; wherein,
[0021] The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method described above.
[0022] A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the methods described above.
[0023] A computer program product includes a computer program / instructions that, when executed by a processor, implement the method described above.
[0024] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0025] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:
[0026] Figure 1 This is a flowchart of an embodiment of the visual content generation method based on a large model as described in this disclosure;
[0027] Figure 2 This is a schematic diagram illustrating the interaction between the user and the target large model as described in this disclosure;
[0028] Figure 3 This is a first schematic diagram of the visual content generation process based on a large model as described in this disclosure;
[0029] Figure 4 This is a second schematic diagram of the visual content generation process based on a large model as described in this disclosure;
[0030] Figure 5 This is a flowchart of the first embodiment of the target large model training method described in this disclosure;
[0031] Figure 6 This is a flowchart of the second embodiment of the target large model training method described in this disclosure;
[0032] Figure 7 This is a schematic diagram of the composition structure of Embodiment 700 of the visual content generation device based on a large model as described in this disclosure;
[0033] Figure 8 This is a schematic diagram of the composition structure of Embodiment 800 of the target large model training device described in this disclosure;
[0034] Figure 9 A schematic block diagram of an electronic device 900 that can be used to implement embodiments of the present disclosure is shown. Detailed Implementation
[0035] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0036] Furthermore, it should be understood that the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.
[0037] Figure 1This is a flowchart illustrating an embodiment of the large-model-based visual content generation method described in this disclosure. Figure 1 As shown, the specific implementation methods are as follows.
[0038] In step 101, the target instruction information (query) is obtained.
[0039] In step 102, the target instruction information is input into the target large model, the corresponding target result information is obtained and output, the target result information includes target visual content, the target result information is generated by the target large model based on the target thinking information, and the target thinking information is the thinking process information generated by the target large model in response to the target instruction information.
[0040] Currently, although large models can be used to generate visual content corresponding to user input instructions, the generated visual content is usually of poor quality and cannot meet user requirements well.
[0041] The above-described method and implementation scheme explicitly adds a thinking stage for the target instruction information input by the user to the target large model. This allows for the generation of thinking process information based on the target instruction information, and then the generation of the required target result information based on the thinking process information. This improves the accuracy of the generated target result information and enables the target result information to better meet user requirements.
[0042] In some embodiments of this disclosure, the target large model may include: a multimodal large model, and the target instruction information may include: first generation requirement description information, or the first generation requirement description information and a first image corresponding to the first generation requirement description information.
[0043] A multimodal large model is a model architecture capable of simultaneously processing multimodal data (such as text, images, audio, and video) input and output, and achieving cross-modal understanding and generation. Its core objective is to integrate the understanding and generation capabilities of traditional multimodal large models through a unified framework, thereby improving task generalization efficiency and interaction flexibility. The solution described in this disclosure can use a multimodal large model as the target large model, which can further improve the accuracy of the generated target result information.
[0044] The target instruction information can include only the first generation requirement description information (such as the text-to-image task), or it can include both the first generation requirement description information and the first image corresponding to the first generation requirement description information (such as the image editing task). That is, the target instruction information can include only text information, or it can include both text information and image information, which is very flexible and convenient.
[0045] Accordingly, the visual content generation process described in this disclosure can be divided into three stages: the instruction acquisition stage, the thinking stage, and the response stage. The instruction acquisition stage refers to the stage of acquiring the target instruction information input by the user; the thinking stage refers to the stage of the target large model's internal thinking process in response to the target instruction information; and the response stage refers to the stage of generating and outputting the target result information.
[0046] The target result information may include target visual content, which may be an image or a video.
[0047] In some embodiments of this disclosure, the target result information may further include: a response script that matches the target instruction information; in addition, target thinking information may also be output at the same time as the target result information is output.
[0048] In other words, while generating target visual content, it can also generate response scripts that match the target instruction information to enrich the information content returned to the user and improve the smoothness of the interaction between the user and the target large model. In addition, it can also return target thinking information to the user to further enrich the information content returned to the user.
[0049] Accordingly, Figure 2 This is a schematic diagram illustrating the interaction between the user and the target large model as described in this disclosure. Figure 2 As shown, users can input target instructions into the target model. The target model can sequentially execute the instruction acquisition stage, the thinking stage, and the response stage, and can return the target result information to the user. The target result information may include target visual content and response scripts that match the target instruction information.
[0050] in addition, Figure 3 This is a first schematic diagram of the visual content generation process based on a large model as described in this disclosure. Figure 3 As shown, assuming the target instruction information only includes the first generation requirement description information, specifically: "Draw me a futuristic clock to place on a wooden table," then the target large model can be generated using a left-to-right token-by-token generation method, such as... Figure 3The white squares at the bottom represent text tokens, and the gray squares represent image tokens. Whether to generate text tokens or image tokens is determined by the target model itself. For example, the target model can first generate the text content "Draw a futuristic floating mechanical clock, cobalt blue metal..." and generate the corresponding image 'a'. Specifically, it can first generate the token for image 'a', and then use an image decoder to generate image 'a' based on the token for image 'a'. Further, it can generate the text content "I need a wooden table, with brown wood grain..." and generate the corresponding image 'b'. Then, it can generate the text content "I need to put the clock on the wooden table and show it to the user" and generate the corresponding image 'c'. Image 'c' is the target visual content. In addition, it can also generate the text content "Hello, this is the clock and wooden table image you requested" as a response.
[0051] Figure 4 This is a second schematic diagram illustrating the large-model-based visual content generation process described in this disclosure. For example... Figure 4 As shown, assuming the target instruction information includes both a first generation requirement description and a corresponding first image, the first generation requirement description is specifically: "Add a banana next to the apple in this image." The first image is the "image" mentioned in the first generation requirement description. The token of the first image can be obtained through an image encoder. The target model can first generate the text "I need to draw a banana first" and generate the corresponding image a'. Then, it can generate the text "The banana is not drawn well, redraw it" and generate the corresponding image b'. After that, it can generate the text "I want to add the banana to the original image" and generate the corresponding image c'. Image c' is the target visual content. In addition, it can also generate the text "The banana has been added. Do you have any other requirements?" as a response.
[0052] The target large model can be obtained through pre-training. The following is an explanation of the training process of the target large model.
[0053] Figure 5 This is a flowchart of the first embodiment of the target large model training method described in this disclosure. Figure 5 As shown, the specific implementation methods are as follows.
[0054] In step 501, the pre-trained base model is obtained.
[0055] In step 502, first training data is obtained, which includes: first sample instruction information, first sample result information corresponding to the first sample instruction information, and first sample thinking information. The first sample thinking information is the thinking process information generated in response to the first sample instruction information, and the first sample result information includes first visual content.
[0056] In step 503, the basic large model is trained based on the first training data, and the target large model is determined based on the training results.
[0057] Based on the training data mentioned above, the target large model can learn how to generate thought process information, so as to generate target result information corresponding to the target instruction information input by the user based on the thought process information, thereby improving the accuracy of the generated target result information and enabling the target result information to better meet the user's requirements.
[0058] In some embodiments of this disclosure, the target large model may include a multimodal large model. Accordingly, the base large model may be a multimodal large model, such as a pre-trained multimodal large model that can be directly reused, such as a hybrid modal base model (Chameleon), thereby improving training efficiency and leveraging the powerful reasoning capabilities of the multimodal large model to improve the accuracy of the obtained target result information.
[0059] There are no restrictions on how the first training data is obtained; for example, it can be collected and labeled manually. The first training data may include: first sample instruction information, first sample result information, and first sample thinking information. The first sample thinking information is the thinking process information generated in response to the first sample instruction information, and the first sample result information includes the first visual content.
[0060] That is, the first training data may include: <query> ……< / query> <thinking> ……< / thinking> <response> ……< / response> ,in, <query> ……< / query> This indicates the first sample instruction information. <thinking> ……< / thinking> < indicates the first sample of thinking information. <response> ……< / response> This indicates the results of the first sample.
[0061] In some embodiments of this disclosure, the first sample instruction information may include: second generation requirement description information, or, the second generation requirement description information and a second image corresponding to the second generation requirement description information. Accordingly, any first sample consideration information may include one of the following: 1) refined requirement description information obtained by refining the second generation requirement description information; 2) step description information for generating the first sample result information; 3) initial result information and optimization description information, wherein the first sample result information is obtained by performing optimization processing corresponding to the optimization description information on the initial result information; 4) M candidate result information and selection reason information corresponding to the first sample instruction information, wherein M is a positive integer greater than 1, the M candidate result information includes the first sample result information, and the selection reason information is used to explain the reasons why the first sample result information is superior to other candidate result information.
[0062] At least four of the above methods can be used to generate thought process information. The following uses an image as the first-view content as an example to further explain the four methods.
[0063] In Method 1), the first sample instruction information can be enriched and rewritten in the form of text, that is, the second generation requirement description information is refined to obtain the refined requirement description information. Compared with the second generation requirement description information, the refined requirement description information can improve the details, style, layout, etc. of the image to be generated.
[0064] In method 2), the second generation requirement description information can be broken down to obtain the step description information for generating the first sample result information, such as which text content to generate first, which image to generate next, ..., and how to combine them to obtain the final required image, etc.
[0065] In method 3), the generated image content can be repeatedly modified. For example, initial result information and optimized description information (textual reflection information) can be provided. The image in the first sample result information can be obtained by performing the optimization processing corresponding to the optimized description information on the image in the initial result information. That is, the image in the initial result information can be modified in detail through the optimized description information to obtain the image in the first sample result information.
[0066] In method 4), M candidate result information corresponding to the first sample instruction information can be given simultaneously, where M is a positive integer greater than 1, and the specific value can be determined according to actual needs. Selection reason information can also be given. The first sample result information is included among the M candidate result information. The selection reason information is used to explain why the first sample result information is superior to other candidate result information. That is, the selection reason information is used to explain why the first sample result information is selected from the M candidate result information as the final required result.
[0067] As can be seen, through the above processing, the target large model can learn various different ways of thinking, thereby improving the learning effect of the target large model, that is, improving the performance of the target large model. Subsequently, when using the target large model for actual reasoning applications, the target large model can determine the specific thinking process information on its own.
[0068] In some embodiments of this disclosure, when training a basic large model based on first training data, the basic large model can be trained autoregressively using the maximum likelihood estimation method based on the first training data.
[0069] Maximum likelihood estimation is a mature training method. Accordingly, the maximum likelihood estimation method can be used to perform autoregressive training on the basic large model according to the process of the first sample instruction information, the first sample thinking information, and the first sample result information, thereby improving the training efficiency and learning effect of the target large model.
[0070] After training the basic large model based on the first training data, the target large model can be determined based on the training results.
[0071] In some embodiments of this disclosure, after training the basic large model based on the first training data, an intermediate large model can be obtained. Then, the intermediate large model can be directly determined as the target large model. Alternatively, second training data can be obtained, which may include second sample instruction information. The intermediate large model can be trained using reinforcement learning based on the second training data to obtain the target large model.
[0072] Training the base model based on the first training data refers to supervised fine-tuning (SFT) training of the base model. The base model is obtained through pre-training. By combining pre-training and fine-tuning, the desired target model can be obtained. Alternatively, to further improve the performance of the target model, reinforcement learning training can be performed on it using the second training data after obtaining the intermediate model.
[0073] Specifically, the reinforcement learning can employ algorithms such as Reinforcement Learning from Human Feedback (RLHF).
[0074] In some embodiments of this disclosure, the method of training the intermediate large model with reinforcement learning based on the second training data may include: inputting the second sample instruction information into the intermediate large model to obtain the output intermediate result information, the intermediate result information including the second visual content, determining the comprehensive evaluation result based on the intermediate result information and the second sample instruction information, and updating the intermediate large model according to the principle of improving the comprehensive evaluation result.
[0075] The comprehensive evaluation result can refer to the comprehensive score, i.e. the comprehensive score of the reward model. The optimization goal of reinforcement learning is to improve the comprehensive score of the output result. Accordingly, after determining the comprehensive score based on the intermediate result information and the second sample instruction information, the intermediate large model can be updated (i.e. optimized) according to the principle of improving the comprehensive score.
[0076] In some embodiments of this disclosure, the second sample instruction information may include: third generation requirement description information, or third generation requirement description information and a third image corresponding to the third generation requirement description information; accordingly, in response to the second visual content being an image, the method for determining the comprehensive evaluation result based on the intermediate result information and the second sample instruction information may include: obtaining a similarity score between the second visual content and the third generation requirement description information, and obtaining an aesthetic score for the second visual content; in response to determining that the second sample instruction information does not include a third image, determining a comprehensive score based on the similarity score and the aesthetic score; in response to determining that the second sample instruction information includes a third image, obtaining the sum of squares of the differences between corresponding pixels in the second visual content and the third image, where corresponding pixels are pixels with the same coordinate position, and determining a comprehensive score based on the similarity score, the aesthetic score, and the sum of squares.
[0077] The scheme described in this disclosure can employ a multi-objective reinforcement learning approach that includes model scoring and rule calculation. Model scoring refers to the aforementioned similarity score and aesthetic score. For example, a pre-trained text-image similarity model can be used to determine the similarity score between the second visual content and the third generated requirement description information, and a pre-trained image aesthetic evaluation model can be used to determine the aesthetic score of the second visual content. Rule calculation refers to calculating the sum of squares of the differences (pixel-by-pixel differences) between corresponding pixels in the second visual content and the third image. The similarity score reflects the degree to which the target large model obeys the user's instructions; the higher the similarity score, the higher the degree to which the target large model obeys the user's instructions. The aesthetic score reflects the aesthetic quality of the generated second visual content; the higher the aesthetic score, the higher the aesthetic quality of the second visual content. The sum of squares reflects whether the image editing process follows the original image; the larger the sum of squares, the higher the degree of adherence. Accordingly, combining the similarity score, aesthetic score, and sum of squares to determine a comprehensive score can improve the accuracy of the comprehensive score, thereby improving the optimization efficiency of the intermediate large model.
[0078] There are no restrictions on how to combine similarity scores, aesthetic scores, and sums of squares to determine the comprehensive score. For example, the comprehensive score can be calculated according to a predetermined calculation formula.
[0079] Based on the above introduction, Figure 6This is a flowchart of a second embodiment of the target large model training method described in this disclosure. Figure 6 As shown, the specific implementation methods are as follows.
[0080] In step 601, the pre-trained base model is obtained.
[0081] The basic large model can be a multimodal large model.
[0082] In step 602, first training data is obtained, which includes: first sample instruction information, first sample result information corresponding to the first sample instruction information, and first sample thinking information. The first sample thinking information is the thinking process information generated in response to the first sample instruction information, and the first sample result information includes first visual content.
[0083] The first sample instruction information may include: second generated requirement description information, or the second generated requirement description information and the second image corresponding to the second generated requirement description information.
[0084] In addition, any first sample thinking information may include one of the following: refined requirement description information obtained by refining the second generated requirement description information; step description information for generating the first sample result information; initial result information and optimization description information, wherein the first sample result information is obtained by performing optimization processing corresponding to the optimization description information on the initial result information; M candidate result information and selection reason information corresponding to the first sample instruction information, wherein M is a positive integer greater than 1, the first sample result information is included among the M candidate result information, and the selection reason information is used to explain why the first sample result information is superior to other candidate result information.
[0085] In step 603, the basic large model is trained based on the first training data to obtain the intermediate large model.
[0086] For example, the basic large model can be trained autoregressively using the maximum likelihood estimation method based on the first training data.
[0087] In step 604, second training data is obtained, which includes second sample instruction information.
[0088] The second sample instruction information includes: third generated requirement description information, or third generated requirement description information and the third image corresponding to the third generated requirement description information.
[0089] In step 605, reinforcement learning training is performed on the intermediate large model based on the second training data to obtain the target large model.
[0090] For example, the second sample instruction information can be input into the intermediate large model to obtain the output intermediate result information, which includes the second visual content. Then, the comprehensive evaluation result can be determined based on the intermediate result information and the second sample instruction information. Subsequently, the intermediate large model can be updated according to the principle of improving the comprehensive evaluation result.
[0091] Once the target large model is obtained, it can be applied to practical reasoning applications, such as... Figure 1 The visual content generation method shown is used to generate corresponding target result information based on the input target instruction information.
[0092] In addition, during the inference application process, after generating target result information using the target large model, the target result information and the corresponding target instruction information can be used to further reinforce the target large model for training, so as to further improve the performance of the target large model.
[0093] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this disclosure is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this disclosure. Secondly, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this disclosure. Furthermore, for parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.
[0094] The above is an introduction to the method embodiments. The following describes the solution described in this disclosure further through device embodiments.
[0095] Figure 7 This is a schematic diagram of the structural composition of Embodiment 700 of the visual content generation device based on a large model as described in this disclosure. Figure 7 As shown, it includes: instruction acquisition module 701 and result generation module 702.
[0096] The instruction acquisition module 701 is used to acquire target instruction information.
[0097] The result generation module 702 is used to input the target instruction information into the target large model, obtain the corresponding target result information and output it. The target result information includes the target visual content. The target result information is generated by the target large model based on the target thinking information. The target thinking information is the thinking process information generated by the target large model in response to the target instruction information.
[0098] In some embodiments of this disclosure, the target large model may include: a multimodal large model, and the target instruction information may include: first generation requirement description information, or the first generation requirement description information and a first image corresponding to the first generation requirement description information.
[0099] In some embodiments of this disclosure, the target result information may further include: a response script that matches the target instruction information, and / or, the result generation module 702 may output target thinking information while outputting the target result information.
[0100] Figure 8 This is a schematic diagram of the structural composition of Embodiment 800 of the target large model training device described in this disclosure. Figure 8 As shown, it includes: a model acquisition module 801, a data acquisition module 802, and a model training module 803.
[0101] The model acquisition module 801 is used to acquire the pre-trained base model.
[0102] The data acquisition module 802 is used to acquire first training data, which includes: first sample instruction information, first sample result information corresponding to the first sample instruction information, and first sample thinking information. The first sample thinking information is the thinking process information generated in response to the first sample instruction information, and the first sample result information includes first visual content.
[0103] The model training module 803 is used to train the basic large model based on the first training data and determine the target large model based on the training results.
[0104] In some embodiments of this disclosure, the target large model may include: a multimodal large model; the first sample instruction information includes: second generation requirement description information, or, the second generation requirement description information and a second image corresponding to the second generation requirement description information.
[0105] In some embodiments of this disclosure, any first sample thinking information may include one of the following: refined requirement description information obtained by refining the second generated requirement description information; step description information for generating first sample result information; initial result information and optimization description information, wherein the first sample result information is obtained by performing optimization processing corresponding to the optimization description information on the initial result information; M candidate result information and selection reason information corresponding to the first sample instruction information, wherein M is a positive integer greater than 1, the first sample result information is included among the M candidate result information, and the selection reason information is used to explain the reasons why the first sample result information is superior to other candidate result information.
[0106] In some embodiments of this disclosure, when the model training module 803 trains the basic large model based on the first training data, it may perform autoregressive training on the basic large model using the maximum likelihood estimation method based on the first training data.
[0107] In some embodiments of this disclosure, after the model training module 803 trains the basic large model based on the first training data, an intermediate large model can be obtained. Then, the intermediate large model can be directly determined as the target large model. Alternatively, second training data can be obtained, which may include second sample instruction information. The intermediate large model can be trained using reinforcement learning based on the second training data to obtain the target large model.
[0108] In some embodiments of this disclosure, the model training module 803 may perform reinforcement learning training on the intermediate large model based on the second training data in the following ways: inputting the second sample instruction information into the intermediate large model to obtain the output intermediate result information, the intermediate result information including the second visual content, determining the comprehensive evaluation result based on the intermediate result information and the second sample instruction information, and updating the intermediate large model according to the principle of improving the comprehensive evaluation result.
[0109] In some embodiments of this disclosure, the second sample instruction information may include: third generation requirement description information, or third generation requirement description information and a third image corresponding to the third generation requirement description information; the comprehensive evaluation result may include: a comprehensive score; correspondingly, in response to the second visual content being an image, the model training module 803 may determine the comprehensive evaluation result based on the intermediate result information and the second sample instruction information in the following ways: obtaining a similarity score between the second visual content and the third generation requirement description information, and obtaining an aesthetic score for the second visual content; in response to determining that the second sample instruction information does not include a third image, determining a comprehensive score based on the similarity score and the aesthetic score; in response to determining that the second sample instruction information includes a third image, obtaining the sum of squares of the differences between corresponding pixels in the second visual content and the third image, where corresponding pixels are pixels with the same coordinate position, and determining a comprehensive score based on the similarity score, the aesthetic score, and the sum of squares.
[0110] The specific workflow of each of the above device embodiments can be found in the relevant descriptions in the foregoing method embodiments, and will not be repeated here.
[0111] In summary, the solution described in this disclosure can utilize the thinking chain technology of multimodal large models to improve the accuracy of visual content generation results, and it is applicable to different visual content generation scenarios, thus having wide applicability.
[0112] The solutions described in this disclosure can be applied to the field of artificial intelligence, particularly deep learning, large-scale models, computer vision, and natural language processing. Artificial intelligence is the study of enabling computers to simulate certain human thought processes and intelligent behaviors (such as learning, reasoning, thinking, and planning). It involves both hardware and software technologies. Artificial intelligence hardware technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, and big data processing. Artificial intelligence software technologies mainly include computer vision, speech recognition, natural language processing, machine learning / deep learning, big data processing, and knowledge graph technologies.
[0113] Furthermore, the instruction and result information in the embodiments described in this disclosure are not targeted at any specific user and do not reflect the personal information of any specific user. The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solutions of this disclosure all comply with relevant laws and regulations and do not violate public order and good morals.
[0114] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0115] Figure 9 A schematic block diagram of an electronic device 900 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0116] like Figure 9 As shown, the electronic device 900 includes a computing unit 901, which can perform various appropriate actions and processes based on a computer program stored in a read-only memory (ROM) 902 or a computer program loaded from a storage unit 908 into a random access memory (RAM) 903. The RAM 903 may also store various programs and data required for the operation of the electronic device 900. The computing unit 901, ROM 902, and RAM 903 are interconnected via a bus 904. An input / output (I / O) interface 905 is also connected to the bus 904.
[0117] Multiple components in electronic device 900 are connected to I / O interface 905, including: input unit 906, such as keyboard, mouse, etc.; output unit 907, such as various types of displays, speakers, etc.; storage unit 908, such as disk, optical disk, etc.; and communication unit 909, such as network card, modem, wireless transceiver, etc. Communication unit 909 allows electronic device 900 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0118] The computing unit 901 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (DSPs), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as those described in this disclosure. For example, in some embodiments, the methods described in this disclosure can be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program can be loaded and / or installed on the electronic device 900 via ROM 902 and / or communication unit 909. When the computer program is loaded into RAM 903 and executed by the computing unit 901, one or more steps of the methods described in this disclosure can be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the methods described herein by any other suitable means (e.g., by means of firmware).
[0119] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard parts (ASSPs), systems-on-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0120] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0121] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory, read-only memory, erasable programmable read-only memory (EPROM), flash memory, optical fiber, compact disc read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0122] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a cathode ray tube (CRT) or liquid crystal display (LCD) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0123] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication (e.g., a communication network) of any form or medium. Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0124] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.
[0125] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.
[0126] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A method for training a large target model, comprising: Obtain the pre-trained base model; Acquire first training data, which includes: first sample instruction information, first sample result information corresponding to the first sample instruction information, and first sample thinking information. The first sample thinking information is thinking process information generated in response to the first sample instruction information, and the first sample result information includes first visual content. The basic large model is trained based on the first training data to obtain an intermediate large model. The intermediate large model is then trained using reinforcement learning based on the second training data, which includes second sample instruction information, to obtain the target large model. This process includes: in response to the second sample instruction information, which includes third generation requirement description information and a third image corresponding to the third generation requirement description information, inputting the second sample instruction information into the intermediate large model to obtain intermediate result information including second visual content; in response to the second visual content being an image, using a text-image similarity model to determine the similarity score between the second visual content and the third generation requirement description information; using an image aesthetics evaluation model to determine the aesthetics score of the second visual content; and obtaining the sum of squares of the differences between corresponding pixels in the second visual content and the third image, where corresponding pixels are pixels with the same coordinate position; determining a comprehensive score based on the similarity score, the aesthetics score, and the sum of squares; and updating the intermediate large model according to the principle of improving the comprehensive score. The first sample instruction information includes: second generated requirement description information and a second image corresponding to the second generated requirement description information; each of the first sample consideration information includes: refined requirement description information obtained by refining the second generated requirement description information; step description information for generating the first sample result information; initial result information and optimization description information, wherein the first sample result information is obtained by performing optimization processing corresponding to the optimization description information on the initial result information; M candidate result information and selection reason information corresponding to the first sample instruction information, where M is a positive integer greater than 1, the M candidate result information includes the first sample result information, and the selection reason information is used to explain why the first sample result information is superior to other candidate result information.
2. The method according to claim 1, wherein, The step of training the basic large model based on the first training data includes: Based on the first training data, the basic large model is trained using the maximum likelihood estimation method through autoregression.
3. A visual content generation method based on a large model, comprising: Obtain target instruction information; The target instruction information is input into the target large model to obtain the corresponding target result information and output it. The target result information includes target visual content. The target result information is generated by the target large model based on the target thinking information. The target thinking information is the thinking process information generated by the target large model in response to the target instruction information. The target large model is trained according to the method of any one of claims 1-2.
4. The method according to claim 3, wherein, The target large model includes: a multimodal large model; The target instruction information includes: first generation requirement description information and a first image corresponding to the first generation requirement description information.
5. The method according to claim 3, wherein, The target result information also includes: a response script that matches the target instruction information; And / or, the method further includes: outputting the target thinking information while outputting the target result information.
6. A visual content generation device based on a large model, comprising: Instruction acquisition module and result generation module; The instruction acquisition module is used to acquire target instruction information; The result generation module is used to input the target instruction information into the target large model, obtain the corresponding target result information and output it. The target result information includes target visual content. The target result information is generated by the target large model based on the target thinking information. The target thinking information is the thinking process information generated by the target large model in response to the target instruction information. The target large model is trained according to the method of any one of claims 1-2.
7. A target large model training device, comprising: Model acquisition module, data acquisition module, and model training module; The model acquisition module is used to acquire the pre-trained basic large model; The data acquisition module is used to acquire first training data, which includes: first sample instruction information, first sample result information corresponding to the first sample instruction information, and first sample thinking information. The first sample thinking information is thinking process information generated in response to the first sample instruction information, and the first sample result information includes first visual content. The model training module is used to train the basic large model based on the first training data to obtain an intermediate large model, and to perform reinforcement learning training on the intermediate large model based on the second training data including the second sample instruction information to obtain the target large model. The module includes: responding to the second sample instruction information including third generation requirement description information and a third image corresponding to the third generation requirement description information, inputting the second sample instruction information into the intermediate large model to obtain intermediate result information including second visual content; responding to the second visual content being an image, using a text-image similarity model to determine the similarity score between the second visual content and the third generation requirement description information, using an image aesthetics evaluation model to determine the aesthetics score of the second visual content, and obtaining the sum of squares of the differences between corresponding pixels in the second visual content and the third image, where corresponding pixels are pixels with the same coordinate position; determining a comprehensive score based on the similarity score, the aesthetics score, and the sum of squares; and updating the intermediate large model according to the principle of improving the comprehensive score. The first sample instruction information includes: second generated requirement description information and a second image corresponding to the second generated requirement description information; each of the first sample consideration information includes: refined requirement description information obtained by refining the second generated requirement description information; step description information for generating the first sample result information; initial result information and optimization description information, wherein the first sample result information is obtained by performing optimization processing corresponding to the optimization description information on the initial result information; M candidate result information and selection reason information corresponding to the first sample instruction information, where M is a positive integer greater than 1, the M candidate result information includes the first sample result information, and the selection reason information is used to explain why the first sample result information is superior to other candidate result information.
8. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.
9. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-5.
10. A computer program product comprising a computer program / instructions that, when executed by a processor, implement the method of any one of claims 1-5.