A method, apparatus, device and storage medium for generating media content
By constructing a media generation model and combining the feature representations of the first and second models with adversarial loss adjustment, the problem of insufficient image quality generated by machine learning models is solved, and high-quality image generation is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- FACE CUTE CO LTD
- Filing Date
- 2024-12-31
- Publication Date
- 2026-06-30
AI Technical Summary
The image materials generated by existing machine learning models cannot meet people's needs for image quality, especially when the required image materials cannot be collected.
By constructing a media generation model, training prompt words are processed using a first model associated with a first effect and a second model associated with a second effect to generate a first feature representation and a second feature representation. A discriminator is used to determine the adversarial loss, and the parameters of the first model are adjusted to construct a media generation model with both the first and second effects.
It improves the image quality of image materials generated by machine learning models, enabling them to accurately present the details of preset objects while also achieving an aesthetically pleasing effect.
Smart Images

Figure CN122309779A_ABST
Abstract
Description
Technical Field
[0001] The exemplary embodiments disclosed herein generally relate to the field of computers, and particularly to a method, apparatus, device, and computer-readable storage medium for generating media content. Background Technology
[0002] With the advancement of computing power, machine learning models have been widely applied in various fields, such as image processing. Specifically, machine learning models can be used to generate, process, or enhance images. In some scenarios, when it is impossible to collect the required image materials, machine learning models are often used to generate them.
[0003] However, the image materials generated using machine learning models sometimes fail to meet people's demands for image quality. Summary of the Invention
[0004] In a first aspect of this disclosure, a method for generating media content is provided. The method includes: acquiring cue words; and processing the cue words using a media generation model to generate media content associated with a first effect and a second effect, wherein the media generation model is constructed based on the following process: processing training cue words using a first model associated with the first effect to generate a first feature representation; processing training cue words using a second model associated with the second effect to generate a second feature representation; processing the first and second feature representations using a discriminator to determine an adversarial loss; and adjusting the parameters of the first model based on the adversarial loss to construct the media generation model.
[0005] In a second aspect of this disclosure, an apparatus for generating media content is provided. The apparatus includes: an acquisition module configured to acquire cue words; and a generation module configured to process the cue words using a media generation model to generate media content associated with a first effect and a second effect, wherein the media generation model is constructed based on the following process: processing training cue words using a first model associated with the first effect to generate a first feature representation; processing training cue words using a second model associated with the second effect to generate a second feature representation; processing the first and second feature representations using a discriminator to determine an adversarial loss; and adjusting the parameters of the first model based on the adversarial loss to construct the media generation model.
[0006] In a third aspect of this disclosure, an electronic device is provided. The device includes at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. When executed by the at least one processing unit, the instructions cause the device to perform the method of the first aspect.
[0007] In a fourth aspect of this disclosure, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program that can be executed by a processor to implement the method of the first aspect.
[0008] It should be understood that the content described in this content section is not intended to limit the key or essential features of the embodiments of this disclosure, nor is it intended to restrict the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0009] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:
[0010] Figure 1 A schematic diagram is shown of an example environment in which embodiments of the present disclosure may be implemented;
[0011] Figures 2A to 2B Example interfaces according to some embodiments of this disclosure are shown;
[0012] Figure 3 A flowchart illustrating an example process for generating media content according to some embodiments of this disclosure is shown;
[0013] Figure 4 A flowchart illustrating an example process for constructing a media generation model according to some embodiments of this disclosure is shown;
[0014] Figure 5 A flowchart illustrating an example process for constructing a media generation model according to some embodiments of the present disclosure is shown;
[0015] Figure 6 A schematic structural block diagram of an example apparatus for generating media content according to some embodiments of the present disclosure is shown; and
[0016] Figure 7 A block diagram of an electronic device capable of implementing several embodiments of the present disclosure is shown. Detailed Implementation
[0017] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0018] It should be noted that the headings of any section / subsection provided herein are not limiting. Various embodiments are described throughout this document, and embodiments of any type may be included under any section / subsection. Furthermore, embodiments described in any section / subsection may be combined in any way with any other embodiments described in the same section / subsection and / or different sections / subsections.
[0019] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions may also be included below. The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.
[0020] The embodiments of this disclosure may involve user data, data acquisition, and / or use. All of these aspects comply with applicable laws, regulations, and relevant provisions. In the embodiments of this disclosure, all data collection, acquisition, processing, manipulation, forwarding, and use are conducted with the user's knowledge and confirmation. Accordingly, in implementing the embodiments of this disclosure, the type, scope of use, and usage scenarios of any data or information that may be involved should be communicated to the user and their authorization obtained in accordance with relevant laws and regulations through appropriate means. The specific methods of notification and / or authorization may vary depending on the actual situation and application scenario, and the scope of this disclosure is not limited in this respect.
[0021] In this specification and the embodiments, any processing of personal information will be carried out only under the premise of legality (such as obtaining the consent of the personal information subject, or being necessary for the performance of a contract), and will only be carried out within the scope stipulated or agreed upon. A user's refusal to process personal information other than that necessary for basic functions will not affect the user's use of basic functions.
[0022] As mentioned above, people often need to collect images of the same type or theme as a basis for operations such as model training. When people cannot collect the required images, or cannot collect enough images, they use machine learning models to generate the necessary images. However, the images generated by machine learning models often fail to meet people's requirements for image quality.
[0023] Embodiments of this disclosure propose a scheme for generating media content. The scheme includes: acquiring cue words; and processing the cue words using a media generation model to generate media content associated with a first effect and a second effect. The media generation model is constructed based on the following process: processing training cue words using a first model associated with the first effect to generate a first feature representation; processing training cue words using a second model associated with the second effect to generate a second feature representation; processing the first and second feature representations using a discriminator to determine an adversarial loss; and adjusting the parameters of the first model based on the adversarial loss to construct the media generation model.
[0024] In this manner, embodiments of the present disclosure enable the first model to learn a second effect from the second model while retaining its first effect, thereby constructing a media generation model with both the first and second effects. This allows for an improvement in the image quality of image materials generated using this media generation model.
[0025] The following section provides a detailed description of various example implementations of this scheme, with reference to the accompanying drawings.
[0026] Example Environment
[0027] Figure 1 A schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented is shown. For example... Figure 1 As shown, example environment 100 may include terminal device 110.
[0028] In this example environment 100, terminal device 110 may run an application 120 that supports the generation of media content. Application 120 may be any suitable type of application for generating media content, examples of which may include, but are not limited to, image processing applications or other suitable applications. User 140 may interact with application 120 via terminal device 110 and / or its attached devices.
[0029] exist Figure 1 In environment 100, if application 120 is active, terminal device 110 can use application 120 to present interface 150 for supporting the generation of media content.
[0030] In some embodiments, terminal device 110 communicates with server 130 to provide services to application 120. Terminal device 110 can be any type of mobile terminal, fixed terminal, or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, handheld computers, portable gaming terminals, VR / AR devices, personal communication system (PCS) devices, personal navigation devices, personal digital assistants (PDAs), audio / video players, digital cameras / camcorders, positioning devices, television receivers, radio receivers, e-book devices, gaming devices, or any combination thereof, including accessories and peripherals of these devices or any combination thereof. In some embodiments, terminal device 110 can also support any type of interface for user 140 (such as "wearable" circuitry).
[0031] Server 130 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks, and big data and artificial intelligence platforms. Server 130 may include, for example, computing systems / servers such as mainframes, edge computing nodes, computing devices in a cloud environment, etc. Server 130 can provide backend services for applications 120 in terminal devices 110 that support the generation of media content.
[0032] A communication connection can be established between server 130 and terminal device 110. This communication connection can be established via wired or wireless means. The communication connection may include, but is not limited to, Bluetooth, mobile network, Universal Serial Bus (USB), and Wireless Fidelity (WiFi) connections; the embodiments of this disclosure are not limited in this respect. In the embodiments of this disclosure, server 130 and terminal device 110 can achieve signaling interaction through the communication connection between them.
[0033] It should be understood that the structure and function of the various elements in environment 100 are described for illustrative purposes only and do not imply any limitation on the scope of this disclosure.
[0034] The following description will continue with reference to the accompanying drawings, which will provide some exemplary embodiments of this disclosure.
[0035] Example Interaction
[0036] Figures 2A to 2B Example interfaces 200A to 200B according to some embodiments of the present disclosure are shown. Interfaces 200A to 200B may, for example, be provided by... Figure 1 The terminal device 110 shown is provided.
[0037] like Figure 2A As shown, in some embodiments, application 120 can provide the function of generating media content. As an example, the main interface of application 120 can be configured with corresponding controls. User 140 can use the function of generating media content in application 120 by clicking the control. Specifically, when terminal device 110 receives the operation information of user 140 clicking the control, terminal device 110 can present interface 200A. Interface 200A is used for user 140 to input prompt words.
[0038] In some embodiments, interface 200A may include an input box for user 140 to input prompts and controls for generating media content. The input box may display prompts for user 140, such as "Please describe the image you want to generate...". Terminal device 110 may support multiple input methods, such as handwriting input and voice input, to facilitate user 140 inputting prompts. Additionally, the input box may be configured with controls for voice input, allowing user 140 to input prompts via voice. The controls for generating media content may display the word "Generate".
[0039] Furthermore, after user 140 enters a prompt in the input box, and terminal device 110 receives the operation information from user 140 clicking the control used to generate media content, terminal device 110 can display, as follows: Figure 2B The interface 200B is shown. Terminal device 110 can display information related to the media content through interface 200B to provide the media content. As an example, the information related to the media content may be at least one of a preview image of the media content and a download link for the media content.
[0040] like Figure 2B As shown, in some embodiments, interface 200B may include a preview area 210 for user 140 to preview media content and controls for user 140 to download media content. The controls for user 140 to download media content may, for example, display the word "Download".
[0041] Additionally, interface 200B may also include a control for regenerating media content. This control may, for example, display the word "Regenerate". When user 140 is dissatisfied with the media content in preview area 210, the control for regenerating media content allows application 120 to regenerate the media content based on the prompt. Specifically, when terminal device 110 receives operation information from user 140 clicking the control for regenerating media content, terminal device 110 causes application 120 to regenerate the media content based on the prompt.
[0042] It should be understood that Figures 2A to 2B The media content generation interface shown is merely an example; other suitable interfaces can also be used to generate and provide media content. The various graphic elements in the interface can have different arrangements and different visual representations, one or more elements can be omitted or replaced, and one or more other elements may also be present. The embodiments of this disclosure are not limited in this respect.
[0043] Example process
[0044] Figure 3 A flowchart of an example process 300 for generating media content according to some embodiments of the present disclosure is shown. Process 300 can be implemented at terminal device 110. Reference is made below. Figure 1 To describe process 300.
[0045] like Figure 3 As shown in box 310, terminal device 110 obtains the prompt word.
[0046] In some embodiments, the prompt word may indicate the image content, image style, etc., of the media content to be generated. The terminal device 110 can obtain the prompt word through an input device communicatively connected to it. The input device may be, for example, a keyboard, a touch screen, or a microphone.
[0047] In box 320, terminal device 110 uses a media generation model to process prompts to generate media content. This media content is associated with both a first effect and a second effect.
[0048] In some embodiments, the media content may be an image or video, which may include a preset object. The preset object may be, for example, a person, an animal, a plant, or an object. For media content containing a preset object, both the first effect and the second effect can be effects applied to the preset object. As an example, the first effect may be an effect applied to a component of the preset object. The second effect may be an effect applied to the overall preset object. For a specific example, when the preset object in the media content is a person, the first effect may affect the number and position of the person's facial features, while the second effect may affect the overall aesthetic appeal of the person.
[0049] The following is combined Figure 4 and Figure 5 This describes the specific construction process of the media generation model. Figure 4 A flowchart illustrating an example process 400 for constructing a media generation model according to some embodiments of the present disclosure is shown. Figure 5 A flowchart illustrating an example process 500 for constructing a media generation model according to some embodiments of the present disclosure is shown. It should be understood that process 400 and / or process 500 can be performed by suitable electronic devices, such as terminal device 110 or server 130. Process 400 is described below using terminal device 110 as an example.
[0050] In box 410, terminal device 110 uses a first model 550 associated with the first effect to process training cue words 540 to generate a first feature representation 555.
[0051] In some embodiments, the training cue word 540 corresponds to the cue words mentioned above and can indicate the image content, image style, etc. of the media content to be generated by the first model 550 or the second model 560.
[0052] In some embodiments, the first model 550 may be a diffusion model that generates images based on text, which can generate media content based on training prompts 540. The media content generated by the first model 550 may include a first effect. For example, the first effect may ensure that the number and position of facial features of a preset object are correct in the generated media content. In this case, the number and position of facial features of the preset object are correct in the media content generated by the first model 550.
[0053] It should be understood that the principle by which the first model 550 generates media content based on the training prompt word 540 is as follows: the first model 550 first generates an initial noise representation 520, and then performs noise reduction processing on the initial noise representation 520 in association with the training prompt word 540, ultimately generating media content corresponding to the training prompt word 540. When training the first model 550, the initial noise representation 520 can also be determined based on adding noise to the training image 510. Specifically, the noise intensity of the initial noise representation 520 can be set by configuring the first model 550, or by setting the noise addition step size. During this process, the clarity of the media content depends on the step size of the noise reduction processing performed by the first model 550.
[0054] In some embodiments, the first feature representation 555 can be determined based on the initial noise representation 520. Specifically, based on the above principle, the process by which the terminal device 110 processes the training prompt words 540 using the first model 550 to generate the first feature representation 555 can be as follows: the first model 550 performs a first-step noise reduction process on the initial noise representation 520 to generate the first feature representation 555. The first-step length is determined from a preset step size range 530.
[0055] As mentioned earlier, when the first model 550 does not perform denoising processing on the initial noise representation 520, the initial noise representation 520 remains in its initial state. When the first model 550 performs denoising processing on the initial noise representation 520 with the maximum step size, the initial noise representation 520 forms media content after denoising processing. However, when the first model 550 performs denoising processing on the initial noise representation 520 with the first step size, the initial noise representation 520 can form the first feature representation 555 after the corresponding denoising processing. Therefore, it can be understood that when the first step size is different values within the step size range 530, the first model 550 can generate first feature representations 555 corresponding to different step sizes of denoising processing. The first feature representation 555 can be an image with a certain noise intensity.
[0056] In some embodiments, the step size range 530 may depend on the noise intensity of the initial noise representation 520. For example, if the noise intensity of the initial noise representation 520 is 1000, then the step size range 530 may be 0 to 1000, and the first step size may be any value between 0 and 1000. As an example, the first step size may be any value between 50 and 100 to facilitate training of the first model 550.
[0057] Furthermore, once a set of first step lengths is determined, the first model 550 can generate a corresponding set of first feature representations 555.
[0058] In box 420, terminal device 110 uses a second model 560 associated with the second effect to process training cue words 540 to generate a second feature representation 565.
[0059] In some embodiments, the second model 560 may be a diffusion model that implements text-to-image functionality, which can generate media content based on training prompts 540. The media content generated by the second model 560 may include a second effect. For example, the second effect may make a preset object in the generated media content aesthetically pleasing. In this case, the preset object is aesthetically pleasing in the media content generated by the second model 560.
[0060] The principle by which the second model 560 generates media content based on the training prompts 540 is the same as that of the first model 550, and will not be elaborated further here. Similarly, the noise intensity of the initial noise representation 520 can be set by configuring the second model 560. Furthermore, the clarity of the media content depends on the step size of the noise reduction process performed by the second model 560.
[0061] In some embodiments, the second feature representation 565 can be determined based on the initial noise representation 520. Specifically, based on the above principle, the process by which the terminal device 110 processes the training prompt words 540 using the second model 560 to generate the second feature representation 565 can be as follows: the second model 560 performs noise reduction processing on the initial noise representation 520 with a second step size to generate the second feature representation 565. The second step size is determined from a preset step size range 530.
[0062] As mentioned earlier, when the second model 560 does not perform denoising on the initial noise representation 520, the initial noise representation 520 remains in its initial state. When the second model 560 performs denoising on the initial noise representation 520 with the maximum step size, the initial noise representation 520 forms media content after denoising. When the second model 560 performs denoising on the initial noise representation 520 with the second step size, the initial noise representation 520 can form the second feature representation 565 after the corresponding denoising. Therefore, it can be understood that when the second step size is different values within the step size range 530, the second model 560 can generate second feature representations 565 corresponding to different step sizes of denoising. The second feature representation 565 can be an image with a certain noise intensity.
[0063] As an example, when the initial noise representation 520 has a noise intensity of 1000, the step size 530 can be 0 to 1000, and the second step size can be any value between 0 and 1000. For example, the second step size can be any value between 50 and 100 to facilitate training the second model 560.
[0064] Furthermore, once a set of second step sizes is determined, the second model 560 can generate a corresponding set of second feature representations 565.
[0065] In some embodiments, the terminal device 110 may execute the steps in block 410 and the steps in block 420 simultaneously.
[0066] In box 430, terminal device 110 uses discriminator 570 to process first feature representation 555 and second feature representation 565 to determine adversarial loss.
[0067] In some embodiments, the terminal device 110 may use the discriminator 570 to process the first feature representation 555 to generate a first discrimination result, and use the discriminator 570 to process the second feature representation 565 to generate a second discrimination result. The first discrimination result indicates that the first feature representation 555 was generated by either the first model 550 or the second model 560. The second discrimination result indicates that the second feature representation 565 was generated by either the first model 550 or the second model 560. By configuring the discriminator 570, the terminal device 110 can distinguish between the first feature representation 555 and the second feature representation 565 to assess the degree to which the first model 550 has learned the knowledge of the second model 560.
[0068] In some embodiments, the adversarial loss can indicate whether the first feature representation 555 and the second feature representation 565 are generated by the same model. Specifically, when the first discrimination result is consistent with the second discrimination result, it means that for the discriminator 570, the first feature representation 555 and the second feature representation 565 are generated by the same model, and thus the first feature representation 555 and the second feature representation 565 can be considered similar.
[0069] Understandably, when terminal device 110 trains the first model 550 and the second model 560 using the same training prompts 540, the first model 550 can learn the knowledge of the second model 560, thus enabling the first feature representation 555 or media content generated by the first model 550 to not only present the first effect but also additional effects. As the number of training iterations increases, when the discriminator 570 considers the first feature representation 555 generated by the first model 550 to be similar to the second feature representation 565 generated by the second model 560, it can be said that the additional effects presented by the first feature representation 555 or media content generated by the first model 550 are equivalent to the second effect. At this point, the first model 550 has fully learned the knowledge of the second model 560.
[0070] In some embodiments, the terminal device 110 can determine the adversarial loss based on the difference between the first discrimination result and the second discrimination result. Specifically, the terminal device 110 can determine the difference between the first discrimination result and the second discrimination result by quantifying them.
[0071] As an example, the quantization result of the first discrimination result and the second discrimination result by the terminal device 110 can be determined by the discriminator 570 based on a set of indicators associated with the first feature representation 555 or the second feature representation 565. That is, the terminal device 110 can determine the parameter value corresponding to the first feature representation 555 or the second feature representation 565 based on a set of indicators.
[0072] Specifically, a set of metrics may include, for example, one or more of the following: an initial noise representation 520, a noise addition step size corresponding to the initial noise representation 520, a first step size or a second step size, a first feature representation 555 or a second feature representation 565, and a training cue word 540. When a set of metrics corresponds to the first feature representation 555, the set of metrics may include the first step size; when a set of metrics corresponds to the second feature representation 565, the set of metrics may include the second step size.
[0073] Taking a set of indicators including initial noise representation 520, the noise addition step size corresponding to initial noise representation 520, the first step size or the second step size, the first feature representation 555 or the second feature representation 565 and the training prompt word 540 as an example, the terminal device 110 can use the initial noise representation 520, the noise addition step size corresponding to initial noise representation 520, the first step size or the second step size, the first feature representation 555 or the second feature representation 565 and the training prompt word 540 to determine the parameter values.
[0074] As an example, when the terminal device 110 uses a set of indicators to determine parameter values, it can perform the following processing on some of the indicators to facilitate the determination of parameter values: for example, it can determine the corresponding weighted value based on the first feature representation 555 or the second feature representation 565.
[0075] Specifically, the weighting values may include a first part and a second part. A first weighting coefficient corresponding to either the first step length or the second step length is applied to the first part, and a second weighting coefficient corresponding to either the first step length or the second step length is applied to the second part. The first part may, for example, be the product of the training image 510 and the first weighting coefficient. The second part may, for example, be the product of the predicted noise representation associated with the initial noise representation 520, the noise-adding step length, and the training cue word 540, and the second weighting coefficient. The predicted noise representation is the difference between the first feature representation 555 or the second feature representation 565 and the initial noise representation 520. The first and second weighting coefficients can be adaptively adjusted according to actual conditions.
[0076] Terminal device 110 can determine the parameter value corresponding to the first discrimination result and the parameter value corresponding to the second discrimination result in the above manner, so as to further determine the difference between the two, and thus determine the difference between the first discrimination result and the second discrimination result.
[0077] It should be understood that when there is a set of first step lengths, a set of first feature representations 555 is generated. When there is a set of second step lengths, a set of second feature representations 565 is generated. When the terminal device 110 divides the first feature representations 555 and second feature representations 565 corresponding to the same first step length and second step length into a feature pair, a set of feature pairs and a corresponding set of differences can be obtained. As an example, the terminal device 110 can determine the minimum value in a set of differences as the adversarial loss.
[0078] In box 440, terminal device 110 adjusts the parameters of first model 550 based on adversarial loss to construct a media generation model.
[0079] In some embodiments, the terminal device 110 may adjust the parameters of the first model 550 based on the adversarial loss and the generation loss associated with the first model 550. The generation loss indicates that the first feature representation 555 is considered to be generated by the second model 560.
[0080] As an example, the process by which the terminal device 110 determines the generation loss can be as follows: First, based on a set of indicators corresponding to the first feature representation 555, parameter values are determined; then, based on the parameter values, the generation loss is determined. The parameter values determined by the set of indicators corresponding to the first feature representation 555 can be the parameter values mentioned above. The process of determining these parameter values is consistent with the above process and will not be elaborated further here.
[0081] It should be understood that when there is a set of first step lengths, a set of first feature representations 555 is generated. Based on the above calculation method, the terminal device 110 can determine a set of parameter values corresponding to the set of first feature representations 555. As an example, the terminal device 110 can use the maximum value among the set of parameter values as the generation loss.
[0082] Furthermore, the terminal device 110 can adjust the parameters of the first model 550 by maximizing the training generation loss and minimizing the training adversarial loss until training converges. The trained first model 550 not only maintains the first effect but can also learn the knowledge of the second model 560 to achieve the second effect. Using the trained first model 550, a media generation model can be constructed so that the media content generated by the media generation model has both the first and second effects, thereby improving the image quality of the generated media content.
[0083] Example devices and equipment
[0084] Embodiments of this disclosure also provide corresponding apparatus for implementing the above methods or processes. Figure 6 A schematic structural block diagram of an example apparatus 600 for generating media content according to certain embodiments of the present disclosure is shown. Apparatus 600 may be implemented as or included in terminal device 110. Various modules / components in apparatus 600 may be implemented by hardware, software, firmware, or any combination thereof.
[0085] like Figure 6As shown, the apparatus 600 includes: an acquisition module 610 configured to acquire cue words; and a generation module 620 configured to process the cue words using a media generation model to generate media content associated with a first effect and a second effect, wherein the media generation model is constructed based on the following process: processing training cue words using a first model associated with the first effect to generate a first feature representation; processing training cue words using a second model associated with the second effect to generate a second feature representation; processing the first and second feature representations using a discriminator to determine an adversarial loss; and adjusting the parameters of the first model based on the adversarial loss to construct the media generation model.
[0086] In some embodiments, the first model and the second model are diffusion models, and the first feature representation and the second feature representation are also determined based on an initial noise representation, which is determined based on adding noise to the training images.
[0087] In some embodiments, the first feature representation is generated by performing a first step of noise reduction on the initial noise representation using a first model, and the second feature representation is generated by performing a second step of noise reduction on the initial noise representation using a second model.
[0088] In some embodiments, the first step length and the second step length are determined from a preset step length range.
[0089] In some embodiments, the adversarial loss is also determined by the discriminator based on at least one of the following: an initial noise representation; a noise step size corresponding to the initial noise representation; a first step size or a second step size; and training cue words.
[0090] In some embodiments, adjusting the parameters of the first model based on the adversarial loss includes adjusting the parameters of the first model based on the adversarial loss and the generation loss associated with the first model.
[0091] In some embodiments, determining the adversarial loss by processing the first feature representation and the second feature representation using a discriminator includes: processing the first feature representation using a discriminator to generate a first discrimination result; processing the second feature representation using a discriminator to generate a second discrimination result; and determining the discrimination loss based on the difference between the first discrimination result and the second discrimination result.
[0092] In some embodiments, the adversarial loss indicates whether the first feature representation and the second feature representation were generated by the same model.
[0093] like Figure 7As shown, electronic device 700 is in the form of a general-purpose electronic device. Components of electronic device 700 may include, but are not limited to, one or more processors or processing units 710, memory 720, storage device 730, one or more communication units 740, one or more input devices 750, and one or more output devices 760. Processing unit 710 may be a physical or virtual processor and is capable of performing various processes according to programs stored in memory 720. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of electronic device 700.
[0094] Electronic device 700 typically includes multiple computer storage media. Such media can be any accessible media that is accessible to electronic device 700, including but not limited to volatile and non-volatile media, removable and non-removable media. Memory 720 can be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 730 can be removable or non-removable media and can include machine-readable media, such as flash drives, disks, or any other media that can be used to store information and / or data and can be accessed within electronic device 700.
[0095] Electronic device 700 may further include additional removable / non-removable, volatile / non-volatile storage media. Although not explicitly stated... Figure 7 As shown, disk drives for reading from or writing to removable, non-volatile disks (e.g., "floppy disks") and optical disk drives for reading from or writing to removable, non-volatile optical disks can be provided. In these cases, each drive can be connected to a bus (not shown) via one or more data media interfaces. Memory 720 may include computer program product 725 having one or more program modules configured to perform various methods or actions of various embodiments of this disclosure.
[0096] The communication unit 740 enables communication with other electronic devices via a communication medium. Additionally, the functionality of the components of the electronic device 700 can be implemented using a single computing cluster or multiple computing machines capable of communicating via communication connections. Therefore, the electronic device 700 can operate in a networked environment using logical connections to one or more other servers, network personal computers (PCs), or another network node.
[0097] Input device 750 can be one or more input devices, such as a mouse, keyboard, trackball, etc. Output device 760 can be one or more output devices, such as a monitor, speaker, printer, etc. Electronic device 700 can also communicate with one or more external devices (not shown) via communication unit 740 as needed. These external devices include storage devices, display devices, etc., and can communicate with one or more devices that enable user interaction with electronic device 700, or with any device that enables electronic device 700 to communicate with one or more other electronic devices (e.g., network card, modem, etc.). Such communication can be performed via input / output (I / O) interface (not shown).
[0098] According to an exemplary implementation of this disclosure, a computer-readable storage medium is provided that stores computer-executable instructions thereon, wherein the computer-executable instructions are executed by a processor to implement the methods described above. According to an exemplary implementation of this disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, which are executed by a processor to implement the methods described above.
[0099] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatuses, devices, and computer program products implemented according to this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.
[0100] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.
[0101] Computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions that execute on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0102] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction, which contains one or more executable instructions for implementing the specified logical function. In some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0103] Various implementations of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed implementations. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described implementations. The terminology used herein is chosen to best explain the principles, practical applications, or improvements to technology in the market, or to enable others skilled in the art to understand the various implementations disclosed herein.
Claims
1. A method for generating media content, comprising: Get the prompt words; as well as The prompt words are processed using a media generation model to generate media content, which is associated with a first effect and a second effect. The media generation model is constructed based on the following process: The training prompt words are processed using a first model associated with the first effect to generate a first feature representation; The training prompt words are processed using a second model associated with the second effect to generate a second feature representation; The discriminator processes the first feature representation and the second feature representation to determine the adversarial loss; and Based on the adversarial loss, the parameters of the first model are adjusted to construct the media generation model.
2. The method according to claim 1, wherein the first model and the second model are diffusion models, and the first feature representation and the second feature representation are further determined based on an initial noise representation, wherein the initial noise representation is determined based on adding noise to the training image.
3. The method according to claim 2, wherein the first feature representation is generated by the first model performing a first-step noise reduction process on the initial noise representation, and the second feature representation is generated by the second model performing a second-step noise reduction process on the initial noise representation.
4. The method according to claim 3, wherein the first step length and the second step length are determined from a preset step length range.
5. The method of claim 2, wherein the adversarial loss is further determined by the discriminator based on at least one of the following: The initial noise representation; The initial noise represents the corresponding noise addition step size; The first step length or the second step length; The training prompt words.
6. The method of claim 1, wherein adjusting the parameters of the first model based on the adversarial loss comprises: The parameters of the first model are adjusted based on the adversarial loss and the generative loss associated with the first model.
7. The method of claim 1, wherein processing the first feature representation and the second feature representation using a discriminator to determine the adversarial loss comprises: The discriminator processes the first feature representation to generate a first discrimination result; The discriminator is used to process the second feature representation to generate a second discrimination result; as well as The adversarial loss is determined based on the difference between the first discrimination result and the second discrimination result.
8. The method of claim 1, wherein the adversarial loss indicates whether the first feature representation and the second feature representation are generated by the same model.
9. An apparatus for generating media content, comprising: The acquisition module is configured to acquire prompt words; as well as A generation module is configured to process the cue words using a media generation model to generate media content associated with a first effect and a second effect, wherein the media generation model is constructed based on the following process: processing the training cue words using a first model associated with the first effect to generate a first feature representation; and processing the training cue words using a second model associated with the second effect to generate a second feature representation; The discriminator processes the first feature representation and the second feature representation to determine the adversarial loss; and based on the adversarial loss, the parameters of the first model are adjusted to construct the media generation model.
10. An electronic device, comprising: At least one processing unit; as well as At least one memory, coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, which, when executed by the at least one processing unit, cause the electronic device to perform the method according to any one of claims 1 to 8.
11. A computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to implement the method according to any one of claims 1 to 8.