Visual generation method and apparatus, and device, medium and program product

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By combining machine learning models to generate fragment descriptions and image or video clips, and optimizing the configuration of prompt words, the problem of unstable generation of miniature landscape style was solved, achieving high-quality visual content generation and improving user experience.

WO2026123267A1PCT designated stage Publication Date: 2026-06-18BEIJING ZITIAO NETWORK TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: BEIJING ZITIAO NETWORK TECH CO LTD
Filing Date: 2024-12-11
Publication Date: 2026-06-18

Application Information

Patent Timeline

11 Dec 2024

Application

18 Jun 2026

Publication

WO2026123267A1

IPC: G06F16/54

AI Tagging

Application Domain

Still image data browsing/visualisation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN2024138611_18062026_PF_FP_ABST

Patent Text Reader

Abstract

Provided in the embodiments of the present disclosure are a visual generation method and apparatus, and a device, a medium and a program product. The method comprises: in response to receiving a topic input by a user, using a first machine learning model to generate, on the basis of the topic and a first prompt, segment descriptions corresponding to a plurality of segments, wherein each segment description indicates an action of a corresponding object in the respective segment, and the first prompt comprises at least one of the following: an objectivity requirement for the segment description, an object action requirement for the segment description, and a format requirement for the segment description; using a second machine learning model to generate, on the basis of the segment descriptions respectively corresponding to the plurality of segments and a second prompt, images or video clips respectively corresponding to the plurality of segments, wherein the second machine learning model is configured to generate images or video clips on the basis of input text; and merging the images or video clips respectively corresponding to the plurality of segments, so as to generate a target video corresponding to the topic.

Need to check novelty before this filing date? Find Prior Art

Description

Methods, apparatuses, devices, media, and procedures for visual generation. Technical Field

[0001] The exemplary embodiments disclosed herein generally relate to the field of computers, and particularly to methods, apparatus, devices, computer-readable storage media, and computer program products for visual generation. Background Technology

[0002] With the development of machine learning technology, by learning and training machine learning models on large amounts of data, the resulting models can be used to automatically generate various forms of content such as text, images, audio, and video. Currently, the expectation is to more consistently obtain high-quality visual materials that meet different expectations in various practical applications involving the generation of visual content such as images or videos. Summary of the Invention

[0003] In a first aspect of this disclosure, a method for visual generation is provided. The method includes: in response to receiving a topic input by a user, generating fragment descriptions corresponding to multiple fragments based on the topic and a first prompt word using a first machine learning model, each fragment description indicating the action of a corresponding object within the fragment, the first prompt word including at least one of the following: an objectivity requirement for the fragment description, an object action requirement for the fragment description, and a format requirement for the fragment description; generating image or video fragments corresponding to multiple fragments based on the fragment descriptions and the second prompt word respectively using a second machine learning model, the second machine learning model being configured to generate image or video fragments based on input text; and generating a target video corresponding to the topic by merging the image or video fragments corresponding to the multiple fragments.

[0004] In a second aspect of this disclosure, a method for visual generation is provided. The method includes: receiving a topic input by a user; generating segment descriptions corresponding to multiple segments based on the topic and style-related segment generation requirements, each segment description indicating the action of a corresponding object within the segment; generating image or video segments of a style corresponding to each of the multiple segments based on the segment descriptions corresponding to each segment; and generating a target video corresponding to the topic by merging the image or video segments of a style corresponding to each of the multiple segments.

[0005] In a third aspect of this disclosure, an apparatus for visual generation is provided. The apparatus includes: a fragment description generation module configured to, in response to receiving a topic input by a user, generate fragment descriptions corresponding to multiple fragments based on the topic and a first prompt word using a first machine learning model, each fragment description indicating the action of a corresponding object within the fragment, the first prompt word including at least one of the following: an objectivity requirement for the fragment description, an action requirement for the object in the fragment description, and a format requirement for the fragment description; a visual generation module configured to, using a second machine learning model, generate image or video fragments corresponding to the multiple fragments respectively based on the fragment descriptions and the second prompt word, the second machine learning model being configured to generate image or video fragments based on input text; and a target video generation module configured to generate a target video corresponding to a topic by merging the image or video fragments corresponding to the multiple fragments.

[0006] In a fourth aspect of this disclosure, an apparatus for visual generation is provided. The apparatus includes: a topic receiving module configured to receive a topic input by a user; a fragment description generation module configured to generate fragment descriptions corresponding to multiple fragments based on the topic and style-related fragment generation requirements, each fragment description indicating the action of a corresponding object within the fragment; a visual generation module configured to generate image or video fragments of a style corresponding to each of the multiple fragments based on the fragment descriptions corresponding to each of the multiple fragments; and a target video generation module configured to generate a target video corresponding to the topic by merging the image or video fragments of a style corresponding to each of the multiple fragments.

[0007] In a fifth aspect of this disclosure, an electronic device is provided. The device includes at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. When executed by the at least one processing unit, the instructions cause the device to perform the method of the first aspect.

[0008] In a sixth aspect of this disclosure, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program that can be executed by a processor to implement the method of the first aspect.

[0009] In a seventh aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the method according to a first aspect of this disclosure.

[0010] It should be understood that the content described in this content section is not intended to limit the key or essential features of the embodiments of this disclosure, nor is it intended to restrict the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0011] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:

[0012] Figure 1 shows a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;

[0013] Figure 2 shows a flowchart of a process for visual generation according to some embodiments of the present disclosure;

[0014] Figure 3 shows a flowchart of a process for visual generation according to some embodiments of the present disclosure;

[0015] Figure 4 shows a flowchart of a process for visual generation according to some embodiments of the present disclosure;

[0016] Figure 5 shows a block diagram of an apparatus for vision generation according to some embodiments of the present disclosure;

[0017] Figure 6 shows a block diagram of an apparatus for vision generation according to some embodiments of the present disclosure; and

[0018] Figure 7 shows a block diagram of an electronic device capable of implementing one or more embodiments of the present disclosure. Detailed Implementation

[0019] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0020] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions may also be included below.

[0021] In this document, unless explicitly stated otherwise, performing a step in response to A does not mean that the step is performed immediately after A, but may include one or more intermediate steps.

[0022] It is understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition, use, storage or deletion of the data) shall comply with the requirements of relevant laws, regulations and related provisions.

[0023] It is understood that before using the technical solutions disclosed in the various embodiments of this disclosure, relevant users should be informed of the type, scope of use, and usage scenarios of the information involved in this disclosure through appropriate means in accordance with relevant laws and regulations, and authorization should be obtained from the relevant users. Among them, relevant users may include any type of rights holder, such as individuals, enterprises, and groups.

[0024] For example, in response to receiving an active request from a user, a prompt message is sent to the relevant user to clearly inform the user that the requested operation will require obtaining and using the user's information, thereby enabling the relevant user to choose whether to provide information to the software or hardware such as the electronic device, application, server, or storage medium that performs the operation of the technical solution disclosed herein based on the prompt message.

[0025] As an optional but non-restrictive implementation, in response to a user's active request, a prompt message can be sent to the user, such as a pop-up window, where the prompt message can be presented in text format. Furthermore, the pop-up window can also include a selection control allowing the user to choose "agree" or "disagree" to provide information to the electronic device.

[0026] It is understood that the above notification and user authorization process are merely illustrative and do not constitute a limitation on the implementation of this disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of this disclosure.

[0027] As used in this paper, the term "model" refers to a model that learns the relationship between inputs and outputs from training data, enabling it to generate corresponding outputs for a given input after training. Model generation can be based on machine learning techniques. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs using multiple layers of processing units. A neural network model is an example of a deep learning-based model. In this paper, "model" may also be referred to as a "machine learning model," "learning model," "machine learning network," or "learning network," and these terms are used interchangeably.

[0028] A neural network is a machine learning network based on deep learning. A neural network processes input and provides a corresponding output, typically consisting of an input layer, an output layer, and one or more hidden layers between the input and output layers. Neural networks used in deep learning applications often include many hidden layers, thus increasing the network's depth. The layers of a neural network are connected sequentially, so that the output of the previous layer is provided as the input to the next layer. The input layer receives the input to the neural network, while the output layer's output serves as the final output. Each layer of a neural network includes one or more nodes (also called processing nodes or neurons), each node processing the input from the layer above.

[0029] Machine learning typically comprises three phases: training, testing, and application (also known as inference). In the training phase, a given model is trained using a large amount of training data, iteratively updating its parameter values until the model can consistently generate inferences that meet the expected goals from the training data. Through training, the model can be considered to have learned the relationship between inputs and outputs (also known as the input-output mapping) from the training data. The parameter values of the trained model are determined. In the testing phase, test inputs are applied to the trained model to test whether it can provide the correct output, thus determining the model's performance. In the application phase, the model can be used to process actual inputs based on the trained parameter values to determine the corresponding output.

[0030] Figure 1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented. Example environment 100 may include an electronic device 110. Electronic device 110 may run an application 115 that supports the generation of visual content. User 140 may interact with application 115 via electronic device 110 and / or its attached devices.

[0031] In embodiments of this disclosure, application 115 may have intelligent dialogue and task processing capabilities. Typically, application 115 can support user 140 to input questions in natural language and perform tasks based on the understanding of natural language input and logical reasoning ability, generating corresponding visual content stories, such as target video 130. For example, application 115 may support text dialogue services, voice dialogue services, and content dialogue in other modalities with user 140.

[0032] In some embodiments, some hardware and software devices of electronic device 110 may work with application 115 to perform tasks so that application 115 can provide user 140 with at least a response about visual content based on the operation of such hardware and software devices.

[0033] In embodiments of this disclosure, application 115 may also be other applications capable of providing interactive capabilities. For example, application 115 may provide interactive capabilities other than intelligent dialogue. In such embodiments, some hardware and software devices of terminal device 110 may perform tasks based on the terminal device 110's perception of the user 140's behavior and / or the user 140's interactive operations on terminal device 110, without relying on application 115.

[0034] In some embodiments, electronic device 110 or its application 115 may utilize machine learning model 120 (which may include one or more machine learning models, such as machine learning model 120-1, machine learning model 120-2, ..., machine learning model 120-N, etc., where N is a positive integer. For ease of description, the one or more machine learning models are collectively referred to as machine learning model 120 herein) to support interaction with user 140. For example, electronic device 110 or its application 115 may utilize one or more machine learning models 120 to enable certain hardware or software devices of electronic device 110 to start working in order to perform tasks.

[0035] In some embodiments, electronic device 110 can communicate with server device to provide services to application 115 and support the operation of some hardware and software devices of electronic device 110. For example, server device can invoke machine learning model 120 to support the operation of some hardware and software devices of electronic device 110 based on the output of machine learning model 120.

[0036] Electronic device 110 can be any type of mobile terminal, fixed terminal, or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, personal communication system (PCS) devices, personal navigation devices, personal digital assistants (PDAs), audio / video players, digital cameras / camcorders, positioning devices, television receivers, radio broadcast receivers, e-book devices, gaming devices, or any combination thereof, including accessories and peripherals of these devices or any combination thereof. In some embodiments, electronic device 110 may also support any type of user-facing interface (such as "wearable" circuitry).

[0037] Server-side equipment can be various types of computing systems / servers capable of providing computing power, including but not limited to mainframes, edge computing nodes, computing devices in cloud environments, and so on. Server-side equipment can, for example, be implemented based on a cloud environment.

[0038] It should be understood that the structure and function of the various elements in environment 100 are described for illustrative purposes only and do not imply any limitation on the scope of this disclosure.

[0039] As briefly mentioned earlier, current machine learning techniques can generate a large number of visual materials in different styles and forms, providing diverse creative options. However, in the application field of visual data generation, miniature landscape images have unique structures and details, making it relatively difficult to generate miniature landscape images from real photographs.

[0040] Specifically, miniature landscapes typically contain highly detailed features and unique proportions, all of which need to be accurately preserved during the generation process. Furthermore, there is a relative scarcity of image resources in existing public datasets for miniature landscapes, making it easy for related drawing models to exhibit instability and inaccuracies in the rendering of image style, scene structure, and image content when generating miniature landscape images.

[0041] Therefore, it is necessary to continuously iterate parameters during the process of generating images or videos using AIGC to obtain images that sufficiently match the miniature landscape style. The goal is to stably produce high-quality miniature landscape materials based on AIGC technology.

[0042] In view of this, embodiments of the present disclosure propose an improved scheme for visual generation. According to various embodiments of the present disclosure, in response to receiving a topic input by a user, a first machine learning model is used to generate multiple segment descriptions corresponding to segments based on the topic and a first prompt word. Each segment description indicates the action of a corresponding object within the segment. The first prompt word includes at least one of the following: an objectivity requirement for the segment description, an action requirement for the object in the segment description, and a format requirement for the segment description. A second machine learning model is used to generate multiple image or video segments corresponding to each segment, based on the segment descriptions and the second prompt word respectively. The second machine learning model is configured to generate image or video segments based on input text. Furthermore, by merging the image or video segments corresponding to each segment, a target video corresponding to the topic is generated.

[0043] In this way, by combining machine learning models to explore specific prompts and gradually adjusting and optimizing model selection and prompt configuration, relatively stable image or video generation results can be obtained. This enables the stable production of a series of high-quality, coherent visual content stories based on the input theme, improving the user experience.

[0044] It should be understood that the structure and function of the various elements in environment 100 are described for illustrative purposes only and do not imply any limitation on the scope of this disclosure.

[0045] The following description will continue with reference to the accompanying drawings, which will provide some exemplary embodiments of this disclosure.

[0046] Figure 2 illustrates a flowchart of a process 200 for vision generation according to some embodiments of the present disclosure. For ease of discussion, these embodiments will be described with reference to the environment 100 of Figure 1. These embodiments may be implemented in the electronic device 110 of Figure 1.

[0047] As shown in Figure 2, electronic device 110 receives topic 210 input by user 140. In some embodiments, electronic device 110 supports interaction with user 140 to receive topic 210 input by user 140. In such embodiments, electronic device 110 may have an application 115 installed that supports user 140 inputting topic 210 in natural language. In some embodiments, the input topic 210 may be a story topic. For example, application 115 may support text input, voice input, and other modal input from user 140. It should be understood that these methods of electronic device 110 receiving input from user 140 are merely examples and are not intended to be limiting.

[0048] In some embodiments, theme 210 may include, for example, the setting, style, plot, protagonist, etc., of a story, without limitation. As an example, theme 210 may be a theme of visual content such as images or videos about miniature landscapes. The visual content of a miniature landscape can refer to images or videos that depict scenes that already exist or have been described in the real world, scaled down proportionally. The miniature landscapes discussed herein may also be referred to as miniature styles. For example, theme 210 about miniature landscapes could be "A group of people traveling in [location]", "A group of people in a colorful [port]", etc. Generally, the content length of theme 210 is relatively short. In some embodiments, the content length of theme 210 may be within a preset length range. It should be understood that the theme must comply with the requirements of relevant laws, regulations, and related provisions.

[0049] Furthermore, if the topic 210 is received from user 140, electronic device 110 uses a first machine learning model to generate multiple segment descriptions (220) corresponding to the segments based on topic 210 and the first prompt word. In this document, the segments generated based on topic 210 and the first prompt word can also be referred to as storyboards. Each segment description indicates the action of the corresponding object within the segment. The first machine learning model can be constructed based on a language model. The first machine learning model must at least support outputting natural language text based on the input natural language text, i.e., it has text generation capabilities. However, it should be noted that the embodiments of this disclosure do not limit the scalability of the first machine learning model. Specific embodiments regarding the first prompt word and segment description will be discussed in detail below.

[0050] In applications related to specific styles, electronic device 110 generates fragment descriptions corresponding to multiple fragments based on a theme and style-related fragment generation requirements. For example, in the application of miniature style, fragment descriptions corresponding to multiple fragments can be generated based on a theme and miniature style-related fragment generation requirements. In some embodiments, when generating fragment descriptions corresponding to multiple fragments, electronic device 110 can utilize a first machine learning model to generate fragment descriptions corresponding to multiple fragments based on a theme and a first prompt word. Here, the first prompt word can indicate style-related fragment generation requirements.

[0051] Furthermore, the electronic device 110 utilizes a second machine learning model to generate multiple image or video clips corresponding to each of the multiple clips, based on the clip descriptions and second prompt words respectively (230). The second machine learning model is configured to generate image or video clips based on input text. The second machine learning model can be constructed based on an image generation model or a video generation model. The second machine learning model is at least capable of generating image or video clips based on input text, i.e., it has text-to-image generation capability. As an example, the second machine learning model can be constructed based on a model structure with strong image generation capabilities, such as a diffusion model.

[0052] Furthermore, based on the segment descriptions corresponding to each of the multiple segments, the electronic device 110 generates image or video segments of the styles corresponding to each of the multiple segments.

[0053] In some embodiments, when generating image or video clips of various styles corresponding to multiple segments, the electronic device 110 can utilize a second machine learning model to generate image or video clips corresponding to each of the multiple segments based on segment descriptions and a second prompt word. In the field of miniature style applications, multiple miniature images or miniature video clips corresponding to each of the multiple segments can be generated based on segment descriptions and a second prompt word. Here, the second prompt word can indicate style-related image or video generation requirements. Specific embodiments regarding the second prompt word will be discussed in detail below.

[0054] The first prompt word should at least include a requirement for objectivity in the description of the fragment. In the application of microfiche style, the first prompt word may include a requirement for objectivity in the description of fragments related to microfiche style. In some embodiments, the requirement for objectivity in the description of the fragment may indicate a requirement for the sentence structure of the description of the fragment. As an example, the sentence structure requirement may include a requirement for using declarative sentences. For example, it may be required to describe the fragment objectively using declarative sentences from a third-person perspective (i.e., an observer's perspective). Additionally, it may be required to concisely describe each fragment. In other examples, the sentence structure requirement may also include a requirement not to use interrogative sentences or other sentence structure requirements, etc., without limitation.

[0055] Alternatively or additionally, in some embodiments, the objectivity requirement for the fragment description may indicate requirements for the use of adjectives in the fragment description. As an example, requirements for adjective use may include requirements regarding the number of adjectives used, or other requirements. For example, it may be required that there not be too many adjectives in the fragment description, or that the number of adjectives be limited to a certain range. In other embodiments, the objectivity requirement for the fragment description may also indicate requirements for the use of other word classes in the fragment description, such as verbs, nouns, etc. Accordingly, requirements for the use of other word classes may also include requirements regarding the number of these other word classes used, or other requirements, etc.

[0056] Therefore, through the above examples of requirements for sentence structure and word class usage in fragment description, fragment descriptions can be made more objective, avoiding the final product from deviating from the theme.

[0057] Alternatively or additionally, in some embodiments, the objectivity requirement for the fragment description may also indicate content requirements for each of the multiple fragments. As an example, the independence of the content of each fragment may be required, meaning that each fragment must have unique content and cannot be completely identical to the content of other fragments. This can enrich the generated visual content and enhance user engagement.

[0058] Alternatively or additionally, the first cue word may include at least a requirement for the action of the object described in the fragment. In the application of miniature style, the first cue word may include a requirement for the action of the object described in the fragment related to miniature style. The object can refer to any suitable type of character, such as a person, animal, or cartoon character. The object action requirement may include, for example, a requirement that only human actions or activities are allowed, or it may include a requirement that only animal actions or activities are described. The object action requirement may also include a requirement to describe both human and animal actions, and so on. As an example, for miniature-style visual content, the object action requirement may include, for example, a requirement to describe only human actions. In this way, small human elements can better highlight the characteristics of the miniature image. Thus, by requiring the object action of the fragment description, the generated visual content can be made more vivid and lifelike.

[0059] Alternatively or additionally, the first prompt word may include at least a format requirement for the fragment description. In some embodiments, the format requirement for the fragment description may be determined based at least on the input requirements of the second machine learning model. Since the generated fragment description can be input into the second machine learning model, the format of the fragment description may be required to conform to the format characteristics of the second machine learning model regarding the input portion.

[0060] In some embodiments, the format requirements for fragment descriptions may indicate that predefined operators are used to identify descriptive information about object actions within the fragment description. Predefined operators can be any suitable type of operator that conforms to the input requirements of the second machine learning model. The type of operator may include, for example, logical operators, unary operators, conditional operators, etc., without limitation. Exemplarily, predefined operators may include parentheses, such as parentheses, square brackets, angle brackets, etc. In some embodiments, one or more predefined operators may be used to identify descriptive information about object actions. In some embodiments, for each fragment description, the same or different predefined operators may be used to identify descriptive information about different object actions. For example, for two different object actions, "multiple children playing games" and "a group of people having a picnic," the corresponding descriptive information may both be identified by parentheses, or it may be identified by parentheses and square brackets respectively.

[0061] In some embodiments, when the segment description corresponding to each segment is input into the second machine learning model, the second machine learning model can be configured to assign a first reference weight to the descriptive information identified by a predetermined operator, wherein the first reference weight is higher than the reference weight assigned to other descriptive information in the segment description. That is, a higher reference weight can be assigned to the descriptive information identified by the predetermined operator. This means that when generating image or video segments, the second machine learning model will pay more attention to the descriptive information identified by the predetermined operator to ensure that the generated image or video segments can more accurately represent this descriptive information.

[0062] Therefore, for descriptions that the second machine learning model perceives less effectively, if the goal is to emphasize these descriptions in the generated visual content, predefined operators can be added to the descriptive information in the first prompt. For example, if the second machine learning model is constructed from a miniature landscape drawing model, it may not be very sensitive to object descriptions. Therefore, as discussed above, predefined operators can be used in the fragment descriptions to identify descriptive information about object actions. This makes the generated image or video clips more likely to meet user needs.

[0063] Alternatively or additionally, in some embodiments, the format requirements for the fragment description may specify the language used in the fragment description. The required language may conform to the language characteristics of the second machine learning model regarding the input portion. For example, the fragment description may be required to be in English, or in another language such as Chinese. The language used for the fragment description can be set according to actual needs and is not limited herein.

[0064] Alternatively or additionally, in some embodiments, the format requirements for the fragment description may indicate the data format of the fragment description. The required data format may be consistent with the data format characteristics of the second machine learning model regarding the input portion. This is not intended to limit specific data formats; data formats may, for example, be Jason format, CSV format, or other formats. Taking Jason format as an example, if multiple fragments comprise four fragments, the data format of the fragment description may, for example, be {"1":"...","2":"...","3":"...","4":"..."}.

[0065] It should be understood that the format requirements for fragment descriptions are not limited to those discussed in the above embodiments. Furthermore, the format requirements for fragment descriptions are not limited to those conforming to the input portion of the second machine learning model. In other embodiments, there may be other requirements for the format of fragment descriptions.

[0066] In some embodiments, the first prompt may further include an object setting for generating fragment descriptions. In some embodiments, the object setting for the first machine learning model when generating fragment descriptions may indicate style-related (e.g., miniature style) story generation. As an example, the object setting may indicate that theme 210 is style-related, or it may indicate the number of fragments for multiple segments, or other style-related story generation information. For example, the object setting may specifically indicate: "You are a miniature landscape feature film director. Given a theme, write a story around this theme, requiring four storyboards involving coherent scenes." The first machine learning model may generate fragment descriptions corresponding to four segments based at least on theme 210 and such an object setting.

[0067] Alternatively or additionally, in some embodiments, the first prompt may also include a logical relationship for generating the fragment description. In some embodiments, such a logical relationship may be a thought chain to be followed. In some embodiments, the logical relationship for generating the fragment description may at least indicate that the scenes corresponding to multiple fragments are related. For example, the story scenes should be coherent between multiple fragments. Alternatively or additionally, the logical relationship for generating the fragment description may also indicate that each fragment has a corresponding environmental time and scene subject. The environmental time may be, for example, morning, evening, or other times, or modern, ancient, or other eras, etc., without limitation. The scene subject may include, for example, any suitable scene subject such as buildings, natural landscapes, city streets, rural fields, etc. Alternatively or additionally, the logical relationship for generating the fragment description may also indicate the length of the descriptive information for each fragment description. For example, it may indicate "simplify the description of each fragment, condensing the descriptive information into one sentence".

[0068] Alternatively or additionally, in some embodiments, the first cue word may also include restrictions on the description of the segment. These restrictions may include any appropriate limitations on the occurrence of the story, such as restrictions on the subjects of the scene, the actions of objects, or other aspects. For example, a restriction could be "The scene contains people and the environment (scenery, architecture), the actions are limited to people, and animals should not appear."

[0069] Alternatively or additionally, in some embodiments, the first prompt word may also include an example topic and an example fragment description corresponding to the example topic. Taking Chinese as the language of the fragment description as an example, the following are examples of an example topic and an example fragment description corresponding to the example topic:

[0070] Input: A group of people are active in a log cabin in the European countryside.

[0071] Output:

[0072] {

[0073] "1": "In the morning, (a group of people were packing their backpacks) and discussing routes in a field with Scottish-style log cabins as a backdrop."

[0074] "2": "In the morning, a group of people are taking pictures among wildflowers, with open fields and distant mountains as the background."

[0075] “3”: “At noon, by the stream (a group of people were having a picnic), sunlight filtered through the leaves, casting dappled shadows.”

[0076] “4”: “At night, in front of the cabin (a group of people lit a campfire) and sat together telling stories.”

[0077] }

[0078] Therefore, by using first-word prompts that include setting the object of the fragment description, logical relationships, constraints, and examples, more accurate fragment descriptions can be generated, which is beneficial for consistently generating visual content that conforms to the theme. For miniature-style visual content stories, this is conducive to presenting more details, thereby improving the quality of generated miniature landscape materials.

[0079] Regarding the second prompt word, in some embodiments, the second prompt word may indicate style-related image or video generation requirements. For example, in some embodiments, the second prompt word may indicate the image quality requirements for each of multiple segments corresponding to different image or video segments. The image quality requirements may be style-related (e.g., miniature style). As an example, the image quality requirements may include one or more first keywords related to image quality improvement. The first keywords may include, for example, "best quality," "soft colors," "8K," etc., without limitation.

[0080] Alternatively or additionally, in some embodiments, the second prompt word may indicate the style requirements for the respective image or video segments corresponding to multiple segments. The style requirements may be related to a miniature style. As an example, the style requirements may include one or more second keywords related to style. Second keywords may include, for example, "tilt-shift photography," "macro photography," "miniature model," etc., without limitation. In some embodiments, the second prompt word may include a phrase formed by concatenating one or more first keywords with one or more second keywords.

[0081] Referring again to Figure 2, in some embodiments, when generating image or video segments (230) corresponding to multiple segments, for a target segment among the multiple segments, the electronic device 110 can utilize a second machine learning model to generate an image (232) corresponding to the target segment based on the segment description and a second prompt word. Based on the image corresponding to the target segment, the electronic device 110 can generate a video segment (235) corresponding to the target segment. The animation amplitude parameter value of the video segment can be less than a preset parameter value. As an example, such a video segment can be a micro-motion video, or micro-motion animation.

[0082] In such an embodiment, the second machine learning model can be constructed based on an image generation model, capable of generating images based on text. As an example, the second machine learning model can be constructed based on a model structure with powerful image generation capabilities, such as a diffusion model. Furthermore, for the generated image corresponding to the target segment, the electronic device 110 can utilize a tool or machine learning model with the function of generating video from the image to generate a corresponding video segment (235).

[0083] In other embodiments, the second machine learning model may be constructed based on an image or video generation model, capable of generating video segments directly based on text. Here, the electronic device 110 may utilize such a second machine learning model to generate video segments (232) corresponding to each of the plurality of segments based on the segment description and second prompt words corresponding to each of the plurality of segments respectively.

[0084] Furthermore, the electronic device 110 generates a target video 130 corresponding to the theme by merging the image or video segments (240) corresponding to each of the multiple segments. Here, the electronic device 110 can generate the target video 130 by merging the images corresponding to each of the multiple segments. Alternatively, the electronic device 110 can generate the target video 130 by merging the video segments corresponding to each of the multiple segments. In such cases, the video segments to be merged can be obtained through the above two methods of generating video segments (i.e., the step in box 232 or the step combining box 232 and box 235).

[0085] In specific style applications, the electronic device 110 generates a target video corresponding to a theme by merging multiple image or video clips of different styles. In some embodiments, when generating a target video corresponding to a theme, the electronic device 110 can add specified media content to the image or video clips corresponding to the merged multiple clips to generate the target video. In the application of a miniature style, the target video can be a miniature video. Specified media content can include, but is not limited to, background music, video end credits, and artistic lettering. This makes the generated target video richer and more interesting.

[0086] In summary, in this embodiment of the disclosure, by combining machine learning models to explore drawing prompts for specific stylized themes, and by gradually adjusting and optimizing the model selection and drawing prompt configuration, a relatively stable visual content generation effect can be obtained. This enables the stable production of a series of high-quality, coherent images or videos based on the input theme, thereby improving the user experience.

[0087] Figure 3 shows a flowchart of a process 300 for vision generation according to some embodiments of the present disclosure. Process 300 can be implemented at electronic device 110. Process 300 will now be described with reference to Figure 1.

[0088] As shown in Figure 3, in box 310, electronic device 110 responds to receiving a topic input by the user and uses a first machine learning model to generate multiple fragment descriptions corresponding to fragments based on the topic and a first prompt word. Each fragment description indicates the action of the corresponding object in the fragment. The first prompt word includes at least one of the following: an objectivity requirement for the fragment description, an action requirement for the object in the fragment description, and a format requirement for the fragment description.

[0089] In box 320, electronic device 110 uses a second machine learning model to generate image or video clips corresponding to multiple segments based on the segment descriptions and second prompt words corresponding to each segment. The second machine learning model is configured to generate image or video clips based on the input text.

[0090] In box 330, electronic device 110 generates a target video corresponding to a theme by merging the image or video segments corresponding to multiple segments.

[0091] In some embodiments, the first prompt word may further include at least one of the following: object settings for generating fragment descriptions, logical relationships for generating fragment descriptions, constraints on fragment descriptions, example topics, and example fragment descriptions corresponding to example topics.

[0092] In some embodiments, the format requirements for fragment descriptions are determined based at least on the input requirements of a second machine learning model.

[0093] In some embodiments, the format requirements for a fragment description indicate at least one of the following: using predetermined operators to identify descriptive information about object actions in the fragment description, the language used in the fragment description, or the data format of the fragment description.

[0094] In some embodiments, when the fragment description corresponding to each fragment is input into the second machine learning model, the second machine learning model is configured to assign a first reference weight to the description information identified by a predetermined operator, wherein the first reference weight is higher than the reference weight assigned to other description information in the fragment description.

[0095] In some embodiments, the objectivity requirements for the fragment description indicate: requirements for the sentence structure of the fragment description, and requirements for the use of adjectives in the fragment description.

[0096] In some embodiments, the object setting instructions for the first machine learning model when generating fragment descriptions are style-related to story generation.

[0097] In some embodiments, generating image or video clips corresponding to multiple segments includes: for a target segment among multiple segments, using a second machine learning model to generate an image corresponding to the target segment based on the segment description and a second prompt word corresponding to the target segment; and generating a video clip corresponding to the target segment based on the image corresponding to the target segment, wherein the animation amplitude parameter value of the video clip is less than a preset parameter value.

[0098] In some embodiments, the second prompt word indicates style-related image or video generation requirements.

[0099] In some embodiments, the second prompt indicates at least one of the following: the image quality requirements for the respective image or video segments of the plurality of segments, and the style requirements for the respective image or video segments of the plurality of segments.

[0100] Figure 4 shows a flowchart of a process 400 for vision generation according to some embodiments of the present disclosure. Process 400 can be implemented at electronic device 110. Process 400 will now be described with reference to Figure 1.

[0101] As shown in Figure 4, in box 410, electronic device 110 receives a topic input by the user.

[0102] In box 420, electronic device 110 generates fragment descriptions corresponding to multiple fragments based on the theme and style-related fragment generation requirements, with each fragment description indicating the action of the corresponding object in the fragment.

[0103] In box 430, electronic device 110 generates image or video clips of a style corresponding to each of the multiple clips based on the clip descriptions corresponding to each of the multiple clips.

[0104] In box 440, electronic device 110 generates a target video corresponding to a theme by merging multiple image or video clips of different styles.

[0105] In some embodiments, generating fragment descriptions corresponding to multiple fragments includes: using a first machine learning model to generate fragment descriptions corresponding to multiple fragments based on a topic and a first cue word, wherein the first cue word indicates style-related fragment generation requirements.

[0106] In some embodiments, generating image or video clips of a style corresponding to each of the multiple segments includes: using a second machine learning model to generate image or video clips corresponding to each of the multiple segments based on segment descriptions and second prompt words, wherein the second prompt words indicate style-related image or video generation requirements.

[0107] In some embodiments, generating a target video corresponding to a theme includes adding specified media content to the image or video segments corresponding to each of the merged multiple segments to generate the target video. Figure 5 shows a schematic structural block diagram of an apparatus 500 for visual generation according to certain embodiments of the present disclosure. The apparatus 500 may be implemented as or included in an electronic device 110. The various modules / components in the apparatus 500 may be implemented by hardware, software, firmware, or any combination thereof. As shown in Figure 5, the device 500 includes a segment description generation module, configured to, in response to receiving a topic input by a user, generate segment descriptions corresponding to multiple segments based on the topic and a first prompt word using a first machine learning model. Each segment description indicates the action of a corresponding object within the segment. The first prompt word includes at least one of the following: an objectivity requirement for the segment description, an action requirement for the object in the segment description, and a format requirement for the segment description; a visual generation module, configured to, using a second machine learning model, generate image or video segments corresponding to each of the multiple segments based on their respective segment descriptions and the second prompt word. The second machine learning model is configured to generate image or video segments based on input text; and a target video generation module, configured to generate a target video corresponding to the topic by merging the image or video segments corresponding to each of the multiple segments.

[0108] In some embodiments, the first prompt word may further include at least one of the following: object settings for generating fragment descriptions, logical relationships for generating fragment descriptions, constraints on fragment descriptions, example topics, and example fragment descriptions corresponding to example topics.

[0109] In some embodiments, the format requirements for fragment descriptions are determined based at least on the input requirements of a second machine learning model.

[0110] In some embodiments, the format requirements for a fragment description indicate at least one of the following: using predetermined operators to identify descriptive information about object actions in the fragment description, the language used in the fragment description, or the data format of the fragment description.

[0111] In some embodiments, when the fragment description corresponding to each fragment is input into the second machine learning model, the second machine learning model is configured to assign a first reference weight to the description information identified by a predetermined operator, wherein the first reference weight is higher than the reference weight assigned to other description information in the fragment description.

[0112] In some embodiments, the objectivity requirements for the fragment description indicate: requirements for the sentence structure of the fragment description, and requirements for the use of adjectives in the fragment description.

[0113] In some embodiments, the object setting instructions for the first machine learning model when generating fragment descriptions are style-related to story generation.

[0114] In some embodiments, the visual generation module 520 is further configured to, for a target segment among multiple segments, use a second machine learning model to generate an image corresponding to the target segment based on the segment description and a second prompt word corresponding to the target segment; and generate a video segment corresponding to the target segment based on the image corresponding to the target segment, wherein the animation amplitude parameter value of the video segment is less than a preset parameter value.

[0115] In some embodiments, the second prompt word indicates style-related image or video generation requirements.

[0116] In some embodiments, the second prompt indicates at least one of the following: the image quality requirements for the respective image or video segments of the plurality of segments, and the style requirements for the respective image or video segments of the plurality of segments.

[0117] Figure 6 shows a schematic structural block diagram of a vision generation apparatus 600 according to certain embodiments of the present disclosure. The apparatus 600 may be implemented as or included in an electronic device 110. The various modules / components in the apparatus 600 may be implemented by hardware, software, firmware, or any combination thereof.

[0118] As shown in Figure 6, the device 600 includes a topic receiving module configured to receive a topic input by a user; a fragment description generation module configured to generate fragment descriptions corresponding to multiple fragments based on the topic and style-related fragment generation requirements, each fragment description indicating the action of a corresponding object in the fragment; a visual generation module configured to generate image or video fragments of a style corresponding to each of the multiple fragments based on the fragment descriptions corresponding to each of the multiple fragments; and a target video generation module configured to generate a target video corresponding to the topic by merging the image or video fragments of a style corresponding to each of the multiple fragments.

[0119] In some embodiments, the fragment description generation module 620 is further configured to use a first machine learning model to generate fragment descriptions corresponding to multiple fragments based on a topic and a first prompt word, wherein the first prompt word indicates style-related fragment generation requirements.

[0120] In some embodiments, the visual generation module 630 is further configured to utilize a second machine learning model to generate image or video clips corresponding to multiple segments based on segment descriptions and second prompt words, respectively, wherein the second prompt words indicate style-related image or video generation requirements.

[0121] In some embodiments, the video generation module 640 is further configured to add specified media content to the image or video segments corresponding to the merged multiple segments to generate the target video.

[0122] The units and / or modules included in device 500 or device 600 can be implemented in various ways, including software, hardware, firmware, or any combination thereof. In some embodiments, one or more units and / or modules can be implemented using software and / or firmware, such as machine-executable instructions stored on a storage medium. In addition to or as an alternative to machine-executable instructions, some or all of the units and / or modules in device 500 can be implemented at least partially by one or more hardware logic components. By way of example and not limitation, exemplary types of hardware logic components that can be used include field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-chips (SoCs), complex programmable logic devices (CPLDs), and so on.

[0123] Figure 7 illustrates a block diagram of an electronic device 700 in which one or more embodiments of the present disclosure may be implemented. It should be understood that the electronic device 700 shown in Figure 7 is merely exemplary and should not constitute any limitation on the functionality and scope of the embodiments described herein. The electronic device 700 shown in Figure 7 can be used to implement the electronic device 110 of Figure 1, the device 500 of Figure 5, or the device 600 of Figure 6.

[0124] As shown in Figure 7, the electronic device 700 is in the form of a general-purpose electronic device. Components of the electronic device 700 may include, but are not limited to, one or more processors or processing units 710, memory 720, storage devices 730, one or more communication units 740, one or more input devices 750, and one or more output devices 760. The processing unit 710 may be a physical or virtual processor and is capable of performing various processes according to programs stored in the memory 720. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of the electronic device 700.

[0125] Electronic device 700 typically includes multiple computer storage media. Such media can be any accessible media that is accessible to electronic device 700, including but not limited to volatile and non-volatile media, removable and non-removable media. Memory 720 can be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 730 can be a removable or non-removable medium and can include machine-readable media, such as flash drives, disks, or any other media that can be used to store information and / or data (e.g., training data for training) and can be accessed within electronic device 700.

[0126] Electronic device 700 may further include additional removable / non-removable, volatile / non-volatile storage media. Although not shown in FIG. 7, disk drives for reading from or writing to removable, non-volatile disks (e.g., "floppy disks") and optical disk drives for reading from or writing to removable, non-volatile optical disks may be provided. In these cases, each drive may be connected to a bus (not shown) via one or more data media interfaces. Memory 720 may include computer program product 725 having one or more program modules configured to perform various methods or actions of various embodiments of the present disclosure.

[0127] The communication unit 740 enables communication with other electronic devices via a communication medium. Additionally, the functionality of the components of the electronic device 700 can be implemented using a single computing cluster or multiple computing machines capable of communicating via communication connections. Therefore, the electronic device 700 can operate in a networked environment using logical connections to one or more other servers, network personal computers (PCs), or another network node.

[0128] Input device 750 can be one or more input devices, such as a mouse, keyboard, trackball, etc. Output device 760 can be one or more output devices, such as a monitor, speaker, printer, etc. Electronic device 700 can also communicate with one or more external devices (not shown) via communication unit 740 as needed. These external devices include storage devices, display devices, etc., and can communicate with one or more devices that enable user interaction with electronic device 700, or with any device that enables electronic device 700 to communicate with one or more other electronic devices (e.g., network card, modem, etc.). Such communication can be performed via input / output (I / O) interface (not shown).

[0129] According to an exemplary implementation of this disclosure, a computer-readable storage medium is provided that stores computer-executable instructions thereon, wherein the computer-executable instructions are executed by a processor to implement the methods described above. According to an exemplary implementation of this disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, which are executed by a processor to implement the methods described above.

[0130] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatuses, devices, and computer program products implemented according to this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0131] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0132] Computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions that execute on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0133] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction, which contains one or more executable instructions for implementing the specified logical function. In some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0134] Various implementations of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed implementations. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described implementations. The terminology used herein is chosen to best explain the principles, practical applications, or improvements to technology in the market, or to enable others skilled in the art to understand the various implementations disclosed herein.

Claims

1. A method for visual generation, comprising: In response to receiving a topic input by the user, a first machine learning model is used to generate multiple segment descriptions corresponding to segments based on the topic and a first prompt word. Each segment description indicates the action of a corresponding object in the segment. The first prompt word includes at least one of the following: The objective requirement for the description of the fragment, The action requirements for the object described in the fragment, Formatting requirements for the description of the fragment; Using a second machine learning model, image or video clips corresponding to each of the multiple segments are generated based on the segment description and the second prompt word, respectively. The second machine learning model is configured to generate image or video clips based on the input text. as well as By merging the image or video segments corresponding to the multiple segments, a target video corresponding to the theme is generated.

2. The method of claim 1, wherein the first prompt word further comprises at least one of the following: The object setting for generating the fragment description. The logical relationship for generating the fragment description. Restrictions on the description of the fragment, Example topics and descriptions of example fragments corresponding to those example topics.

3. The method of claim 1, wherein the format requirements for the description of the fragment are determined at least based on the input requirements of the second machine learning model.

4. The method of claim 3, wherein the format requirement for the description of the fragment indicates at least one of the following: In the fragment description, predetermined operators are used to identify descriptive information about the object's actions. The language used in the description of the fragment, or The data format described in the fragment.

5. The method of claim 4, wherein when inputting the fragment description corresponding to each fragment into the second machine learning model, the second machine learning model is configured to assign a first reference weight to the description information identified by the predetermined operator, wherein the first reference weight is higher than the reference weight assigned to other description information in the fragment description.

6. The method of claim 1, wherein the objectivity requirement for the description of the fragment indicates: The sentence structure requirements for the description of the fragment are as follows: Requirements for the use of adjectives in the description of the aforementioned fragment.

7. The method of claim 2, wherein the object setting for the first machine learning model when generating the fragment description indicates style-related story generation.

8. The method according to claim 1, wherein generating the image or video segment corresponding to each of the plurality of segments comprises: For a target segment among the multiple segments, the second machine learning model is used to generate an image corresponding to the target segment based on the segment description corresponding to the target segment and the second prompt word; as well as Based on the image corresponding to the target segment, a video segment corresponding to the target segment is generated, wherein the animation amplitude parameter value of the video segment is less than a preset parameter value.

9. The method of claim 1, wherein the second prompt word indicates an image or video generation requirement related to the style.

10. The method of claim 1, wherein the second prompt word indicates at least one of the following: The image quality requirements for each of the multiple segments corresponding to the corresponding image or video segments. The style requirements for the image or video segments corresponding to each of the multiple segments.

11. A method for visual generation, comprising: Receive user input on the topic; Based on the theme and style-related fragment generation requirements, fragment descriptions corresponding to multiple fragments are generated, with each fragment description indicating the action of the corresponding object in the fragment; Based on the segment descriptions corresponding to each of the multiple segments, generate image or video segments of the style corresponding to each of the multiple segments; as well as By merging the image or video clips of the respective styles of the multiple segments, a target video corresponding to the theme is generated.

12. The method of claim 11, wherein generating fragment descriptions corresponding to the plurality of fragments comprises: Using a first machine learning model, multiple fragment descriptions are generated based on the topic and a first prompt word, wherein the first prompt word indicates the fragment generation requirements related to the style.

13. The method of claim 11, wherein generating image or video clips of the style corresponding to each of the plurality of clips comprises: Using a second machine learning model, image or video clips corresponding to each of the multiple clips are generated based on the clip descriptions and second prompt words, whereby the second prompt words indicate image or video generation requirements related to the style.

14. The method of claim 11, wherein generating the target video corresponding to the topic comprises: The target video is generated by adding specified media content to the image or video segments corresponding to the merged segments.

15. An apparatus for visual generation, comprising: The fragment description generation module is configured to, in response to receiving a topic input by a user, generate fragment descriptions corresponding to multiple fragments based on the topic and a first prompt word using a first machine learning model. Each fragment description indicates the action of a corresponding object in the fragment. The first prompt word includes at least one of the following: The objective requirement for the description of the fragment, The action requirements for the object described in the fragment, Formatting requirements for the description of the fragment; The visual generation module is configured to use a second machine learning model to generate image or video segments corresponding to each of the multiple segments based on the segment description and the second prompt word, respectively. The second machine learning model is configured to generate image or video segments based on the input text. as well as The target video generation module is configured to generate a target video corresponding to the theme by merging the image or video segments corresponding to the multiple segments.

16. An apparatus for visual generation, comprising: The topic receiving module is configured to receive topics input by the user; The fragment description generation module is configured to generate fragment descriptions corresponding to multiple fragments based on the theme and style-related fragment generation requirements, with each fragment description indicating the action of a corresponding object in the fragment. The visual generation module is configured to generate image or video clips of the style corresponding to each of the plurality of segments based on the segment descriptions corresponding to each of the plurality of segments. as well as The target video generation module is configured to generate a target video corresponding to the theme by merging image or video clips of the respective styles of the multiple segments.

17. An electronic device comprising: At least one processing unit; as well as At least one memory, coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform the method according to any one of claims 1 to 10 or 11 to 14 when executed by the at least one processing unit.

18. A computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to implement the method according to any one of claims 1 to 10 or 11 to 14.

19. A computer program product tangibly stored in a computer storage medium and comprising computer-executable instructions that, when executed by a device, cause the device to perform the method according to any one of claims 1 to 10 or 11 to 14.