Apparatus and method for generating video on basis of prompt
The method addresses the challenge of generating accurate and natural videos from non-expert prompts by converting user input into detailed frame-by-frame information, classifying attributes, and calculating bounding box coordinates to enhance video generation accuracy.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- INDUSTRY UNIVERSITY COOPERATION FOUNDATION HANYANG UNIVERSITY
- Filing Date
- 2025-12-09
- Publication Date
- 2026-06-18
Smart Images

Figure KR2025021161_18062026_PF_FP_ABST
Abstract
Description
Prompt-based video generation device and method
[0001] The present invention relates to a video generation device and method, and more specifically, to a device and method for generating video based on a form.
[0002]
[0003] Various methods for generating videos based on prompts (text) entered by the user are being studied, and various types of video generation neural networks are generating videos based on prompts.
[0004] However, user input prompts have the problem of being abbreviated and failing to fully reflect the user's intent in text; consequently, there are clear limitations to generating the video intended by the user using prompts that do not fully reflect the intended meaning.
[0005] For videos to be accurately generated through neural networks, the specific locations and interactions of subjects and objects must be clearly defined; however, since the text input by general users does not include such definitions, prompt-based video generation fails to meet user requirements.
[0006] In particular, when generating videos using prompts entered by users who lack sufficient knowledge of prompt creation, the unnaturalness of the video becomes even more pronounced when creating long videos.
[0007] Furthermore, if the object's behavior is not clearly defined in the prompt, problems may arise where the attributes of the relationship between the subject and the object are reversed or omitted, as the relationship and attributes between the subject and the object are not clearly defined.
[0008]
[0009] Therefore, a video generation device and method are required that can generate the video intended by the user even through a prompt written by a non-expert.
[0010]
[0011] The present invention proposes an apparatus and method capable of generating a video that reflects the user's intent based on a prompt written by a non-expert user.
[0012] The present invention proposes an apparatus and method capable of generating a video in which the interaction between a subject and an object is clearly displayed.
[0013]
[0014] According to one aspect of the present invention, a prompt-based video generation method is provided, comprising the steps of: converting a prompt input by a user and inputting it into an LLM, and obtaining frame-by-frame text and frame-by-frame graph information for the prompt input by the user from the LLM (a); generating a scene graph using the frame-by-frame text information (b); classifying each attribute of the scene graph into a subject, an object, and an interaction (c); obtaining bounding box coordinate information of the subject and bounding box coordinate information of the object from the frame-by-frame graph information (d); calculating bounding box information of the interaction using the bounding box coordinate information of the subject and the bounding box coordinate information of the object (e); inputting each attribute of the scene graph into a graph neural network to obtain an embedding vector for each attribute (f); synthesizing the features of the embedding vector for each attribute and the bounding box coordinates of each attribute to generate a synthetic token (g); and inputting the synthetic token into a video generation neural network to generate a video (h).
[0015] Step (a) above converts the prompt entered by the user to include a detailed request prompt requesting frame-by-frame text and frame-by-frame graph information, and an example prompt showing examples of frame-by-frame text and frame-by-frame graphs.
[0016] The above-mentioned frame-by-frame graph information includes object information contained in the above-mentioned frame-by-frame text and bounding box coordinate information of each object.
[0017] Step (d) determines the object as a subject or an object based on the classification result of Step (c), and then obtains the subject bounding box coordinate information and the object bounding box coordinate information.
[0018] Step (e) determines the larger left coordinate value between the left coordinates of the upper-left corners of the subject bounding box and the object bounding box as the upper-left left coordinate of the interaction bounding box, determines the smaller upper coordinate value between the upper coordinates of the upper-left corners of the subject bounding box and the object bounding box as the upper-left upper coordinate of the interaction bounding box, determines the smaller right coordinate value between the left coordinates of the lower-right corners of the subject bounding box and the object bounding box as the right-right right coordinate of the interaction bounding box, and determines the larger lower coordinate value between the lower-right bottom coordinates of the subject bounding box and the object bounding box as the lower-right bottom coordinate of the interaction bounding box.
[0019] Step (e) swaps the left coordinate and the right coordinate if the left coordinate of the interaction bounding box is determined to be larger than the right coordinate, and swaps the lower coordinate and the upper coordinate if the lower coordinate of the interaction bounding box is determined to be larger than the upper coordinate.
[0020] The characteristics of the bounding box coordinates in step (g) above are obtained by Fourier mapping of the bounding box coordinates.
[0021] According to another aspect of the present invention, a prompt-based video generation device is provided, comprising: a processor; a memory connected to the processor, wherein the processor converts a prompt entered by a user and inputs it into an LLM, and obtains frame-by-frame text and frame-by-frame graph information for the prompt entered by the user from the LLM (a); a step of generating a scene graph using the frame-by-frame text information (b); a step of classifying each attribute of the scene graph into a subject, an object, and an interaction (c); a step of obtaining bounding box coordinate information of the subject and bounding box coordinate information of the object from the frame-by-frame graph information (d); a step of calculating bounding box information of the interaction using the bounding box coordinate information of the subject and the bounding box coordinate information of the object (e); a step of inputting each attribute of the scene graph into a graph neural network to obtain an embedding vector for each attribute (f); a step of generating a synthetic token by synthesizing the features of the embedding vector for each attribute and the bounding box coordinates of each attribute (g); and a step of generating a video by inputting the synthetic token into a video generation neural network (h).
[0022]
[0023] According to the present invention, there is an advantage in that a video reflecting the user's intention can be generated based on a prompt written by a non-expert user, and a video in which the interaction between the subject and the object is clearly displayed can be generated.
[0024]
[0025] FIG. 1 is a block diagram illustrating the overall structure of a prompt-based video generation device according to one embodiment of the present invention.
[0026] FIG. 2 is a diagram showing an example of a conversion prompt converted in a prompt conversion module according to one embodiment of the present invention.
[0027] FIG. 3 is a diagram showing an example of information output by an LLM when a converted prompt according to an embodiment of the present invention is input to the LLM.
[0028] FIG. 4 is a block diagram illustrating the structure of a graph generation module according to one embodiment of the present invention.
[0029] FIG. 5 is a diagram showing an example of a scene graph generated by a scene graph generation module according to an embodiment of the present invention.
[0030] FIG. 6 is a drawing showing an example of interaction bounding boxes set according to an embodiment of the present invention.
[0031] FIG. 7 is a block diagram showing the detailed structure of a bounding box coordinate acquisition module according to one embodiment of the present invention.
[0032] FIG. 8 is a diagram showing the operation of a feature synthesis module according to an embodiment of the present invention.
[0033] FIG. 9 is a flowchart illustrating the overall flow of a prompt-based video generation method according to an embodiment of the present invention.
[0034]
[0035] Hereinafter, specific embodiments according to embodiments of the present invention will be described with reference to the drawings. The following detailed description is provided to facilitate a comprehensive understanding of the methods, devices, and / or systems described herein. However, this is merely illustrative and the present invention is not limited thereto.
[0036] In describing the embodiments of the present invention, if it is determined that a detailed description of known technology related to the present invention may unnecessarily obscure the essence of the embodiments, such detailed description will be omitted. Furthermore, the terms described below are defined in consideration of their functions in the present invention, and these may vary depending on the intentions or practices of the user or operator. Therefore, such definitions should be based on the content throughout this specification. Terms used in the detailed description are intended merely to describe specific embodiments and should not be limiting. Unless explicitly stated otherwise, expressions in the singular form include the meaning of the plural form. In this description, expressions such as “include” or “comprising” are intended to refer to certain characteristics, numbers, steps, actions, elements, parts thereof, or combinations thereof, and should not be interpreted to exclude the existence or possibility of one or more other characteristics, numbers, steps, actions, elements, parts thereof, or combinations thereof other than those described.
[0037] FIG. 1 is a block diagram illustrating the overall structure of a prompt-based video generation device according to one embodiment of the present invention.
[0038] Referring to FIG. 1, a prompt-based video generation device according to one embodiment of the present invention includes a prompt conversion module (100), a graph generation module (200), a bounding box coordinate acquisition module (300), a graph neural network (400), a feature synthesis module (500), and a video generation neural network (600). In addition, the present invention utilizes an existing Large Language Model (LLM, 1000) and uses information output through the LLM (1000) to acquire detailed descriptive information regarding a prompt input by a user, subject and object information included in the detailed descriptive information, and bounding box information thereof.
[0039] The prompt conversion module (100) converts a prompt entered by a user into a prompt that requires frame-by-frame text information and graph information for each frame (objects forming nodes of the graph and bounding box location information of the objects) for the corresponding prompt.
[0040] A video consists of multiple frames, and information regarding each frame is required to generate an accurate and natural video. However, the prompts entered by the user are mostly simple sentences. To utilize the inference function of the LLM, the present invention converts the prompt in a prompt conversion module (100) to obtain frame-specific text information and graph information of each frame (object information forming the nodes of the graph and bounding box location information of the object) from the LLM.
[0041] The prompt conversion module (100) converts the prompt entered by the user to add a detailed request prompt and an example prompt. The detailed request prompt and the example prompt are predefined, and the prompt conversion module (100) converts the prompt entered by the user by adding to the pre-prepared detailed request prompt and example prompt, and the converted prompt is input as an LLM.
[0042] LLM outputs frame-by-frame text information and frame-by-frame graph information through inference based on the detailed request prompts and example prompts included in the converted prompts.
[0043] FIG. 2 is a diagram showing an example of a conversion prompt converted in a prompt conversion module according to one embodiment of the present invention.
[0044] Referring to FIG. 2, when the sentence “A person is directing an airplane” is entered as input, a detailed request prompt and an example prompt are added in the prompt conversion module (100) as shown in FIG. 2.
[0045] The detailed request prompt describes to the LLM constraints on frame-specific text information, frame object information, and their bounding box information.
[0046] In addition, the example prompt includes examples illustrating the output format of the LLM. The example prompt contains examples of example prompts for Input and Output.
[0047] LLM (1000) outputs frame-by-frame text information and frame-by-frame graph information using detailed requirements and examples included in the converted prompt, and the present invention starts from using the frame-by-frame text information and frame-by-frame graph information output by the LLM.
[0048] FIG. 3 is a diagram showing an example of information output by an LLM when a converted prompt according to an embodiment of the present invention is input into the LLM.
[0049] Referring to FIG. 3, information output by the LLM when the user's input prompt is "A woman drink water while holding her phone and then picks up a book" and the prompt is converted and input by the prompt conversion module (100) is illustrated in FIG. 3.
[0050] The LLM outputs frame-by-frame text information and graph information for each frame (frame object information and coordinate information of each object). Since the converted prompt requests the output of the object information in the frame-by-frame text and the coordinate information of each object, the LLM outputs the object coordinate information in accordance with the request of the converted prompt. Various commercial LLMs can be utilized, and current LLMs possess the reasoning capability to understand the requests made in the prompt and output frame-by-frame text and graph information for each frame.
[0051] Referring again to FIG. 1, the graph generation module (200) has the function of generating a scene graph using frame-by-frame text information output by the LLM (1000).
[0052] FIG. 4 is a block diagram illustrating the structure of a graph generation module (200) according to one embodiment of the present invention.
[0053] It includes a frame-by-frame text acquisition module (210), a scene graph generation module (220), and an information classification module (230).
[0054] The frame-by-frame text acquisition module (210) receives frame-by-frame text output from the LLM (1000).
[0055] The scene graph generation module (220) has the function of generating a scene graph for each frame using text for each frame. A scene graph is data that represents a specific image in a graph structure, and various software tools or artificial neural networks exist for generating scene graphs, and a scene graph for text is generated using such tools or neural networks. Since the generation of scene graphs is a widely known technique, a detailed description thereof is omitted. For example, a natural language processing parser may be used as a software tool for generating scene graphs.
[0056] FIG. 5 is a diagram showing an example of a scene graph generated by a scene graph generation module according to an embodiment of the present invention.
[0057] Referring to FIG. 5, the scene graph is configured with objects (person, bottle, phone) included in the text as nodes. Additionally, interactions included in the text are configured as edge attributes. FIG. 5 illustrates an example where the interactions "pick up" and "holding" are configured as edge attributes. The scene graph has objects and interactions as attribute information, and the attribute information of the scene graph later becomes input information for the graph neural network (400).
[0058] The information classification module (230) classifies the subject, object, and interaction of each attribute from the generated scene graph. The LLM detects objects but does not determine whether the object is a subject or an object. The information classification module (230) determines that people are subjects and objects are objects in the scene graph. For example, the information classification module (230) may have a word table representing people in advance, and may determine that words of objects included in the table are subjects and words not included are objects. Of course, the information classification module (230) may determine whether a specific object is a subject or an object using various judgment logics or artificial neural networks.
[0059] Additionally, the information classification module (230) obtains interaction information from the scene graph. The information classification module (230) determines edge attributes as interactions in the scene graph generated from frame-by-frame text.
[0060] The graph generation module (200) provides the object's subject / object classification information and interaction information to the bounding box coordinate acquisition module (300). The bounding box coordinate acquisition module (300) acquires the bounding box of the subject, the bounding box of the object, and the bounding box of the interaction.
[0061] Since the bounding box of the subject and the bounding box of the object are already obtained from the LLM, no separate operation is required. It is only necessary to determine whether a specific object output from the LLM is a subject or an object, and information for determining this is provided from the graph generation module (200).
[0062] However, the coordinates of the bounding box of an interaction are not obtained from the LLM, and it is difficult to accurately infer the bounding box of an interaction with the current inference capabilities of the LLM. Accordingly, the present invention calculates the bounding box coordinates of an interaction using the bounding box location information of a subject and the bounding box location information of an object, and the configuration for calculating the bounding box coordinates of an interaction is one of the unique configurations of the present invention.
[0063] FIG. 6 is a diagram showing examples of interaction bounding boxes set according to an embodiment of the present invention, and FIG. 7 is a block diagram showing the detailed structure of a bounding box coordinate acquisition module according to an embodiment of the present invention.
[0064] Referring to FIG. 7, a bounding box coordinate acquisition module (300) according to one embodiment of the present invention includes a subject bounding box coordinate acquisition module (310), an object bounding box acquisition module (320), and an interaction bounding box coordinate calculation module (330). As previously described, the subject bounding box coordinate acquisition module (310) and the object bounding box acquisition module (320) acquire the bounding box of the subject and the bounding box of the object using the output information of the LLM (1000) and the information provided by the graph generation module (200).
[0065] The interaction bounding box coordinate calculation module (330) calculates the interaction bounding box using the coordinates of the subject bounding box and the coordinates of the object bounding box.
[0066] FIG. 6 illustrates various examples of setting an interaction bounding box using the position coordinates of a subject bounding box and an object bounding box, and the interaction bounding box coordinate calculation module (330) determines the upper-left coordinates of the interaction bounding box based on the upper-left coordinates of the subject bounding box and the object bounding box, and determines the lower-right coordinates of the interaction bounding box based on the lower-right coordinates of the subject bounding box and the object bounding box.
[0067] Specifically, the interaction bounding box coordinate calculation module (330) determines the left coordinate of the larger value between the left coordinates of the upper left corner of the subject bounding box and the object bounding box as the left coordinate of the upper left corner of the interaction bounding box, and determines the upper coordinate of the smaller value between the upper coordinates of the upper left corner of the subject bounding box and the object bounding box as the upper coordinate of the upper left corner of the interaction bounding box.
[0068] The operation of determining the top-left coordinates of the interaction bounding box in this way can be expressed as shown in the following mathematical formula 1.
[0069]
[0070] Additionally, the interaction bounding box coordinate calculation module (330) determines the right coordinate of the smaller value between the left coordinates of the lower right of the subject bounding box and the object bounding box as the right coordinate of the lower right of the interaction bounding box, and determines the lower coordinate of the larger value between the lower coordinates of the lower right of the subject bounding box and the object bounding box as the lower coordinate of the lower right of the interaction bounding box.
[0071] The operation of determining the top-left coordinates of the interaction bounding box in this way can be expressed as shown in the following mathematical formula 2.
[0072]
[0073] Meanwhile, as a result of calculating the top-left and bottom-right coordinates as described above, depending on the positional relationship between the subject bounding box and the object bounding box, there may be cases where the left coordinate is larger than the right coordinate or the top coordinate is smaller than the bottom coordinate. In this case, the right coordinate and the left coordinate are swapped, or the top coordinate and the bottom coordinate are swapped. This switching operation can be expressed as shown in the following Equation 3.
[0074]
[0075]
[0076] The scene graph generated by the graph generation module (200) is input into the graph neural network (400), and the graph neural network (400) outputs an embedding vector for each attribute included in the scene graph. A commercial graph neural network may be used as the graph neural network (400).
[0077] For example, as shown in Fig. 5, when there are 5 attributes (person, bottle, phone, pick up, holding), an embedding vector for each attribute is output through the graph neural network (400). That is, 5 embedding vectors are output through the graph neural network (400).
[0078] The feature synthesis module (500) synthesizes an embedding vector for each attribute (object + interaction) and features for the bounding box location for each attribute. Since the bounding box information for each attribute is coordinate information, Fourier mapping is performed on the coordinate information to convert it into feature information, thereby converting the coordinate information of the bounding box into high-dimensional feature information. Since Fourier mapping is a well-known operation, a detailed explanation thereof is omitted.
[0079] The feature synthesis module (500) synthesizes the embedding vector of each attribute forming the scene graph and the feature information of the bounding box corresponding to each attribute to generate synthesis tokens for each attribute.
[0080] FIG. 8 is a diagram showing the operation of a feature synthesis module according to one embodiment of the present invention.
[0081] Referring to Fig. 8, it can be seen that an embedding vector of each attribute constituting the graph is output through a graph neural network, and the output embedding vector is combined with the features of the bounding box corresponding to each attribute.
[0082] Referring again to FIG. 1, the attribute-specific synthesis tokens output from the feature synthesis module (500) and the frame-specific text obtained from the LLM (1000) are input into the video generation neural network (600). The video generation neural network (600) receives the attribute-specific synthesis tokens and frame-specific text input for the corresponding frame and generates frame images, and these frame images are accumulated to generate a video.
[0083] The present invention enables the generation of natural and continuous video even for abbreviated text input by a user by obtaining bounding box coordinates for all attributes (subject, object, interaction) of a scene graph and providing a synthetic token, which combines the embedding vector of all attributes of the scene graph and the features of the bounding box corresponding to each embedding vector, to a video generation neural network (600).
[0084] FIG. 9 is a flowchart illustrating the overall flow of a prompt-based video generation method according to one embodiment of the present invention.
[0085] Referring to FIG. 9, the prompt entered by the user is modified (step 900). The prompt entered by the user is modified by adding a detailed request prompt and an example prompt to the prompt entered by the user. The detailed request prompt describes frame-specific text information and constraints on the object information of the frame and their bounding box information. Additionally, the example prompt includes an example of the required output.
[0086] The converted prompt is input into the LLM to obtain frame-by-frame text and graph information for each frame from the LLM (step 902). Here, the graph information of the frame includes objects and bounding box coordinate information for each object.
[0087] Using frame-specific text information, a scene graph for the frame is generated through a scene graph generation module such as NLP (step 904). The scene graph includes multiple attributes.
[0088] Classify subjects, objects, and interactions in each attribute of the generated scene graph (step 906).
[0089] When the subject, object, and interaction are classified, bounding box coordinate information of the subject, object, and interaction is obtained (step 908). The bounding box coordinate information of the subject and object is obtained from the LLM, and it is necessary to determine whether the object in the LLM is a subject or an object. The bounding box coordinate information of the interaction is determined as shown in Equations 1 to 3 above based on the subject bounding box coordinate information and the object bounding box coordinate information.
[0090] Each attribute of the generated scene graph is input into a graph neural network to obtain an embedding vector for each attribute (step 910).
[0091] A composite token for each attribute is generated by synthesizing the embedding vector for each attribute (subject, object, interaction) and the feature information of the bounding box coordinate information of each attribute (step 912). The feature information of the bounding box coordinate information can be obtained through Fourier mapping of the coordinate information.
[0092] Synthetic tokens for each attribute in each frame and frame text obtained from LLM are input into a video generation neural network, and the video generation neural network generates a video based on the prompt entered by the user (step 914).
[0093] The prop-based video generation method of the present invention described above may be performed on a computing device including a processor and memory.
[0094] The present invention has been described with reference to embodiments illustrated in the drawings, but this is merely illustrative, and those skilled in the art will understand that various modifications and equivalent alternative embodiments are possible therefrom. Accordingly, the true technical scope of protection of the present invention should be determined by the technical spirit of the appended claims.
Claims
1. A step of converting a prompt entered by a user and inputting it into an LLM, and obtaining frame-by-frame text and frame-by-frame graph information regarding the prompt entered by the user from the LLM (a); Step (b) of generating a scene graph using the above-mentioned frame-by-frame text information; Step (c) of classifying each attribute of the above scene graph into subject, object, and interaction; Step (d) of obtaining bounding box coordinate information of the subject and bounding box coordinate information of the object from the frame-by-frame graph information; A step (e) of calculating bounding box information of the interaction using bounding box coordinate information of the subject and bounding box coordinate information of the object; A step (f) of inputting each attribute of the above scene graph into a graph neural network to obtain an embedding vector for each attribute; A step (g) of generating a composite token by synthesizing the features of the embedding vector for each of the above attributes and the bounding box coordinates of each of the above attributes; and A prompt-based video generation method comprising the step (h) of generating a video by inputting the above synthetic token into a video generation neural network.
2. In Paragraph 1, The above step (a) is a prompt-based video generation method for converting a prompt entered by a user to include a detailed request prompt requesting frame-by-frame text and frame-by-frame graph information and an example prompt showing examples of frame-by-frame text and frame-by-frame graphs.
3. In Paragraph 1, A prompt-based video generation method in which the above-mentioned frame-by-frame graph information includes object information included in the above-mentioned frame-by-frame text and bounding box coordinate information of each object.
4. In Paragraph 3, The above step (d) is a prompt-based video generation method that determines the object as a subject or an object based on the classification result of the above step (c), and then obtains the subject bounding box coordinate information and the object bounding box coordinate information.
5. In Paragraph 4, The above step (e) is a prompt-based video generation method in which the larger value of the left coordinate of the upper-left corner of the subject bounding box and the object bounding box is determined as the upper-left left coordinate of the interaction bounding box, the smaller value of the upper coordinate of the upper-left corner of the subject bounding box and the object bounding box is determined as the upper-left upper coordinate of the interaction bounding box, the smaller value of the right coordinate of the lower-right corner of the subject bounding box and the object bounding box is determined as the lower-right right coordinate of the interaction bounding box, and the larger value of the lower coordinate of the lower-right corner of the subject bounding box and the object bounding box is determined as the lower-right lower coordinate of the interaction bounding box.
6. In Paragraph 5, The above step (e) is a prompt-based video generation method that swaps the left coordinate and the right coordinate when the left coordinate of the interaction bounding box is determined to be larger than the right coordinate, and swaps the lower coordinate and the upper coordinate when the lower coordinate of the interaction bounding box is determined to be larger than the upper coordinate.
7. In Paragraph 1, A prompt-based video generation method in which the bounding box coordinates of step (g) are obtained by Fourier mapping of the bounding box coordinates.
8. Processor; It includes memory connected to the above processor, The above processor is, A step (a) of converting a prompt entered by a user and inputting it into an LLM, and obtaining frame-by-frame text and frame-by-frame graph information for the prompt entered by the user from the LLM; Step (b) of generating a scene graph using the above-mentioned frame-by-frame text information; Step (c) of classifying each attribute of the above scene graph into subject, object, and interaction; Step (d) of obtaining bounding box coordinate information of the subject and bounding box coordinate information of the object from the frame-by-frame graph information; A step (e) of calculating bounding box information of the interaction using bounding box coordinate information of the subject and bounding box coordinate information of the object; A step (f) of inputting each attribute of the above scene graph into a graph neural network to obtain an embedding vector for each attribute; A step (g) of generating a composite token by synthesizing the features of the embedding vector for each of the above attributes and the bounding box coordinates of each of the above attributes; and A prompt-based video generation device comprising the step (h) of generating a video by inputting the above synthetic token into a video generation neural network.
9. In Paragraph 8, The above step (a) is a prompt-based video generation device that converts the prompt entered by the user to include a detailed request prompt requesting frame-by-frame text and frame-by-frame graph information and an example prompt showing examples of frame-by-frame text and frame-by-frame graphs.
10. In Paragraph 8, The above-mentioned frame-by-frame graph information is a prompt-based video generation device that includes object information included in the above-mentioned frame-by-frame text and bounding box coordinate information of each object.
11. In Paragraph 10, The above step (d) is a prompt-based video generation device that determines the object as a subject or an object based on the classification result of the above step (c), and then obtains the subject bounding box coordinate information and the object bounding box coordinate information.
12. In Paragraph 11, The above step (e) is a prompt-based video generation device that determines the larger left coordinate value between the upper-left left coordinates of the subject bounding box and the object bounding box as the upper-left left coordinate of the interaction bounding box, determines the smaller upper coordinate value between the upper-left upper coordinates of the subject bounding box and the object bounding box as the upper-left upper coordinate of the interaction bounding box, determines the smaller right coordinate value between the lower-right left coordinates of the subject bounding box and the object bounding box as the lower-right right coordinate of the interaction bounding box, and determines the larger lower coordinate value between the lower-right lower coordinates of the subject bounding box and the object bounding box as the lower-right lower coordinate of the interaction bounding box.
13. In Paragraph 12, The above step (e) is a prompt-based video generation device that swaps the left coordinate and the right coordinate when the left coordinate of the interaction bounding box is determined to be larger than the right coordinate, and swaps the lower coordinate and the upper coordinate when the lower coordinate of the interaction bounding box is determined to be larger than the upper coordinate.
14. In Paragraph 8, A prompt-based video generation device in which the bounding box coordinates of step (g) are obtained by Fourier mapping of the bounding box coordinates.