A method for generating images based on an instance layout and a computing device using the same.

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The method addresses the challenge of accurately reflecting user-expected instance layouts in image generation by using conditional models and object detection to recompose objects, enhancing image fidelity and alignment with user intent.

JP7876244B1Active Publication Date: 2026-06-19SUPERB AI CO LTD

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: SUPERB AI CO LTD
Filing Date: 2025-10-27
Publication Date: 2026-06-19

Application Information

Patent Timeline

27 Oct 2025

Application

19 Jun 2026

Publication

JP7876244B1

IPC: G06T11/00; G06T7/00; G06T7/70; G06T11/60; G06V10/70

AI Tagging

Application Domain

Image analysis Editing/combining figures or text

Technology Topics

Computer graphics (images)Reference image

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Content display method, content acquisition method, device, medium, and system
CN122389051AComputer graphics (images)Data storing
A Zero-Shot Embodied Target Navigation Method and System Based on Prior Semantic Map
CN122408793AComputer graphics (images)Self adaptive
Sports eyewear (ASK5267Rx)
CN310080402SComputer graphics (images)Eyewear
A traffic event video slice setting method, device, equipment and product
CN122265900AImprove slicing efficiencyEfficient and convenient to determineCharacter and pattern recognition Design optimisation/simulation Computer graphics (images)Engineering
2D dynamic soft shadow efficient rendering method and system based on signed distance field
CN122415833AShadowingsComputer graphics (images)

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing image generation models struggle to accurately reflect user-expected instance layouts, particularly when there is significant overlap or many small layouts, leading to decreased fidelity in generated images.

Method used

A method that verifies and corrects instance layouts in composite images by using conditional image generation models and object detection models to recompose objects based on instance layouts and image captions, ensuring accurate reflection of user intent.

Benefits of technology

Generates completed composite images that accurately reflect intended instance layouts by iteratively recomposing objects, improving image fidelity and alignment with user expectations.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 0007876244000001_ABST

Patent Text Reader

Abstract

This provides a method for correcting cases where the instance layout is not properly reflected in the generated composite image. [Solution] The method involves inputting a source control image containing first to n instance layouts and an image caption into a conditional image generation model, generating an initial composite image by compositing objects into each instance layout of the source control image by referring to the image caption, and if there are false detections or undetected instances among the first to n instance layouts of the initial composite image through a first object detection model, inputting at least one sub-control image corresponding to the specific instance layout that is falsely detected or undetected in the initial composite image into the conditional image generation model, recompositing at least one specific object corresponding to the specific instance layout in the initial composite image by referring to the image caption, and generating a composite image.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to a method for generating an image based on an instance layout and a computing device using the same. More specifically, an initial composite image is generated by referring to a control image including an instance layout and an image caption, and it is checked whether there is a specific instance layout that is misdetected or undetected by referring to the result of object detection performed on the initial composite image through an object detection model. If there is a specific instance layout that is misdetected or undetected, the present invention relates to a method for generating a completed composite image by recomposing a specific object corresponding to the specific instance layout and a computing device using the same.

Background Art

[0002] Conventionally, when attempting to generate a composite image through an image generation model, if each arrangement and combination of objects to be generated on the composite image do not sufficiently exist in the learning data, it has been difficult to generate a composite image expected by the user. Here, each arrangement and combination of objects means (i) the spatial arrangement of each object, (ii) the mutual relationship of each object, (iii) the relative size and ratio of each object, and (iv) the combination of each object with its background. That is, if insufficient learning is performed on how each object to be generated on the composite image appears and is arranged in any shape, it has been difficult to generate a composite image expected by the user through the image generation model.

[0003] To solve these problems, controllable image generation methods (e.g., the ControlNet model) and layout-guidance methods (e.g., the ReCo model) have been proposed and used, which use a control image containing an instance layout (where the instance layout represents the arrangement and structure of objects to be generated according to the class) as a condition for generating a composite image.

[0004] However, even when following these methods, there were still limitations in generating object composite images that corresponded to the instance layouts expected by the user, especially when there was significant overlap between instance layouts or when many small instance layouts were distributed.

[0005] Recent research, such as MIGC (Multi-Instance Generation Controller), which simultaneously generates multiple instances in a single image while controlling them in diverse ways, and InstanceDiffusion, which enables instance-level control within an image in text-image generation models, has attempted to solve the above problems by introducing separate module learning or advanced inference strategies. However, if appropriate configurations and placements are not set for each instance layout, the fidelity of the generated image (i.e., an indicator of how similar the generated image is to the actual image) decreases.

[0006] Therefore, the applicant proposes a new method that verifies whether at least one instance layout is properly reflected in the generated composite image, and if at least a part of the instance layout is not reflected, it can even perform correction. [Overview of the project] [Problems that the invention aims to solve]

[0007] The purpose of this invention is to solve all of the problems mentioned above.

[0008] Furthermore, the present invention also aims to generate a completed composite image by inputting a source control image containing a first instance layout to an nth instance layout (where n is an integer of 1 or more) and an image caption into a conditional image generation model, using the conditional image generation model to generate an initial composite image by referencing the image caption and compositing the first object to the nth object into each of the first instance layouts to the nth instance layout of the source control image, and by referring to the results of object detection on the initial composite image through a first object detection model to check whether there are any instance layouts among the first to nth instance layouts that are falsely detected or not detected, and if there are instance layouts that are falsely detected or not detected, inputting at least one sub-control image corresponding to at least one specific instance layout that is falsely detected or not detected in the initial composite image into the conditional image generation model, and using the conditional image generation model to recombine at least one specific object corresponding to at least one specific instance layout in the initial composite image by referencing the image caption.

[0009] Furthermore, the present invention also aims to generate a recombined image by recombining at least one specific object in the initial composite image, to further confirm whether there are any instance layouts that are falsely detected or not detected among at least one specific instance layout by referring to the results of object detection on the recombined image through a second object detection model, and if there are instance layouts that are falsely detected or not detected, to generate a completed composite image by repeating the process of further recombining at least a portion of the specific objects that are falsely detected or not detected among at least one specific instance layout through a conditional image generation model. [Means for solving the problem]

[0010] According to one embodiment of the present invention, in a method for generating an image based on an instance layout, (a) when a source control image and an image caption are obtained, including a first instance layout (the instance layout represents the arrangement and structure of objects to be generated according to the class) to an nth instance layout (where n is an integer of 1 or more), the computing device inputs the source control image and the image caption into a conditional image generation model, and uses the conditional image generation model to generate an initial composite image by compositing the first to nth objects into each of the first to nth instance layouts of the source control image with reference to the image caption; and (b) the computing device performs an operation on the initial composite image through a first object detection model. A method is provided which includes the steps of: referring to the results of object detection, confirming whether there are any instance layouts among the first instance layout to the nth instance layouts that are falsely detected or not detected; if there are no instance layouts that are falsely detected or not detected, generating the initial composite image as a completed composite image; and if there are instance layouts that are falsely detected or not detected, inputting at least one sub-control image corresponding to at least one specific instance layout that is falsely detected or not detected in the initial composite image into the conditional image generation model, and using the conditional image generation model to re-combine at least one specific object corresponding to the at least one specific instance layout in the initial composite image by referring to the image caption, thereby generating the completed composite image.

[0011] In one example, in step (b), the computing device generates a recombined image by recombining the at least one specific object in the initial composite image, further checks whether there are any instance layouts that are falsely detected or not detected among the at least one specific instance layouts by referring to the results of object detection on the recombined image through a second object detection model, and if there are instance layouts that are falsely detected or not detected, repeats the process of further recombining at least a portion of the specific objects that are falsely detected or not detected among the at least one specific instance layout through the conditional image generation model, thereby generating the completed composite image.

[0012] In one example, the result of object detection on the reconstructed image through the second object detection model includes at least one pseudo label, which includes at least one predicted category name and at least one predicted bounding box corresponding to at least one predicted object predicted to be located on the reconstructed image. The computing device further confirms whether there are any instance layouts that are falsely detected or not detected among the at least one specific instance layouts by checking whether there is at least one specific pseudo label that matches the at least one specific instance layout.

[0013] In one example, the computing device determines that each of the specific pseudo-labels is matched to each of the specific instance layouts if each of the specific category names corresponding to each of the specific instance layouts matches each of the specific predicted category names contained in each of the specific pseudo-labels, and each of the specific bounding boxes corresponding to each of the specific instance layouts matches each of the specific predicted bounding boxes contained in each of the specific pseudo-labels.

[0014] In one example, the computing device generates the re-composite image by (i) separately generating at least one sub-composite image by referencing the at least one sub-control image and the image caption, and adding the generated at least one sub-composite image to the at least one specific instance layout, or (ii) generating the re-composite image by performing either image inpainting or image editing on the at least one specific instance layout in the initial composite image.

[0015] In one example, the first object detection model and the second object detection model are characterized by being either an open-set detection model or a closed-set detection model.

[0016] In one example, the first object detection model and the second object detection model are characterized by being identical to each other.

[0017] In one example, the result of object detection on the initial composite image through the first object detection model includes at least one pseudo-label, each of which includes at least one predicted category name and at least one predicted bounding box, corresponding to each of the at least one predicted object predicted to be located on the initial composite image, wherein in step (b), the computing device checks for each of the first instance layout to the nth instance layout whether there are first pseudo-labels to the nth pseudo-labels that match each of the first instance layout to the nth instance layout, thereby checking whether there are instance layouts among the first instance layout to the nth instance layout that are falsely detected or not detected.

[0018] In one example, the computing device determines that each of the first to nth pseudo-labels is matched to each of the first to nth instance layouts if each of the first to nth category names corresponding to each of the first to nth instance layouts matches each of the first to nth pseudo-labels included in each of the first to nth pseudo-labels, and each of the first to nth bounding boxes corresponding to each of the first to nth instance layouts matches each of the first to nth pseudo-labels included in each of the first to nth pseudo-labels.

[0019] In one example, the source control image and the sub-control image are characterized by being one of the following: an image based on an open pose, an image based on a canny edge, an image based on a depth map, an image based on segmentation, an image based on line art, and an image based on a normal map.

[0020] In one example, the original control image and the sub-control image are characterized by being images based on different methods.

[0021] Furthermore, according to another embodiment of the present invention, a computing device for generating an image based on an instance layout includes one or more memories for storing instructions; and one or more processors configured to execute the instructions, wherein the processor performs the following processes: (I) when a source control image and an image caption are obtained, including a first instance layout (the instance layout represents the arrangement and structure of objects to be generated according to the class) to an nth instance layout (where n is an integer of 1 or more), the source control image and the image caption are input to a conditional image generation model, and the conditional image generation model is used to generate an initial composite image by referencing the image caption and compositing the first to nth objects into each of the first to nth instance layouts of the source control image; and (II) a first object detector A computing device is provided, characterized by performing the following process: referencing the results of object detection on the initial composite image through a conditional model to check whether there are any instance layouts among the first to n instance layouts that are falsely detected or not detected; if there are no falsely detected or not detected instance layouts, the initial composite image is generated as a completed composite image; if there are falsely detected or not detected instance layouts, at least one sub-control image corresponding to at least one specific instance layout that is falsely detected or not detected in the initial composite image is input to the conditional image generation model, and the conditional image generation model is used to re-combine at least one specific object corresponding to the at least one specific instance layout in the initial composite image by referring to the image caption, thereby generating the completed composite image.

[0022] In one example, in process (II), the processor generates a recombined image by recombining the at least one specific object in the initial composite image, further checks whether there are any instance layouts that are falsely detected or not detected among the at least one specific instance layouts by referring to the results of object detection on the recombined image through a second object detection model, and if there are instance layouts that are falsely detected or not detected, the processor repeats the process of further recombining at least a portion of the specific objects that are falsely detected or not detected among the at least one specific instance layout through the conditional image generation model, thereby generating the completed composite image.

[0023] In one example, the result of object detection on the reconstructed image through the second object detection model includes at least one pseudo label, which includes at least one predicted category name and at least one predicted bounding box corresponding to at least one predicted object predicted to be located on the reconstructed image, and the processor further determines whether there are any instance layouts that are falsely detected or not detected among the at least one specific instance layouts by checking whether there is at least one specific pseudo label that matches the at least one specific instance layout.

[0024] In one example, the processor determines that each of the specific pseudo labels matches each of the specific instance layouts when each of the specific category names corresponding to each of the specific instance layouts matches each of the specific predicted category names included in each of the specific pseudo labels, and each of the specific bounding boxes corresponding to each of the specific instance layouts matches each of the specific predicted bounding boxes included in each of the specific pseudo labels.

[0025] In one example, the processor uses the conditional image generation model to separately generate at least one sub-synthetic image by referring to the at least one sub-control image and the image caption, and generates the re-synthesized image by adding the generated at least one sub-synthetic image to the at least one specific instance layout, or (ii) generates the re-synthesized image by performing either image inpainting or image editing on the at least one specific instance layout in the initial synthetic image.

[0026] In one example, the first object detection model and the second object detection model are either an open-set detection model or a closed-set detection model.

[0027] In one example, the first object detection model and the second object detection model are the same model as each other.

[0028] In one example, the result of object detection on the initial composite image through the first object detection model includes, for each of at least one predicted object predicted to be located on the initial composite image, each of at least one predicted category name and each of at least one predicted bounding box. In the process (II), the processor checks, for each of the first instance layout to the nth instance layout, whether there is each of the first pseudo-label to the nth pseudo-label that matches each of the first instance layout to the nth instance layout, so as to check whether there is an instance layout that is misdetected or undetected among the first instance layout to the nth instance layout.

[0029] In one example, when each of the first category name to the nth category name corresponding to each of the first instance layout to the nth instance layout is consistent with each of the first predicted category name to the nth predicted category name included in each of the first pseudo-label to the nth pseudo-label, and each of the first bounding box to the nth bounding box corresponding to each of the first instance layout to the nth instance layout is consistent with each of the first predicted bounding box to the nth predicted bounding box included in each of the first pseudo-label to the nth pseudo-label, it is determined that each of the first pseudo-label to the nth pseudo-label is matched to each of the first instance layout to the nth instance layout.

[0030] In one example, the source control image and the sub-control image are characterized by being one of the following: an image based on an open pose, an image based on a canny edge, an image based on a depth map, an image based on segmentation, an image based on line art, and an image based on a normal map.

[0031] In one example, the original control image and the sub-control image are characterized by being images based on different methods. [Effects of the Invention]

[0032] The present invention has the effect of inputting a source control image including a first instance layout to an nth instance layout (where n is an integer of 1 or more) and an image caption into a conditional image generation model, using the conditional image generation model to generate an initial composite image by referencing the image caption and compositing the first object to the nth object into each of the first instance layouts to the nth instance layout of the source control image, using the first object detection model to check whether there are any instance layouts among the first to nth instance layouts that are falsely detected or not detected, and if there are any falsely detected or not detected instance layouts, inputting at least one sub-control image corresponding to at least one specific instance layout that is falsely detected or not detected in the initial composite image into the conditional image generation model, and using the conditional image generation model to referencing the image caption and compositing at least one specific object corresponding to at least one specific instance layout in the initial composite image to generate a completed composite image.

[0033] Furthermore, the present invention generates a recombined image by recombining at least one specific object in the initial composite image, and further checks whether there are any instance layouts that are falsely detected or not detected among at least one specific instance layout by referring to the results of object detection on the recombined image through a second object detection model, and if there are instance layouts that are falsely detected or not detected, the process of further recombining at least a portion of the specific objects that are falsely detected or not detected among at least one specific instance layout through a conditional image generation model is repeated to generate a completed composite image. [Brief explanation of the drawing]

[0034] The following drawings, attached for use in describing embodiments of the present invention, represent only a portion of the embodiments, and a person with ordinary skill in the art to which the present invention pertains (hereinafter referred to as "ordinary art") can obtain other drawings from these drawings without performing any inventive work.

[0035] [Figure 1] This figure shows a schematic configuration of a computing device for generating images based on an instance layout according to one embodiment of the present invention. [Figure 2] This is a flowchart illustrating the sequence of steps for generating an image based on an instance layout according to one embodiment of the present invention. [Figure 3]This figure shows a process according to one embodiment of the present invention, which involves repeating the following steps to generate a completed composite image: (i) a process for generating an initial composite image; (ii) a process for checking whether there is at least one specific instance layout that is falsely detected or not detected in the initial composite image; (iii) a process for generating a recomposite image by recomposite a specific object corresponding to the specific instance layout if a specific instance layout exists; and (iv) a process for further recomposite at least a portion of the specific objects that are falsely detected or not detected. [Modes for carrying out the invention]

[0036] The detailed description of the present invention, as described below, refers to the accompanying drawings illustrating specific embodiments in which the present invention may be carried out. These embodiments are described in sufficient detail to enable a person of the ordinary skill to carry out the present invention. It should be understood that the various embodiments of the present invention are different from one another but do not need to be mutually exclusive. For example, certain shapes, structures and characteristics described herein can be realized by modifying one embodiment to another without departing from the spirit and scope of the present invention. It should also be understood that the position or arrangement of individual components within each embodiment can be modified without departing from the spirit and scope of the present invention. Therefore, the detailed description described below should not be taken as restrictive, and the scope of the present invention should be understood to encompass the scope claimed in the claims and all equivalent scopes thereto. In the drawings, similar reference numerals indicate identical or similar components in various aspects.

[0037] In the following, several preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings, so that a person with ordinary skill in the art to which the present invention pertains can easily implement the present invention.

[0038] Figure 1 shows a schematic configuration of a computing device 100 for generating an image based on an instance layout according to one embodiment of the present invention.

[0039] As shown in Figure 1, the computing device 100 of the present invention may include a memory 110 and a processor 120.

[0040] Here, memory 110 can store instructions to be executed by processor 120, specifically, instructions being code generated for the purpose of enabling computing device 100 to function in a particular manner, and which may be stored in computer-accessible or computer-readable memory intended for computers and other programmable data processing devices. The instructions can carry out processes for performing the functions described in the specification of the present invention.

[0041] Furthermore, the processor 120 may include hardware configurations such as an MPU (Micro Processing Unit) or CPU (Central Processing Unit), cache memory, and a data bus. The computing device 100 may also further include software configurations for an operating system and applications that perform specific purposes.

[0042] Furthermore, the computing device 100 can cooperate with the conditional image generation model 200 and the object detection model 300. In Figure 1, the conditional image generation model 200 and the object detection model 300 are shown as being configured separately from the computing device 100, but the invention is not limited to this, and at least a part of the conditional image generation model 200 and the object detection model 300 may be included in the computing device 100. Also, in Figure 1, the conditional image generation model 200 and the object detection model 300 are shown as one each, but the present invention is not limited to this, and it goes without saying that they may be configured as multiple units in some cases. For example, the object detection model 300 may include a first object detection model 310 and a second object detection model 320, as shown in Figure 3. The specific functions of the conditional image generation model 200 and the object detection model 300 will be described later.

[0043] A method using the computing device 100 according to one embodiment of the present invention, configured in this manner, will be described below with reference to Figure 2.

[0044] Figure 2 is a flowchart illustrating the sequence of steps for generating an image based on an instance layout according to one embodiment of the present invention.

[0045] Referring to Figure 2, once the original control image and image caption, which include a first instance layout (where the instance layout represents the arrangement and structure of the objects to be generated according to the class) or an nth instance layout (where n is an integer of 1 or more) are obtained, the computing device 100 inputs the original control image and image caption into the conditional image generation model 200, and uses the conditional image generation model 200 to generate an initial composite image by referencing the image caption and compositing the first to nth objects into the first to nth instance layouts of the original control image, respectively (S201).

[0046] Here, the source control image is the base image for generating the initial composite image, and is an image that is synthesized based on a natural language sentence contained in the image caption (where the image caption is a natural language sentence that describes a predetermined state, for example, "There is a person standing behind the car"). The source control image may preferably be any one of the following: an open pose image, a canny edge image, a depth map image, a segmentation image, a line art image, and a normal map image, but the present invention is not limited thereto. For example, when trying to generate a composite image of a person in a specific pose, a specific facial expression, or a specific hand movement, it is preferable to use an open pose image as the source control image. When trying to generate a composite image that includes not only an image of a person but also images of surrounding objects, it is preferable to use a depth map image, a segmentation image, or a normal map image as the source control image, but the present invention is not limited thereto.

[0047] It should be noted that, as mentioned above, an instance layout represents the arrangement and structure of at least one object to be generated according to the class, and does not represent the original control image itself. For example, if the original control image is an image based on canny edges, and a doll-like outline and a car-shaped outline exist on the original control image, the original control image may contain instance layouts corresponding to the position information, size information, and shape information of the doll-like outline, and instance layouts corresponding to the position information, size information, and shape information of the car-shaped outline. In this case, each instance layout can be represented on the original control image as each bounding box, each semantic mask, each canny edge, each depth map, etc. It should also be noted that each of the at least one object does not necessarily have to be of a different class; the invention is valid even if they are of the same class. To explain this in more detail, please refer to Figure 3.

[0048] Figure 3 shows a process according to one embodiment of the present invention, which involves (i) a process for generating an initial composite image 33, (ii) a process for checking whether there is at least one specific instance layout that is falsely detected or not detected in the initial composite image 33, (iii) a process for generating a recomposite image 35 by recomposite the specific objects corresponding to the specific instance layout if a specific instance layout exists, and (iv) a process for generating a completed composite image 36 by repeating the process of further recompositeting at least a portion of the specific objects that are falsely detected or not detected.

[0049] Referring to Figure 3, the computing device 100 can obtain an image caption 32 that includes an original control image 31, which contains a first instance layout 31a, a second instance layout 31b, and a third instance layout 31c, and a text describing it.

[0050] Furthermore, each of the first instance layout 31a, the second instance layout 31b, and the third instance layout 31c can be configured with a corresponding category name and a corresponding bounding box. For example, in Figure 3, it can be seen that the category name corresponding to the first instance layout 31a is set to "person", the category name corresponding to the second instance layout 31b is set to "car", and the category name corresponding to the third instance layout 31c is set to "bicycle". Therefore, the conditional image generation model 200 can generate an image by compositing an image related to the first object (i.e., "person") into the first instance layout 31a, an image related to the second object (i.e., "car") into the second instance layout 31b, and an image related to the third object (i.e., "bicycle") into the third instance layout 31c, by referring to the image caption 32 described later. Of course, in at least some of the first instance layouts 31a, second instance layout 31b, and third instance layout 31c, problems may occur where synthesized object images are not generated or incorrect object images are generated. This may vary depending on how well the conditional image generation model 200 has been trained beforehand, or what kind of sentences the image captions 32 described later contain.

[0051] As described above, the image caption 32 according to the present invention includes a sentence describing the control image 31. In the case of Figure 3, for example, an image caption 32 that includes the sentence "A person is standing behind the car, and the bicycle is positioned away from the car and the person" as an example may be input to the conditional image generation model 200. In this case, the image caption 32 may be generated by the user and input to the conditional image generation model 200, but the present invention is not limited thereto.

[0052] When the control image 31 and the image caption 32 are input to the conditional image generation model 200, the conditional image generation model 200 can refer to the image caption 32 and generate an initial composite image 33 by compositing the second object (i.e., "car") and the third object (i.e., "car") into the second instance layout 31b and the third instance layout 31c of the control image 31, respectively. Note that in the initial composite image 33 in Figure 3, the first object (i.e., "person") is not generated, and the third object is incorrectly generated as "car" instead of "bicycle". A method to resolve this will be explained below.

[0053] Referring again to Figure 2, the computing device 100 according to the present invention, through the first object detection model 310, refers to the results of object detection on the initial composite image to determine whether there are any instance layouts that are falsely detected or not detected among the first instance layout to the nth instance layout. If there are no falsely detected or not detected instance layouts, the initial composite image is generated as a completed composite image. If there are falsely detected or not detected instance layouts, at least one sub-control image corresponding to at least one specific instance layout that is falsely detected or not detected in the initial composite image is input to the conditional image generation model 200. The conditional image generation model 200 then refers to the image caption 32 to re-combine at least one specific object corresponding to at least one specific instance layout in the initial composite image, thereby generating a completed composite image (S202).

[0054] To explain step S202 in detail, we will refer to Figure 3 again.

[0055] Referring again to Figure 3, once the initial composite image 33 is generated, the computing device 100 performs object detection on the initial composite image 33 using the first object detection model 310. At this time, the first object detection model 310 may be an open-set detection model or a closed-set detection model. The types and functions of open-set detection models and closed-set detection models are already known, so a detailed explanation is omitted here.

[0056] Furthermore, the object detection results may include at least one pseudo label 34b, 34c, each containing at least one predicted category name and at least one predicted bounding box corresponding to at least one predicted object 33b, 33c predicted to be located on the initial composite image 33. For example, assuming that two objects having the shape of a car are actually generated on the initial composite image 33 as shown in Figure 3, two pseudo labels 34b, 34c will be included on the initial composite image 33, in which case each pseudo label 34b, 34c may contain the predicted category name (e.g., "car") and the predicted bounding box corresponding to each predicted object having the shape of a car.

[0057] In other words, the computing device 100 can determine whether there are any instance layouts among the instance layouts 31a, 31b, and 31c that are falsely detected or not detected by checking whether pseudo-labels 34b and 34c that match the instance layouts 31a, 31b, and 31c exist on the initial composite image 33.

[0058] Here, a match between an instance layout and a pseudo-label means that (i) the category names corresponding to each instance layout 31a, 31b, and 31c match the predicted category names included in each pseudo-label 34b and 34c, and (ii) the bounding boxes corresponding to each instance layout 31a, 31b, and 31c match the predicted bounding boxes included in each pseudo-label 34b and 34c.

[0059] As an example, in Figure 3, (i) the second category name corresponding to the second instance layout 31b (i.e., "automobile") matches the second predicted category name included in the second pseudo-label 34b (i.e., "automobile"), and the second bounding box corresponding to the second instance layout 31b matches the second predicted bounding box included in the second pseudo-label 34b, so the second instance layout 31b cannot be said to be an instance layout that is falsely detected or not detected, and (ii) the third category name corresponding to the third instance layout 31c (i.e., "bicycle") does not match the third predicted category name included in the third pseudo-label 34c, while the third bounding box corresponding to the third instance layout 31c (iii) The bounding box matches the third predicted bounding box included in the third pseudo-label 34c, so the third instance layout 31c is considered a false positive instance layout. (iii) The first predicted object corresponding to the first instance layout 31a does not have a corresponding first pseudo-label on the initial composite image 33, so it is not possible to determine whether the first category name (i.e., "person") corresponding to the first instance layout 31a matches the first predicted category name, and it is also not possible to determine whether the first bounding box corresponding to the first instance layout 31a matches the first predicted bounding box. Therefore, the first instance layout 31a is considered an undetected instance layout.

[0060] Thus, in the example shown in Figure 3, there are instance layouts that are falsely detected and those that are not detected, namely the first instance layout 31a and the third instance layout 31c. Therefore, the process to resolve this, namely the process of generating a recomposed image 35 by recomposing specific objects corresponding to the falsely detected or undetected instance layouts (i.e., specific instances), will be described below.

[0061] First, the computing device 100 can further acquire sub-control images (not shown) corresponding to specific instance layouts that are falsely detected or not detected, namely the first instance layout 31a and the third instance layout 31c, respectively. At this time, each of the sub-control images (not shown) may be any one of the following: an image based on open pose, an image based on canny edge, an image based on depth map, an image based on segmentation, an image based on line art, and an image based on normal map. Furthermore, it is preferable that each of the sub-control images (not shown) is an image based on a different method from the original control image 31, but the present invention is not limited thereto, and in some cases, they may be images based on the same method. That is, if the original control image 31 is an image based on depth map, it is preferable that each of the sub-control images (not shown) is any one of the following: each of the images based on open pose, each of the images based on canny edge, each of the images based on segmentation, each of the images based on line art, and each of the images based on normal map, but in some cases, they may be each of the images based on depth map. Furthermore, although the above description assumes that each of the sub-control images (not shown) is based on the same scheme, the present invention is not limited thereto, and it goes without saying that in some cases, each of the sub-control images (not shown) may be based on a different scheme.

[0062] As described above, once each of the sub-control images (not shown) is acquired, the computing device 100, although not shown in Figure 3, inputs each of the sub-control images (not shown) into the conditional image generation model 200, and uses the conditional image generation model 200 to separately generate a sub-composite image by referring to each of the sub-control images (not shown) and the image caption. The re-composite image 35 can then be generated by adding each of the separately generated sub-composite images to specific instance layouts in the initial composite image, namely the first instance layout 31a and the third instance layout 31c, respectively. At this time, the image caption may be the image caption 32 that was input along with the original control image 31 when it was first input into the conditional image generation model 200, but the present invention is not limited thereto, and it goes without saying that it may also be an image caption that was added separately later. Furthermore, although the above has been described as if there is only one conditional image generation model 200, the present invention is not limited thereto.

[0063] On the other hand, although the above description assumes that the recombined image 35 is generated using at least one separately generated sub-combined image, the present invention is not limited thereto, and in some cases the recombined image 35 can be generated without separately generating at least one sub-combined image. For example, it is also possible to generate the recombined image 35 by having the conditional image generation model 200 perform either image inpainting or image editing on each of the specific instance layouts in the initial composite image 33, namely the first instance layout 31a and the third instance layout 31c. Here, since the methods for performing image inpainting and image editing are already known, a detailed explanation is omitted here.

[0064] Once the re-composite image 35 is generated, the computing device 100 can refer to the results of object detection on the re-composite image 35 through the second object detection model 320 to further determine whether there are any instance layouts that are falsely detected or not detected among at least one specific instance layout. In this case, the second object detection model 320 may be either an open set detection model or a closed set detection model. Furthermore, although Figure 3 shows the first object detection model 310 and the second object detection model 320 as different models, it should be noted that the present invention is not limited thereto, and the first object detection model 310 and the second object detection model 320 may be the same model.

[0065] Here, the result of object detection on the reconstructed image 35 through the second object detection model 320 may include at least one pseudo-label that includes at least one predicted category name and at least one predicted bounding box corresponding to at least one predicted object predicted to be located on the reconstructed image 35. For example, assuming that an object having the shape of a bicycle is generated on the reconstructed image 35 through the reconstruction process as shown in Figure 3, then at least a portion of the specific instance layouts 31a and 31c, for example, a specific pseudo-label (not shown) that matches the third instance layout 31c, will be included on the reconstructed image 35. In this case, the specific pseudo-label (not shown) may include a predicted category name (e.g., "bicycle") and a predicted bounding box corresponding to the object having the shape of a bicycle. As mentioned above, "specific instance layout" refers to the instance layouts that are falsely detected or not detected in the initial composite image 33, i.e., the first instance layout 31a and the third instance layout 31c, so it should be noted that the second instance layout 31b does not fall under the category of specific instance layout.

[0066] In other words, the computing device 100 can further determine whether there are any instance layouts among the specific instance layouts 31a and 31c that are falsely detected or not detected by checking whether specific pseudo-labels that match specific instance layouts 31a and 31c exist on the resynthesized image 35.

[0067] Here, a match between a specific instance layout and a specific pseudo-label means that a specific category name corresponding to a specific instance layout, for example, the third instance layout 31c (e.g., "bicycle"), matches a specific predicted category name included in a specific pseudo-label (e.g., "bicycle"), and that a specific bounding box corresponding to a specific instance layout matches a specific predicted bounding box included in a specific pseudo-label.

[0068] On the other hand, as shown in Figure 3, it can be confirmed that an instance layout that is either falsely detected or not detected, i.e., the first instance layout 31a, still exists on the re-composite image 35. Therefore, the computing device 100 can further recomposite the first instance layout 31a, which is at least a portion of the specific instance layouts 31a and 31c that is either falsely detected or not detected, by the conditional image generation model 200 to generate a completed composite image 36. In this case, the recomposite process can be repeated until the completed composite image 36 is generated.

[0069] That is, on the image obtained through a further resynthesis process, (i) the first category name (e.g., "person"), second category name (e.g., "car"), and third category name (i.e., "bicycle") corresponding to the first instance layout 31a, second instance layout 31b, and third instance layout 31c respectively match the first predicted category name (e.g., "person"), second predicted category name (e.g., "car"), and third predicted category name (e.g., "bicycle") contained in the first pseudo-label, second pseudo-label, and third pseudo-label respectively, and (ii) the first instance If the computing device determines that the first bounding box, second bounding box, and third bounding box corresponding to each of the layout 31a, second instance layout 31b, and third instance layout 31c match the first predicted bounding box, second predicted bounding box, and third predicted bounding box contained in each of the first pseudo-label, second pseudo-label, and third pseudo-label, the computing device can determine that the image is a completed composite image 36 in which there are no instance layouts that are falsely detected or not detected.

[0070] The embodiments of the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and may be recorded on a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, etc., individually or in combination. The program instructions recorded on the computer-readable recording medium may be specially designed and configured for the present invention, or they may be known and available to those skilled in the art in the field of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and hardware devices specially configured to store and execute program instructions, such as ROMs, RAMs, and flash memory. Examples of program instructions include not only machine code, such as that produced by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices may be configured to operate as one or more software modules to perform the processing according to the present invention, and vice versa.

[0071] Although the present invention has been described above with specific details such as concrete components, and with limited embodiments and drawings, these are provided only to aid in a more overall understanding of the invention, and the invention is not limited to the above embodiments. A person with ordinary skill in the art to which the invention pertains can make various modifications and variations from this description.

[0072] Therefore, the concept of the present invention should not be limited to the embodiments described above, and all modifications equivalent to or equivalent to the claims described below shall also fall within the scope of the concept of the present invention.

Claims

1. In a method for generating images based on an instance layout, (a) When a source control image and image caption are acquired, including a first instance layout (the instance layout represents the arrangement and structure of objects to be generated according to the class) to an nth instance layout (where n is an integer of 1 or more, and each of the first to nth instance layouts is set with corresponding category names and corresponding bounding boxes), the computing device inputs the source control image and the image caption into a conditional image generation model, and uses the conditional image generation model to generate an initial composite image by compositing the first to nth objects onto each of the first to nth instance layouts of the source control image, with reference to the image caption, each of the category names, and each of the bounding boxes; (b) The computing device generates at least one pseudo-label on the initial composite image as a result of object detection on the initial composite image using a first object detection model, and checks whether there are any instance layouts among the first to n instance layouts that are falsely detected or not detected by referring to each of the category names, each of the bounding boxes, and each of the at least one pseudo-label on the initial composite image that is falsely detected or not detected by referring to each of the first to n instance layouts, and if there are no instance layouts that are falsely detected or not detected, the initial composite image is generated as a completed composite image, and if there are instance layouts that are falsely detected or not detected, at least one sub-control image corresponding to at least one specific instance layout that is falsely detected or not detected in the initial composite image is input to the conditional image generation model, and the conditional image generation model re-combines at least one specific object corresponding to at least one specific instance layout in the initial composite image by referring to the image caption, thereby generating the completed composite image. A method that includes this.

2. In step (b) above, The method according to claim 1, characterized in that the computing device generates a recombined image by recombining the at least one specific object in the initial composite image, further checks whether there are any instance layouts that are falsely detected or not detected among the at least one specific instance layouts by referring to the results of object detection on the recombined image through a second object detection model, and if there are instance layouts that are falsely detected or not detected, the process of further recombining at least a portion of the specific objects that are falsely detected or not detected among the at least one specific instance layout through the conditional image generation model is repeated to generate the completed composite image.

3. The result of object detection on the reconstructed image through the second object detection model includes at least one pseudo-label, which includes at least one predicted category name and at least one predicted bounding box, corresponding to at least one predicted object predicted to be located on the reconstructed image. The method according to claim 2, characterized in that the computing device further confirms whether there are any instance layouts that are falsely detected or not detected among the at least one specific instance layouts by checking whether there is at least one specific pseudo-label that matches the at least one specific instance layout.

4. The method according to claim 3, characterized in that the computing device determines that each of the specific pseudo-labels is matched to each of the specific instance layouts if each of the specific category names corresponding to each of the specific instance layouts matches each of the specific pseudo-labels, and each of the specific bounding boxes corresponding to each of the specific instance layouts matches each of the specific pseudo-labels.

5. The method according to claim 2, characterized in that the computing device generates the re-composite image by (i) separately generating at least one sub-composite image by referencing the at least one sub-control image and the image caption, and adding the generated at least one sub-composite image to the at least one specific instance layout, or (ii) generating the re-composite image by performing either image inpainting or image editing on the at least one specific instance layout in the initial composite image.

6. The method according to claim 2, characterized in that the first object detection model and the second object detection model are either an open-set detection model or a closed-set detection model.

7. The method according to claim 6, characterized in that the first object detection model and the second object detection model are identical models.

8. The result of object detection on the initial composite image through the first object detection model includes each of the first to n objects predicted to be located on the initial composite image, each of which includes at least one predicted category name and at least one predicted bounding box, and each of the first to n objects predicted to be located on the initial composite image, and each of which includes In step (b) above, The method according to claim 1, characterized in that the computing device checks whether there are any instance layouts among the first instance layout to the nth instance layouts that are falsely detected or not detected by checking whether there are any first pseudo-labels to the nth pseudo-labels that match each of the first instance layout to the nth instance layouts.

9. The method according to claim 8, characterized in that the computing device determines that each of the first to nth pseudo-labels is matched to each of the first to nth instance layouts if each of the first to nth category names corresponding to each of the first to nth instance layouts matches each of the first to nth pseudo-labels included in each of the first to nth pseudo-labels, and each of the first to nth bounding boxes corresponding to each of the first to nth instance layouts matches each of the first to nth pseudo-labels included in each of the first to nth pseudo-labels.

10. The method according to claim 1, characterized in that the source control image and the sub-control image are one of the following: an image based on open pose, an image based on canny edge, an image based on depth map, an image based on segmentation, an image based on line art, and an image based on normal map.

11. The method according to claim 10, characterized in that the original control image and the sub-control image are images based on different methods.

12. In a computing device that generates images based on an instance layout, One or more memory locations for storing instructions, Includes one or more processors configured to execute the aforementioned instructions, The processor performs the following processes: (I) When a source control image and image caption are obtained, including a first instance layout (the instance layout represents the arrangement and structure of objects to be generated according to the class) to an nth instance layout (where n is an integer of 1 or more, and each of the first to nth instance layouts is set with corresponding category names and corresponding bounding boxes), the processor inputs the source control image and the image caption into a conditional image generation model, and uses the conditional image generation model to generate an initial composite image by compositing the first to nth objects onto each of the first to nth instance layouts of the source control image, referencing the image caption, each of the category names, and each of the bounding boxes;(II) Using a first object detection model, generate at least one pseudo-label on the initial composite image where each of the first to n objects is predicted to be located, as a result of object detection on the initial composite image; check whether there are any instance layouts among the first to n instance layouts that are falsely detected or not detected by referring to each of the category names, each of the bounding boxes, and each of the at least one pseudo-label where each of the first to n objects is predicted to be located; generate the initial composite image as a completed composite image if there are no instance layouts that are falsely detected or not detected; input at least one sub-control image corresponding to at least one specific instance layout that is falsely detected or not detected in the initial composite image into the conditional image generation model, and use the conditional image generation model to re-combine at least one specific object corresponding to the at least one specific instance layout in the initial composite image by referring to the image caption, thereby generating the completed composite image. ;

13. In the above process (II), The computing device according to claim 12, characterized in that the processor generates a recombined image by recombining at least one specific object in the initial composite image, further checks whether there are any instance layouts that are falsely detected or not detected among the at least one specific instance layouts by referring to the results of object detection on the recombined image through a second object detection model, and if there are instance layouts that are falsely detected or not detected, repeats the process of further recombining at least a portion of the specific objects that are falsely detected or not detected among the at least one specific instance layout through the conditional image generation model, thereby generating the completed composite image.

14. The result of object detection on the reconstructed image through the second object detection model includes at least one pseudo-label, which includes at least one predicted category name and at least one predicted bounding box, corresponding to at least one predicted object predicted to be located on the reconstructed image. The computing device according to claim 13, wherein the processor further determines whether there are any instance layouts that are falsely detected or not detected among the at least one specific instance layouts by checking whether there is at least one specific pseudo-label that matches the at least one specific instance layout.

15. The computing device according to claim 14, characterized in that the processor determines that each of the specific pseudo-labels is matched to each of the specific instance layouts if each of the specific category names corresponding to each of the specific instance layouts matches each of the specific pseudo-labels, and each of the specific bounding boxes corresponding to each of the specific instance layouts matches each of the specific pseudo-labels.

16. The computing device according to claim 13, characterized in that the processor generates the re-composite image by (i) separately generating at least one sub-composite image by referencing the at least one sub-control image and the image caption, and adding the generated at least one sub-composite image to the at least one specific instance layout, or (ii) generating the re-composite image by performing either image inpainting or image editing on the at least one specific instance layout in the initial composite image.

17. The computing device according to claim 13, characterized in that the first object detection model and the second object detection model are either an open-set detection model or a closed-set detection model.

18. The computing device according to claim 17, characterized in that the first object detection model and the second object detection model are identical models.

19. The result of object detection on the initial composite image through the first object detection model includes each of the first to n objects predicted to be located on the initial composite image, each of which includes at least one predicted category name and at least one predicted bounding box, and each of the first to n objects predicted to be located on the initial composite image, and each of which includes In the above process (II), The computing device according to claim 12, characterized in that the processor checks for each of the first instance layout to the nth instance layout whether there is a first pseudo-label to the nth pseudo-label that matches each of the first instance layout to the nth instance layout, thereby checking whether there is an instance layout among the first instance layout to the nth instance layout that is falsely detected or not detected.

20. The computing device according to claim 19, characterized in that the processor determines that each of the first to nth pseudo-labels is matched to each of the first to nth instance layouts if each of the first to nth category names corresponding to each of the first to nth instance layouts matches each of the first to nth pseudo-labels, and each of the first to nth bounding boxes corresponding to each of the first to nth instance layouts matches each of the first to nth pseudo-labels, and each of the first to nth bounding boxes corresponding to each of the first to nth instance layouts matches each of the first to nth pseudo-labels

21. The computing device according to claim 12, characterized in that the source control image and the sub-control image are one of the following: an image based on open pose, an image based on canny edge, an image based on depth map, an image based on segmentation, an image based on line art, and an image based on normal map.

22. The computing device according to claim 21, characterized in that the original control image and the sub-control image are images based on different methods.