A method for training and testing a controllable image generation model capable of reflecting fine-grained instance layouts, and a training device and test device using the same.

The diffusion model uses semantic masks and VAEs to generate images with fine-grained instance layouts, addressing accuracy issues in box-based and depth-based methods, and improves image generation speed and control by balancing noise levels.

JP7875640B1Active Publication Date: 2026-06-18SUPERB AI CO LTD

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
SUPERB AI CO LTD
Filing Date
2025-10-23
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing image generation models struggle with accurately representing fine-grained instance layouts due to reliance on box-based representations, which are inaccurate for thin or tilted objects and include background noise, and depth-based methods fail to capture small objects or fine instance masks.

Method used

A diffusion model that utilizes semantic masks and a Variational Autoencoder (VAE) to generate images with fine-grained instance layouts by incorporating semantic masks and text embeddings, adjusting noise levels through a scheduler, and using a denoising network to refine image generation.

🎯Benefits of technology

The model effectively reflects fine-grained instance layouts while reducing dependence on external segmentation models and enhancing control over image generation by balancing fidelity and control information.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007875640000001_ABST
    Figure 0007875640000001_ABST
Patent Text Reader

Abstract

This provides a method for training and testing an image generation model that reflects a fine-grained instance layout. [Solution] The learning method includes the steps of generating a learning image latency from a learning image, generating a learning noisy image latency 14 from the learning image latency, generating learning control information from a learning semantic mask latency 13 generated from a learning semantic mask and a learning noisy image latency, generating learning text embeddings from a learning image caption, and generating learning time step embeddings 17 from a learning time step, generating a learning composite image latency 19 from the learning time step embeddings, learning text embeddings and learning control information using a denoising network 310, and generating a learning composite image 30 from the learning composite image latency and learning the denoising network.
Need to check novelty before this filing date? Find Prior Art

Description

【Technical Field】 【0001】 The present invention relates to a learning method and a test method of a controllable image generation model capable of reflecting a fine-grained instance layout, and a learning device and a test device using the same. More specifically, the present invention relates to a method for learning and testing a diffusion model, which is a controllable image generation model, using a semantic mask capable of reflecting a fine-grained instance layout and a caption corresponding thereto, and a learning device and a test device using the same. 【Background Art】 【0002】 With the development of AI technology, image generation models that generate new images according to inputs such as text and noise are widely used. Representative examples include the GAN (Generative Adversarial Network) model and the diffusion model. The GAN model is a model in which a generator that generates an image and a discriminator that determines whether the image is real or fake compete with each other for learning. The diffusion model is a model that gradually generates an image from noise. 【0003】 Among these, the diffusion model uses a forward process that makes data into complete noise while adding noise to the data, and a reverse process that generates data while gradually restoring it from noise, which is the reverse of the forward process. 【0004】 In recent years, research has progressed on layout-to-image systems for controlling the instance layout within images, but these primarily rely on box-based layout representations. However, boxes are inaccurate for representing thin or tilted objects, and because the background is also included within the box, there is a problem in that the accuracy of control at the instance level is reduced. 【0005】 Therefore, research is underway to utilize semantic masks, which provide more accurate instance shape and position information, enabling precise layout control. 【0006】 Recent research in 3DIS (Depth-Driven Decoupled Instance Synthesis) proposes a method that utilizes masks obtained from control images through a segmentation model to more accurately reflect the instance layout during the denoising process in the inference stage of a diffusion model. 【0007】 However, 3DIS relies on depth-based control images, which makes it difficult to reflect small objects or fine instance masks, thus limiting it to coarse layout controls. 【0008】 Therefore, the applicant proposes a diffusion model capable of generating images that reflect not only coarse instance layouts but also fine-grained instance layouts. [Overview of the Initiative] [Problems that the invention aims to solve] 【0009】 The purpose of this invention is to solve all of the problems mentioned above. 【0010】 Another objective of the present invention is to provide a controllable image generation model that can reflect a fine-grained instance layout. 【0011】 Furthermore, another objective of the present invention is to improve image generation speed by eliminating dependence on external segmentation models. 【0012】 Furthermore, another objective of the present invention is to extend the degree of control over the image generation process by adjusting the balance between image fidelity and control information based on the time step during the denoising process. [Means for solving the problem] 【0013】 According to one embodiment of the present invention, in a method for learning a controllable image generation model that can reflect a fine-grained instance layout, (a) when a training image and a training time step are acquired, the learning device inputs the training image to the encoder of a Variational Autoencoder (VAE) and uses the encoder of the VAE to generate a training image latency, and generates a training noisy image latency by repeatedly adding noise to the training image latency according to the training time step via a scheduler;(b) When the learning device obtains a learning semantic mask and a learning image caption corresponding to the learning image (the learning image caption includes learning instance-level text for the learning class corresponding to the learning semantic mask and learning global-level text corresponding to the learning image), it inputs the learning semantic mask to the encoder of the VAE and uses the encoder of the VAE to generate a learning semantic mask latency; in the process of generating learning control information using the learning noisy image latency and the learning semantic mask latency, it inputs the learning image caption to a text encoder and uses the text encoder to generate a learning text embedding including a learning instance-level text embedding and a learning global-level text embedding. The process of generating an Embedding, and the process of inputting the training timesteps into a timestep encoder and using the timestep encoder to encode the training timesteps and generate a training timestep embedding;(c) The learning device denoising the learning time step embedding, the learning text embedding, and the learning control information A method is provided that includes the steps of: (d) inputting the data into a Network, causing the denoising network to generate training prediction noise by referring to the training control information, and causing the scheduler to remove noise from the training control information by referring to the training prediction noise to generate training composite image latency, repeating a prediction noise generation process in which the denoising network generates training intermediate prediction noise by referring to the training intermediate composite image latency according to the training time step embedding, and a denoising process in which the scheduler generates the training intermediate composite image latency from which the noise has been removed by referring to the training intermediate prediction noise, thereby generating the training composite image latency; and (d) inputting the training composite image latency into the decoder of the VAE, causing the decoder of the VAE to decode the training composite image latency to generate a training composite image, and training the denoising network to minimize the loss generated by referring to the training composite image and the training image. 【0014】 In one example, in step (b), the learning device concatenates the learning noisy image latency and the learning semantic mask latency channel by channel to generate the learning control information. 【0015】 In one example, in step (c), the learning device adds zero-initialized weights corresponding to the increased number of channels to the first layer of the denoising network, corresponding to the increased number of channels obtained by concatenating the training noisy image latency and the training semantic mask latency, so that the predictive noise generation process is performed. 【0016】 In one example, in step (b), the learning device inputs the learning noisy image latency, the learning semantic mask latency, the learning text embedding, and the learning time step embedding into a control net to generate a control signal using the control net, and generates the learning control information including the control signal and the learning noisy image latency. 【0017】 In one example, the control net is generated by duplicating some of the layers of a pre-trained diffusion model and adding zero convolution layers to the first and last layers of the control net. 【0018】 In one example, the training semantic mask is an RGB image in which each of the training classes included in the training image is assigned its own unique color. 【0019】 Furthermore, according to another embodiment of the present invention, in a method for testing a controllable image generation model capable of reflecting a fine-grained instance layout, (a) a subprocess that, when a training image and training time step are acquired by a training device, inputs the training image to the encoder of a Variational Autoencoder (VAE) and uses the encoder of the VAE to generate a training image latency, and through a scheduler, repeatedly adds noise to the training image latency according to the training time step to generate a training noisy image latency, (ii) a training semantic mask and training image caption corresponding to the training image (the training image caption includes training instance-level text for the training class corresponding to the training semantic mask and training global-level text corresponding to the training image). When the Text (including) is obtained, the learning semantic mask is input to the encoder of the VAE, and the encoder of the VAE is used to generate a learning semantic mask latency. The process of generating learning control information using the learning noisy image latency and the learning semantic mask latency is performed, and the learning image caption is input to the text encoder, and the text encoder is used to generate a learning text embedding (including instance-level text embedding and global-level text embedding).(iii) A subprocess including the process of generating an Embedding, and inputting the learning timestep to a timestep encoder, and using the timestep encoder to encode the learning timestep and generate a learning timestep embedding; (iii) Inputting the learning timestep embedding, the learning text embedding, and the learning control information to a denoising network, and using the denoising network to generate learning prediction noise by referring to the learning control information, and using the scheduler to remove noise from the learning control information by referring to the learning prediction noise and generate a learning composite image latency, in which case the denoising network generates learning intermediate prediction noise according to the timestep embedding, and the scheduler generates the learning intermediate composite with the noise removed by referring to the learning intermediate prediction noise (iv) A subprocess is performed to generate the training synthetic image latency by repeating the denoising process that generates image latency, and (iv) a subprocess is performed to input the training synthetic image latency to the decoder of the VAE, to decode the training synthetic image latency with the decoder of the VAE to generate a training synthetic image, and to train the denoising network to minimize the loss generated by referencing the training synthetic image and the training image, and with the denoising model trained, the test device provides a test noisy latency, a test semantic mask, a corresponding test image caption (the test image caption is a test instance-level text for the test class corresponding to the test semantic mask, and a test global-level text corresponding to the test semantic mask).(b) the step of obtaining a test time step (including the Text); (c) the step of the test apparatus generating a test semantic noisy latency using the test semantic mask and the test noisy latency, inputting the test image caption to the text encoder, and using the text encoder to generate a test text embedding (including the test instance-level text embedding and the test global-level text embedding).(c) The process of generating an Embedding, and inputting the test time step into the time step encoder, and using the time step encoder to encode the test time step and generate a test time step embedding; (c) The test apparatus inputs the test semantic noisy latency, the test text embedding, and the test time step embedding into the denoising network, and using the denoising network to generate test predictive noise by referring to the test semantic noisy latency, and using the scheduler to remove noise from the test semantic noisy latency by referring to the test predictive noise and generate a test composite image ray A method is provided to generate a test composite image latency by having a tent generated, in which the denoising network generates a test intermediate predictive noise by referring to the test intermediate composite image latency in accordance with the embedded test time step, and repeating the denoising process by which the scheduler generates the noise-free test intermediate composite image latency by referring to the test intermediate predictive noise; and (d) the test apparatus inputs the test composite image latency to the decoder of the VAE, and the decoder of the VAE decodes the test composite image latency to generate a test composite image; 【0020】 In one example, in step (b), the test apparatus converts the test semantic mask to the same resolution as the test noisy latency, and then performs a 1:1 mapping operation with the test noisy latency to generate the test semantic noisy latency. 【0021】 In one example, in step (c), the test apparatus repeats the predictive noise generation process and the denoising process, and in doing so according to the test time step, (i) in the initial denoising process, in which the predictive noise generation process and the denoising process are repeated up to k times (where k is a preset integer of 1 or more), the output calculation operation for each layer included in the denoising network is (i-1) an attention calculation for the portion corresponding to the instance-level mask generated by mapping the test instance-level text embedding to the test noisy semantic mask latency, or the test instance-level text embedding to the test intermediate composite image latency. (i-2) an attention operation on the test global level text embedding and the test noisy semantic mask latency, or the test global level text embedding and the test intermediate composite image latency, and (ii) in a later denoising process in which the prediction noise generation process and the denoising process are repeated k times or more, the output calculation operation of each layer included in the denoising network is characterized in that it includes an attention operation on the test global level text embedding and the test noisy semantic mask latency, or the test global level text embedding and the test intermediate composite image latency. 【0022】 In one example, the test semantic mask is an RGB image to which each test class to be generated is assigned its own unique color. 【0023】 Furthermore, according to one embodiment of the present invention, a controllable image generation model learning device capable of reflecting a fine-grained instance layout includes one or more memories for storing instructions, and one or more processors configured to execute the instructions, wherein the processors (I) when a training image and a training time step are acquired, input the training image to the encoder of a Variational Autoencoder (VAE) to generate a training image latency using the encoder of the VAE, and repeatedly add noise to the training image latency according to the training time step via a scheduler to generate a training noisy image latency, and (II) a training semantic mask and a training image caption corresponding to the training image (the training image caption is a training instance-level text for the training class corresponding to the training semantic mask). When the learning image (including the learning image and the learning global-level text) is acquired, the learning semantic mask is input to the encoder of the VAE, and the encoder of the VAE is used to generate the learning semantic mask latency. The process of generating learning control information using the learning noisy image latency and the learning semantic mask latency is then performed. The learning image caption is input to the text encoder, and the text encoder is used to generate the learning text embedding (including the learning instance-level text embedding and the learning global-level text embedding).(III) A process to generate an Embedding, and a process to input the learning timestep to a timestep encoder and use the timestep encoder to encode the learning timestep and generate a learning timestep embedding, (III) a process to input the learning timestep embedding, the learning text embedding, and the learning control information to a denoising network and use the denoising network to generate learning prediction noise by referring to the learning control information, and use the scheduler to remove noise from the learning control information by referring to the learning prediction noise and generate a learning composite image latency, in which case the denoising network generates learning intermediate prediction noise by referring to the learning intermediate composite image latency according to the learning timestep embedding, and the scheduler refers to the learning intermediate prediction noise A learning device is provided that performs the following steps: (IV) inputting the learning composite image latency into the decoder of the VAE, the decoder of the VAE to decode the learning composite image latency to generate a learning composite image, and training the denoising network to minimize the loss generated by referencing the learning composite image and the learning image. 【0024】 In one example, the processor generates the training control information by concatenating the training noisy image latency and the training semantic mask latency channel by channel in the (II) process. 【0025】 In one example, in the (III) process, the processor concatenates the learning noisy image latent and the learning semantic mask latent, and corresponding to the increased number of channels, adds zero-initialized weights corresponding to the increased number of channels to the first layer of the denoising network, so that the prediction noise generation process is executed. 【0026】 In one example, in the (II) process, the processor inputs the learning noisy image latent, the learning semantic mask latent, the learning text embedding, and the learning time step embedding into a ControlNet, and uses the ControlNet to generate a Control Signal, and generates learning control information including the Control Signal and the learning noisy image latent. 【0027】 In one example, the ControlNet is generated by replicating a part of the layers of a pre-trained Diffusion Model, and adding zero convolutional layers to the first and last layers of the ControlNet. 【0028】 In one example, the learning semantic mask is an RGB image with unique colors assigned to each of the learning classes included in the learning image. 【0029】 Furthermore, according to another embodiment of the present invention, a test apparatus for a controllable image generation model capable of reflecting a fine-grained instance layout includes one or more memories for storing instructions, and one or more processors configured to execute the instructions, wherein the processors include (i) a subprocess that, when a training image and training time step are acquired by a training apparatus, inputs the training image to the encoder of a Variational Autoencoder (VAE) to generate a training image latency using the encoder of the VAE, and generates a training noisy image latency by repeatedly adding noise to the training image latency according to the training time step via a scheduler, and (ii) a training semantic mask and training image caption corresponding to the training image (the training image caption is training instance-level text for the training class corresponding to the training semantic mask). When the learning image (including the learning image and the learning global-level text) is acquired, the learning semantic mask is input to the encoder of the VAE, and the encoder of the VAE is used to generate the learning semantic mask latency. The process of generating learning control information using the learning noisy image latency and the learning semantic mask latency is then performed. The learning image caption is input to the text encoder, and the text encoder is used to generate the learning text embedding (including the learning instance-level text embedding and the learning global-level text embedding).(iii) A subprocess including the process of generating an Embedding, and inputting the learning time step into a time step encoder and using the time step encoder to encode the learning time step and generate a learning time step embedding, (iii) a denoising network (Denoising) the learning time step embedding, the learning text embedding, and the learning control information (iv) A subprocess is performed to input the training data into the VAE decoder, have the denoising network generate training prediction noise by referring to the training control information, have the scheduler remove noise from the training control information by referring to the training prediction noise to generate training composite image latency, and repeat the prediction noise generation process in which the denoising network generates training intermediate prediction noise, and the denoising process in which the scheduler removes noise by referring to the training intermediate prediction noise to generate training intermediate composite image latency, in accordance with the time step embedding, thereby generating the training composite image latency; and (iv) a subprocess is performed to input the training composite image latency into the VAE decoder, have the VAE decoder decode the training composite image latency to generate a training composite image, and train the denoising network to minimize the loss generated by referring to the training composite image and the training image, and with the denoising model trained, a test noisy latency (Noisy Latent), a test semantic mask, a corresponding test image caption (the test image caption includes a test instance-level text for the test class corresponding to the test semantic mask, and a test global-level text corresponding to the test semantic mask).(i) a process of obtaining a test time step (including the test semantic mask and the test noisy latency), (ii) a process of generating a test semantic noisy latency using the test semantic mask and the test noisy latency, inputting the test image caption to the text encoder, and using the text encoder to generate a test text embedding (including the test instance-level text embedding and the test global-level text embedding). (III) A process to generate an Embedding, and a process to input the test time step into the time step encoder and use the time step encoder to encode the test time step and generate a test time step embedding, (III) Input the test semantic noisy latency, the test text embedding, and the test time step embedding into the denoising network and use the denoising network to generate test predictive noise by referring to the test semantic noisy latency, and use the scheduler to remove noise from the test semantic noisy latency by referring to the test predictive noise and generate a test composite image latency A process is provided to generate a test composite image latency by having the denoising network generate a predictive noise generation process in which it refers to a test intermediate composite image latency in accordance with the embedded test time step, and the scheduler generate a denoising process in which it refers to a test intermediate composite image latency from which the noise has been removed, thereby generating the test composite image latency; and (IV) a test apparatus is provided which inputs the test composite image latency to the decoder of the VAE and performs a process of decoding the test composite image latency with the decoder of the VAE to generate a test composite image. 【0030】 In one example, in the (II) process, the processor converts the test semantic mask to the same resolution as the test noisy latent, and then performs a 1:1 mapping operation with the test noisy latent to generate the test semantic noisy latent. 【0031】 In one example, in the (III) process, the processor repeats the prediction noise generation process and the denoising process. At this time, according to the test time step, (i) in the initial denoising process where the prediction noise generation process and the denoising process are repeated up to the k-th time (where k is a preset integer greater than or equal to 1), the output calculation operation of each layer included in the denoising network includes (i-1) an attention operation on the part corresponding to the instance-level mask generated by mapping the test instance-level text embedding and the test noisy semantic mask latent, or the test instance-level text embedding and the test intermediate composite image latent, and (i-2) an attention operation on the test global-level text embedding and the test noisy semantic mask latent, or the test global-level text embedding and the test intermediate composite image latent. (ii) In the late denoising process where the prediction noise generation process and the denoising process are repeated after the k-th time, the output calculation operation of each layer included in the denoising network includes an attention operation on the test global-level text embedding and the test noisy semantic mask latent, or the test global-level text embedding and the test intermediate composite image latent. 【0032】 In one example, the test semantic mask is an RGB image to which each test class to be generated is assigned its own unique color. [Effects of the Invention] 【0033】 The present invention has the effect of providing a controllable image generation model that can reflect a fine-grained instance layout. 【0034】 Furthermore, the present invention has the effect of improving image generation speed by eliminating dependence on external segmentation models. 【0035】 Furthermore, the present invention has the effect of expanding the degree of control over the image generation process by adjusting the balance between image fidelity and control information based on the time step during the denoising process. [Brief explanation of the drawing] 【0036】 The following drawings, attached for use in describing embodiments of the present invention, represent only a portion of the embodiments, and a person with ordinary skill in the art to which the present invention pertains (hereinafter referred to as "ordinary art") can obtain other drawings based on these drawings without performing any inventive work. 【0037】 [Figure 1] This figure schematically illustrates a learning device that learns a diffusion model, which is a controllable image generation model capable of reflecting a fine-grained instance layout, according to one embodiment of the present invention. [Figure 2] This figure schematically shows a diffusion model, which is a controllable image generation model according to one embodiment of the present invention. [Figure 3A] This figure illustrates a method for learning a diffusion model according to one embodiment of the present invention. [Figure 3B] This figure schematically illustrates another method for learning a diffusion model according to one embodiment of the present invention. [Figure 4] This figure schematically shows a test apparatus for testing a diffusion model, which is a controllable image generation model capable of reflecting a fine-grained instance layout, according to one embodiment of the present invention. [Figure 5] This figure schematically shows a diffusion model learned according to one embodiment of the present invention. [Figure 6] This figure illustrates a method for testing a diffusion model according to one embodiment of the present invention. [Figure 7] This figure illustrates the processes of the initial denoising stage and the later denoising stage according to the progression of a time step, according to one embodiment of the present invention. [Figure 8] This figure schematically shows a semantic mask, which is a control image input according to one embodiment of the present invention. [Modes for carrying out the invention] 【0038】 The detailed description of the present invention, as described below, refers to the accompanying drawings illustrating specific embodiments in which the present invention may be carried out. These embodiments are described in sufficient detail so that a person of the ordinary skill can carry out the present invention. It should be understood that the various embodiments of the present invention are different from one another but do not need to be mutually exclusive. For example, certain shapes, structures and characteristics described herein can be realized by modifying one embodiment to another without departing from the spirit and scope of the present invention. It should also be understood that the position or arrangement of individual components within each embodiment can be modified without departing from the spirit and scope of the present invention. Therefore, the detailed description described below should not be taken as restrictive, and the scope of the present invention should be understood to encompass the scope claimed in the claims and all equivalent scopes thereto. Similar reference numerals in the drawings indicate identical or similar components across multiple aspects. 【0039】 For reference, throughout this specification, we have added "for learning" or "learning" to terms related to the learning process, and "for testing" or "test" to terms related to the testing process, in order to avoid confusion wherever possible. 【0040】 In the following, several preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings, so that a person with ordinary skill in the art to which the present invention pertains can easily implement the present invention. 【0041】 Figure 1 schematically shows a learning device for learning a controllable image generation model capable of reflecting a fine-grained instance layout according to one embodiment of the present invention. Referring to Figure 1, the learning device 100 may include a memory 110 that stores instructions for learning a diffusion model 300, which is a controllable image generation model capable of reflecting a fine-grained instance layout, and a processor 120 that executes operations to train the diffusion model 300 according to the instructions stored in the memory 110. In this case, the diffusion model 300 is shown as being provided on the learning device 100, but it is not limited to this, and the diffusion model 300 may be provided in a cloud environment or on a computing device different from the learning device 100. 【0042】 Specifically, the learning device 100 may achieve desired system performance using a combination of a typical computing device (e.g., a device that may include a computer processor, memory, storage, input and output devices, and other components of conventional computing devices; electronic communication devices such as routers and switches; and electronic information storage systems such as network-attached storage (NAS) and storage area networks (SAN)) and computer software (i.e., instructions that enable the computing device to function in a particular manner). 【0043】 Furthermore, the processor 120 of the learning device 100 may include hardware components such as an MPU (Micro Processing Unit) or CPU (Central Processing Unit), cache memory, and a data bus. The computing device may also further include an operating system and software components for applications that perform a specific purpose. 【0044】 However, this does not exclude the case in which the learning device 100 includes an integrated processor, which is a medium, processor, and memory integrated form for carrying out the present invention. 【0045】 Referring to Figure 2, the diffusion model 300, a controllable image generation model according to one embodiment of the present invention, may include a denoising network 310 that generates predictive noise predicting that the input noise will be denoised in the next time step; a scheduler 320 that performs a forward process to make the training image completely noise by adding noise to the training image, and a reverse process to generate the training image by gradually restoring it from the noise; a VAE (Variational AutoEncoder) 330 that includes an encoder for encoding the input image into image latency and a decoder for decoding the image latency into an image; a text encoder 340 for encoding the input text; and a time step encoder 350 for encoding the input time step. However, the present invention is not limited thereto, and the diffusion model 300 may be configured with various architectures depending on the user. In this case, the time step may be represented by an integer of 1 or more. 【0046】 Using the learning device 100 configured in this way, a learning method for a diffusion model according to one embodiment of the present invention will be described below with reference to Figures 2 and 3A. First, when the learning image 10 and the learning time step are acquired according to the forward process, the learning device 100 inputs the learning image 10 to the encoder of the VAE 330, and uses the encoder of the VAE 330 to encode the learning image 10 and generate a learning image latent. The learning device 100 then generates a learning noisy image latent 14 by repeatedly adding noise to the learning image latent according to the learning time step via the scheduler 320. At this time, the learning time step indicates the number of iterations for adding or removing noise, but is not limited to this. Also, at this time, the learning noisy image latent 14 is complete noise generated by the repeated addition of noise, and may be Gaussian noise, which is random noise following a normal distribution, but is not limited to this. 【0047】 For example, if noise is added using 1000 training time steps, scheduler 320 can generate a schedule sequence containing 1000 digits such as [0,1,2,...,999], and then add Gaussian noise to each noise step according to a predefined value in the variance schedule to generate the noisy image latency for training. Here, the variance schedule is a fixed schedule, meaning a small variance value that determines how much noise to add at each time step of the schedule sequence. Therefore, the value of the variance schedule is defined to start with a small number usually close to 0 and gradually increase so that the data degrades more as time progresses. As another example, if noise is added using 50 training time steps, the scheduler can generate a schedule sequence containing 50 digits such as [0,20,...,940,960,980], and then add Gaussian noise according to the variance schedule to generate the noisy image latency for training. The numbers in the schedule sequence generated here may be generated according to uniform intervals, such as [0,...,940,960,980], or they may follow non-uniform intervals using other optimization sampling methods. 【0048】 Here, the learning time step may be determined randomly by the learning device 100, but the present invention is not limited thereto, and may be determined by various methods, such as user settings. 【0049】 Next, a reverse process may be performed to remove noise from the training noisy image latency 14. First, the input required for the denoising network 310 of the diffusion model 300 is obtained through the following process. 【0050】 First, the learning device 100 can acquire a learning semantic mask 11 and a learning image caption 12 corresponding to the learning image 10. Here, the learning semantic mask 11 may be an RGB image in which each learning class included in the learning image 10 is assigned its own unique color, as shown in (a), (b), and (c) in Figure 8. Furthermore, the learning image caption 12 may include learning instance-level text for the learning class corresponding to the learning semantic mask 11 and learning global-level text corresponding to the learning image 10. For example, the learning instance-level text may be "car", "dog", and "person", and the learning global-level text may be "a dog and a person are walking next to a car". 【0051】 The learning device 100 can then input the learning semantic mask 11 to the encoder of the VAE 330, use the encoder of the VAE 330 to encode the learning semantic mask 11 and generate a learning semantic mask latency 13, and perform the process of generating learning control information using the learning noisy image latency 15 and the learning semantic mask latency 13; input the learning image caption 12 to the text encoder 340, use the text encoder 340 to encode the learning image caption 12 and generate a learning text embedding 16 that includes a learning instance-level text embedding and a learning global-level text embedding; and input the learning time step to the time step encoder 350, use the time step encoder 350 to encode the learning time step and generate a learning time step embedding 17. 【0052】 On the other hand, although the above description assumes that the learning device 100 acquires training images 10 and training time steps for the forward process, and training semantic masks 11 and training image captions 12 for the reverse process, the present invention is not limited thereto, and the learning device 100 can also perform the forward and reverse processes after acquiring the training images 10, training time steps, training semantic masks 11 and training image captions 12. 【0053】 Subsequently, the learning device 100 can concatenate the learning noisy image latency 14 and the learning semantic mask latency 13 channel by channel to generate learning control information 15. 【0054】 Here, the training control information 15 may be a latency obtained by concatenating the training noisy image latency 14 and the training semantic mask latency 13. 【0055】 Furthermore, by concatenating the training noisy image latency 14 and the training semantic mask latency 13, the number of channels is increased, and by adding zero-initialized weights equal to the number of increased channels to the first layer of the denoising network 310, the denoising network 310 can be enabled to process the training control information 15. 【0056】 The learning device 100 inputs the learning time step embedding 17, the learning text embedding 16, and the learning control information 15 to the denoising network 310, causing the denoising network 310 to generate learning prediction noise 18 by referring to the learning control information 15 according to the learning time step embedding 17, and the scheduler 320 to remove noise from the learning control information 15 by referring to the learning prediction noise to generate the learning composite image latency 19. In this case, the learning prediction noise may be a prediction of the restored noise obtained by restoring the input noise, i.e., the learning control information 15 or the learning intermediate composite image latency, using the learning time step embedding. 【0057】 Specifically, the learning device 100 uses a denoising network 310 to generate intermediate prediction noise for learning by referring to the learning control information 15, inputs the intermediate prediction noise for learning to the scheduler 320, and the scheduler 320 uses the intermediate prediction noise for learning to remove noise from the learning control information 15 to generate intermediate composite image latency for learning, and inputs the intermediate composite image latency for learning back into the denoising network 310 as learning control information, and the denoising network The learning composite image latency 19 can be generated by repeating the following processes: a prediction noise generation process in which the learning intermediate composite image latency is referenced to generate learning intermediate prediction noise using the learning intermediate composite image latency 310; and a denoising process in which the learning intermediate prediction noise is input to the scheduler 320, and the scheduler 320 references the learning intermediate prediction noise to remove noise from the learning intermediate composite image latency and generates a learning intermediate composite image latency corresponding to the learning time step using the scheduler 320. Here, when the learning device 100 inputs the learning intermediate composite image latency to the denoising network 310, it can also concatenate the learning intermediate composite image latency and the learning semantic mask latency 13 to generate updated learning control information 15, and input the updated learning control information 15 to the denoising network 310. 【0058】 For reference, unlike the forward process, the reverse process is a noise-removing process. Therefore, if scheduler 320 uses 50 steps as the learning time step, scheduler 320 will generate a schedule sequence containing 50 numbers such as [980,960,...,20,0], and noise may be removed depending on the predefined values ​​included in the distributed schedule. 【0059】 Once the training composite image latency 19 is generated by the reverse process described above, the learning device 100 inputs the training composite image latency 19 to the decoder of the VAE 330, and the VAE 330's decoder decodes the training composite image latency 19 to generate the training composite image 30. The denoising network 310 can then be trained to minimize the loss generated by referencing the training composite image 30 and the training image 10. In this case, unlike the case where the denoising network 310 is trained using the loss generated by referencing the training composite image 30 and the training image 10, the denoising network 310 can also be trained using the loss generated by referencing the noise added and the predicted noise at each time step. However, the present invention is not limited thereto, and the denoising network 310 can be trained using various loss functions so that it can predict the noise in the previous time step from the noise at each time step. Through such training of the denoising network 310, the diffusion model 300 becomes able to recognize a semantic mask containing fine-grained information. 【0060】 On the other hand, in the above method, training control information 15 was generated by concatenating the training noisy image latency 14 and the training semantic mask latency 13. However, it is also possible to generate training control information 15 using a control net, which will be explained below with reference to Figure 3B. In the following, detailed explanations will be omitted for parts that can be easily understood from the explanation with reference to Figure 3A. 【0061】 First, the learning device 100 inputs the learning noisy image latency 14, the learning semantic mask latency 13, the learning text embedding 16, and the learning time step embedding 17 to the control net 360, causing the control net 360 to generate a control signal (not shown), and thereby generating learning control information including the control signal (not shown) and the learning noisy image latency 14. At this time, the control net 360 may be generated by duplicating some of the layers of the pre-trained diffusion model 300 and adding zero convolution layers to the first and last layers of the control net 360. 【0062】 As described above, the learning device 100 inputs the learning time step embedding 17, the learning noisy image latency 14, the learning semantic mask latency 13, the learning text embedding 16, and the learning time step embedding 17 to the control net 360 to generate a control signal (not shown). Then, the learning time step embedding 17, the learning text embedding 16, and the control signal (not shown) are input to the denoising network 310 to generate learning prediction noise according to the learning time step embedding 17, by referring to the learning control information, which is the control signal (not shown) and the learning noisy image latency 14. The scheduler 320 then refers to the learning prediction noise, removes noise from the noisy image latency 14, and generates a learning composite image latency 19. In this case, the training prediction noise may be a prediction of the restored noise obtained by restoring the input noise, i.e., the training noisy image latency 14 or the training intermediate composite image latency, according to the time step. 【0063】 Specifically, the learning device 100 uses a denoising network 310 to generate intermediate prediction noise for learning by referring to learning control information, namely (i) noisy image latency 14 for learning and (ii) control signals (not shown) output from the control network 360. The learning device 100 inputs the intermediate prediction noise for learning to the scheduler 320, which uses the intermediate prediction noise for learning to remove noise from the noisy image latency 14 included in the learning control information to generate intermediate composite image latency 19. The learning device 100 then uses the learning control information again to generate intermediate noise for learning. The synthesized image latency 19 can be generated by repeatedly performing a prediction noise generation process, in which the synthesized image latency is input to the denoising network 310 and the denoising network 310 is used to generate training intermediate prediction noise by referring to the training intermediate synthesized image latency, and a denoising process, in which the training intermediate prediction noise is input to the scheduler 320 and the scheduler 320 is used to remove noise from the training intermediate synthesized image latency by referring to the training intermediate prediction noise and generate training intermediate synthesized image latency corresponding to the training time step. Here, when the learning device 100 inputs the learning intermediate composite image latency to the denoising network 310, it can also generate updated learning control information, which includes updated control signals, by inputting the learning intermediate composite image latency and the learning intermediate composite image and learning semantic mask latency 13 to the control network 360 and outputting them, and input the updated learning control information to the denoising network 310. 【0064】 Once the training composite image latency 19 is generated by the reverse process described above, the denoising network 310 can be trained as explained with reference to Figure 3A. 【0065】 For reference, the control network 360 can be trained in a similar manner to the denoising network 310. That is, it can be trained to minimize the loss generated by referencing the training composite image 30 and the training image 10. Furthermore, using various loss functions, the control network 360 can be trained to generate a control signal (not shown) by predicting the noise in the previous time step from the noise in the current time step. Through such training, the control network 360 can generate a control signal (not shown) that follows both the training text embedding 16 and the training semantic mask latency 13 by minimizing the error in the predicted noise. 【0066】 Once the diffusion model 300 has been trained using the learning method described above, it is possible to perform a test using the diffusion model 300 to generate a composite image that reflects the fine-grained instance layout. 【0067】 Figure 4 shows a schematic configuration of a test apparatus for testing a controllable image generation model capable of reflecting a fine-grained instance layout using a diffusion model 300 learned according to one embodiment of the present invention. Referring to Figure 4, the test apparatus 200 may include a memory 210 that stores instructions for testing the diffusion model 300, which is a controllable image generation model capable of reflecting a fine-grained instance layout, and a processor 220 that executes operations for testing the diffusion model 300 according to the instructions stored in the memory 210. In this case, the diffusion model 300 is shown as being provided in the test apparatus 200, but it is not limited to this, and the diffusion model 300 may be provided in a cloud environment or on a computing device different from the test apparatus 200. 【0068】 Specifically, the test apparatus 200 may achieve the desired system performance using a combination of a typical computing device (e.g., a device that may include computer processors, memory, storage, input and output devices, and other conventional computing device components; electronic communication devices such as routers and switches; and electronic information storage systems such as network-attached storage (NAS) and storage area networks (SAN)) and computer software (i.e., instructions that enable the computing device to function in a particular manner). 【0069】 Furthermore, the processor 220 of the test device 200 may include hardware configurations such as an MPU (Micro Processing Unit) or CPU (Central Processing Unit), cache memory, and data bus. The computing device may also further include an operating system and software configurations for applications that perform a specific purpose. 【0070】 However, this does not preclude the case where the test apparatus 200 includes an integrated processor, which is a medium, processor, and memory integrated form for carrying out the present invention. 【0071】 A method for testing a controllable image generation model capable of reflecting a fine-grained instance layout according to one embodiment of the present invention, using the test apparatus configured in this manner, will be described below with reference to Figures 5 and 6. 【0072】 With reference to Figures 3A and 3B, the learning method described above allows the diffusion model 300 to learn to recognize the fine-grained instance layout, i.e., the fine-grained semantic mask, enabling the test device 200 to acquire the test noisy latency 20, the test semantic mask 21, the corresponding test image caption 22, and the test time step. Here, the test semantic mask 21 may be an RGB image, as shown in Figures 8(a), (b), and (c), with each test class in the test composite image 31 to be generated having its own unique color. Here, the test image caption 22 may include test instance-level text for the test class corresponding to the test semantic mask 21 and test global-level text corresponding to the test semantic mask 21. For example, the test instance-level text may be "car", "dog", and "person", and the test global-level text may be "a dog and a person are walking next to a car". Furthermore, the test noisy latency 20 may, but is not limited to, be generated using a random seed. Also, the test time step may, but is not limited to, be determined randomly by the test device 200, and may be determined by various methods, such as user settings. In this case, the test time step indicates, but is not limited to, the maximum number of iterations for removing noise. 【0073】 Subsequently, the test device 200 can perform the following steps: generate a test semantic noisy latency 25 using a test semantic mask 21 and a test noisy latency 20; input a test image caption 22 to a text encoder 340 and use the text encoder 340 to encode the test image caption 22 to generate a test text embedding 26 including a test instance-level text embedding and a test global-level text embedding; and input a test time step to a time step encoder 350 and use the time step encoder 350 to encode the test time step to generate a test time step embedding 27. 【0074】 Here, the test device 200 can convert the test semantic mask to the same resolution as the test noisy latency 20, and then perform a 1:1 mapping operation with the test noisy latency 20 to generate the test semantic noisy latency 25. 【0075】 Next, the test device 200 inputs the test semantic noisy latency 25, the test text embedding 26, and the test time step embedding 27 to the denoising network 310, causing the denoising network 310 to generate test predictive noise 28 by referring to the test semantic noisy latency 25. The scheduler 320 then refers to the test predictive noise 28 to remove noise from the test semantic noisy latency 25 and generate a test composite image latency 29. In this case, the test predictive noise may be a prediction of the restored noise obtained by restoring the input noise, i.e., the test semantic noisy latency 25 or the test intermediate composite image latency, using the test time step embedding. 【0076】 Specifically, the test device 200 uses a denoising network 310 to generate test intermediate predictive noise by referring to the test semantic noisy latency 25, inputs the test intermediate predictive noise to the scheduler 320, and uses the scheduler 320 to remove noise from the test semantic noisy latency 25 by referring to the test intermediate predictive noise to generate a test intermediate composite image latency, and inputs the test intermediate composite image latency as the test semantic noisy latency 25 to the denoising network 310. The test composite image latency 31 can be generated by repeating a process in which the denoising network 310 generates test intermediate predictive noise by referring to the test intermediate composite image latency, and a denoising process in which the test intermediate predictive noise is input to the scheduler 320, and the scheduler 320 removes noise from the test intermediate composite image latency by referring to the test intermediate predictive noise to generate the test intermediate composite image latency corresponding to the test time step. Here, when the test device 200 inputs the test intermediate composite image latency to the denoising network 310, it can also perform a 1:1 mapping between the test intermediate composite image latency and the test semantic mask latency 23 and input the updated test semantic noisy latency 25 to the denoising network 310. 【0077】 At this time, the attention mask can be applied to the attention calculations included in the denoising network 310 by referencing the test instance-level text embedding and the test global-level text embedding, so that the layout shown in the test semantic masks, such as (a), (b), and (c) in Figure 8, which are provided as control information for generating the test composite image, is appropriately reflected. 【0078】 Specifically, referring to Figure 7, the test apparatus 200 repeats the prediction noise generation process and the denoising process as described above, and in accordance with the test time step, (i) in the initial denoising process, which is repeated up to the kth time, the output calculation operation for each layer included in the denoising network is (i-1) attention calculation for the portion corresponding to the instance-level mask generated by mapping the test instance-level text embedding and the test noisy semantic mask latency, or the test instance-level text embedding and the test intermediate composite image latency, and (i-2) test global-level text embedding and test (ii) In a later denoising process in which the prediction noise generation process and the denoising process are repeated k times or more, the output calculation of each layer included in the denoising network includes an attention calculation of the test global level text embedding and the test noisy semantic mask latency, or the test global level text embedding and the test intermediate composite image latency, and the reflection of the semantic mask to the instance level can also be reflected in the fine grain. 【0079】 In other words, the denoising network 310 can perform multiple attention operations, and in the process of generating a composite image from noise through self-attention operations, it can organize how each part of the image is related to each other, and in the cross-attention operations it can control according to external conditions such as text captions. Figure 7 exemplifies a representation of the process of self-attention operations, which are included in some of the operations that the denoising network 310 performs to predict noise. Specifically, Figure 7 shows a mapped noisy latency space represented in 2D format by the method of the present invention. As shown in Figure 7(a) initial denoising stage, initially, the process includes self-attention calculations performed by applying an attention mask to the portion corresponding to the instance-level mask in the noisy semantic mask latency obtained by mapping the test semantic mask latency to the test noisy latency, and self-attention calculations applied to the test global-level text embedding and the test noisy semantic mask latency, or the test global-level text embedding and the test intermediate composite image latency. However, later, as shown in Figure 7(b) later denoising stage, only self-attention calculations applied to the test global-level text embedding and the test noisy semantic mask latency, and the test intermediate composite image latency can be performed. 【0080】 Here, k, which separates the initial denoising stage from the later denoising stage, may be an integer greater than or equal to 1 and less than or equal to the test time step, representing a repeated process, and may be determined by the user, but is not limited to this. 【0081】 After generating a test composite image return using the method described above, the test device 200 can input the test composite image return 29 into the decoder of the VAE 330, and the decoder of the VAE 330 can decode the test composite image return 29 to generate the test composite image 31. 【0082】 The embodiments of the present invention described above are implemented in the form of program instructions that can be executed through various computer components and may be recorded on a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, etc., individually or in combination. The program instructions recorded on the computer-readable recording medium may be specially designed and configured for the present invention, or they may be publicly known and available to the average technician in the field of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and hardware devices specially configured to store and execute program instructions, such as ROMs, RAMs, and flash memory. Examples of program instructions include not only machine code, such as that produced by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices may be configured to operate as one or more software modules to perform the processing according to the present invention, and vice versa. 【0083】 Although the present invention has been described above with specific details such as concrete components, and with limited embodiments and drawings, these are provided only to aid in a more overall understanding of the invention, and the invention is not limited to the above embodiments. A person with ordinary skill in the art to which the invention pertains can make various modifications and variations from this description. 【0084】 Therefore, the concept of the present invention should not be limited to the embodiments described above, and all modifications equivalent to or equivalent to the claims described below shall also fall within the scope of the concept of the present invention.

Claims

[Claim 1] In a training method for a controllable image generation model that can reflect a fine-grained instance layout, (a) When a training image and a training time step are acquired, the training device inputs the training image to the encoder of a Variational Auto Encoder (VAE) and uses the encoder of the VAE to generate a training image latency, and generates a training noisy image latency by repeatedly adding noise to the training image latency according to the training time step via a scheduler. (b) When the learning device obtains a learning semantic mask and a learning image caption corresponding to the learning image (the learning image caption includes learning instance-level text for the learning class corresponding to the learning semantic mask and learning global-level text corresponding to the learning image), it inputs the learning semantic mask to the encoder of the VAE and uses the encoder of the VAE to generate a learning semantic mask latency, and in the process of generating learning control information using the learning noisy image latency and the learning semantic mask latency, it inputs the learning image caption to a text encoder and uses the text encoder to embed the learning instance-level text The process of generating a text embedding that includes a learning-type text embedding (Embedding) and a learning-type global-level text embedding (Global-level Text Embedding), and the process of inputting the learning-type time step into a time step encoder and using the time step encoder to encode the learning-type time step and generate a learning-type time step embedding, (c) The learning device inputs the learning time step embedding, the learning text embedding, and the learning control information into a denoising network, and uses the denoising network to generate learning prediction noise by referring to the learning control information, and uses the scheduler to remove noise from the learning control information by referring to the learning prediction noise to generate a learning composite image latency, and in doing so, the denoising process in which the denoising network generates learning intermediate prediction noise by referring to the learning intermediate composite image latency according to the learning time step embedding, and the denoising process in which the scheduler generates the learning intermediate composite image latency from which the noise has been removed by referring to the learning intermediate prediction noise, is repeated to generate the learning composite image latency. (d) The learning device inputs the training composite image latency to the decoder of the VAE, the decoder of the VAE decodes the training composite image latency to generate a training composite image, and trains the denoising network to minimize the loss generated by referencing the training composite image and the training image. A method that includes this. [Claim 2] In step (b) above, The method according to claim 1, wherein the learning device generates the learning control information by concatenating the learning noisy image latency and the learning semantic mask latency channel by channel. [Claim 3] In step (c) above, The method according to claim 2, wherein the learning device adds zero-initialized weights corresponding to the increased number of channels to the first layer of the denoising network, corresponding to the increased number of channels, in accordance with the number of channels increased by concatenating the learning noisy image latency and the learning semantic mask latency, so that the predictive noise generation process is performed. [Claim 4] In step (b) above, The method according to claim 1, wherein the learning device inputs the learning noisy image latency, the learning semantic mask latency, the learning text embedding, and the learning time step embedding to a control net, and uses the control net to generate a control signal, and generates the learning control information including the control signal and the learning noisy image latency. [Claim 5] The method according to claim 4, wherein the control net is generated by replicating some of the layers of a pre-trained diffusion model and adding zero convolution layers to the first and last layers of the control net. [Claim 6] The method according to claim 1, wherein the learning semantic mask is an RGB image in which each of the learning classes included in the learning image is assigned its own unique color. [Claim 7] In a method for testing a controllable image generation model that can reflect a fine-grained instance layout, (a) A subprocess that, when a training image and training time step are acquired by the learning device, inputs the training image to the encoder of a Variational Autoencoder (VAE), uses the encoder of the VAE to generate a training image latency, and, through a scheduler, repeatedly adds noise to the training image latency according to the training time step to generate a training noisy image latency; (ii) A training semantic mask and a training image caption corresponding to the training image (the training image caption consists of a training instance-level text for the training class corresponding to the training semantic mask and a training global-level text corresponding to the training image) When the training semantic mask (including the training noisy image latency) is obtained, the training semantic mask is input to the encoder of the VAE, and the encoder of the VAE is used to generate a training semantic mask latency. In the process of generating training control information using the training noisy image latency and the training semantic mask latency, the training image caption is input to the text encoder, and the text encoder is used to generate a training text embedding (including an instance-level text embedding and a global-level text embedding). (iii) A subprocess including the process of generating an Embedding, and inputting the learning time step into a time step encoder and using the time step encoder to encode the learning time step to generate a learning time step embedding, (iii) a denoising network (Denoising Network) of the learning time step embedding, the learning text embedding, and the learning control informationThe Network inputs the data, and the denoising network generates training prediction noise by referring to the training control information, and the scheduler generates training composite image latency by referring to the training prediction noise and removing noise from the training control information, in which case, according to the training time step embedding, the denoising network generates training intermediate prediction noise in a prediction noise generation process, and the scheduler generates training intermediate prediction noise from which noise has been removed by referring to the training intermediate prediction noise. (iv) A subprocess is performed to generate the training synthetic image latency by repeating the denoising process that generates the synthetic image latency, and (iv) a subprocess is performed to input the training synthetic image latency to the decoder of the VAE, to decode the training synthetic image latency with the decoder of the VAE to generate a training synthetic image, and to train the denoising network to minimize the loss generated by referencing the training synthetic image and the training image, and with the denoising model trained, the test device obtains a test noisy latency, a test semantic mask, a corresponding test image caption (the test image caption includes a test instance-level text for the test class corresponding to the test semantic mask and a test global-level text corresponding to the test semantic mask) and a test time step, (b) The steps of the test apparatus performing the following: generating a test semantic noisy latency using the test semantic mask and the test noisy latency; inputting the test image caption into the text encoder and causing the text encoder to generate a test text embedding including a test instance-level text embedding and a test global-level text embedding; and inputting the test time step into the time step encoder and causing the time step encoder to encode the test time step and generate a test time step embedding; (c) The test apparatus inputs the test semantic noisy latency, the test text embedding, and the test time step embedding into the denoising network, causing the denoising network to generate test predictive noise by referring to the test semantic noisy latency, and the scheduler to remove noise from the test semantic noisy latency by referring to the test predictive noise to generate a test composite image latency, and in doing so, the denoising process in which the denoising network generates test intermediate predictive noise by referring to the test intermediate composite image latency according to the test time step embedding, and the denoising process in which the scheduler generates the test intermediate composite image latency from which noise has been removed by referring to the test intermediate predictive noise, thereby generating the test composite image latency; and (d) The test apparatus inputs the test composite image latency to the decoder of the VAE, and the decoder of the VAE decodes the test composite image latency to generate a test composite image; A method that includes this. [Claim 8] In step (b) above, The method according to claim 7, wherein the test apparatus converts the test semantic mask to the same resolution as the test noisy latency tent, and then performs a 1:1 mapping operation with the test noisy latency tent to generate the test semantic noisy latency tent. [Claim 9] In step (c) above, The test apparatus repeats the prediction noise generation process and the denoising process, and in doing so according to the test time step, (i) in the initial denoising process in which the prediction noise generation process and the denoising process are repeated up to k times (where k is a preset integer of 1 or more), the output calculation operation for each layer included in the denoising network is: (i-1) an attention calculation for the portion corresponding to the instance-level mask generated by mapping the test instance-level text embedding to the test noisy semantic mask latency, or the test instance-level text embedding to the test intermediate composite image latency, and (i-2) the The method according to claim 8, comprising an attention operation on a test global-level text embedding and the test noisy semantic mask latency, or the test global-level text embedding and the test intermediate composite image latency, (ii) in a later denoising process in which the prediction noise generation process and the denoising process are repeated k times or more, the output calculation operation of each layer included in the denoising network comprises an attention operation on the test global-level text embedding and the test noisy semantic mask latency, or the test global-level text embedding and the test intermediate composite image latency. [Claim 10] The method according to claim 7, wherein the test semantic mask is an RGB image to which each test class to be generated is assigned its own unique color. [Claim 11] In a controllable image generation model learning device that can reflect a fine-grained instance layout, One or more memory locations to store instructions, Includes one or more processors configured to execute the aforementioned instructions, The processor performs the following processes: (I) When a training image and training time step are acquired, it inputs the training image to the encoder of a Variational Autoencoder (VAE) and uses the encoder of the VAE to generate a training image latency, and through a scheduler, repeatedly adds noise to the training image latency according to the training time step to generate a training noisy image latency; (II) A training semantic mask and a training image caption corresponding to the training image (the training image caption includes a training instance-level text for the training class corresponding to the training semantic mask and a training global-level text corresponding to the training image) When the training semantic mask (including the training noisy image latency) is obtained, the training semantic mask is input to the encoder of the VAE, and the encoder of the VAE is used to generate a training semantic mask latency. In the process of generating training control information using the training noisy image latency and the training semantic mask latency, the training image caption is input to the text encoder, and the text encoder is used to generate a training text embedding (including an instance-level text embedding and a global-level text embedding). (III) A process to generate an Embedding, and a process to input the learning time step into a time step encoder and use the time step encoder to encode the learning time step and generate a learning time step embedding, (III) a process to denoise the learning time step embedding, the learning text embedding, and the learning control information into a Denoising Network (Denoising Network)The Network inputs the data, and the denoising network generates training prediction noise by referring to the training control information, and the scheduler removes noise from the training control information by referring to the training prediction noise to generate training composite image latency, and in the process, according to the training time step embedding, the denoising network generates training intermediate prediction noise by referring to the training intermediate composite image latency, and the scheduler refers to the training intermediate prediction noise. A learning device that performs the process of generating a training intermediate composite image latency by repeating a denoising process that generates the training intermediate composite image latency from which noise has been removed; and (IV) inputting the training composite image latency into the decoder of the VAE, using the decoder of the VAE to decode the training composite image latency to generate a training composite image, and training the denoising network to minimize the loss generated by referencing the training composite image and the training image. [Claim 12] The aforementioned processor, The learning apparatus according to claim 11, wherein in the process of (II) above, the learning noisy image latency and the learning semantic mask latency are concatenated channel by channel to generate the learning control information. [Claim 13] The aforementioned processor, The learning apparatus according to claim 12, wherein in the process (III), the predictive noise generation process is performed by adding zero-initialized weights corresponding to the increased number of channels to the first layer of the denoising network, corresponding to the increased number of channels, in accordance with the increased number of channels obtained by concatenating the training noisy image latency and the training semantic mask latency. [Claim 14] The aforementioned processor, The learning apparatus according to claim 11, wherein in the process of (II) above, the learning noisy image latency, the learning semantic mask latency, the learning text embedding, and the learning time step embedding are input to a control net, the control net is used to generate a control signal, and learning control information including the control signal and the learning noisy image latency is generated. [Claim 15] The learning device according to claim 14, wherein the control net is generated by replicating some of the layers of a pre-trained diffusion model and adding zero convolution layers to the first and last layers of the control net. [Claim 16] The learning apparatus according to claim 11, wherein the learning semantic mask is an RGB image in which each learning class included in the learning image is assigned its own unique color. [Claim 17] In a test device for a controllable image generation model that can reflect a fine-grained instance layout, One or more memory locations to store instructions, Includes one or more processors configured to execute the aforementioned instructions, The processor includes (i) a subprocess that, when a learning device acquires a learning image and a learning time step, inputs the learning image to the encoder of a Variational Autoencoder (VAE) and uses the encoder of the VAE to generate a learning image latency, and through a scheduler, repeatedly adds noise to the learning image latency according to the learning time step to generate a learning noisy image latency, and (ii) a learning semantic mask and a learning image caption corresponding to the learning image (the learning image caption includes a learning instance-level text for the learning class corresponding to the learning semantic mask and a learning global-level text corresponding to the learning image). When the training semantic mask (including the training noisy image latency) is obtained, the training semantic mask is input to the encoder of the VAE, and the encoder of the VAE is used to generate a training semantic mask latency. In the process of generating training control information using the training noisy image latency and the training semantic mask latency, the training image caption is input to the text encoder, and the text encoder is used to generate a training text embedding (including an instance-level text embedding and a global-level text embedding). (iii) A subprocess including the process of generating an Embedding, and inputting the learning time step into a time step encoder and using the time step encoder to encode the learning time step to generate a learning time step embedding, (iii) a denoising network (Denoising Network) of the learning time step embedding, the learning text embedding, and the learning control informationThe system inputs data into the Network, and the denoising network generates training prediction noise by referring to the training control information, and the scheduler generates training composite image latency by referring to the training prediction noise and removing noise from the training control information, in which case, according to the training time step embedding, the denoising network generates training intermediate prediction noise in a prediction noise generation process, and the scheduler generates training intermediate prediction noise from which noise has been removed by referring to the training intermediate prediction noise. (iv) A subprocess is performed to generate the training composite image latency by repeating the denoising process that generates the denoising composite image latency, and (iv) a subprocess is performed to input the training composite image latency into the decoder of the VAE, and the VAE decoder decodes the training composite image latency to generate a training composite image, and the denoising network is trained to minimize the loss generated by referencing the training composite image and the training image, and with the denoising model trained, a test noisy latency (Noisy (1) A process to obtain a test semantic mask, a corresponding test image caption (the test image caption includes a test instance-level text (Instance-level Text) for the test class corresponding to the test semantic mask and a test global-level text (Global-level Text) corresponding to the test semantic mask), and a test time step; (2) A process to generate a test semantic noisy latency using the test semantic mask and the test noisy latency; inputting the test image caption into the text encoder and using the text encoder to generate a test text embedding (Text) including a test instance-level text embedding (Instance-level Text Embedding) and a test global-level text embedding (Global-level Text Embedding).(III) A process to generate an Embedding, and a process to input the test time step into the time step encoder, and to use the time step encoder to encode the test time step and generate a test time step embedding, (III) Input the test semantic noisy latency, the test text embedding, and the test time step embedding into the denoising network, and to use the denoising network to generate test predictive noise by referring to the test semantic noisy latency, and to use the scheduler to remove noise from the test semantic noisy latency by referring to the test predictive noise and generate a test composite image A process for generating a test composite image latency, wherein the process involves repeating the following steps in accordance with the embedded test time step: the predictive noise generation process in which the denoising network references the test intermediate composite image latency to generate test intermediate predictive noise, and the denoising process in which the scheduler references the test intermediate predictive noise to generate the test intermediate composite image latency from which the noise has been removed; and (IV) a test apparatus that inputs the test composite image latency into the decoder of the VAE and performs the process of decoding the test composite image latency with the decoder of the VAE to generate a test composite image. [Claim 18] The aforementioned processor, The test apparatus according to claim 17, wherein in the process of (II) above, the test semantic mask is converted to the same resolution as the test noisy latency tent, and then a 1:1 mapping operation is performed with the test noisy latency tent to generate the test semantic noisy latency tent. [Claim 19] The aforementioned processor, In the (III) process described above, the prediction noise generation process and the denoising process are repeated, and in the process thereafter, according to the test time step, (i) in the initial denoising process in which the prediction noise generation process and the denoising process are repeated up to k times (where k is a preset integer of 1 or more), the output calculation operation for each layer included in the denoising network is: (i-1) an attention operation for the portion corresponding to the instance-level mask generated by mapping the test instance-level text embedding to the test noisy semantic mask latency, or the test instance-level text embedding to the test intermediate composite image latency, and (i-2) the previous The test apparatus according to claim 18, comprising: (ii) an attention operation on a test global level text embedding and a test noisy semantic mask latency, or a test global level text embedding and a test intermediate composite image latency, wherein in a later denoising process in which the prediction noise generation process and the denoising process are repeated k times or more, the output calculation operation for each layer included in the denoising network comprises an attention operation on a test global level text embedding and a test noisy semantic mask latency, or a test global level text embedding and a test intermediate composite image latency. [Claim 20] The test apparatus according to claim 17, wherein the test semantic mask is an RGB image to which each test class to be generated is assigned its own unique color.