Method, apparatus, medium, and program product for generating bounding boxes for image targets
By utilizing self-attention and cross-attention features with latent vectors, accurate bounding boxes are generated, addressing the issue of limited training data and enhancing target detection performance.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- NTT DOCOMO INC
- Filing Date
- 2025-11-21
- Publication Date
- 2026-06-15
AI Technical Summary
Target detection performance degrades when training data is limited, and diffusion models generate images lacking bounding boxes, hindering their application in target detection tasks.
Generate bounding boxes using self-attention and cross-attention features, along with latent vectors, to obtain accurate edge information for targets, enhancing feature maps and improving bounding box accuracy.
Accurate bounding boxes are generated, enabling effective training of target detectors and improving target detection capabilities.
Smart Images

Figure 2026096935000001_ABST
Abstract
Description
【Technical Field】 【0001】 Embodiments of the present disclosure relate to the field of computer vision, and more specifically, to a method for generating a bounding box of an image target, an electronic device, a non-transitory computer-readable storage medium, and a computer program product. 【Background Art】 【0002】 In the field of computer vision, the main purpose of target detection is to enable a computer to automatically identify the position of a target in an image or video frame and recognize its category. Recognition may be assisted by drawing a bounding box (bounding box, bbox) around the target to indicate the position of each target, or a bounding box may be drawn around the target after the target is recognized. The generation of the bounding box is an important step in the target detection task. 【Summary of the Invention】 【Problems to be Solved by the Invention】 【0003】 Target detection technology has various application scenarios in the field of computer vision, such as unlocking a smartphone by face authentication, detecting products in a vending machine, and detecting pedestrians during autonomous driving. However, the performance of target detection significantly degrades when the training data is limited. In recent years, diffusion models have been used to generate more diverse images and have become an effective data augmentation technique. However, the images generated by the diffusion model lack the bounding box of the corresponding target and thus cannot be directly applied to the training of target detection. This is one of the main problems faced by the current target detection field. 【Means for Solving the Problems】 【0004】 According to one aspect of the present disclosure, at least one embodiment provides a method for generating a bounding box for an image target, which includes obtaining a feature map based on at least one of a Self-Attention feature and a Cross-Attention feature, as well as one or more latent vectors, obtained in a process of generating an image containing an image target or a process of adding noise to an image and regenerating the image, and obtaining a bounding box for the image target based on the feature map. 【0005】 According to one aspect of the present disclosure, at least one embodiment provides a bounding box generation system for an image target, comprising: a feature map acquisition means configured to acquire a feature map based on at least one of self-attention features and cross-attention features, and one or more latent vectors, acquired in a process of generating an image containing an image target or a process of adding noise to an image and regenerating the image; and a bounding box acquisition means configured to acquire a bounding box for an image target based on the feature map. 【0006】 According to another aspect of the present disclosure, at least one embodiment provides an electronic device including a memory for storing computer instructions and a processor for reading computer instructions stored in the memory and performing a method according to at least one embodiment of the present disclosure. 【0007】 According to another aspect of the present disclosure, at least one embodiment provides a non-temporary computer-readable storage medium in which computer instructions are stored, and causes the processor to execute a method according to at least one embodiment of the present disclosure when the computer instructions are executed by the processor. 【0008】 According to another aspect of the present disclosure, at least one embodiment provides a computer program product that includes computer instructions and, when executed by a processor, causes a processor to perform a method according to at least one embodiment of the present disclosure. [Effects of the Invention] 【0009】 According to each aspect and embodiment of this disclosure, by utilizing the self-attention features and / or cross-attention features of the image, as well as latent vector features, acquired simultaneously in a process of generating an image containing an image target or a process of adding noise to an image and regenerating the image, more accurate edge information of the target can be obtained, thereby obtaining a feature map that more accurately represents the edge features of the target, and a more accurate bounding box of the target can be generated based on the more accurate feature map. 【0010】 To more clearly illustrate the embodiments of this disclosure or related technologies, the drawings necessary for describing the embodiments or related technologies are briefly described below. The drawings described below are obviously only a selection of embodiments of this disclosure, and those skilled in the art can obtain other drawings based on these drawings without any creative effort. [Brief explanation of the drawing] 【0011】 [Figure 1] Figure 1 illustrates the bounding box of an image target according to at least one embodiment of the present disclosure. [Figure 2] Figure 2 shows a schematic diagram of an embodiment for generating a bounding box for an image target according to at least one embodiment of the present disclosure. [Figure 3] Figure 3 shows a flowchart of a method for generating a bounding box for an image target according to at least one embodiment of the present disclosure. [Figure 4]Figure 4 shows exemplary images of a self-attention feature map, cross-attention feature map, and latent vector feature map of an exemplary image relating to at least one embodiment of the present disclosure. [Figure 5] Figure 5 shows an exemplary latent vector map of one or more latent vectors relating to at least one embodiment of the present disclosure. [Figure 6A] Figure 6A shows an exemplary diagram illustrating how self-attention features and cross-attention features according to at least one embodiment of the present disclosure are integrated to generate integrated attention features, and how the target latent vector and integrated attention features are input to one or more neural networks to obtain a feature map. [Figure 6B] Figure 6B shows a schematic diagram illustrating how a feature map is obtained by inputting a target latent vector and self-attention features according to at least one embodiment of the present disclosure into one or more neural networks. [Figure 6C] Figure 6C shows a schematic diagram illustrating how a feature map is obtained by inputting a target latent vector and cross-attention features according to at least one embodiment of the present disclosure into one or more neural networks. [Figure 6D] Figure 6D shows a schematic diagram illustrating how a feature map is obtained by inputting a target latent vector, self-attention features, and cross-attention features according to at least one embodiment of the present disclosure into one or more neural networks. [Figure 7A] Figure 7A shows an exemplary diagram of the quantitative distribution of feature maps and corresponding pixel values output from one or more neural networks relating to at least one embodiment of the present disclosure. [Figure 7B] Figure 7B shows an exemplary diagram of the quantitative distribution of feature maps and corresponding pixel values output from one or more neural networks relating to at least one embodiment of the present disclosure. [Figure 7C] Figure 7C shows an exemplary diagram of the quantitative distribution of feature maps and corresponding pixel values output from one or more neural networks relating to at least one embodiment of the present disclosure. [Figure 8]FIG. 8 shows the differences between the bounding box obtained by the technical solution of at least one embodiment of the present disclosure, the actual bounding box, and the bounding box obtained by the prior art solution. [Figure 9] FIG. 9 shows a block diagram of a system for generating a bounding box of an image target according to at least one embodiment of the present disclosure. [Figure 10] FIG. 10 shows a block diagram of an exemplary electronic device according to at least one embodiment of the present disclosure. 【DETAILED DESCRIPTION OF THE INVENTION】 【0012】 Refer to the specific embodiments of the present disclosure in detail. Examples of the present disclosure are illustrated in the drawings. The present application will be described with reference to specific embodiments, but it is not intended to limit the present application to the described embodiments. Rather, it should be understood that the present disclosure is intended to cover modifications, alterations, and equivalents that are within the spirit and scope of the present disclosure. The method steps described herein may be implemented by any functional module or functional arrangement. Also, any functional module or functional arrangement may be implemented as a physical entity or a logical entity, or a combination of both. 【0013】 As used herein, "a plurality" means two or more. "And / or" represents the relationship of related objects, meaning that three relationships may exist. For example, A and / or B may represent that only A exists, A and B exist simultaneously, or only B exists. The symbol " / " generally indicates that the related objects before and after are in an "or" relationship. 【0014】 Note that similar reference numerals and letters refer to similar items in the following drawings. Therefore, once an item is defined in one drawing, there is no need to re-explain the definition of that item in subsequent drawings. 【0015】 FIG. 1 is a diagram illustrating a bounding box of an image target according to at least one embodiment of the present disclosure. 【0016】 FIG. 1 shows images 1, 2, and 3 that include targets of multiple types of serows. As shown in FIG. 1, generation of bounding box 1 for the target in image 1, generation of bounding box 2 for the target in image 2, and generation of bounding box 3 for the target in image 3 are each desired. However, in practice, due to the complex background of the image and varying illumination conditions, it may not be possible to accurately generate the bounding box of the target. For example, the tip of the long horn of the animal in image 1 or the limbs covered with grass in image 2 may be difficult to recognize within the bounding box range of the target. How to accurately generate the bounding box of the image target is an issue that needs to be resolved urgently. 【0017】 According to at least one embodiment of the present disclosure, in the process of generating an image including an image target or the process of adding noise to an image to regenerate the image, the self-attention feature and / or cross-attention feature of the image obtained simultaneously are utilized, and in addition, the latent vector feature is utilized to obtain more accurate edge information of the target, thereby obtaining a feature map that more accurately represents the edge feature of the target, and more accurately generating the bounding box of the target based on the more accurate feature map. The generated image, the generated bounding box, and / or the prompt may be used as training data for training the target detector. In this way, the target detector can give the recognition result and / or the bounding box of the recognition result. 【0018】 FIG. 2 shows a schematic diagram of an embodiment for generating a bounding box of an image target according to at least one embodiment of the present disclosure. 【0019】 As shown in Figure 2, in the process of generating an image using a generative neural network 202, text 201 (also called a prompt) for generating the image may be input. Based on this text, the generative neural network may generate an image 206 corresponding to the text. For example, if "squirrel" is input, an image 206 containing the target "squirrel" may be generated, and if "antelope" is input, an image 206 containing the target "antelope" may be generated. Of course, the generative neural network may generate an image containing two or more targets based on two or more input texts, and there is no limit to the number here. 【0020】 The generative neural network 202 utilizes a self-attention layer 203, a cross-attention layer 204, and a latent vector layer 205 in the process of generating an image. The self-attention feature map and / or cross-attention feature map and latent vector feature map automatically generated by these layers may be used by an image bounding box generating technique (such as a method or system) according to at least one embodiment of this disclosure to output the bounding box of the generated image 206. In this way, the generative neural network 202 can simultaneously generate an image and its bounding box in the process of generating an image. The generated image, its bounding box, and / or recognition result information may be used as training data to train a target detector. 【0021】 Alternatively, if it is necessary to generate a generated image related to an existing input image, the existing image may include an antelope. When noise is added to the existing image and text for generating the image (e.g., antelope) 201' is also input to the generative neural network 202, an image 206 containing a single target, an antelope, may be generated. The image 206 may be related to the input 201', the same as the input 201', or different from the input 201'. The type of image to be generated is not limited here, as only a few features acquired in the image generation process need to be considered. That is, in the process by which the generative neural network 202 generates an image, the self-attention feature map and / or cross-attention feature map and latent vector feature map, which are automatically generated by the self-attention layer 203, cross-attention layer 204, latent vector layer 205, may still be used by a technique (method or system, etc.) for generating an image bounding box according to at least one embodiment of this disclosure, so as to output the bounding box of the generated image 206. 【0022】 Figure 3 shows a flowchart of a method 300 for generating a bounding box for an image target according to at least one embodiment of the present disclosure. 【0023】 The method 300 for generating a bounding box for an image target may include steps 310 and 320. 【0024】 In step 310, a feature map is obtained based on at least one of the self-attention features and cross-attention features, as well as one or more latent vectors, acquired in the process of generating an image containing the image target or in the process of adding noise to the image and regenerating the image. In step 320, a bounding box of the image target is obtained based on the feature map. 【0025】 In step 310, at least one of the self-attention features and cross-attention features of the image containing the image target, as well as one or more latent vectors, are obtained. 【0026】 Self-attention mechanisms and cross-attention mechanisms are techniques that improve deep learning models, particularly when processing sequential and cross-modal data, allowing them to focus on important information and ignore unimportant information. 【0027】 The self-attention mechanism is a mechanism in generative models and may also be used in diffusion models (diffusion models first gradually add noise using a forward diffusion process to transform the data distribution into a more manageable prior distribution (usually a standard Gaussian distribution), and then train a neural network to gradually remove the noise, thereby achieving the generation of real data from noise). The self-attention mechanism captures dependencies within a sequence by allowing each position to pay attention to all other positions in the sequence as the model processes the sequence. Specifically, in the self-attention feature calculation process, correlation scores are calculated for each position in the input image with all other positions, and self-attention features are obtained based on the weighted values of these scores. The self-attention mechanism is useful for capturing global contextual information. The main operating principle of the self-attention mechanism may be divided into the following steps: Linear Mapping: Each input vector in the sequence (for example, one block of an image in image processing) is linearly mapped to a query vector (q), key vector (k), and value vector (v) using three different weight matrices (W^q, W^k, W^v). Similarity Calculation: The similarity between each query vector and all other element key vectors is calculated using the dot product. This similarity indicates the degree of attention given to elements at different positions in the sequence. Normalization: The similarity is normalized using the softmax function, converted into an effective probability distribution, and represents the relative attention weight of each element relative to other elements. Weighted Addition: The value vectors are weighted and added using the normalized attention weights to obtain the output vector for each element. This output vector contains information not only about the element itself but also about other related elements. 【0028】 Thus, the self-attention mechanism can dynamically capture the characteristics of dependencies between elements at different positions within a sequence as self-attention features. 【0029】 The cross-attention mechanism is a key mechanism that connects encoders and decoders in generative models, and may also be used in diffusion models. The cross-attention mechanism achieves a deep understanding of the input sequence by allowing the decoder to focus on the encoder's output when generating the output sequence. The cross-attention mechanism differs from the self-attention mechanism in that it operates between two different inputs. In image processing, this typically means that a model can focus on a relevant portion of another input (e.g., an image) as a cross-attention feature based on information from one input (e.g., a text prompt in an image). 【0030】 The general workflow of a cross-attention mechanism is as follows: Feature extraction: Initial feature maps are extracted from two inputs (e.g., an image and text (e.g., a prompt to generate an image)) using a convolutional neural network or other method. Similarity calculation: A similarity matrix is calculated between the two initial feature maps, usually using methods such as the dot product or cosine similarity. Normalization: The similarity matrix is normalized to obtain an attention weight matrix. Weighted addition: The attention weight matrix is used to perform weighted addition on the second input initial feature map to obtain a cross-attention feature representation. 【0031】 In this specification, self-attention features and cross-attention features may be automatically acquired in the process of generating an image using a diffusion model with the help of prompts (and images). 【0032】 Specifically, in a diffusion model, the following steps may be performed to obtain self-attention features: Initialization: Define convolution kernels for the query, key, and value at each position in the input image, and one output projection convolution kernel. These convolution kernels are used to extract features from the input image and calculate attention weights. Feature extraction: Extract features from the input image using convolution operations to obtain feature representations of the query, key, and value. Attention weight calculation: Calculate the similarity between the query and key to obtain attention weights. This usually involves matrix multiplication of the feature representations of the query and key, and normalization by applying a scaling factor and the Softmax function. Weighted addition: Apply the attention weights to the feature representation of the value at each position and perform weighted addition to obtain the self-attention feature representation at each position. 【0033】 In diffusion models, cross-attention mechanisms may be used to guide the stepwise removal of noise and the stepwise generation of data. For example, in an image generation task, a cross-attention mechanism may be used in conjunction with text prompts and image data to generate an image closely related to the text content. In this case, the text prompt may be considered an additional input sequence, which the cross-attention mechanism interacts with the image data to guide the image generation process. 【0034】 In a diffusion model, the following steps may be performed to obtain cross-attention features: Feature extraction: Extract feature representations from the input sequence (e.g., text prompt) and the noise image (or latent representation), respectively. Calculation of cross-attention weights: Calculate cross-attention weights using the feature representations of the input sequence as queries and the feature representations of the noise image as keys and values. Weighted addition: Apply the cross-attention weights to the value representations at each position in the noise image, perform weighted addition, and obtain cross-attention feature representations at each position where the information from the input sequence is integrated. 【0035】 Figure 4 shows exemplary diagrams of self-attention feature maps, cross-attention feature maps, and latent vector feature maps of exemplary images relating to at least one embodiment of the present disclosure. As shown in the self-attention feature map 410 corresponding to the image on the left side of Figure 4, the self-attention feature map 410 shows the dependency between pixels at different locations in the image. As shown in the cross-attention feature map 420 corresponding to the image on the left side of Figure 4, the cross-attention feature map 420 shows the dependency between two images or between an image and text. 【0036】 An image latent vector is a vector in a latent representation space within a deep learning model, typically a low-dimensional, densely packed continuous vector representing an image. Such a vector can capture key features of an image and may be used to generate a new image similar to the original. The image latent vector may be obtained during the process by which the deep learning model generates the image. 【0037】 For example, the latent vector may be obtained by a Latent Diffusion Model (LDM). The diffusion model gradually converts an image into noise through a diffusion process and generates an image from the noise through a dediffusion process. In this process, the latent vector may be used as the starting point or intermediate state for diffusion and dediffusion, and therefore, when generating an image, a new latent vector may be sampled from the latent distribution and a new image may be generated through the dediffusion process. Specifically, at the initial stage of the diffusion model, a random noise tensor is usually obtained by sampling from a standard normal distribution and used as the initial latent vector. The random noise tensor is converted into a latent representation by an encoder (which may be a simple mapping function or a complex neural network), and this latent representation is a latent vector containing initial information of the generated data. At each time step, the model gradually updates this representation to approach the latent representation of the target data based on time embeddings, text prompt embeddings, and the current latent vector, and after multiple iterations, the final latent vector is obtained. In other words, latent vectors may be obtained in the process of generating images by diffusion (for example, in the process of generating images from text using a large-scale model). 【0038】 Alternatively, high-dimensional image data may be compressed into low-dimensional latent vectors using, for example, an autoencoder. Here, the autoencoder is trained using a large amount of image data so that it can learn an effective mapping from images to latent vectors. Alternatively, for example, a variational autoencoder (VAE) may be used. A variational autoencoder may have a regularization term added to it compared to an autoencoder to ensure that the latent vectors follow a specific distribution (e.g., a unit Gaussian distribution). A variational autoencoder may be trained by optimizing the KL (Kullback-Leibler) divergence between the reconstruction error and the latent spatial distribution. Alternatively, for example, a generative adversarial network may be used. A generative adversarial network consists of two parts: a generator and a discriminator. The generator attempts to generate a real image from random noise. The discriminator attempts to distinguish between the generated image and a real image. GANs can gradually learn to generate real images by repeatedly training their generators and classifiers. In GANs, random noise vectors are typically treated as latent vectors. By adjusting the values of these vectors, different styles of images can be generated. 【0039】 The different model structures described above differ in their ability to generate and represent latent vectors, and the parameters of the models (e.g., the number of dimensions of the latent vector, the weights of the regularization term, etc.) also affect the generation of latent vectors. In actual applications, it is necessary to select an appropriate model and method based on the specific task and data characteristics to generate and utilize latent vectors. The specific process will not be described in detail here. 【0040】 This disclosure primarily uses a diffusion process of a diffusion model to acquire latent vectors, self-attention features, and / or cross-attention features. This allows the self-attention features, / or cross-attention features, and latent vectors to be acquired in a single image generation process using the diffusion model, thereby improving efficiency. 【0041】 As shown in the latent vector feature map 430 corresponding to the image on the left side of Figure 4, the latent vector feature map 430 represents the latent features or structure of the image. 【0042】 As can be seen from the figure, if only one of the self-attention feature map 410 and the cross-attention feature map 420 is considered, or if the combined feature map of the self-attention feature map 410 and the cross-attention feature map 420 (see the combined feature map 440 in Figure 4) is considered, the grass-covered limb portions of the target may be ignored. If only one of the self-attention feature map 410 and the cross-attention feature map 420 is considered, or if the combined feature map of the self-attention feature map 410 and the cross-attention feature map 420 is considered, referring to Figure 4, the generated bounding box is a relatively small solid line box in the combined feature map 440, and the majority of the grass-covered limb portions may be ignored. On the other hand, the solid bounding box, referring to Figure 4, is a relatively large solid line box in the combined feature map 440. Therefore, if a bounding box is generated using only at least one of the attention feature map 410 and the cross-attention feature map 420, it may not be possible to obtain a sufficiently accurate bounding box, and this will further affect the target detection capability. 【0043】 In some embodiments, the integrated feature map of the self-attention feature map 410 and the cross-attention feature map 420 may be obtained by several methods, such as direct integration, attention mechanism integration, and hierarchical integration. In direct integration, the self-attention feature map 410 and the cross-attention feature map 420 may be concatenated or added together. This method allows all information from the two features to be retained. In attention mechanism integration, association weights between the two features may be calculated, and these weights and feature representations may be weighted and integrated. This method allows for more flexible adjustment of the influence of different features on the final result and improves the performance of the integrated feature map. In hierarchical integration, the two features may first be integrated at a lower level, and then further integration and processing may be performed at a higher level. This method allows for the gradual extraction and integration of feature information from different levels and improves the performance of the integrated feature map. Here, the integration method for the integrated feature map of the self-attention feature map 410 and the cross-attention feature map 420 is not limited. 【0044】 However, at least one embodiment of the present invention further utilizes a latent vector feature map 430 in addition to at least one of the self-attention feature map 410 and cross-attention feature map 420 described above to capture latent features or structure of an image, such as basic content information in an image, such as the shape, color, and texture of objects, as well as style features of an image, such as the style of painting and the style of photography, and even the relationships between objects in an image and the atmosphere of a scene. Therefore, by constructing an image feature map considering latent vectors, it is possible to obtain further information about the boundaries (edges) of the image target and generate a more accurate bounding box from the feature map. 【0045】 In some embodiments, step 310 may include obtaining a feature map by one or more neural networks based on at least one of self-attention features and cross-attention features, as well as one or more latent vectors. The one or more neural networks may be trained, and how such one or more neural networks are trained will be described in detail later. 【0046】 In some embodiments, obtaining a feature map by one or more neural networks based on at least one of self-attention features and cross-attention features, and one or more latent vectors may include selecting at least one of the one or more latent vectors as a target latent vector for obtaining the feature map, based on a first predetermined rule, or generating a target latent vector for obtaining the feature map by combining at least two of the one or more latent vectors, based on a second predetermined rule. 【0047】 Figure 5 shows an exemplary latent vector map of one or more latent vectors relating to at least one embodiment of the present disclosure. 【0048】 In the latent vector generation process, one or more latent vectors may be generated due to iteration or other causes. Figure 5 shows a case where four layers of latent vectors are generated from the image on the left. Each layer of latent vector contains information about a target aspect of the image (e.g., texture, style, detail). Latent vectors may also be obtained by inputting the image into an autoencoder. Latent vectors represent multi-layered features of the image captured by different network layers. For example, shallow latent vectors may contain low-level features of the image (e.g., edges and texture), intermediate latent vectors may capture more complex patterns (e.g., shape and structure of objects), and deep latent vectors may contain high-level semantic information (e.g., style and detail). 【0049】 As shown in Figure 5, suppose, for example, that four latent vectors are generated in four layers. When considering these four latent vectors, at least one of them may be selected as the target latent vector, or these four latent vectors may be combined to generate a single target latent vector. Here, the selection or combination method may vary. For example, selection may involve randomly selecting at least one, or selecting based on a first predetermined rule. For example, combination may be based on a second predetermined rule. For example, the second predetermined rule may involve adding the four latent vector maps using a direct addition method. Alternatively, combination may involve using a weighted addition method based on the second predetermined rule to ultimately obtain the target latent vector for obtaining the feature map. 【0050】 In some embodiments, a first predetermined rule for selection may include random selection or based on the ratio of the size of the foreground region to the size of the background region in the latent vector map of one or more latent vectors, and / or the degree of concentration of the foreground region in the image. A second predetermined rule for combination may include direct addition or based on the ratio of the size of the foreground region to the size of the background region in the latent vector map of one or more latent vectors, and / or the degree of concentration of the foreground region in the image. 【0051】 For example, the larger the ratio of the foreground region size to the background region size in the latent vector map represented by the latent vector, and the higher the degree of concentration of the foreground region in the latent vector map represented by the latent vector, the more likely it is to be selected or combined with a higher weight. This is because a larger ratio of the foreground region size to the background region size, and a higher degree of concentration of the foreground region in the latent vector map, means that more accurate boundary or edge information of the target can be obtained from that one or more latent vectors. This compensates for any omissions or absences in the self-attention features and / or cross-attention features regarding the boundary or edge information mentioned above. 【0052】 Of the four latent vectors shown in Figure 5, the (best) fourth layer latent vector has a large ratio of foreground to background size, indicating a high degree of foreground concentration in the latent vector map. On the other hand, the (second best) third layer latent vector has a relatively large ratio of foreground to background size, indicating a relatively high degree of foreground concentration in the latent vector map. In this case, it is conceivable to select the best fourth layer latent vector as the target latent vector for obtaining the feature map (or, it is conceivable to select the best fourth layer latent vector and the second best third layer latent vector as the target latent vector for obtaining the feature map), or, in the case of weighted addition, it is conceivable to set the weight of the fourth layer latent vector to the highest and the weight of the third layer latent vector to the second highest, and by analogy, these four layer latent vectors can be combined to generate a target latent vector for obtaining the feature map. 【0053】 In this way, by selecting or combining latent vectors to obtain a target latent vector for acquiring feature maps, the accuracy of acquiring the bounding box of subsequent image targets can be further improved. 【0054】 In some embodiments, a feature map may be obtained by integrating at least one of the self-attention features and cross-attention features, as well as one or more latent vectors. Here, for example, the feature map is obtained by an integration method such as adding or weighting the at least one of the self-attention features and cross-attention features, as well as one or more latent vectors. 【0055】 In some embodiments, obtaining a feature map by inputting at least one of self-attention features and cross-attention features, as well as one or more latent vectors, into one or more neural networks may include multiple embodiments such as inputting a target latent vector and self-attention features into one or more neural networks to obtain a feature map, inputting a target latent vector and cross-attention features into one or more neural networks to obtain a feature map, inputting a target latent vector, self-attention features, and cross-attention features into one or more neural networks to obtain a feature map, or integrating self-attention features and cross-attention features to generate integrated attention features, and inputting the target latent vector and integrated attention features into one or more neural networks to obtain a feature map. 【0056】 Figure 6A shows a schematic diagram illustrating how a self-attention feature and a cross-attention feature according to at least one embodiment of the present disclosure are integrated to generate an integrated attention feature, and how the target latent vector and the integrated attention feature are input to one or more neural networks to obtain a feature map. 【0057】 As shown in Figure 6A, first, the self-attention feature map 610 and the cross-attention feature map 620 are integrated to generate an integrated attention feature map 640, and then the target latent vector map 630 and the integrated attention feature map 640 are input to one or more neural networks 660 to obtain a feature map 650. 【0058】 Figure 6B shows a schematic diagram illustrating how a feature map is obtained by inputting a target latent vector and self-attention features according to at least one embodiment of the present disclosure into one or more neural networks. 【0059】 As shown in Figure 6B, the target latent vector map 630 and the self-attention feature map 610 are input to one or more neural networks 660 to obtain the feature map 650. 【0060】 Figure 6C shows a schematic diagram illustrating how a feature map is obtained by inputting a target latent vector and cross-attention features according to at least one embodiment of the present disclosure into one or more neural networks. 【0061】 As shown in Figure 6C, the target latent vector map 630 and the cross-attention feature map 620 are input to one or more neural networks 660 to obtain the feature map 650. 【0062】 Figure 6D shows a schematic diagram illustrating how a feature map is obtained by inputting a target latent vector, self-attention features, and cross-attention features according to at least one embodiment of the present disclosure into one or more neural networks. 【0063】 As shown in Figure 6D, the target latent vector map 630 is input to one or more neural networks 660 along with the self-attention feature map 610 and the cross-attention feature map 620 to obtain the feature map 650. 【0064】 The one or more neural networks described above may be trained based on the training data input to each. For example, the one or more neural networks shown in Figure 6A may be trained using integrated attention feature samples generated by integrating self-attention feature samples and cross-attention feature samples, and target latent vector samples (abbreviated as latent vector samples). The one or more neural networks shown in Figure 6B may be trained using self-attention feature samples and latent vector samples. The one or more neural networks shown in Figure 6C may be trained using cross-attention feature samples and latent vector samples. The one or more neural networks shown in Figure 6D may be trained using self-attention feature samples, cross-attention feature samples, and latent vector samples. 【0065】 In some embodiments, one or more neural networks may have edge enhancement capabilities. These edge enhancement capabilities may be implemented using, for example, multiple edge enhancement algorithms such as the Sobel operator, Prewitt operator, Canny edge detection, Laplacian operator, LoG operator, or SIFT algorithm. Of course, these edge enhancement capabilities may be implemented by a single neural network, but this network does not need to be trained independently; it is sufficient that the final output after all neural networks have been trained meets the expectations of edge enhancement. 【0066】 In some embodiments, one or more neural networks may be trained by a heuristic training policy. 【0067】 In conventional neural network training methods using loss calculation and backpropagation, there are cases where the loss cannot be backpropagated. This is because, although the neural network outputs a predicted feature map, the bounding box is obtained by performing calculations on the feature map using a series of non-differentiable operations (e.g., torch.nonzero (a function provided by Py Torch that returns the coordinates of non-zero elements in the input tensor) and coords.min / max (attributes used when processing the coordinate tensor)) with cv2.threshold(OSTU) (a self-adaptive method for determining the binarization threshold of an image) and cv2.findContours (for detecting contours in a binary image). In this case, the propagation process of the neural network is already complete, and it is not possible to optimize the parameters of the neural network (e.g., feature integration networks, edge enhancement networks, etc.) by backpropagating based on the loss between the predicted bounding box and the actual bounding box. On the other hand, when training the neural network of the proposed technology using a heuristic training policy, it is possible to focus on improving the feature maps generated by the neural network, and in the output process of the neural network, the problem that the loss of the proposed technology cannot be backpropagated using conventional neural network training methods that use loss calculation and backpropagation can be sequentially solved. 【0068】 Specifically, in this proposed technology, the objective of the heuristic training policy for the neural network is to generate a well-integrated feature map for generating an accurate bounding box. A "well-integrated feature" means that the feature map has very high distinction between the foreground target and the background, and when the feature map is typically a grayscale image, that is, the pixel values of the foreground target are concentrated around 255 and the pixel values of the background are concentrated around 0. In other words, the pixel value distribution curve has two peaks and is concentrated around 0 or 255. Such a feature map has clearer distinctions between black and white and sharper boundaries, so in step 320, where the bounding box of the image target is obtained based on the subsequent feature map, a more accurate bounding box can be obtained. 【0069】 Accordingly, in some embodiments, the evaluation criteria for the heuristic training policy include the number of pixels in the image whose pixel value is within a predetermined range around 0 exceeding a first threshold, and the number of pixels in the image whose pixel value is within a predetermined range around 255 exceeding a second threshold. Here, "exceeding the first threshold" or "exceeding the second threshold" indicates a large number. The first threshold or the second threshold may be set to a relatively high value, for example, a number that approximates the result of dividing the total number of pixels in the feature map by 2. The first threshold and the second threshold may be the same or different. 【0070】 In some embodiments, the heuristic training policy may include the following steps: 【0071】 First, we obtain the quantity distribution of pixel values in the feature maps output from one or more neural networks. In the quantity distribution, the horizontal axis represents the pixel value, and the vertical axis represents the quantity. Here, "quantity" refers to the number of times each pixel value appears in the image. For example, the quantity of the pixel value 233 is 134,450, which indicates the number of times the pixel value 233 appears, i.e., the number of pixels with the pixel value 233 is 134,450. The quantity distribution diagram can intuitively show the distribution of pixel values in an image, for example, which pixel values appear frequently and which pixel values appear rarely. 【0072】 Figures 7A–7C show schematic diagrams of several examples of feature maps and corresponding pixel value quantity distributions output from one or more neural networks relating to at least one embodiment of the present disclosure. 【0073】 The lower part of Figure 7A shows that the quantity distribution of pixel values in the feature map at the top of Figure 7A has multiple (more than two) peaks, and that the peak values of multiple waves are concentrated around 0. The feature map at the top of Figure 7A clearly has a lot of black and the target edges are not clear, so the bounding box generated based on this feature map (see the frame in the feature map at the top of Figure 7A) will be smaller than the actual bounding box. 【0074】 The lower part of Figure 7B shows that the quantity distribution of pixel values in the feature map at the top of Figure 7B has multiple (more than two) peaks, with the peak value of one wave concentrated around 0 and the peak values of several other waves concentrated around 255. The feature map at the top of Figure 7B clearly has a lot of white and the target edges are over-exaggerated and inaccurate, so the bounding box generated based on this feature map (see the frame in the feature map at the top of Figure 7B) is larger than the actual bounding box. 【0075】 The lower part of Figure 7C shows that the quantity distribution of pixel values in the feature map at the top of Figure 7C has two peaks, and the two peak values are concentrated around 0 and 255. The edges of the feature map at the top of Figure 7C are appropriate, and the bounding box generated based on this feature map (see the frame in the feature map at the top of Figure 7C) usually matches the actual bounding box. 【0076】 Therefore, in order to achieve the effect shown in Figure 7C, the heuristic training policy may further include the following steps. 【0077】 Determine one or more waves with peaks in the quantity distribution. Here, a wave with a peak is a wave that has only the peaks that are of interest in this specification and has no troughs. In this case, only protruding waves are considered, not concave waves, because protruding waves represent the parts where the number of pixels is most abundant. For example, here, each of the one or more waves is a wave where the pixel value of the corresponding peak is 0 and the right side of the corresponding peak is decreasing, a wave where the pixel value of the corresponding peak is 255 and the left side of the corresponding peak is increasing, or a wave where the pixel value of the corresponding peak is between 0 and 255 and the left side of the corresponding peak is increasing and the right side of the corresponding peak is decreasing. 【0078】 The mean (Mean), variance or standard deviation (Std), x-axis amplitude (AmpX), and y-axis amplitude (AmpY) of one or more waves may be calculated. The x-axis amplitude (AmpX) and y-axis amplitude (AmpY) may be calculated using parameters such as the polarization state of the wave, the propagation direction, and the number of waves. The mean and variance or standard deviation may be calculated using many methods well known in the art, and the variance and standard deviation may be converted to each other. 【0079】 A total score may be calculated for the feature map output from the neural network based on the number of waves (one or more), the mean value, variance or standard deviation of each wave, the amplitude on the x-axis, and the amplitude on the y-axis. The higher the number of waves (2) and the more the pixel values of the two waves are concentrated at 0 and 255, respectively, the higher the total score. The concentration of pixel values of the two waves at 0 and 255 may include the mean values of the two waves being near 0 and 255, respectively, and the shape of the two waves being elongated. 【0080】 The specific scoring method involves setting an initial score, and the more the above conditions are not met, the more points are deducted. 【0081】 For example, the initial score could be 100. 【0082】 First, scoring may be based on the number of waves. Based on the number of waves (one or more), the first score is calculated as initial score - abs(number of waves - 2)*a. Here, abs indicates that the absolute value is being calculated. abs indicates that the absolute value is being calculated, meaning that having more or fewer than two waves is inappropriate, and the more waves there are or the fewer waves there are, the more points will be deducted. 【0083】 Specifically, for example, scoring may be based on the mean value of the waves. From one or more waves, find two waves whose mean value lies within a predetermined range of both 0 and 255. If no waves are found, it indicates that the feature map is not appropriate and may be ignored. If waves are found, a second score is calculated based on the mean values of the two waves, for the wave whose mean value lies within the predetermined range of 0, as initial score - MSE(mean value of waves - 0)*b, and a third score is calculated based on the mean value lies within the predetermined range of 255, as initial score - MSE(mean value of waves - 255)*. Here, MSE indicates calculating the mean squared error. Here, the predetermined range may be set to 10 or another number so that the mean value is as close as possible to 0 or 255. In other words, for waves whose mean value is near 0, points are deducted as the mean value moves away from 0, and for waves whose mean value is near 255, points are deducted as the mean value moves away from 255. Alternatively, one score may be calculated for each of the two waves. 【0084】 Scoring may be based on the shape of the waves. As mentioned above, search for two waves from all waves whose mean value is around 0 or 255. If no such waves are found, it indicates that the feature map is not appropriate and may be ignored. If such waves are found, calculate a fourth score for each of the two waves based on the variance or standard deviation of the two waves, the amplitude on the x-axis, and the amplitude on the y-axis, as initial fraction - MSE(aspect ratio or standard deviation of the wave - aspect ratio or standard deviation of the given wave)*c. Here, the aspect ratio of the wave is the amplitude on the y-axis of the wave divided by the amplitude on the x-axis of the wave, and the standard deviation is the square root of the variance. The fourth score is calculated for each wave. Here, the given aspect ratio is, for example, 10, and the given standard deviation is, for example, 0.2, but is not limited to these. That is, the shape of each wave is preferably elongated, and the greater the deviation from the given aspect ratio or given standard deviation, the more points will be deducted. Elongated waves indicate that the pixel values are concentrated around 0 or 255. 【0085】 a, b, and c are weights. You can set them as a>b>c. Of course, you can also set other weight relationships based on importance. 【0086】 You may add the first score, the second score, the third score, and the fourth score from the two waves to obtain a total score. 【0087】 One or more neural networks may be trained up to a predetermined number of times (e.g., 100 times), and the model parameters of the one or more neural networks with the highest total score may be retained to determine the trained one or more neural networks. 【0088】 Figures 7A-7C show the quality scores (i.e., the total scores above) of several exemplary curves, respectively. The reason the scores are negative is that the initial scores were set low. 【0089】 Therefore, the above policy allows for relatively accurate training of appropriate neural networks. 【0090】 Thus, heuristic training policies allow for the efficient and low-cost discovery of optimal model parameters for neural networks, avoiding the problems associated with conventional loss function and backpropagation-based neural network training policies. 【0091】 Next, in step 320, the bounding box of the image target may be obtained based on the feature map. In some embodiments, the bounding box may be obtained by binarizing the feature map and detecting contours in the binary image. For example, the bounding box may be determined based on the maximum box range of the contour using the cv2.threshold(OSTU) (a self-adaptive method for determining the image binarization threshold) or cv2.findContours (for detecting contours in a binary image) algorithms. As an example of a bounding box, refer to the box in the feature map at the top of Figure 7C. Of course, the contour of the target may be obtained using other methods such as edge detection, and the bounding box may be determined based on the maximum box range of the contour, and the methods for obtaining the bounding box will not be listed here one by one. 【0092】 Figure 8 shows the difference between a bounding box obtained by a technical proposal of at least one embodiment of the present disclosure, a real bounding box, and a bounding box obtained by a conventional technical proposal. 【0093】 In the three images on the left side of Figure 8, the solid bounding boxes represent the bounding boxes obtained using the conventional technology, while the dashed bounding boxes represent the actual bounding boxes. As shown in the figure, there is a significant difference between the obtained bounding boxes and the actual bounding boxes in the three images on the left side of Figure 8, indicating that the obtained bounding boxes do not realistically reproduce the actual bounding boxes. 【0094】 In the three images on the right side of Figure 8, the solid bounding boxes represent bounding boxes obtained by the technical invention of at least one embodiment of this disclosure, while the dashed bounding boxes represent actual bounding boxes. As shown in the figure, in the three images on the right side of Figure 8, the difference between the obtained bounding boxes and the actual bounding boxes is not large, and they are at least closer to the actual bounding boxes than the bounding boxes on the left side of Figure 8. Therefore, the obtained bounding boxes reproduce the actual bounding boxes relatively realistically. 【0095】 Thus, according to at least one embodiment of the present disclosure, by utilizing the self-attention force features and / or cross-attention features, as well as latent vector features of the image acquired simultaneously in a process of generating an image containing an image target or a process of adding noise to an image and regenerating the image, more accurate target edge information can be obtained, thereby obtaining a feature map that more accurately represents the edge features of the target, and based on the more accurate feature map, the bounding box of the target can be generated more accurately. 【0096】 The bounding box generated by the technical proposal of at least one embodiment of this disclosure may be used, along with the image (and prompt or recognition result), as training data for a neural network to generate bounding boxes to further train a target detector. This allows the recognition result of the target to be directly detected and the bounding box of the target to be obtained simply by inputting an image. The image here may be a generated image that is simultaneously generated to acquire self-attention features and / or cross-attention features, and in this way, the bounding box generated while generating the image can be obtained, and training can be performed using the generated image at the same time. Alternatively, the image here may be any input image. According to the technical proposal of at least one embodiment of this disclosure, the arbitrary image is processed to acquire self-attention features and / or cross-attention features to generate a bounding box, and training is performed using the arbitrary image at the same time. Hereinafter, this disclosure is not limited to various subsequent applications. 【0097】 As can be seen by comparing the bounding boxes on the left and right sides of Figure 8, the bounding box obtained by the technical proposal of at least one embodiment of this disclosure is more accurate and closer to the actual bounding box than the bounding box obtained by the conventional technical proposal. 【0098】 Figure 9 shows a block diagram of an image target bounding box generation system 900 according to at least one embodiment of the present disclosure. 【0099】 The image target bounding box generation system 900 may include feature map acquisition means 910 and bounding box acquisition means 920. 【0100】 The feature map acquisition means 910 may be configured to acquire a feature map based on at least one of self-attention features and cross-attention features, as well as one or more latent vectors, acquired in a process of generating an image containing an image target or a process of adding noise to an image and regenerating the image. 【0101】 The bounding box acquisition means 920 may be configured to acquire the bounding box of an image target based on a feature map. 【0102】 In some embodiments, the feature map acquisition means 910 may be configured to acquire a feature map by inputting at least one of self-attention features and cross-attention features, as well as one or more latent vectors, into one or more neural networks. 【0103】 In some embodiments, the feature map acquisition means 910 may be configured to select at least one latent vector from one or more latent vectors as a target latent vector for acquiring a feature map based on a first predetermined rule, or it may be configured to acquire a latent vector for acquiring a feature map by combining at least two latent vectors from one or more latent vectors based on a second predetermined rule. 【0104】 In some embodiments, a first predetermined rule for selection may include random selection, or based on the ratio of the size of the foreground region to the size of the background region in the latent vector map of one or more latent vectors, and / or the degree of concentration of the foreground region in the image. A second predetermined rule for combination may include direct addition, or based on the ratio of the size of the foreground region to the size of the background region in the latent vector map of one or more latent vectors, and / or the degree of concentration of the foreground region in the image. 【0105】 In some embodiments, the feature map acquisition means 910 may be configured to acquire a feature map by inputting the target latent vector and self-attention features into one or more neural networks, to acquire a feature map by inputting the target latent vector and cross-attention features into one or more neural networks, to acquire a feature map by inputting the target latent vector, self-attention features and cross-attention features into one or more neural networks, or to integrate the self-attention features and cross-attention features to generate integrated attention features, and then input the target latent vector and integrated attention features into one or more neural networks to acquire a feature map. 【0106】 In some embodiments, one or more neural networks may have functions for edge enhancement. 【0107】 In some embodiments, one or more neural networks may be trained by a heuristic training policy. 【0108】 In some embodiments, the evaluation criteria for the heuristic training policy may include the number of pixels in the image whose pixel value is within a predetermined range around 0 exceeding a first threshold, and the number of pixels in the image whose pixel value is within a predetermined range around 255 exceeding a second threshold. 【0109】 In some embodiments, the heuristic training policy involves obtaining the quantitative distribution of pixel values of feature maps output from one or more neural networks, determining one or more waves having peaks in the quantitative distribution, where each of the waves is a wave whose corresponding peak pixel value is 0 and the right side of the corresponding peak is decreasing, a wave whose corresponding peak pixel value is 255 and the left side of the corresponding peak is decreasing, or a wave whose corresponding peak pixel value is between 0 and 255 and the left side of the corresponding peak is increasing and the right side of the corresponding peak is decreasing, and the mean value of each of the waves is... The method may include calculating the variance or standard deviation, the amplitude on the x-axis, and the amplitude on the y-axis; calculating a total score for a feature map output from a neural network based on the number of waves, the mean value of each wave, the variance or standard deviation, the amplitude on the x-axis, and the amplitude on the y-axis, wherein the total score is higher as the conditions are met, such that there are 2 waves and the pixel values of the two waves are concentrated at 0 and 255, respectively; and determining a trained neural network by training one or more neural networks up to a predetermined number of times and retaining the model parameters of the neural network with the highest total score. 【0110】 Thus, according to at least one embodiment of the present disclosure, by utilizing the self-attention features and / or cross-attention features of an image acquired simultaneously in a process of generating an image containing an image target or a process of adding noise to an image and regenerating the image, as well as by utilizing latent vector features, more accurate edge information of the target can be obtained, thereby obtaining a feature map that more accurately represents the edge features of the target, and a more accurate bounding box of the target can be generated based on the more accurate feature map. 【0111】 Furthermore, the defects present in the prior art described above were the result of thorough examination through practical and creative work, and the process of discovering these problems, as well as the solutions proposed in at least one embodiment of this disclosure, are all creative contributions to the invention in the process of invention. 【0112】 Figure 10 shows a block diagram of an exemplary electronic device relating to at least one embodiment of the present disclosure. 【0113】 The electronic device may include a processor 1010 and a memory 1020. The memory 1020 is connected to the processor 1010 and stores computer instructions for performing steps of each method of at least one embodiment of the present disclosure when executed by the processor 1010. 【0114】 The processor 1010 may include, but is not limited to, one or more processors or microprocessors. 【0115】 Memory 1020 may include, but is not limited to, random access memory (RAM), read-only memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, and computer storage media (e.g., hard disks, flexible disks, solid-state drives, removable disks, CD-ROMs, DVD-ROMs, Blu-ray discs, etc.). 【0116】 In addition, the electronic device may further include, but is not limited to, a data bus 1030, an input / output (I / O) bus 1040, a display 1050, and input / output devices 1060 (e.g., a keyboard, mouse, speaker, etc.). 【0117】 The processor 1010 may communicate with external displays 1050 and input / output devices 1060, etc., via the I / O bus 1040. 【0118】 In one embodiment, the at least one computer instruction may be compiled or configured as a computer program product or software product, and when one or more of these computer instructions are executed by the processor, the steps of each function and / or method in the embodiments described herein are performed. 【0119】 According to at least one embodiment of the present disclosure, a non-temporary computer-readable storage medium is further provided. The non-temporary computer-readable storage medium stores instructions, such as computer instructions. When a computer instruction is executed by a processor, the methods described above may be performed. The non-temporary computer-readable storage medium includes, but is not limited to, random access memory (RAM), read-only memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, and computer storage media (e.g., hard disks, flexible disks, solid-state drives, removable disks, CD-ROMs, DVD-ROMs, Blu-ray discs, etc.). For example, the non-temporary computer-readable storage medium may be connected to a computing device such as a computer, and when the computing device executes a computer instruction stored in the computer-readable storage medium, the methods described above may be performed. 【0120】 This disclosure may further include computer program products. These computer program products may perform the methods, steps, and operations provided in this disclosure. Such computer program products may be, for example, computer software packages, computer code instructions, or computer-readable tangible media having tangibly stored (and / or encoded) computer instructions, which may be executed by a processor to perform the operations described herein. Computer program products may also include packaging materials. 【0121】 The block diagrams of the components, devices, equipment, and systems relating to this disclosure are illustrative only and do not require or imply that they must be connected, installed, or arranged in the manner shown in the block diagrams. As those skilled in the art will see, these components, devices, equipment, and systems may be connected, installed, or arranged in any manner. Words such as “include,” “contain,” and “have” are open vocabulary and can be used interchangeably to mean “include, but not limited to.” The word “for example” as used in this disclosure can be used interchangeably to mean “for example, but not limited to.” 【0122】 The step flowcharts and method descriptions in this disclosure are illustrative only and do not require or imply that the steps of each embodiment must be performed in the order described. As those skilled in the art will see, the steps in the embodiments may be performed in any order. Words such as "then," "and," and "next" do not limit the order of the steps, but are merely for the purpose of helping the reader understand the method description. Furthermore, any citation of a singular element using, for example, the article "one," "one," or "the said" shall not be construed as restricting that element to a singular form. 【0123】 Furthermore, the steps and apparatus in each embodiment of this specification are not limited to being performed in only one embodiment. In fact, new embodiments may be conceived by combining some of the relevant steps and apparatus in each embodiment of this specification based on the concepts of this disclosure, and these new embodiments are also included in the scope of this disclosure. 【0124】 The above method may be implemented by hardware, software, firmware, or any combination thereof. 【0125】 Furthermore, modules and / or other suitable means for carrying out the methods and techniques described herein may be downloaded from a server via wireless communication as appropriate. Alternatively, the various methods described herein may be provided via storage means so as to be obtained when coupled to storage means. Furthermore, any other suitable techniques for providing the methods and techniques described herein to the apparatus may be utilized. 【0126】 The above description is provided for illustrative and explanatory purposes only. Furthermore, this description is not intended to limit at least one embodiment of the disclosure to the form disclosed herein. While several exemplary forms and embodiments have been discussed above, those skilled in the art will understand that several variations, modifications, changes, additions, and subcombinations thereof are possible.
Claims
[Claim 1] A method for generating a bounding box for an image target, A step of obtaining a feature map based on at least one of self-attention features and cross-attention features, and one or more latent vectors obtained in a process of generating an image containing an image target or a process of adding noise to the image and regenerating the image, A method for generating a bounding box for an image target, comprising the steps of obtaining the bounding box for the image target based on the feature map. [Claim 2] Obtaining a feature map based on at least one of the self-attention features and cross-attention features, and one or more latent vectors obtained in a process of generating an image containing the aforementioned image target or a process of adding noise to the aforementioned image and regenerating the aforementioned image, is: The process includes obtaining the feature map by one or more neural networks based on at least one of the self-attention features and the cross-attention features, and the one or more latent vectors. A method for generating a bounding box for an image target according to claim 1. [Claim 3] Obtaining a feature map based on at least one of the self-attention features and cross-attention features, and one or more latent vectors obtained in a process of generating an image containing the aforementioned image target or a process of adding noise to the aforementioned image and regenerating the aforementioned image, is: Based on the first prescribed rule, select at least one of the one or more latent vectors as the target latent vector for obtaining the feature map, or This includes generating the target latent vector by combining at least two of the one or more latent vectors based on a second prescribed rule, The first prescribed rule is, To select randomly, or This includes being based on the ratio of the size of the foreground region to the size of the background region in the latent vector map of each of the one or more latent vectors, and / or the degree of concentration of the foreground region in the image, The second prescribed rule mentioned above is: Direct addition, or This includes being based on the ratio of the size of the foreground region to the size of the background region in the latent vector map of each of the one or more latent vectors, and / or the degree of concentration of the foreground region in the image. A method for generating a bounding box for an image target according to claim 2. [Claim 4] Obtaining the feature map by one or more neural networks based on at least one of the self-attention features and the cross-attention features, and the one or more latent vectors, is: Inputting the target latent vector and the self-attention features into one or more neural networks to obtain the feature map, Inputting the target latent vector and the cross-attention features into one or more neural networks to obtain the feature map, Inputting the target latent vector, the self-attention features, and the cross-attention features into one or more neural networks to obtain the feature map, or This includes integrating the self-attention feature and the cross-attention feature to generate an integrated attention feature, and inputting the target latent vector and the integrated attention feature into one or more neural networks to obtain the feature map. A method for generating a bounding box for an image target according to claim 3. [Claim 5] The aforementioned one or more neural networks have functions for edge enhancement. A method for generating a bounding box for an image target according to claim 2. [Claim 6] The aforementioned one or more neural networks are trained by a heuristic training policy. A method for generating a bounding box for an image target according to claim 2. [Claim 7] The evaluation criteria for the heuristic training policy include the following: the number of pixels in the image whose pixel value is within a predetermined range around 0 exceeds a first threshold, and the number of pixels in the image whose pixel value is within a predetermined range around 255 exceeds a second threshold. A method for generating a bounding box for an image target according to claim 6. [Claim 8] The aforementioned heuristic training policy is: Obtaining the quantitative distribution of pixel values of feature maps output from the aforementioned one or more neural networks, Determining one or more waves having peaks in the quantity distribution, wherein each of the one or more waves is a wave in which the pixel value of the corresponding peak is 0 and the right side of the corresponding peak is decreasing, a wave in which the pixel value of the corresponding peak is 255 and the left side of the corresponding peak is decreasing, or a wave in which the pixel value of the corresponding peak is between 0 and 255 and the left side of the corresponding peak is increasing and the right side of the corresponding peak is decreasing. Calculate the mean value, variance, or standard deviation, the amplitude on the x-axis, and the amplitude on the y-axis for each of the one or more waves mentioned above. A total score is calculated for the feature map output from the neural network based on the number of waves (one or more), the mean value, variance or standard deviation of each wave, the amplitude of the x-axis, and the amplitude of the y-axis, wherein the total score is higher as the conditions are met, such that the number of waves is 2 and the pixel values of the two waves are concentrated at 0 and 255, respectively. The process includes training one or more neural networks up to a predetermined number of times and determining the trained one or more neural networks by retaining the model parameters of the one or more neural networks with the highest total score, A method for generating a bounding box for an image target according to claim 7. [Claim 9] Memory where computer instructions are stored, Electronic device comprising: a processor that performs a method for generating a bounding box for an image target according to any one of claims 1 to 8 by executing computer instructions stored in the memory. [Claim 10] Including computer instructions, A computer program product that, when the computer instruction is executed by the processor, causes the processor to execute the method for generating a bounding box for an image target according to any one of claims 1 to 8.