Methods, apparatus, and computer program products for filtering images to be filtered.
By filtering images based on similarity calculations and using mixup mixing to generate extended bounding box images, the method enhances the diversity and performance of deep learning target detection models, addressing the issues of insufficient labeled data and overfitting.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- NTT DOCOMO INC
- Filing Date
- 2025-11-21
- Publication Date
- 2026-06-15
Smart Images

Figure 2026096934000001_ABST
Abstract
Description
【Technical Field】 【0001】 The present disclosure relates to the field of data processing, and particularly to a method, an apparatus, and a computer program product for filtering a target image to be filtered. 【Background Art】 【0002】 Target detection is an important technology in the field of computer vision, aiming to recognize and position one or more target objects in an image or video frame. The application fields of target detection are very wide, including, for example, autonomous driving, face recognition, diseased leaf detection, and the like. In the task of target detection, a bounding box is an essential tool for determining the position of a target object in an image, and the position of the target is labeled by a rectangular area. The bounding box may be determined by manually labeling the image by a person, or may be determined by automatically labeling it by methods such as edge detection, clustering algorithms, or rule-based methods. 【Summary of the Invention】 【Problems to be Solved by the Invention】 【0003】 With the rapid development of deep learning, target detection algorithms based on deep learning have made remarkable progress. YOLO (You Only Look Once) and Faster R-CNN (Region-based Convolutional Neural Network) are two popular deep learning target detection models. Deep learning models rely on a large amount of labeled data for training, but problems such as insufficient labeled data and non-uniform data distribution often lead to poor performance of deep learning models. 【0004】 To achieve better target detection effectiveness, the generalization ability of a model can be improved by expanding / enhancing the training dataset. Generative models are a type of machine learning model. They can learn the distribution of existing data and generate new data from it, making them very useful in tasks such as image generation, text generation, and audio generation. Diffusion models and GANs (Generative Adversarial Networks) are common examples of generative models. [Means for solving the problem] 【0005】 This disclosure relates to a method, apparatus, and computer program product for filtering images, which can determine the similarity between one or more bounding box images corresponding to one or more bounding box images contained in an image to be filtered, based on a reference bounding box image set, and further determine whether or not to filter the image to be filtered based on these similarities. Since the reference bounding box image set contains more bounding box images than the true bounding box image set, images to be filtered that are beneficial to a downstream task (e.g., various machine learning or neural network models of the deep learning target detection model described above) are more likely to be selected for training the downstream task without being filtered, increasing the diversity of the filtered images, reducing overfitting, and further improving the performance of the downstream task. 【0006】 One aspect of the present disclosure provides a method for filtering an image to be filtered. The method for filtering an image to be filtered includes the steps of: deriving a reference bounding box image set based on a true bounding box image set, wherein the reference bounding box images in the reference bounding box image set include true bounding box images and extended bounding box images different from the true bounding box images; calculating the similarity between each bounding box image in one or more bounding box images included in the image to be filtered and the reference bounding box image set, to obtain the similarity of one or more bounding box images corresponding to one or more bounding box images; and deciding whether or not to filter the image to be filtered based on the similarity of one or more bounding box images. 【0007】 According to some embodiments of the present disclosure, the step of calculating the similarity between each bounding box image and a set of reference bounding box images and obtaining the similarity of a bounding box image corresponding to a bounding box image includes the steps of: calculating the similarity between the bounding box image and each reference bounding box image in the set of reference bounding box images to obtain a plurality of first similarities; and setting the largest similarity among the plurality of first similarities as the similarity of a bounding box image corresponding to a bounding box image. 【0008】 According to some embodiments of the present disclosure, the step of determining whether to filter an image to be filtered based on the similarity of one or more bounding box images includes the step of filtering an image to be filtered based on the similarity of one or more bounding box images being lower than a first threshold. 【0009】 According to some embodiments of the present disclosure, the step of determining whether to filter an image to be filtered based on the similarity of one or more bounding box images includes the step of filtering an image to be filtered based on the similarity of at least one of the one or more bounding box images being lower than a second threshold. 【0010】 According to some embodiments of the present disclosure, the step of deriving a set of reference bounding boxes based on a set of true bounding boxes includes the step of deriving additional bounding box images as extended bounding box images using a mixup mixing method based on true bounding box images in the set of true bounding boxes. 【0011】 According to some embodiments of the present disclosure, the step of deriving an additional bounding box image as an extended bounding box image using a mixup mixing method based on true bounding box images in a set of true bounding box images includes the step of performing pixel-level mixing or feature-hierarchical mixing on the true bounding box image using the mixup mixing method to derive the additional bounding box image as an extended bounding box image. 【0012】 According to some embodiments of the present disclosure, the step of deriving an additional bounding box image as an extended bounding box image using a mixup mixing method based on true bounding box images in a set of true bounding box images includes the step of mixing a set of true bounding box images that are more similar than a third threshold using a mixup mixing method to derive an additional bounding box image as an extended bounding box image. 【0013】 According to some embodiments of the present disclosure, the step of deriving an additional bounding box image as an extended bounding box image using a mixup mixing method based on true bounding box images in a set of true bounding box images includes the step of mixing a set of true bounding box images with similarity below a fourth threshold using a mixup mixing method to derive an additional bounding box image as an extended bounding box image. 【0014】 According to some embodiments of this disclosure, the image to be filtered is an image generated by a generative model. 【0015】 In other aspects of the present disclosure, an image filtering apparatus is provided. The apparatus includes a processor and a memory storing one or more computer programs, wherein when one or more computer programs are executed by the processor, the processor performs the image filtering method described above. 【0016】 According to a further aspect of the present disclosure, a computer program product is provided which, when executed by a processor, stores instructions that the processor uses to perform the method of filtering the image described above. [Effects of the Invention] 【0017】 According to embodiments of this disclosure, the similarity between one or more bounding box images corresponding to one or more bounding box images included in the image to be filtered can be determined based on a reference bounding box image set, and further, it can be determined whether or not to filter the image to be filtered based on these similarities. Since the reference bounding box image set contains more bounding box images than the true bounding box image set, more images that are beneficial to the downstream task can be selected for training the downstream task without being filtered, increasing the diversity of the filtered images, reducing overfitting, and further improving the performance of the downstream task. 【0018】 The embodiments of this disclosure will be made clearer and easier to understand by describing them in conjunction with the drawings below. [Brief explanation of the drawing] 【0019】 [Figure 1] Figure 1 is a schematic diagram illustrating an application scenario of the method for filtering images according to an embodiment of this disclosure. [Figure 2] Figure 2 shows a flowchart of a method for filtering images according to an embodiment of the present disclosure. [Figure 3] Figure 3 shows a schematic diagram of a bounding box image. [Figure 4] Figure 4 shows a schematic diagram of pixel-level mixing. [Figure 5] Figure 5 shows a schematic diagram of feature hierarchy mixing. [Figure 6] Figure 6 shows a schematic diagram of an apparatus for filtering images to be filtered according to an embodiment of the present disclosure. [Modes for carrying out the invention] 【0020】 Hereinafter, embodiments of the present disclosure will be described in more detail with reference to the drawings. Although several embodiments of the present disclosure are shown in the drawings, the present disclosure can be implemented in various aspects and should not be construed as being limited to the embodiments described below. Rather, these embodiments should be understood to be provided for a more thorough and complete understanding of the present disclosure. The drawings and embodiments of the present disclosure are not intended to limit the protection scope of the present disclosure, but should be understood as being exemplary. 【0021】 It should be understood that each step described in the method embodiments of the present disclosure may be executed in a different procedure and / or in parallel. Also, the method embodiments may include other steps and / or some steps may be omitted. 【0022】 As used herein, the term "comprising" and its variations are open inclusion, that is, it means "including but not limited to...". The term "based on" means "at least partially based on...". The term "one embodiment" represents "at least one embodiment". The term "another embodiment" represents "at least one another embodiment". The term "some embodiments" represents "at least some embodiments". Definitions for other terms will be described later. 【0023】 It should be understood that the concepts such as "first", "second", etc. referred to in the present disclosure are merely for distinguishing different devices, modules or units, etc., and are not for limiting the functional procedures or interdependencies executed by these devices, modules or units. 【0024】 Note that the modifiers "one" and "a plurality" referred to in the present disclosure are not restrictive but general. As can be understood by those skilled in the art, unless otherwise specified in the context, it should be understood as "one or a plurality". 【0025】 As mentioned above, generative models can generate new data based on existing data to augment / extend the training dataset for downstream tasks (e.g., the deep learning target detection model described above). However, data generation is difficult to control precisely, and it is necessary to select data from the large generated dataset that is useful for the downstream task and to train the downstream task in a way that improves its performance. For example, for a target detection task of diseased leaves, augmented images may be generated using a generative model based on images of true diseased leaves. It is necessary to select augmented images that are correct in type (i.e., diseased leaf type), have high clarity, and have strong truthfulness to improve the performance of the downstream target detection model. However, since the disease features of diseased leaves are usually not clear, it is not possible to determine which augmented images have strong ground truth and contain diseased leaf features. If all augmented images are used to train the downstream target detection model without selection, side effects will occur, degrading the performance of the downstream target detection model. 【0026】 Traditionally, it has been known that similarity can be calculated for each true image and augmented image, and based on a similarity threshold, it can be decided whether to retain the augmented image as training data or to filter it out. The problem with such an approach is that only images similar to the true image are selected as augmented images, resulting in limited diversity and causing overfitting, which degrades the performance of downstream tasks. 【0027】 This disclosure has been made in view of the above issues. The object of this disclosure is to provide a method, apparatus and computer program product for filtering images that can determine the similarity of one or more bounding box images corresponding to one or more bounding box images contained in an image to be filtered, based on a reference bounding box image set, and further determine whether or not to filter the image to be filtered based on these similarities. Since the reference bounding box image set contains more bounding box images than the true bounding box image set, more images to be filtered that are useful for downstream tasks can be selected for training of the downstream task without being filtered, increasing the diversity of the filtered images, reducing overfitting, and further improving the performance of the downstream task. 【0028】 Figure 1 is a schematic diagram of application scenario 100 of the method for filtering images according to an embodiment of the present disclosure. As shown in Figure 1, the method 102 for filtering images according to an embodiment of the present disclosure can receive and filter images. In some embodiments, the images to be filtered may be, for example, images generated by a generative model 101 shown in the dotted box in Figure 1, and are intended to be used as training data for a downstream task 103 shown in the dotted box in Figure 1. The generative model 101 is, for example, a diffusion model. To improve the performance of the downstream task 103, the images to be filtered are filtered using method 102, and the filtered images generated are input to the downstream task 103 as training data. In some embodiments, the downstream task 103 may be, for example, a target detection model, and the target detection model is trained for the subsequent execution of target detection tasks such as diseased leaf detection and face recognition. 【0029】 For convenience, the embodiments of this disclosure will be described below primarily in application scenarios where the target detection task is diseased leaf detection. However, it should be understood that this disclosure is not limited to such application scenarios. 【0030】 Figure 2 shows a flowchart of the method 200 for filtering an image to be filtered according to an embodiment of the present disclosure. In some embodiments, the image to be filtered may be an image generated by, for example, the generation model 101 shown in the dotted frame in Figure 1. The generation model 101 may be, for example, a diffusion model. As shown in Figure 2, the method 200 for filtering an image to be filtered according to an embodiment of the present disclosure includes steps S210 to S230. 【0031】 In step S210, a reference bounding box image set may be derived based on a true bounding box image set, where the true bounding box image set is a group of true bounding box images, each true bounding box image being a bounding box image determined from a true image. In this disclosure, an image defined by a bounding box is referred to as a bounding box image. As described above, a bounding box is a rectangular area used to determine the position of a target object in an image. Bounding boxes can be determined manually or automatically by various conventional methods, thereby determining the bounding box image. 【0032】 Figure 3 shows a schematic diagram of bounding box images. For example, image 300 shown in Figure 3 is an image of a single leaf and includes three bounding box images 301, 302, and 303 labeled by three rectangular frames. As an example, if the target detection task is diseased leaf detection, the three bounding box images 301, 302, and 303 are images of yellowed leaves, curled leaves, rotten leaves, withered leaves, or variegated leaves, which are labeled / extracted / determined from image 300, for example. 【0033】 The bounding box images may overlap (e.g., bounding box images 301 and 302) or they may not overlap (e.g., bounding box images 301 and 303). Also, although Figure 3 shows three bounding box images in one image, this is merely an example and should be understood that there is no limitation on the number of bounding box images in this disclosure. 【0034】 Referring again to Figure 2, as described above, in step S210, a reference bounding box image set may be derived based on a true bounding box image set. The true bounding box image set is a group of true bounding box images, and the reference bounding box image set is a group of reference bounding box images. The reference bounding box images in the reference bounding box image set may include (1) true bounding box images in the true bounding box image set, and (2) extended bounding box images that are different from the true bounding box images. In other words, the reference bounding box images in the reference bounding box image set are greater than the true bounding box images in the true bounding box image set. By introducing additional extended bounding box images, embodiments of this disclosure can select more images for filtering that are beneficial to subsequent downstream tasks without filtering them, for training of downstream tasks. For example, for filtering images with good diversity, it is possible to avoid filtering them on the grounds that none of the bounding box images contained within them are similar to true bounding box images. This increases the diversity of filtered images, reduces overfitting, and further improves the performance of downstream tasks. 【0035】 In some embodiments, the true bounding box image may be augmented using a known mixup mixing method to derive the augmented bounding box image described above. The mixup mixing method is a data augmentation method mainly used in the field of computer vision, and its central concept is to augment the dataset and extend the generalization ability of the model by mixing input data with a simple linear transformation. Therefore, in some embodiments, step S210 may include deriving an additional bounding box image as an augmented bounding box image using the mixup mixing method based on the true bounding box image in the set of true bounding box images. In other words, the augmented bounding box image may be obtained by mixing multiple true bounding box images using the mixup mixing method. This makes it possible to obtain features that are not collected in some of the true images. For example, in the example of diseased leaf detection, the true bounding box image obtained based on the true images may only include images of yellowed or withered leaves, but the augmented bounding box image may include not only images of yellowed and withered leaves, but also images of yellowed and withered leaves. 【0036】 The mixup mixing method may also be pixel-level mixing or feature-level mixing. Therefore, in some embodiments, the step of deriving an additional bounding box image as an extended bounding box image using the mixup mixing method based on the true bounding box image in the set of true bounding box images described above may include the step of performing pixel-level mixing or feature-level mixing on the true bounding box image using the mixup mixing method to derive the additional bounding box image as an extended bounding box image. Figure 4 shows a schematic diagram of pixel-level mixing. Pixel-level mixing of images may refer to randomly selecting two images and mixing them in a predetermined ratio to generate a new image. As shown in Figure 4, the first bounding box image 401 and the second bounding box image 402 can be used to obtain a third bounding box image 403 through pixel-level mixing. This third bounding box image 403 is, in other words, an additional bounding box image that can be the extended bounding box image described above. Figure 5 shows a schematic diagram of feature-level mixing. Image feature layer mixing may also refer to performing mixing in a randomly selected hidden layer (intermediate layer) of a neural network. As shown in Figure 5, for the first bounding box image 501 and the second bounding box image 502, an encoder and decoder may be used to randomly select one layer as a mixing layer, then the features of that layer may be linearly interpolated to mix at the feature level, and then an additional bounding box image may be output as an extended bounding box image based on the mixed features. 【0037】 The input data for the mixup mixing method described above may include two or more true bounding box images, for example, the first bounding box image 401 and the second bounding box image 402 as described above with reference to Figure 4, or the first bounding box image 501 and the second bounding box image 502 as described with reference to Figure 5. These true bounding box images may be mixed in pairs, or two or more true bounding box images may be mixed once. These true bounding box images may be randomly selected from a set of true bounding box images, or they may be multiple true bounding box images with high similarity from the set of true bounding box images, or they may be multiple true bounding box images with large differences from the set of true bounding box images. 【0038】 In some embodiments, the extended bounding box image described above may be derived by selecting multiple true bounding box images with high similarity, i.e., small differences. By using highly similar true bounding box images for derivation, an extended bounding box image that is similar to but slightly different from the true bounding box image can be obtained, contributing to a reduction in the downstream task's dependence on certain attributes and improving the downstream task's generalization ability. Furthermore, deriving the extended bounding box image based on multiple highly similar true bounding box images contributes to the downstream task learning more robust feature representations, allowing the downstream task to maintain stable output performance even when faced with small changes in the input image. Accordingly, in some embodiments, deriving an additional bounding box image as an extended bounding box image using a mixup mixing method based on true bounding box images in the set of true bounding box images described above may include mixing multiple true bounding box images from the set of true bounding box images that have a similarity higher than the third threshold using a mixup mixing method, and deriving an additional bounding box image as an extended bounding box image. 【0039】 In some embodiments, multiple true bounding box images with low similarity, i.e., large differences, may be selected to derive the extended bounding box image described above. Deriving using true bounding box images with low similarity can increase the diversity of the reference bounding box image set, allowing downstream tasks to better learn generalized feature representations and improve performance when downstream tasks encounter novel, unseen data. Thus, in some embodiments, deriving an additional bounding box image as an extended bounding box image using the mixup mixing method based on true bounding box images in the set of true bounding box images described above may include mixing multiple true bounding box images with similarity below a fourth threshold in the set of true bounding box images using the mixup mixing method to derive an additional bounding box image as an extended bounding box image. 【0040】 Referring again to Figure 2, after step S210, the process proceeds to step S220. In step S220, the similarity between each bounding box image in the one or more bounding box images included in the image to be filtered and the set of reference bounding box images derived in step S210 is calculated, and the similarity between one or more bounding box images corresponding to one or more bounding box images can be obtained. As mentioned above, bounding boxes can be determined manually or automatically by various conventional methods, so bounding box images can be determined. This also makes it possible to determine one or more bounding box images included in the image to be filtered. In step S220, the similarity between each bounding box image in these bounding box images and the set of reference bounding box images can be determined. In other words, for each bounding box image included in the image to be filtered, one similarity corresponding to that bounding box image is calculated, and this similarity represents the degree of similarity between the bounding box image and the set of reference bounding box images. This allows us to obtain the similarity of one or more bounding box images corresponding to one or more bounding box images included in the image to be filtered. In other words, if the number of bounding box images included in the image to be filtered is N (where N is an integer greater than or equal to 1), then in step S220, the image similarity of N bounding box images can be calculated, and the similarity of these N bounding box images corresponds to one bounding box image each. 【0041】 There may be multiple methods for calculating the similarity between each bounding box image and the set of reference bounding boxes. As a simple example, the similarity of a bounding box image corresponding to one bounding box image may be calculated as the maximum value of the similarity between that bounding box image and all reference bounding box images in the set of reference bounding boxes. Therefore, in some embodiments, step S220 may include the steps of: obtaining a plurality of first similarities by calculating the similarity between each bounding box image included in the image to be filtered and each reference bounding box image in the set of reference bounding boxes; and setting the maximum similarity among the plurality of first similarities as the similarity of the bounding box image corresponding to the bounding box image. 【0042】 It can be understood that the above-mentioned first similarity between each bounding box image in the filtered image and each reference bounding box image in the reference bounding box image set may be calculated using any known method for calculating the similarity between two images in this field. For example, a method for calculating the similarity between two images may include calculating it based on the cosine similarity between the contrastive language-image pre-training (CLIP) feature vectors of the two images, calculating it based on the structural similarity (SSIM), peak signal-to-noise ratio (PSNR), and learned perceptual image patch similarity (LPIPS) of the two images, and determining the similarity between the two images based on the Euclidean distance and Manhattan distance between the feature vectors of the two images. Assuming that the number of reference bounding box images in the reference bounding box image set is M (where M is an integer greater than or equal to 1), M first similarity scores can be calculated for each bounding box image, and the highest similarity score among these M scores may be used as the similarity score for the bounding box image corresponding to that image. 【0043】 In other embodiments, the similarity between each bounding box image and the set of reference bounding box images may be calculated using other methods. For example, the similarity between a bounding box image and a bounding box image corresponding to a single bounding box image may be calculated as the average or median of the similarity between that single bounding box image and all reference bounding box images in the set of reference bounding box images, but this disclosure is not limited to this. 【0044】 In step S230, the decision of whether or not to filter the image to be filtered may be made based on the similarity of one or more bounding box images obtained in step S220. In the embodiments of this disclosure, the decision of whether or not to filter the image to be filtered depends on the similarity at the bounding box image level, rather than the similarity at the overall image level. In this way, it is possible to decide whether to retain or filter the image to be filtered at a finer granularity, allowing more images useful to downstream tasks to be retained as training data for downstream tasks, thereby expanding the diversity of the training data. 【0045】 In some embodiments, for a given image to be filtered, the image to be filtered is retained as long as the similarity of the bounding box images corresponding to one bounding box image in the image to be filtered is sufficiently high. In other words, if the similarity of the bounding box images corresponding to all bounding box images in the image to be filtered is not sufficiently high, the image to be filtered is filtered. In such cases, step S230 may also include filtering the image to be filtered based on the fact that the similarity of one or more bounding box images obtained in step S220 is all less than the first threshold. In this embodiment, the entire image to be filtered is retained as long as there is at least one bounding box image that satisfies the first threshold. Compared to conventional methods that decide whether or not to filter based on the similarity of the entire image (i.e., all of the similarities corresponding to each bounding box image satisfy the threshold), the method in this embodiment can retain a wider variety of images. This increases the diversity of the filtered images and improves the performance of downstream tasks. 【0046】 Since the reference bounding box images in the reference bounding box image set are an extended set compared to the true bounding box image set, in some embodiments, a stricter filtering condition may be considered for one image to be filtered, i.e., the image to be filtered is filtered as long as there is at least one similarity of bounding box images corresponding to a bounding box image below the threshold. In such cases, step S230 may include filtering two images to be filtered based on the fact that at least one of the similarities of one or more bounding box images obtained in step S220 is below the second threshold. Since all images to be filtered can be used as training data for downstream tasks after filtering, the embodiments described above can reduce outlier data in the training data and improve the accuracy of downstream tasks. 【0047】 The first, second, third, and fourth thresholds mentioned above are all similarity thresholds, and their magnitudes may be the same or different, and they may be set according to experience or needs. 【0048】 The method for filtering images according to the embodiments of this disclosure can increase the diversity of the filtered images, reduce overfitting, and further improve the performance of downstream tasks. 【0049】 Embodiments of this disclosure further provide a device for filtering images to be filtered. Figure 6 shows a schematic diagram of a device 600 for filtering images to be filtered according to an embodiment of this disclosure. 【0050】 As shown in Figure 6, the filtering device 600 according to this embodiment comprises a processor 610 and a memory 620. One or more computer programs are stored in the memory 620. 【0051】 The processor 610 is a program control device such as a microprocessor, and operates based on a program implemented in the memory 620, for example. The memory 620 is a storage element such as ROM or RAM. The program executed by the processor 610, etc., is stored in the memory 620. The device 600 for filtering images shown in Figure 6 can be used to implement the method for filtering images disclosed in this application. 【0052】 The apparatus according to the embodiments of this disclosure can increase the diversity of filtered images, reduce overfitting, and further improve the performance of downstream tasks. 【0053】 Embodiments of the present disclosure further provide a computer program product which, when executed by a processor, stores instructions that the processor uses to perform a method for filtering images according to the embodiments of the present disclosure. 【0054】 The hardware computing devices described in this disclosure, either as a whole or as components thereof, may be implemented by a variety of suitable hardware means, including but not limited to FPGAs, ASICs, SoCs, discrete gates or transistor logic, discrete hardware components, or any combination thereof. The devices, equipment, methods, and systems relating to this disclosure are not limited to any particular hardware architecture or configuration. The components of the disclosed devices, equipment, and systems may be separate or integrated, combined in different ways, and / or replaced or complemented by other components. It should be understood that this disclosure can be implemented in a variety of forms, including hardware, software, firmware, dedicated processors, or combinations thereof. 【0055】 The block diagrams of the apparatus, equipment, methods, and systems relating to this disclosure are merely illustrative and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will understand, these circuits, devices, apparatus, equipment, and systems may be connected, arranged, and configured in any manner that achieves the desired purpose. 【0056】 In the above description, the present invention has been explained based on examples. These examples are for illustrative purposes only, and it should be understood by those skilled in the art that the components and combinations of processes in these examples can be modified in various ways, and that such modifications are also within the scope of the present invention.
Claims
[Claim 1] A step of deriving a set of reference bounding box images based on a set of true bounding box images, wherein the reference bounding box images in the set of reference bounding box images include true bounding box images in the set of true bounding box images and extended bounding box images that are different from the true bounding box images. The steps include: calculating the similarity between each bounding box image in one or more bounding box images included in the image to be filtered and the set of reference bounding boxes, and obtaining the similarity of one or more bounding box images corresponding to the one or more bounding box images; A step of determining whether or not to filter the image to be filtered based on the similarity of one or more bounding box images, A method for filtering images that include [specific elements]. [Claim 2] The steps of calculating the similarity between each bounding box image and the set of reference bounding box images, and obtaining the similarity of the bounding box image corresponding to the bounding box image, are as follows: The steps include: calculating the similarity between the bounding box image and each reference bounding box image in the set of reference bounding box images to obtain a plurality of first similarities; The steps include: setting the largest similarity among the plurality of first similarities as the similarity of the bounding box image corresponding to the bounding box image; A method for filtering an image to be filtered according to claim 1, including the method described in claim 1. [Claim 3] The step of determining whether or not to filter the image to be filtered based on the similarity of one or more bounding box images is: A step of filtering the images to be filtered based on the fact that the similarity of each of the one or more bounding box images is lower than the first threshold, A method for filtering an image to be filtered according to claim 1, including the method described in claim 1. [Claim 4] The step of determining whether or not to filter the image to be filtered based on the similarity of one or more bounding box images is: A step of filtering the images to be filtered based on the fact that at least one of the similarity values of the one or more bounding box images is lower than a second threshold, A method for filtering an image to be filtered according to claim 1, including the method described in claim 1. [Claim 5] The step of deriving a set of reference bounding boxes based on the set of true bounding boxes is: A step of deriving an additional bounding box image as the extended bounding box image using a mixup method based on the true bounding box image in the set of true bounding box images, A method for filtering an image to be filtered according to claim 1, including the method described in claim 1. [Claim 6] The step of deriving an additional bounding box image as the extended bounding box image using a mixup mixing method based on the true bounding box image in the set of true bounding box images is: A step of performing pixel-level mixing or feature-level mixing on the true bounding box image using a mixup mixing method to derive an additional bounding box image as the extended bounding box image. A method for filtering images to be filtered according to claim 5, including the method described in claim 5. [Claim 7] The step of deriving an additional bounding box image as the extended bounding box image using a mixup mixing method based on the true bounding box image in the set of true bounding box images is: Using a mixup mixing method, multiple true bounding box images with similarity higher than a third threshold in the set of true bounding box images are mixed, and an additional bounding box image is derived as the extended bounding box image. A method for filtering images to be filtered according to claim 5, including the method described in claim 5. [Claim 8] The step of deriving an additional bounding box image as the extended bounding box image using a mixup mixing method based on the true bounding box image in the set of true bounding box images is: Using a mixup mixing method, multiple true bounding box images in the set of true bounding box images with similarity lower than the fourth threshold are mixed, and an additional bounding box image is derived as the extended bounding box image. A method for filtering images to be filtered according to claim 5, including the method described in claim 5. [Claim 9] Processor and A memory containing one or more computer programs, A device for filtering images, wherein when one or more of the aforementioned computer programs are executed by the processor, the processor performs a method for filtering images according to any one of claims 1 to 8. [Claim 10] A computer program product in which instructions are stored, A computer program product wherein, when the instruction is executed by a processor, the processor executes the method according to any one of claims 1 to 8.