Method, apparatus, and computer program for selecting generated samples
By grouping and selecting generated samples based on their contribution to the learning target model, the method enhances the effectiveness of machine learning by improving model performance and reducing the impact of non-contributory or degrading data.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- NTT DOCOMO INC
- Filing Date
- 2025-11-26
- Publication Date
- 2026-06-15
AI Technical Summary
The newly generated data in machine learning is not always effective for model training, often containing non-contributory or degrading data, making it challenging to select effective generated samples for downstream model training.
A method is provided to divide a sample set into groups, cluster samples, and select generated samples based on their contribution to a learning target model, using a genetic algorithm as a fitness supervisor function to identify and prioritize samples that enhance model performance.
This approach improves the performance of downstream models by selecting generated samples that contribute effectively to training, reducing overfitting and enhancing generalization.
Smart Images

Figure 2026096941000001_ABST
Abstract
Description
【Technical Field】 【0001】 The present application relates to the field of artificial intelligence, and particularly to a method, an apparatus, and a computer program product for selecting generated samples. 【Background Art】 【0002】 In the field of machine learning, often the cost of obtaining labeled data is very high. Expanding the training dataset has been a long-standing goal in this field. Currently, it is common to perform data expansion by generating new data based on real data. 【Summary of the Invention】 【Problems to be Solved by the Invention】 【0003】 However, the newly generated data, i.e., the expanded data, is not necessarily effective for model training. Among them, there are many data that do not contribute, and even data that can degrade performance. Therefore, selecting effective generated data and using it for the training of downstream models has become a problem to be solved in the field of machine learning. 【Means for Solving the Problems】 【0004】 In view of the above problems, the present disclosure provides a method, an apparatus, and a computer program product for selecting generated samples, which improve the performance of downstream models. 【0005】 According to one aspect of the present disclosure, there is provided a method for selecting generated samples, including dividing a sample set including real samples and generated samples into a plurality of groups, and selecting a part of the generated samples within each group based on the contribution degree of the generated samples in each of the plurality of groups to a learning target model, thereby selecting the generated samples within the sample set. 【0006】 According to one example of this disclosure, the contribution of each group's generated samples is positively correlated with the proportion of generated samples selected from the generated samples of that group. 【0007】 According to one example of this disclosure, dividing a sample set, including actual samples and generated samples, into multiple groups includes clustering the samples within the sample set and generating multiple sample clusters as the multiple groups. 【0008】 According to one example of this disclosure, the contribution of each group's generated sample is determined based on the proportion of the actual sample within that group. 【0009】 According to one example of the present disclosure, selecting a portion of the generated samples within a group based on the contribution of the generated samples in that group includes selecting a first proportion of generated samples from the generated samples within the group in response that the proportion of actual samples within the group exceeds a first threshold, selecting a second proportion of generated samples from the generated samples within the group in response that the proportion of actual samples within the group is lower than a second threshold and greater than zero, and selecting a third proportion of generated samples from the generated samples within the group in response that the proportion of actual samples within the group is zero, wherein the first threshold is greater than the second threshold, the second proportion is greater than the first proportion, and the first proportion is greater than the third proportion. 【0010】 According to one example of this disclosure, the contribution of each of the multiple groups of generated samples is determined based on the performance of the trained model that has been trained using the samples within that group, and the performance of the trained model is positively correlated with the contribution of the generated samples of that group. 【0011】 According to one example of the present disclosure, selecting a portion of the generative samples within a group based on the contribution of each group's generative samples to the model under study includes automatically identifying the generative samples to be selected from the group's generative samples according to the genetic algorithm by using the contribution of each group's generative samples as the fitness supervisor function of the genetic algorithm. 【0012】 According to one example of this disclosure, the actual sample and the generated sample in the sample set are one of the following: an image, audio, or text. 【0013】 According to another aspect of the present disclosure, there is a device for sorting generated samples, which includes a processor and a memory for storing one or more computer program modules, wherein when the one or more computer program modules are executed by the processor, the method for sorting generated samples described above is performed. 【0014】 According to another aspect of the present disclosure, a computer program product is provided which includes a computer instruction, the method for selecting the generated samples described above, is performed when the computer instruction is executed by a processor. [Effects of the Invention] 【0015】 According to the method for selecting generated samples described herein, the performance of the downstream model can be improved by selecting generated samples using the contribution of the generated samples within the group and selecting samples that are effective for training the downstream model. 【0016】 The above and other purposes, features, and advantages of this disclosure will become clearer through a more detailed description of the embodiments of this disclosure in conjunction with the accompanying drawings. The accompanying drawings are intended to further enhance the understanding of the embodiments of this disclosure and constitute part of this specification. The accompanying drawings are used to illustrate this disclosure together with the embodiments of this disclosure and are not intended to limit this disclosure. In the drawings, the same reference numerals generally represent the same component or step. [Brief explanation of the drawing] 【0017】 [Figure 1] Figure 1 is a flowchart showing a method for selecting a generated sample according to the embodiment of this disclosure. [Figure 2] Figure 2 is a flowchart showing a method for selecting a generated sample according to one embodiment of the present disclosure. [Figure 3] Figure 3 is a schematic diagram showing a method for selecting a generated sample according to one embodiment of the present disclosure. [Figure 4] Figure 4 is a flowchart showing a method for selecting a generated sample according to another embodiment of the present disclosure. [Figure 5] Figure 5 is a schematic diagram showing a method for selecting a generated sample according to one embodiment of the present disclosure. [Figure 6] Figure 6 shows a schematic diagram of an apparatus for selecting generated samples according to the embodiment of this disclosure. [Modes for carrying out the invention] 【0018】 The technical means in the embodiments of this disclosure will be described clearly and completely below with reference to the drawings of the embodiments of this disclosure. Clearly, the embodiments described are only a selection of the embodiments of this disclosure, not all embodiments. All other embodiments that a person skilled in the art can obtain by basing on the embodiments of this disclosure without requiring any creative work are all within the scope of this disclosure. 【0019】 In this application, a flowchart is used to describe the steps of the method according to the embodiments. It should be understood that the preceding and subsequent steps are not necessarily executed strictly in order. On the contrary, various steps may be processed in reverse order or simultaneously. Also, other operations may be added to these processes, or one or more steps may be deleted from these processes. 【0020】 First, referring to FIG. 1, a flowchart of a method 100 for screening generated samples according to an embodiment of the present disclosure will be described. As shown in FIG. 1, the method 100 used for image recognition learning includes the following steps S102 to S104. 【0021】 In step S102, a sample set including real samples and generated samples is divided into a plurality of groups. In some embodiments of the present disclosure, it may be grouped based on the similarity of the samples. For example, dividing the sample set into a plurality of groups may be realized by a clustering method. In other words, the samples in the sample set are divided into a plurality of sample clusters as a plurality of groups. In some embodiments of the present disclosure, the samples in the sample set may be divided into different groups based on certain characteristics of the samples. Those skilled in the art can understand that other methods may be used to group the samples in the sample set. Also, in some embodiments, the generated samples in the sample set have been preliminarily screened. For example, screening based on authenticity is performed, and samples that clearly do not match reality are excluded. 【0022】 In step S104, the generated samples in the sample set are screened by selecting a part of the generated samples within each group based on the contribution degree of the generated samples of each of the plurality of groups to the learning target model. In other words, first, the contribution degree of the generated samples of each of the plurality of groups to the learning target model is specified, and a part of the generated samples within the group is selected based on the contribution degree. The contribution degree refers to the degree to which a sample contributes to the performance of the learning target model. For example, a model is learned using a sample. If the learned model improves performance during actual use, the sample contributes to the model. In some embodiments of the present disclosure, the contribution degree of the generated samples of each group to the learning target model is positively correlated with the ratio of the generated samples selected from the generated samples of the group. In other words, the higher the contribution degree, the more generated samples are selected from that group. As a result, more samples with a high contribution degree are used for model learning, and the performance of the model is improved. 【0023】 Hereinafter, referring to FIG. 2, taking the grouping by the clustering method as an example, the method 100 for screening the generated samples will be further described. FIG. 2 is a flowchart showing a method 200 for screening the generated samples according to an embodiment of the present disclosure. As shown in FIG. 2, the method 200 for screening the generated samples includes the following steps S202 to S206. 【0024】 In step S202, a sample set including real samples and generated samples is clustered into a plurality of categories. In other words, a plurality of sample clusters are generated as the plurality of groups. The clustering may use any clustering method corresponding to the data type according to the data type, for example, the K-Means clustering algorithm, the hierarchical clustering algorithm, the spectral clustering algorithm, and the like. 【0025】 In step S204, the contribution of the generated samples within each category is identified based on the proportion of real samples within that category. In some embodiments, the proportion of real samples within each category may refer to the proportion of real samples in all samples within that category. In some other embodiments, the proportion of real samples within each category may refer to the proportion of real samples in the entire sample set. The proportion of real samples within each category reflects, to some extent, the extent to which the generated samples within that category contain new knowledge. 【0026】 In step S206, based on the contribution level, a generation sample corresponding to the contribution level is selected from the category. The contribution level reflects the extent to which the generation sample contains new knowledge. Therefore, by selecting a generation sample corresponding to the contribution level, more effective new knowledge will be selected. 【0027】 Next, with reference to Figure 3, the method 200 for selecting the generated samples will be further described. Figure 3 is a schematic diagram showing a method for selecting the generated samples according to one embodiment of the present disclosure. As shown in Figure 3, first, the samples in the sample set are clustered to obtain three sample clusters of type 1, type 2, and type 3. In Figure 3, triangles represent actual samples, circles represent generated samples, dashed circles represent the clustering results, and stars represent the cluster centers of the actual samples. 【0028】 After clustering the samples, it is necessary to identify the contribution of the generated samples within each sample cluster to the downstream model. In Type 1 sample clusters, there are many real samples present, so the new knowledge contained in the generated samples is limited, resulting in a low contribution to the downstream model. In some embodiments, a large number of real samples may mean that the proportion of real samples in a Type 1 sample cluster exceeds a first threshold, e.g., 50%. In Type 2 sample clusters, there are few real samples present, so the new knowledge contained in the generated samples is abundant, resulting in a high contribution to the downstream model. In some embodiments, a small number of real samples may mean that the proportion of real samples in a Type 2 sample cluster does not exceed a second threshold, e.g., 50%. It should be understood that the first threshold may be equal to or greater than the second threshold. In Type 3 sample clusters, there are no real samples, and the generated samples contain a large amount of new knowledge. However, they are likely to be outliers because they are too far from the cluster center of the actual data. Such generated samples usually also have a low contribution to the downstream model. 【0029】 Finally, based on the contribution of each generated sample within a sample cluster to the downstream model, a proportion of generated samples corresponding to that contribution is selected from that sample cluster. In other words, the number of generated samples selected from a sample cluster is positively correlated with the contribution. That is, sample clusters containing generated samples that contribute highly to improving the performance of the downstream model will have a higher selection proportion. As shown on the right side of Figure 3, the shaded circles represent the generated samples selected for use in training the downstream model. Specifically, a first proportion of generated samples is selected from the generated samples within a Type 1 sample cluster. The first proportion may be, for example, 50%. This is because, if a large number of real samples already exist, selecting a large number of generated samples that are very close to the real samples will only yield limited new knowledge, potentially leading to overfitting. A second proportion of generated samples is selected from the generated samples within a Type 2 sample cluster. The second proportion may be, for example, 90%. As a general principle, it should be understood that the second proportion is greater than the first proportion. The second ratio may be set to any other ratio greater than the first ratio, whether greater than or less than 90%, for example. This is because if there are few real samples of this type, there will be insufficient samples of that type to train the downstream model. In this case, the new knowledge contained in the generated samples within this type of sample cluster is rich, and selecting a large number of them is effective for training the downstream model. In addition, the third ratio of generated samples is selected from the generated samples within the type 3 sample cluster. The first ratio may be, for example, 10%. As a general principle, it should be understood that the third ratio is smaller than the first ratio. The third ratio may be set to any other ratio smaller than the first ratio, whether greater than or less than 10%, for example. This is because there are no real samples in type 3, or in other words, the generated samples are too far from the cluster center of real samples, so the generated samples within this type are likely to be outliers. Such data can be detrimental to training the downstream model. 【0030】 Figure 3 illustrates examples of three types of sample clusters. Those skilled in the art will understand that a sample set may be clustered into four, five, or more types of clusters (i.e., more sample clusters). 【0031】 As can be seen from Figures 2 and 3, the sample generation method of this disclosure improves the performance of downstream models by retaining a variety of hard examples and reducing the selection of single duplicate samples or outliers. The variety of hard examples increases the generalization of the model. Single duplicate samples exacerbate overfitting, and outliers are likely to degrade the performance of trained models because they are far removed from real-world situations. 【0032】 Figure 4 is a flowchart of method 400 for selecting generated samples according to another embodiment of the present disclosure. As shown in Figure 4, method 400 includes the following steps S402 to S410. Step S402 divides the sample set, including real samples and generated samples, into several groups, similar to step S102 in Figure 1. The grouping method employed may be clustering, as described in step S202 in Figure 2, or other methods such as similarity to a training / validation set (e.g., clip score, LPIPS perceptual similarity). 【0033】 After grouping is complete, step S404 involves training a downstream model (also referred to herein as the model to be trained) using the samples within each group to obtain several different trained models. Next, step S406 involves obtaining the performance of these several different trained models. In other words, the performance of a trained model using the samples within each group substantially reflects the contribution of the samples within that group to the downstream model. In some embodiments, the performance of a trained model may be obtained by testing a validation set using that trained model. 【0034】 After determining the performance of each trained model, step S408 determines the contribution of each group of samples based on that performance. In some embodiments, the performance of the trained model may be directly used as the contribution of the corresponding group of samples. In some other embodiments, the performance of the trained model may be normalized to obtain a contribution score. Those skilled in the art will understand that, assuming a positive correlation between the performance of the trained model and its contribution, other appropriate algorithms may be used to convert the performance to its contribution. 【0035】 After identifying the contributions, step S410 uses the contribution of each group's samples as the fitness supervisor function of the genetic algorithm. Based on the genetic algorithm, it identifies the generative samples to be selected from the generative samples of the corresponding group. In other words, the contribution of each group's samples is used as a fitness supervisor signal, and the genetic algorithm automatically increases the number of difficult samples that contribute highly to the downstream model based on that fitness. At the same time, it reduces duplicate samples and outliers that have a low contribution or even a negative impact. Specifically, in some embodiments, selection, crossover, and mutation are performed on the generative data within each group to generate new groups corresponding to the groups. Next, the fitness of these new groups is calculated, and this process is repeated until the fitness meets the requirements, thereby obtaining a candidate sample set for the new groups. Finally, the optimal candidate sample set is obtained by obtaining the final new groups corresponding to each group. 【0036】 Conventional methods for selecting generated samples can be divided into two types. The first type selects the optimal generated sample using the performance of each generated sample in the downstream model as the fitness function. The second type selects the optimal generated sample using the distance between the generated sample and the cluster center as the fitness function. The first type of method is computationally very complex, and to select the one with the best performance from n generated samples, the performance of the downstream model must be calculated twice. n It needs to be calculated twice. In other words, the computational complexity is 2n Therefore, it is only suitable for structured data, where the amount of data is small and the learning speed is very fast. In the second type of method, the unsupervised genetic algorithm uniformly selects generated samples that are close to the centroid of the real data. However, it discards samples that are far from the centroid of the real samples and have a high contribution to the downstream model. Furthermore, because the unsupervised genetic algorithm cannot accurately distinguish the contribution of generated samples to the downstream model, it selects samples with low contribution, which results in a decrease in the performance of the downstream model. 【0037】 In contrast, the sample selection method 400 shown in Figure 4 uses the performance impact of each group on the downstream model as a fitness supervisor signal. This reduces computational complexity on the one hand, and on the other hand, by adopting fitness as the supervisor function, it automatically increases the number of difficult samples that contribute highly to the downstream model, while reducing duplicate samples and outliers that have a low contribution or even a negative impact. 【0038】 Referring to Figure 5, the method 400 for selecting generated samples will be further explained using a sample set of diseased leaves as an example. Figure 5 is a schematic diagram showing a method for selecting generated samples according to one embodiment of the present disclosure. As shown in Figure 5, the sample is an image sample of diseased leaves. The goal that the downstream model should achieve is to detect whether the captured leaf images contain diseased leaves. Since labeled data for images of diseased leaves is difficult to obtain, a data augmentation technique is used to increase the number of images of diseased leaves. First, bounding box filtering is performed on the images in the image set of diseased leaves to extract bounding box images. Then, the extracted bounding box images are grouped to obtain multiple groups. For example, curled leaves, yellow leaves, perforated leaves, rotten leaves, etc., are each made into one group. Next, each group of bounding box images calculates a score for the performance of the downstream model. Then, by using this performance score (or contribution score normalized from the score) as a fitness supervisor function, a genetic algorithm is applied to each group sample to obtain k sets of generated leaf bounding box images. 【0039】 Although Figure 5 illustrates an image sample, those skilled in the art will understand that the sample in this disclosure may be an image, audio, or text. 【0040】 The apparatus 600 for selecting generated samples according to the embodiment of this disclosure will be described below with reference to Figure 6. Figure 6 is a schematic diagram of the apparatus for selecting generated samples according to the embodiment of this disclosure. The function of the apparatus for selecting generated samples according to this embodiment is the same as the details of the method described above with reference to Figures 1 to 5, so for simplicity, a detailed explanation of the same content will be omitted here. 【0041】 An apparatus for selecting generated samples according to the embodiments of this disclosure includes a processor 602 and a memory 601 for storing computer-readable instructions. Here, when the computer-readable instructions are executed by the processor, a method for selecting generated samples is performed. This method includes dividing a sample set, which includes real samples and generated samples, into a plurality of groups, and selecting generated samples in the sample set by selecting a portion of the generated samples within each of the plurality of groups based on the contribution of the generated samples in each group to the model to be trained. 【0042】 For the technical effects of the apparatus 600 for selecting generated samples in different embodiments, you may refer to the technical effects of the method for selecting generated samples according to the embodiments of this disclosure. A detailed explanation is omitted here. 【0043】 The sample generating device 600 can be used in a variety of suitable electronic devices. 【0044】 This disclosure further provides a computer program product that includes computer-readable instructions. When the computer-readable instructions are executed by a processor, a method for selecting generated samples is performed. The method includes dividing a sample set, which includes real samples and generated samples, into a plurality of groups, and selecting generated samples in the sample set by selecting a subset of the generated samples within each of the plurality of groups based on the contribution of the generated samples in each group to a model under study. 【0045】 Each aspect / embodiment described herein may be used individually, in combination, or switched between during execution. Furthermore, the processing procedures, sequences, flowcharts, etc., of each aspect / embodiment described herein may be rearranged in order, provided they are consistent. For example, the methods described herein present various step elements in an exemplary order and are not limited to that specific order. 【0046】 As used herein, the phrase "based on" does not mean "based solely on" unless otherwise specified. In other words, the phrase "based on" means both "based solely on" and "based at least on." 【0047】 Any reference to elements using designations such as “first,” “second,” etc., as used herein, does not generally limit the quantity or order of those elements. These designations may be used herein as a convenient way to distinguish between two or more elements. Therefore, references to a first element and a second element do not imply that only two elements may be employed or that the first element must precede the second element in any way. 【0048】 Where the terms “include,” “including,” and variations thereof are used in this specification or in the claims, these terms are intended to be inclusive, as is the term “comprising.” Furthermore, where the term “or” is used in this specification or in the claims, it is not intended to be exclusive OR. 【0049】 Those skilled in the art will understand that each aspect of this Application may be described and represented by several types or situations that are patentable, including any novel and useful combination of processes, machines, products or materials, or novel and useful improvements thereto. Accordingly, each aspect of this Application may be executed entirely by hardware, entirely by software (including firmware, resident software, microcode, etc.), or by a combination of hardware and software. Any of the above hardware or software may be referred to as “data blocks,” “modules,” “engines,” “units,” “components,” or “systems.” Furthermore, each aspect of this Application may be represented as a computer product on one or more computer-readable media containing computer-readable program code. 【0050】 This application uses specific terminology to describe embodiments thereof. For example, “one embodiment,” “one example,” and / or “several embodiments” means a component, structure, or feature relating to at least one embodiment of this application. Therefore, it should be noted that “one embodiment,” “one example,” or “one alternative embodiment” mentioned more than once in different places in this specification do not necessarily refer to the same embodiment. Furthermore, components, structures, or features of one or more embodiments of this application may be combined appropriately. 【0051】 Unless otherwise specified, all terms used herein (including technical and scientific terms) have the same meaning as those generally understood by an ordinary technician in the art to which this disclosure pertains. Furthermore, terms defined in, for example, a standard dictionary should not be interpreted in an idealized or overly formalized sense, unless explicitly defined herein, but rather in their contextual meaning within the relevant technology. 【0052】 Although the present disclosure has been described in detail above, it will be clear to those skilled in the art that the present disclosure is not limited to the embodiments described herein. The present disclosure may be implemented in the form of amendments and modifications without departing from the spirit and scope of the present disclosure as defined by the claims. Accordingly, the descriptions herein are for illustrative purposes only and do not imply any limitation of the present disclosure.
Claims
[Claim 1] A method for selecting generated samples, Dividing the sample set, which includes actual samples and generated samples, into multiple groups, A method for selecting generated samples, which includes selecting generated samples within a sample set by selecting a portion of the generated samples within a group based on the contribution of each of the multiple groups' generated samples to the model to be trained. [Claim 2] The method according to claim 1, wherein the contribution of each group's generated sample is positively correlated with the ratio of generated samples selected from the generated samples of that group. [Claim 3] Dividing a sample set, including actual samples and generated samples, into multiple groups is The method according to claim 1, comprising clustering the samples in the sample set to generate a plurality of sample clusters as the plurality of groups. [Claim 4] The method according to claim 3, wherein the contribution of each group of generated samples is determined based on the proportion of actual samples within that group. [Claim 5] Selecting a portion of the generated samples within each group based on the contribution of the generated samples from that group is: In response to the proportion of actual samples within the group exceeding a first threshold, a sample representing the first proportion is selected from the sample representing the group. In response to the proportion of actual samples within the group being lower than a second threshold and greater than zero, a sample representing the second proportion is selected from the sample representing the group. This includes selecting a third proportion of generating samples from the generating samples within the group in response to the proportion of actual samples within that group being zero, The first threshold is greater than the second threshold. The method according to claim 4, wherein the second ratio is greater than the first ratio, and the first ratio is greater than the third ratio. [Claim 6] The contribution of each of the aforementioned groups' generated samples is determined based on the performance of the trained model, which has been trained using the samples within that group. The method according to claim 1, wherein the performance of the trained target model is positively correlated with the contribution of the generated samples of the group. [Claim 7] Selecting a portion of the generated samples within a group based on the contribution of each group's generated samples to the model being trained is: The method according to claim 6, further comprising using the contribution of each group's generated sample as a fitness supervisor function of the genetic algorithm to automatically identify a generated sample to be selected from the group's generated samples according to the genetic algorithm. [Claim 8] The method according to claim 1, wherein the actual sample and the generated sample in the sample set are one of images, audio, or text. [Claim 9] A device for selecting generated samples, Processor and A memory for storing one or more computer program modules, An apparatus for selecting generated samples, wherein when one or more of the aforementioned computer program modules are executed by the processor, the method for selecting generated samples according to any one of claims 1 to 8 is executed. [Claim 10] A computer program product, Including computer instructions, A computer program product wherein, when the computer instruction is executed by the processor, the method for selecting a generated sample according to any one of claims 1 to 8 is executed.