Apparatus for classifying datasets and, in particular, computer-implemented methods
The method enhances transfer learning by classifying datasets for pre-training based on similarity and quality, addressing the challenge of suboptimal dataset similarity in existing AI algorithms, leading to improved model performance.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- ROBERT BOSCH GMBH
- Filing Date
- 2022-03-15
- Publication Date
- 2026-06-22
AI Technical Summary
Existing artificial intelligence algorithms require a significant amount of training data, and transfer learning methods often rely on datasets that are not sufficiently similar, leading to suboptimal performance.
A method and apparatus for classifying datasets using a computer-implemented approach that determines the suitability of training datasets for pre-training by evaluating their similarity and quality through multiple distance scales, identifying nearest neighbors, and classifying them as suitable or unsuitable for pre-training based on model performance improvements.
Enhances the effectiveness of transfer learning by selecting the most appropriate datasets for pre-training, thereby improving model performance on the main task without the need for extensive labeled data.
Smart Images

Figure 0007877025000001 
Figure 0007877025000002 
Figure 0007877025000003
Abstract
Description
Technical Field
[0001] The present invention relates to an apparatus for classifying datasets and in particular to a computer-implemented method.
Background Art
[0002] For artificial intelligence algorithms, a significantly large amount of training data is required. One possibility to reduce the amount of training data required is so-called transfer learning.
[0003] In the case of transfer learning, a pre-trained model is used and then that model is applied to the original main task. By doing so, the required amount of data to be labeled for the main task is reduced. This is because the pre-trained model brings generally applicable knowledge. To achieve maximum results, it is desirable to use a pre-trained model based on data that is as similar as possible. To determine this similarity, for example, a similarity scale or a distance scale is used. "Exploring and Predicting Transferability across NLP Tasks" by Vu et al., 2020, https: / / arxiv.org / abs / 2005.00770 discloses this type of distance scale.
[0004] By using a distance scale, an order of available datasets can be created, and the most similar dataset, i.e., the dataset having the shortest distance to the target dataset, can be found for training for the main task and used for pre-training the model.
Prior Art Documents
Non-Patent Documents
[0005]
Non-Patent Document 1
[0006] Disclosure of the invention The method for classifying training data, particularly the computer implementation method, is structured as follows: A model for task solving is pre-configured, multiple training datasets are pre-configured, and for each training dataset belonging to the above multiple training datasets, one trained model is determined for task solving by pre-training a model based on this training dataset and training a model based on a reference training dataset, and one trained reference model is determined for task solving by training a model based on a reference training dataset without pre-training using the above multiple training datasets, the quality of task solving is determined for each trained model, the reference quality of task solving is determined for the trained reference model, and the trained models are classified as particularly suitable or unsuitable for pre-training according to the deviation of the individual quality of this model from the reference quality, the nearest neighbor of one dataset is determined in the above multiple training datasets, and the dataset is classified as particularly suitable or unsuitable for pre-training according to how the model trained using the nearest neighbor is classified, or the nearest neighbor of the dataset is classified as particularly suitable for pre-training. The first training dataset is a candidate used when a transfer test is performed. These transfer tests include pre-training based on a selected dataset, i.e., a first training dataset, and subsequent training based on the main task, i.e., a baseline training dataset. Positive transfer, in this case, means that the performance achieved by pre-training in training the main task is improved compared to training the main task without pre-training. On the one hand, it is predicted whether a new, unknown dataset is suitable for pre-training. On the other hand, it is predicted whether an unknown dataset is suitable or unsuitable as a training dataset for pre-training.
[0007] One approach to training a model is as follows: If a dataset is classified by the classifier as suitable for pre-training, the model is pre-trained using that dataset to solve the task; if it is not classified as suitable for pre-training, the model is not pre-trained using that dataset to solve the task. The dataset is used for pre-training as long as it is selected. Subsequently, the model can be further trained based on a reference dataset.
[0008] In training a model, one approach is as follows: the model is pre-trained using at least one training dataset that has been classified as suitable for pre-training, and in particular, the model is subsequently trained to solve a task using that same dataset. In this case, one or more selected training datasets are used for pre-training.
[0009] One possible approach is to determine, for each of the multiple training datasets mentioned above, at least one distance to the dataset is identified, and a predetermined number of training datasets belonging to the multiple training datasets are determined as nearest neighbors if they are closer to the datasets belonging to the other datasets belonging to the multiple training datasets, or if a training dataset belonging to the multiple training datasets is determined as a nearest neighbor if it is closer to the dataset than the predetermined distance. Depending on the distance, each training dataset may or may not be assigned as a nearest neighbor.
[0010] Preferably, for each of the multiple training datasets, multiple distances to the dataset are defined using different distance scales, and the training datasets belonging to the multiple training datasets are determined as nearest neighbors in which at least one of the multiple distances is shorter than a predetermined distance. Depending on the distance scale used, the nearest neighbors can be different. Each training dataset is evaluated based on multiple distances using different distance scales. This allows multiple different distance scales to be combined into one. In this way, a single training dataset can be assigned to the nearest neighbor based on a single distance scale, even if, when another distance scale is used alone, the training dataset would not be the nearest neighbor based on that distance scale.
[0011] The device for classifying the dataset is configured to implement the method described above.
[0012] Preferably, the device includes a classifier and a training device, wherein the classifier is configured to classify a dataset, and the training device is configured to determine a model trained or pre-trained using the dataset for task solving if the classifier classifies the dataset as suitable for training or pre-training, and to determine a model without training or pre-training using the dataset if the classifier classifies the dataset as unsuitable for training or pre-training.
[0013] A computer program includes computer-readable instructions, and the execution of these computer-readable instructions allows the computer to carry out the methods described above.
[0014] Further advantageous embodiments will be revealed from the following description and drawings. [Brief explanation of the drawing]
[0015] [Figure 1] This diagram schematically shows a device for classifying datasets. [Figure 2] This figure shows a first method for classifying a dataset. [Figure 3] This figure shows a second method for classifying a dataset. [Modes for carrying out the invention]
[0016] Figure 1 schematically shows a device 100 for classifying datasets. In this embodiment, the datasets are a training dataset, a reference dataset, or an unknown dataset. The datasets can be labeled to enable supervised training. In this embodiment, the datasets include multiple embeddings. In this embodiment, the embeddings represent a single digital image, metadata about that digital image, or a portion of a corpus, i.e., text, using numerical or alphanumeric characters.
[0017] The device 100 includes an input 102 for a dataset and a classifier 104 configured to classify the dataset.
[0018] Device 100 includes an input 106 for a training dataset and a training device 108, which is configured to determine a trained model 112 from a model 110 for solving a certain task by pre-training based on the training dataset and training based on a reference dataset. The training device 108 is configured to determine a reference model 114 from the model 110 by training based on a reference dataset without pre-training based on the training dataset.
[0019] The device 100 includes a unit 116 configured to determine the reference quality of the reference model 114 and the quality of the trained model 112 during task resolution. The unit 116 is configured to classify a training data set used when the trained model 112 was pre-trained as suitable or unsuitable for pre-training according to the deviation of the quality from the reference quality. In this embodiment, the training data set is classified as suitable for pre-training when the quality is higher than the reference quality. In this embodiment, the training data set is classified as unsuitable for pre-training when the quality is lower than the reference quality. In this embodiment, the training data set is classified as unsuitable for pre-training when the quality is equal to the reference quality. In the last case, it is also possible to configure the training data set to be classified as neither.
[0020] The classifier 104 is configured to classify the data set according to how the model 112 trained using the training data set is classified.
[0021] The device 100 is configured to classify a plurality of training data sets in this way.
[0022] In this embodiment, the classifier 104 includes a k-nearest neighbor classifier configured to determine the k nearest neighbors of the number of data sets from a plurality of training data sets.
[0023] In one embodiment, the classifier 104 is configured to classify the data set as suitable or unsuitable for pre-training the model 110 for training based on the reference data set according to how its nearest neighbors are classified. In this embodiment, the model 110 is an artificial neural network, and this artificial neural network can be trained with a teacher using the data set to define the solution for a certain task.
[0024] In this embodiment, when the training device 108 classifies that a dataset is suitable for pre-training, it is configured to determine a model 118 that is pre-trained using the dataset and trained using a reference dataset. When it is not classified as suitable for pre-training, the training device 108 can be configured not to perform pre-training or to perform pre-training using another dataset.
[0025] In one embodiment, the classifier 104 is configured to identify the number k of nearest neighbors that is classified as suitable for pre-training the model 110 for training based on a dataset.
[0026] In this embodiment, the training device 108 is configured to be pre-trained using at least one of the training datasets determined as nearest neighbors and to determine a model 118 trained using the dataset.
[0027] The device 108 can be configured to output the number k of nearest neighbors and / or the training datasets determined as nearest neighbors. The device 108 can be configured to sort and output the nearest neighbors according to their distances to the dataset.
[0028] The device 100 is configured to implement one or both of the methods described below.
[0029] A first method for classifying a dataset will be described with reference to FIG. 2. According to the first method, it is determined whether the dataset is suitable for pre-training. Optionally, when the dataset is suitable for pre-training, pre-training is performed using the dataset.
[0030] In step 202, model 110 for task resolution is pre-configured.
[0031] In step 204, multiple training datasets are pre-configured.
[0032] In step 206, the baseline model is determined for task solving by training model 110 based on the baseline training dataset, without pre-training using the multiple training datasets mentioned above.
[0033] In step 208, the baseline quality of task solving is determined for the baseline model trained in this manner.
[0034] In step 210, for each of the training datasets belonging to the above-mentioned multiple training datasets, one trained model 112 is determined for task solving by pre-training model 110 based on this training dataset and training model 110 based on the reference training dataset.
[0035] In step 212, the quality of task solving is determined for each of the 112 models trained in this manner.
[0036] In step 214, the models 112 thus trained are classified as suitable or unsuitable for pre-training, in particular, according to their quality deviations from the baseline quality.
[0037] In step 216, the nearest neighbors of the datasets are determined for the multiple training datasets mentioned above.
[0038] In this embodiment, for each of the multiple training datasets mentioned above, at least one distance to the dataset is identified.
[0039] One possible approach here is to determine a predetermined number of training datasets belonging to the aforementioned set of training datasets as the nearest neighbors, based on their shorter distance from the other training datasets belonging to the same set of training datasets.
[0040] One possible approach here is to determine the nearest neighbors, where each training dataset belonging to one of the aforementioned training datasets is located at a distance shorter than a predetermined distance.
[0041] Depending on the distance, for example, each training dataset may be assigned to the nearest neighbor or not.
[0042] Multiple distance scales can be used. For each of the multiple training datasets mentioned above, multiple distances to the dataset can be defined using different distance scales. In this embodiment, the training datasets belonging to the multiple training datasets mentioned above are determined as nearest neighbors, in which at least one of the multiple distances is shorter than a predetermined distance.
[0043] Different distance scales can be predetermined or made selectable by the user.
[0044] In step 218, the dataset is classified according to how the model 112, trained using nearest neighbors, classifies the data. Specifically, the dataset is classified as either suitable or unsuitable for pre-training.
[0045] Optionally, in step 220, it is checked whether the dataset is classified as suitable for pre-training.
[0046] If the dataset is classified as suitable for pre-training, step 222 is performed. If it is not classified as suitable for pre-training, the process can be configured as follows: that is, step 216 is performed on other datasets, in particular, until one dataset is deemed suitable for pre-training, or step 224 is performed without the model being pre-trained.
[0047] In step 222, model 110 is pre-trained using the dataset. Then, step 224 is performed.
[0048] In step 224, a model 118 is determined that has been pre-trained on the dataset and subsequently trained on the reference dataset. If no dataset is selected, the model 118 can be determined by training on the reference dataset without being pre-trained.
[0049] Step 224 is optional. One possibility here is that a model 118 that has been pre-trained solely on the dataset is determined, without being further trained on the baseline dataset.
[0050] A second method for classifying datasets will be explained with reference to Figure 3. According to the second method, when a dataset is used for the purpose of training model 110 for a task, it is determined which of the multiple training datasets is suitable for pre-training. Selectively, pre-training is performed using at least one of the training datasets that is suitable for pre-training.
[0051] In step 302, model 110 for task resolution is pre-configured.
[0052] In step 304, multiple training datasets are pre-configured.
[0053] In step 306, the baseline model is determined for task solving by training model 110 based on the baseline training dataset, without being pre-trained using the multiple training datasets mentioned above.
[0054] In step 308, the baseline quality of task solving is determined for the baseline model trained in this manner.
[0055] In step 310, for each of the training datasets belonging to the above-mentioned multiple training datasets, one trained model 112 is determined for task solving by pre-training model 110 based on this training dataset and training model 110 based on a reference training dataset.
[0056] In step 312, the quality of task solving is determined for each of the 112 models trained in this manner.
[0057] In step 314, the models 112 thus trained are classified as suitable or unsuitable for pre-training, in particular, according to their quality deviations from the standard quality.
[0058] In step 316, the nearest neighbors of the datasets are determined for the multiple training datasets mentioned above.
[0059] In this embodiment, for each of the multiple training datasets mentioned above, at least one distance to the dataset is identified.
[0060] One possible approach here is to determine the nearest neighbors, where each training dataset belonging to one of the aforementioned training datasets is located at a distance shorter than a predetermined distance.
[0061] Depending on the distance, for example, each training dataset may be assigned to the nearest neighbor or not.
[0062] Multiple distance scales can be used. For each of the multiple training datasets mentioned above, multiple distances to the dataset can be defined using different distance scales. In this embodiment, the training datasets belonging to the multiple training datasets mentioned above are determined as nearest neighbors, in which at least one of the multiple distances is shorter than a predetermined distance.
[0063] Different distance scales can be predetermined or made selectable by the user.
[0064] In step 318, the nearest neighbors of the dataset are classified as suitable for pre-training.
[0065] Optionally, in step 320, model 110 is pre-trained using at least one training dataset that has been classified as suitable for pre-training.
[0066] Optionally, in step 322, a model 118 that has been pre-trained using at least one training dataset and trained using the dataset is selected for task solving.
[0067] Step 322 is optional. One possibility here is that a model 118 is determined that has been pre-trained on at least one training dataset, without being further trained on the dataset.
[0068] Instead of determining the pre-trained and trained model 118 in this way, the system can be configured to output the number or identifier of the training datasets for it. The training datasets can be output sorted according to their distance to the dataset. Preferably, the training datasets are sorted according to their suitability for contributing to improving the quality of task solving through pre-training. Preferably, the most suitable training dataset is placed at the top of the order.
Claims
1. In methods for classifying datasets, especially computer-based methods, A model (110) for task resolution is pre-configured (202, 302), Multiple training datasets are pre-configured (204, 304), For each of the training datasets belonging to the plurality of training datasets, one trained model (112) is determined for solving the task by pre-training the model (110) based on the training dataset and training the model (110) based on the reference training dataset (210, 310). Without being pre-trained using the aforementioned multiple training datasets, one trained reference model is determined for solving the task by training the model (110) based on the reference training dataset (206, 306). For each trained model (112), the quality of task resolution is determined (212, 312), For the aforementioned trained reference model, the reference quality for task resolution is determined (208, 308), The trained models (112) are classified as suitable or unsuitable for the pre-training, in particular, according to the deviation of the individual quality of the models (112) from the standard quality (214, 314). In the aforementioned multiple training datasets, the nearest neighbors of one dataset are determined (216, 316), Depending on how the trained model (112) trained using the nearest neighbor is classified, the dataset may be classified as suitable or unsuitable for pre-training (218), or The nearest neighbor of the dataset is classified as suitable for pre-training (318) A method characterized by the following:
2. In training the model (110), if the dataset is classified by the classifier as suitable for pre-training (220), the model (110) is pre-trained using the dataset for task solving (222); if it is not classified as suitable for pre-training, the model (110) is not pre-trained using the dataset for task solving. The method according to claim 1.
3. In training the model (110), the model (110) is pre-trained using at least one training dataset that is classified as suitable for pre-training with respect to the dataset (320), and in particular, the model (110) is subsequently trained using the dataset to solve the task (322). The method according to claim 1.
4. For each of the training datasets in the plurality of training datasets, at least one distance to the dataset is identified (216, 316), A predetermined number of training datasets belonging to the aforementioned plurality of training datasets are determined to be the nearest neighbors with a shorter distance than other training datasets belonging to the aforementioned plurality of training datasets, or The training dataset belonging to the plurality of training datasets is determined as the nearest neighbor, with a distance shorter than a predetermined distance. The method according to any one of claims 1 to 3.
5. For each of the training datasets in the aforementioned plurality of training datasets, multiple distances to the dataset are defined using different distance scales (216, 316), The training dataset belonging to the plurality of training datasets is determined as the nearest neighbor in which at least one of the plurality of distances is shorter than the preset distance. The method according to claim 4.
6. In a device (100) for classifying a dataset (102), An apparatus (100) characterized by being configured to carry out the method described in any one of claims 1 to 5.
7. The device (100) includes a classifier (104) and a training device (108), The classifier (104) is configured to classify the dataset, The training device (108) is configured such that, if the classifier classifies the dataset as suitable for training or pre-training, it determines a model (118) trained or pre-trained using the dataset for task solving, and if the classifier classifies the dataset as unsuitable for training or pre-training, it determines a model (118) without training or pre-training using the dataset. The apparatus (100) according to claim 6.
8. A computer program that includes computer-readable instructions, A computer program characterized in that, by executing the computer-readable instruction, the method described in any one of claims 1 to 5 is performed by a computer.