Data distillation system, data distillation method, and data distillation program

By defining an upper bound value for the variation range of the second model parameter, the dual optimization problem in kernel models is reduced to a single optimization problem, enabling efficient data distillation and maintaining classification accuracy.

WO2026133425A1PCT designated stage Publication Date: 2026-06-25DENSO CORP +1

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
DENSO CORP
Filing Date
2024-12-17
Publication Date
2026-06-25

Smart Images

  • Figure JP2024044618_25062026_PF_FP_ABST
    Figure JP2024044618_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A data distillation system (1) generates, from a dataset used for training a kernel model, a synthetic dataset having a smaller number of pieces of data than the number of pieces of data constituting the dataset. The data distillation system comprises an information acquisition unit (20) and an information processing unit (30). The information acquisition unit acquires a dataset, a first model parameter obtained by training a kernel model using the dataset, and a synthetic data count indicating the number of the synthetic datasets generated. It is assumed that a model parameter obtained by training the kernel model using the synthetic dataset of the synthetic data count is a second model parameter. In this case, the information processing unit defines an upper bound value of a variation range of the second model parameter with respect to the first model parameter by using a predetermined upper bound evaluation method, and generates the synthetic dataset through processing for minimizing the upper bound value.
Need to check novelty before this filing date? Find Prior Art

Description

Data distillation system, data distillation method, data distillation program

[0001] The present disclosure relates to a data distillation system, a data distillation method, and a data distillation program.

[0002] Conventionally, to efficiently train a kernel model, a technique of distilling a large dataset into a smaller synthetic dataset is known (see, for example, Non-Patent Document 1). Data distillation is formulated as a meta-learning problem that performs double optimization such as data optimization and model optimization. Double optimization is difficult to solve in a realistic time because of its high computational cost. In contrast, in Non-Patent Document 1, in kernel ridge regression, by utilizing the fact that the solution of model optimization can be analytically derived, the double optimization problem is reduced to a single optimization problem.

[0003] Timothy Nguyen, et al., “DATASET META-LEARNING FROM KERNEL RIDGE-REGRESSIO”, [online], 2021 / 3 / 22, [searched on 2024 / 11 / 6], Internet, <URL: https: / / arxiv.org / pdf / 2011.00050 >

[0004] By the way, in kernel models other than kernel ridge regression (for example, support vector machines, logistic regression models), it is difficult to analytically derive the solution of model optimization. Therefore, the data distillation technique described in Non-Patent Document 1 is limited to the kernel ridge regression model and cannot be applied to kernel models other than the kernel ridge regression model.

[0005] An object of the present disclosure is to provide a data distillation system, a data distillation method, and a data distillation program that are also applicable to kernel models.

[0006] The inventors diligently investigated methods for reducing a dual optimization problem to a single optimization problem in kernel models other than kernel ridge regression. As a result, they devised a method to reduce a dual optimization problem to a single optimization problem by defining an upper bound value for the variation range of the second model parameter relative to the first model parameter using an upper bound evaluation method.

[0007] According to one aspect of this disclosure, the data distillation system generates a synthetic dataset from a dataset used to train a kernel model, with fewer data points in the dataset than the number of data points constituting the dataset, and comprises an information acquisition unit and an information processing unit, wherein the information acquisition unit acquires a dataset, first model parameters obtained by training a kernel model using the dataset, and a number of synthetic data points indicating the number of synthetic datasets to be generated, and when the model parameters obtained by training a kernel model using the synthetic dataset of the number of synthetic data points are taken as second model parameters, the information processing unit uses a predetermined upper bound evaluation method to define an upper bound value for the variation range of the second model parameters with respect to the first model parameters, and generates a synthetic dataset through a process of minimizing the upper bound value.

[0008] In another aspect of this disclosure, the data distillation method generates a synthetic dataset from a dataset used to train a kernel model, with fewer data points in the dataset than the number of data points in the dataset, and includes obtaining a dataset, first model parameters obtained by training a kernel model using the dataset, and a number of synthetic data points indicating the number of synthetic datasets to be generated; and, when the model parameters obtained by training a kernel model using the synthetic dataset of synthetic data points are taken as second model parameters, a predetermined upper bound evaluation method is used to define an upper bound value for the variation range of the second model parameters with respect to the first model parameters, and a synthetic dataset is generated through a process of minimizing the upper bound value.

[0009] In yet another aspect of this disclosure, the data distillation program generates a synthetic dataset from a dataset used to train a kernel model, with a computer having fewer data points than the number of data points constituting the dataset, and causes the computer to perform an acquisition process to obtain a dataset, a first model parameter obtained by training a kernel model using the dataset, and a number of synthetic data points indicating the number of synthetic datasets to be generated, and a generation process in which, when the model parameter obtained by training a kernel model using the synthetic dataset is taken as the second model parameter, a predetermined upper bound evaluation method is used to define an upper bound value for the variation range of the second model parameter with respect to the first model parameter, and a synthetic dataset is generated through a process of minimizing the upper bound value.

[0010] As disclosed herein, by defining an upper bound value for the range of variation of the second model parameter relative to the first model parameter using an upper bound evaluation method, it becomes possible to reduce the dual optimization problem of data distillation in kernel models other than kernel ridge regression to a single optimization problem. Furthermore, the generation of a synthetic dataset can be performed in a realistic time through the minimization process of the aforementioned upper bound value.

[0011] Thus, according to this disclosure, data distillation can be performed on kernel models other than kernel ridge regression.

[0012] This is an explanatory diagram for explaining data distillation. This is a schematic diagram of the data distillation system according to the embodiment. This is an explanatory diagram for explaining the dual optimization of data distillation. This is a flowchart of the control processing flow executed by the computers constituting the data distillation system. This is an explanatory diagram for explaining the symbols, etc., in [Equation 6]. This is an explanatory diagram for explaining the performance of a kernel model trained on a synthetic dataset obtained by data distillation of the present disclosure.

[0013] Hereinafter, an embodiment of this disclosure will be described with reference to Figures 1 to 6. In this embodiment, an example of performing data distillation on a kernel model other than kernel ridge regression (KRR) will be described.

[0014] A kernel model is a linear model that uses kernel methods and assumes a strongly convex loss function. Examples of such kernel models include support vector machines (SVM) and kernel logistic regression (LR), which are widely used for classification problems.

[0015] The data distillation system 1 is a system that generates a small number of synthetic datasets that, for example, in a kernel model, achieve performance similar to that obtained when training with the original training dataset, as shown in Figure 1. These synthetic datasets are also called support data.

[0016] As shown in Figure 2, the data distillation system 1 of this embodiment is configured to include a computer 10 that includes one or more processors and memory. The memory is composed of a non-transitional tangible storage medium. The computer 10 reads and executes various computer programs, including a data distillation program, stored in the memory. When various computer programs, including the data distillation program, are executed, a method corresponding to the data distillation program (i.e., a data deletion method) is performed.

[0017] The computer 10 constituting the data distillation system 1 of this embodiment functions as an information acquisition unit 20, an information processing unit 30, and an information output unit 40 by executing a program stored in memory. In other words, the data distillation system 1 comprises an information acquisition unit 20, an information processing unit 30, and an information output unit 40.

[0018] The information acquisition unit 20 acquires various types of information through a large-capacity database DB1, a data input unit DI, etc., located outside the system. In this embodiment, the large-capacity database DB1 and the data input unit DI are located outside the data distillation system 1, but this is not limited to this configuration. The data distillation system 1 may be configured to include at least one of the large-capacity database DB1 and the data input unit DI.

[0019] Specifically, the information acquisition unit 20 acquires at least a portion of the large amount of data stored in the large-capacity database DB1 as a dataset to be used for training the kernel model. This dataset is a collection of labeled data. The dataset includes raw data that has not been modified or altered, such as personal information contained within the data.

[0020] Furthermore, the information acquisition unit 20 acquires information such as the first model parameters obtained by training the kernel model using the dataset, and the number of synthetic data points input to the data input unit DI, through the data input unit DI. The number of synthetic data points is the number of synthetic datasets generated by data distillation. The number of synthetic data points is specified by the user or other relevant parties.

[0021] The information processing unit 30 generates a synthetic dataset from the dataset through data distillation. Data distillation is formulated as a dual optimization problem involving data optimization and model optimization, as shown in [Equation 1] and Figure 3 below.

[0022] Here, in [Equation 1] and Figure 3, the first model parameter is set to "βo", the second model parameter to "βs", the composite datasets to "Xs" and "ys", and the loss function of the composite dataset to "Ps".

[0023] In this disclosure, in order to reduce the above-mentioned dual optimization problem to a single optimization problem, we address this by minimizing the upper bound of the parameter change, rather than minimizing the parameter change itself, ||βo - βs||. Specifically, the information processing unit 30 uses a predetermined upper bound evaluation method to define an upper bound for the range of variation of the second model parameter relative to the first model parameter, and generates a composite dataset through a process of minimizing the defined upper bound.

[0024] As shown in [Equation 2], the information processing unit 30 of this embodiment defines an upper limit value of the variation range of the second model parameter βs with respect to the first model parameter βo using the dual gap Gs. In other words, in this embodiment, as an upper limit evaluation method, a method is employed in which the upper limit value of the variation range of the distance between the first model parameter βo and the second model parameter βs is defined using the dual gap Gs.

[0025] The dual gap Gs is the difference between the evaluation value Ps(βo) obtained by applying the first model parameter βo to the loss function Ps of the synthetic dataset and the evaluation value Ds(αs) obtained by applying the second model parameter αs to the loss function Ds of the synthetic dataset, as shown in [Equation 3]. Note that the loss function Ds is the objective function of the dual problem when the loss function Ps is the objective function of the primal problem. The second model parameter αs is a model parameter that is the dual variable of the second model parameter βs. By generating data as a synthetic dataset that minimizes the dual gap Gs, the performance degradation of the kernel model due to data distillation can be sufficiently suppressed.

[0026]

[0027] Here, the second model parameter αs included in [Equation 2] and [Equation 3] is difficult to obtain by training a kernel model using a synthetic dataset. Therefore, the information processing unit 30 of this embodiment approximates the second model parameter βs by replacing it with the first model parameter βo in the Karush-Kuhn-Tucker condition (hereinafter also referred to as the KKT condition) that arises when training a kernel model using a synthetic dataset. Note that the KKT condition naturally holds for the trained model parameters. In this embodiment, the KKT condition includes constraints that the synthetic dataset "Xs, ys" and the second model parameters αs and βs obtained by training the kernel model must satisfy. Specifically, in this embodiment, the KKT condition shown in [Equation 4] below is used.

[0028] The “[0, 1]” shown in [Equation 4] means a value between 0 and 1, inclusive.

[0029] The second model parameter βs shown in [Equation 4] is expected to coincide with the first model parameter βo. Therefore, in this embodiment, the second model parameter βs shown in [Equation 4] is approximated by replacing it with the first model parameter βo. By doing so, the second model parameter αs can be determined.

[0030] However, the above approximation results in a discontinuity in the second model parameter αs shown in [Equation 4] with respect to the composite dataset "Xs, ys". For this reason, it is desirable to approximate it smoothly using a sigmoid function, for example, as shown in [Equation 5].

[0031] The information output unit 40 outputs the composite dataset generated by the information processing unit 30 to the external database DB2. The composite dataset is compressed raw data, which reduces storage costs and protects privacy.

[0032] Next, the control processing performed by the computer 10 of the data distillation system 1 will be explained with reference to Figure 4. The control routine shown in Figure 4 is realized by having the processor of the computer 10 execute the data distillation program stored in the memory of the computer 10.

[0033] In step S10, computer 10 performs an acquisition process to obtain the dataset, the first model parameter βo, the number of composite data points, etc. Computer 10 stores the acquired information in memory or elsewhere as appropriate.

[0034] Next, in step S20, the computer 10 uses a predetermined upper bound evaluation method to define an upper bound value for the variation range of the second model parameter βs with respect to the first model parameter βo, and executes a generation process to generate a composite dataset through a process of minimizing the upper bound value. For example, if the kernel model is a support vector machine SVM, the computer 10 generates a composite dataset by performing a minimization calculation of the dual gap Gs shown in [Equation 6] to [Equation 8] below.

[0035]

[0036]

[0037] Here, since [Equation 6] to [Equation 8] are obtained by expanding known definitions related to a support vector machine SVM or the like, detailed descriptions thereof are omitted. Note that various symbols shown in [Equation 6] are as shown in FIG. 5.

[0038] Through these processes, a synthetic dataset obtained by distilling the dataset is obtained. The inventors of the present invention conducted a comparative verification of the classification accuracy of a support vector machine SVM and the classification accuracy of a kernel logistic regression LR using the synthetic dataset generated in the present case with the classification accuracy of a kernel ridge regression KRR.

[0039] FIG. 6 shows the verification results of the classification accuracy of each model. The verification results shown in FIG. 6 are those obtained by using 2000 image data related to "airplane" and "automobile" in CIFAR-10 as a dataset and conducting a comparative verification of the classification accuracy of "airplane" and "automobile". In the comparative verification, an NTK (abbreviation for Neural Tangent Kernel) corresponding to a two-layer MLP with a width of 1024 is used as a kernel function. Also, the initial value of the synthetic dataset is data randomly selected in equal numbers for each class from the dataset. The L2 regularization coefficient is set to "1e-5".

[0040] As shown in FIG. 5, according to the comparative verification by the inventors of the present invention, it was found that for each of the support vector machine SVM and the kernel logistic regression LR, classification accuracy comparable to that of the kernel ridge regression KRR can be obtained.

[0041] Also, according to this verification, it was found that even if the size of the synthetic dataset is compressed to about 10 by data distillation, the classification accuracy of the kernel model can be maintained in the 80% range equivalent to the case where data distillation is not performed. Note that the classification accuracy of the kernel ridge regression KRR when data distillation is not performed is "86.7%", the classification accuracy of the support vector machine SVM is "87.1%", and the classification accuracy of the kernel logistic regression LR is "87.4".

[0042] In the data distillation system 1, data distillation method, and data distillation program described above, by using the upper bound evaluation method to define the upper bound value of the variation range of each model parameter, the double optimization problem of data distillation is reduced to a single optimization problem. As a result, generation of the synthetic data set can be performed in a realistic time through minimization processing of the upper bound value.

[0043] Therefore, according to the data distillation system 1, data distillation method, and data distillation program of the present embodiment, data distillation can be performed on a kernel model other than the kernel ridge regression KRR.

[0044] Further, the data distillation system 1 of the present embodiment has the following features.

[0045] (1) The above upper bound evaluation method is a method of defining the upper bound value of the variation range of the second model parameter βs with respect to the first model parameter βo by the duality gap Gs. By thus defining the upper bound value of the variation range of the second model parameter βs with respect to the first model parameter βo by the duality gap Gs, it becomes possible to reduce the double optimization problem of data distillation to a single optimization problem.

[0046] (2) In the KKT conditions used when the information processing unit 30 trains the kernel model using the synthetic data set, the second model parameter βs is approximated by replacing it with the known first model parameter βo. By performing such approximation, the computational cost in data distillation can be suppressed.

[0047] (Other Embodiments) Although the representative embodiments of the present disclosure have been described above, the present disclosure is not limited to the above-described embodiments and can be variously modified, for example, as follows.

[0048] In the above-described embodiment, as the upper bound evaluation method, a method of defining the upper bound value of the variation range of the second model parameter βs with respect to the first model parameter βo by the duality gap Gs has been exemplified, but the upper bound evaluation method is not limited thereto. For example, a method called hypersphere bound may be adopted as the upper bound evaluation method.

[0049] As described in the above embodiment, it is desirable that the information processing unit 30 approximates the second model parameter βs with a known first model parameter βo in the KKT condition, but it is not required to do so.

[0050] In the embodiments described above, it goes without saying that the elements constituting the embodiments are not necessarily essential unless explicitly stated to be particularly essential or considered to be fundamentally essential. In the embodiments described above, when numerical values ​​such as the number of elements, numerical values, quantities, or ranges of the components of the embodiments are mentioned, the embodiments are not limited to those specific numbers unless explicitly stated to be particularly essential or considered to be fundamentally limited to a specific number. Furthermore, although detailed examples of mathematical formulas are given in the embodiments described above, the embodiments are not limited to those described above, and some parts of the formulas may differ from those described above.

[0051] The control unit and its method of this disclosure may be implemented in a dedicated computer provided by configuring a processor and memory programmed to perform one or more functions embodied by a computer program. The control unit and its method of this disclosure may be implemented in a dedicated computer provided by configuring a processor by one or more dedicated hardware logic circuits. The control unit and its method of this disclosure may be implemented in one or more dedicated computers configured by a combination of a processor and memory programmed to perform one or more functions and a processor configured by one or more hardware logic circuits. The computer program may also be stored as instructions executed by the computer on a computer-readable non-transitional tangible recording medium.

Claims

1. A data distillation system that generates a synthetic dataset from a dataset used for training a kernel model, wherein the system comprises an information acquisition unit (20) and an information processing unit (30), wherein the information acquisition unit acquires the dataset, a first model parameter obtained by training the kernel model using the dataset, and a number of synthetic data indicating the number of synthetic datasets to be generated, and when the model parameter obtained by training the kernel model using the synthetic dataset is defined as a second model parameter, the information processing unit defines an upper bound value for the variation range of the second model parameter and the first model parameter using a predetermined upper bound evaluation method, and generates the synthetic dataset through a process to minimize the upper bound value.

2. The data distillation system according to claim 1, wherein the difference between the evaluation value obtained by applying the first model parameter to the loss function of the synthetic dataset and the evaluation value obtained by applying the second model parameter to the loss function of the synthetic dataset is defined as the dual gap, and the upper bound evaluation method is a method that defines the upper bound value by the dual gap.

3. The data distillation system according to claim 1 or 2, wherein the information processing unit approximates the Karush-Kuhn-Tucker condition that arises when the kernel model is trained using the synthetic dataset by replacing the second model parameter with the first model parameter.

4. A data distillation method for generating a synthetic dataset from a dataset used to train a kernel model, wherein the number of data points is less than the number of data points constituting the dataset, the method comprising: obtaining the dataset, a first model parameter obtained by training the kernel model using the dataset, and a number of synthetic data points indicating the number of synthetic datasets to be generated; and, when the model parameter obtained by training the kernel model using the synthetic dataset is defined as a second model parameter, defining an upper bound value for the variation range of the second model parameter with respect to the first model parameter using a predetermined upper bound evaluation method, and generating the synthetic dataset through a process to minimize the upper bound value.

5. A data distillation program that causes a computer (10) to generate a synthetic dataset from a dataset used for training a kernel model, the synthetic dataset having fewer data points than the number of data points constituting the dataset, the program comprising: an acquisition process to acquire the dataset, a first model parameter obtained by training the kernel model using the dataset, and a number of synthetic data points indicating the number of synthetic datasets to be generated; and a generation process to generate the synthetic dataset, where the number of synthetic data points is used as the model parameter obtained by training the kernel model using the synthetic dataset, and the second model parameter is defined as a second model parameter, and a predetermined upper bound evaluation method is used to define an upper bound value for the variation range of the second model parameter with respect to the first model parameter, and the synthetic dataset is generated through a process to minimize the upper bound value.