Neural network training method, neural network training architecture, and program
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- WASEDA UNIV
- Filing Date
- 2025-11-27
- Publication Date
- 2026-06-25
Smart Images

Figure JP2025041345_25062026_PF_FP_ABST
Abstract
Claims
1. A method for training a neural network, comprising the following steps: a training data acquisition step, in which data to be input to the neural network and a teacher probability distribution associated with the data are acquired; and an output error minimization step, in which, based on the error between a pseudo-probability distribution output from the neural network based on the data and the teacher probability distribution, a gradient used in the procedure for minimizing the error is calculated to perform minimization that can reflect the properties of the teacher probability distribution, and each of the plurality of parameters is updated with the calculated gradient.
2. A method for training a neural network according to claim 1, wherein the teacher probability distribution and the pseudo-probability distribution are probability vectors consisting of probability values for each class.
3. A method for training a neural network according to claim 1 or claim 2, wherein minimization that reflects the properties of the teacher probability distribution means minimizing a loss defined such that each component of the gradient includes at least a coefficient that depends on the teacher probability value of the class corresponding to each component, or a coefficient based on the difference between the teacher probability distribution and the pseudo-probability distribution.
4. A training method according to any one of claims 1 to 3, wherein the output error minimization step involves calculating an alpha divergence based on the error and calculating the gradient such that the alpha divergence is minimized.
5. The training method according to claim 4, wherein the alpha divergence is defined to include the product of the i-th training data and the i-th output node (where i is any natural number).
6. A training method according to claim 4 or claim 5, wherein in the output error minimization step, parameter α > -1, and the gradient is calculated such that the alpha divergence is minimized, where parameter α is a parameter that adjusts the alpha divergence.
7. A training method according to any one of claims 4 to 6, wherein the alpha divergence is a plurality of alpha divergences, and in the output error minimization step, the gradient is calculated by selecting one of the plurality of alpha divergences according to the value of the plurality of alpha divergences.
8. A training method according to claim 7, wherein in the output error minimization step, the gradient is calculated by selecting the alpha divergence with the largest value from among the plurality of alpha divergences.
9. A training method according to any one of claims 4 to 8, wherein in the acquisition step, the teacher probability distribution is configured such that only one component of the teacher vector is single and the other components are zero, and in the output error minimization step, the gradient is calculated that reflects the output position corresponding to the single component.
10. A training method according to any one of claims 4 to 8, wherein in the acquisition step, a second matrix different from the first matrix is applied as the teacher probability distribution, where the first matrix is a matrix in which only one component of the teacher vector is the total probability and the other components are zero, the second matrix is a matrix in which a portion of the total probability of the single component of the first matrix is distributed to the other elements, and in the output error minimization step, the gradient is calculated that reflects the output position corresponding to the single component.
11. A training method according to any one of claims 4 to 8, wherein in the acquisition step, a third matrix different from both the first matrix and the second matrix is applied as the teacher probability distribution, where the first matrix is a matrix in which only one component of the teacher vector is the total probability and the other components are zero, the second matrix is a matrix in which a portion of the total probability of the single component of the first matrix is distributed to the other elements, the third matrix is a matrix in which at least the second matrix has been quenched or annealed, and in the output error minimization step, the gradient reflecting the output position corresponding to the single component is calculated.
12. A training method according to any one of claims 4 to 11, wherein in the output error minimization step, processing by an optimizer is performed before or after calculating the gradient based on the alpha divergence.
13. A neural network training architecture comprising a processor capable of executing a program so that each step of the training method according to any one of claims 1 to 12 is performed.
14. A program for causing at least one computer to perform each step of the neural network training method described in any one of claims 1 to 12.