Learning device and learning method

The learning device stabilizes Early Fusion by pre-learning and averaging model parameters across different input data types, addressing accuracy loss due to size mismatches and enhancing inference accuracy.

JP2026100129APending Publication Date: 2026-06-19NEC CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
NEC CORP
Filing Date
2024-12-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Early fusion in multimodal neural network inference requires ensuring that multiple input data are the same size, leading to information loss and decreased accuracy when data sizes differ, especially in dimensions such as channels, height, and width.

Method used

A learning device and method that includes pre-learning to determine initial parameter values by averaging parameters of models trained on different input data types, using model averaging to stabilize the fusion process, ensuring consistent initial values for subsequent training.

Benefits of technology

This approach suppresses the decrease in inference accuracy by stabilizing the training process, enabling accurate Early Fusion even when input data sizes vary.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026100129000001_ABST
    Figure 2026100129000001_ABST
Patent Text Reader

Abstract

Even when using Early Fusion, the decrease in inference accuracy is suppressed. [Solution] The learning device includes a learning means that learns a model corresponding to each of the multiple types of input data, and a model averaging means that calculates the average value of the parameters of the learned model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present disclosure relates to a learning device and a learning method related to multimodal machine learning.

Background Art

[0002] As a method of neural network inference, there is multimodal that simultaneously handles multiple types of input data. When using multimodal, the inference accuracy can be improved by integratively processing multiple input data.

[0003] As typical methods related to the integration of input data, there are Early fusion and Late fusion. Early fusion is a method of combining multiple input data before the inference by a neural network is executed.

Prior Art Documents

Patent Documents

[0004]

Patent Document 1

Non-Patent Documents

[0005]

Non-Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0006] Using early fusion reduces computational complexity compared to late fusion, which offers higher accuracy but is computationally intensive. Late fusion, on the other hand, integrates data after neural network inference has been performed.

[0007] When using Early Fusion, it is necessary to ensure that multiple input data are the same size. The size of an input data can be represented by its channels, height, and width. Specifically, ensuring that multiple input data are the same size means that at least two of the three dimensions (channels, height, and width) are the same for each input data. Hereafter, these three dimensions will be referred to as "dimensions."

[0008] For example, when given multiple input data points that differ in both height and width, or in either of these, it may be necessary to enlarge or reduce the input data to make the sizes of the multiple input data points the same.

[0009] This can lead to information loss and a decrease in inference accuracy when using Early Fusion.

[0010] Furthermore, Patent Document 1 describes the merging of input data in the field of machine learning. Patent Document 1 also describes a method of applying a predetermined transformation process (preprocessing) to input data and then inputting the preprocessed data into a learning model. Non-Patent Document 1 proposes multimodal approaches, including Early fusion and Late fusion, as well as Joint fusion and Common Space fusion.

[0011] The present invention aims to provide a learning device, a learning method, and a learning program that can suppress a decrease in inference accuracy even when using Early Fusion. [Means for solving the problem]

[0012] The learning device according to the present disclosure includes learning means for learning a model corresponding to each of a plurality of types of input data, and model averaging means for calculating an average value of parameters of the learned model.

[0013] The learning method according to the present disclosure learns a model corresponding to each of a plurality of types of input data, and calculates an average value of parameters of the learned model.

[0014] The learning program according to the present disclosure causes a computer to learn a model corresponding to each of a plurality of types of input data, and calculate an average value of parameters of the learned model.

Advantages of the Invention

[0015] According to the present invention, even when Early fusion is used, a decrease in inference accuracy can be suppressed.

Brief Description of the Drawings

[0016] [Figure 1] It is a block diagram showing an example of the configuration of the learning device. [Figure 2] It is an explanatory diagram showing an example of input data. [Figure 3] It is an explanatory diagram for explaining pre-learning. [Figure 4] It is a schematic diagram for explaining the processing of the layer weight coupling section in the model averaging section. [Figure 5] It is an explanatory diagram for explaining the learning process using the learned model after pre-learning. [Figure 6] It is a flowchart showing the operation of the learning device. [Figure 7] It is a block diagram showing a configuration example of the information processing system. [Figure 8] It is a block diagram showing the main part of the learning device.

Embodiments for Carrying Out the Invention

[0017] Hereinafter, embodiments will be described with reference to the drawings.

[0018] FIG. 1 is a block diagram showing an example of the configuration of a learning device. The learning device 100 shown in FIG. 1 includes an initialization unit 110, a data combining unit 101, and a model learning unit 102.

[0019] The initialization unit 110 includes a model holding unit 120 and a model averaging unit 130. The model averaging unit 130 includes a layer weight combining unit 131 and a layer weight averaging unit 132.

[0020] The model holding unit 120 in the initialization unit 110 is a storage unit (memory) that holds the learned model. The layer weight combining unit 131 in the model averaging unit 130 reads out all the models held in the model holding unit 120 and combines the parameters of a predetermined layer of each model. The layer weight averaging unit 132 in the model averaging unit 130 calculates the average value of the parameters over all the models for each layer in the model.

[0021] In this embodiment, before the learning of the model is performed, the learning device 100 performs pre-learning to determine the initial values of the parameters of the model. The parameters are mainly weights. Note that the learning executed using the initial values of the parameters determined by the pre-learning is referred to as the main learning or learning process.

[0022] In the pre-learning, in the learning device 100, for each of a plurality of types of input data, learning of the corresponding model is performed. Specifically, the model learning unit 102 performs pre-learning and main learning. Assume that there are two types of input data. During pre-learning, the model learning unit 102 performs learning on one of the input data and stores the learned model (corresponding to model A described later) in the model holding unit 120. Next, learning is performed on the other input data, and the learned model (corresponding to model B described later) is stored in the model holding unit 120.

[0023] The learning device 100 then calculates the average value of the parameters for each model. The learning device 100 uses this average value as the initial value.

[0024] In pre-training, the structure of the model trained in each modal is the same as the structure of the model used in the main training. However, if the number of channels in the input data used in pre-training differs from the number of channels in the input data used in the main training, then in pre-training, a model with a different structure for only the first layer will be used. For example, if the first layer is a convolutional layer, the number of input channels in the convolutional layer will be matched to the number of channels in the input data.

[0025] The following explains pre-training using a concrete example.

[0026] Figure 2 is an explanatory diagram showing an example of input data. Figure 2 illustrates two input data (input data A and input data B). Note that the number of input data is not limited to two; three or more types of input data may be input to the learning device 100.

[0027] In the following, we will use color image data (hereinafter referred to as "color image") as input data A and monochrome image data (hereinafter referred to as "monochrome image") as input data B as an example. That is, input data A and input data B are both images, but their formats are different. However, input data A and input data B may be in the same format (in this example, either a color image or a monochrome image). Furthermore, the input data to the learning device 100 is not limited to image data. For example, the input data may be audio data, text data, wireless signals, etc.

[0028] The number of channels, height, and width of the input data are represented as [number of channels, height, width]. The input data is sometimes also referred to as modal.

[0029] In the example shown in Figure 2, the number of channels, height, and width of input data A are [3,256,256]. The number of channels, height, and width of input data B are [1,256,256].

[0030] Figure 3 is an explanatory diagram for explaining pre-training. In Figure 3, Model A is shown as an example of a model corresponding to Modal A (Input Data A). Model B is shown as an example of a model corresponding to Modal B (Input Data B).

[0031] The structure of the model trained in each model is the same as the structure of the model used in the main training. However, if the number of channels in the modal in the main training differs from the number of channels in modals A and B, then models A and B will be used, with a different structure only in the first layer compared to the model used in the main training.

[0032] Taking a convolutional neural network (CNN) as an example model, if the first layer is a convolutional layer, the number of input channels in the convolutional layer should be matched to the number of channels in the input data. In the case of modals A and B shown in Figure 2, modal A has 3 input channels, while modal B has 1 input channel.

[0033] The learning device 100 calculates the average value of the parameters of model A and the corresponding parameters of model B. If there are multiple parameters, the learning device 100 calculates the average value for each parameter. As described above, the average value is used as the initial value in the learning process. In the learning device 100, each model that has completed pre-training is stored in the model storage unit 120.

[0034] Figure 4 is a schematic diagram illustrating the processing of the layer weight coupling unit 132 in the model averaging unit 130. In Figure 4, the parameters of the first convolutional layer are schematically shown as cubes corresponding to each output channel. Figure 4 illustrates the number of output channels (Output ch), number of input channels (Input ch), kernel height (number of rows), and kernel width (number of columns) of the first convolutional layer for Modal A (input data A) and Modal B (input data B). In the example shown in Figure 4, the number of output channels, number of input channels, kernel height, and kernel width of the first layer of Modal A are (16,3,3,3). The number of output channels, number of input channels, kernel height, and kernel width of the first layer of Modal B are (16,1,3,3).

[0035] The number of input channels for each parameter corresponds to the number of channels in the respective input data. In the example shown in Figure 4, the number of input channels is 3 for modal A and 1 for modal B. The layer weight coupling unit 131 in the model averaging unit 130 couples the parameters of modal A and modal B. In the example shown in Figure 4, the layer weight coupling unit 131 couples the parameters of modal A and modal B along the dimension of the input channels. That is, the parameters of modal A and modal B are coupled in the direction of the input channels. In the example shown in Figure 4, the number of output channels, the number of input channels, the kernel height, and the kernel width are (16, 4, 3, 3). Due to the coupling, the number of input channels in the convolutional layer becomes 4.

[0036] In the model average part 130, the layer weight average part 132 calculates the average value for each layer parameter from the second layer onward.

[0037] Figure 5 is an explanatory diagram illustrating the learning process (main learning) using a pre-trained model. Figure 5 shows multiple input data (input data A and input data B) as examples. The number of channels, height, and width of input data A are [3,256,256]. The number of channels, height, and width of input data B are [1,256,256].

[0038] In this learning process, the data merging unit 101 combines multiple input data (for example, input data A and input data B). The data merging unit 101 combines input data A and input data B, for example, in the channel direction. Therefore, the number of channels, height, and width of one input data after merging are [4,256,256].

[0039] The model learning unit 102 reads initial values, i.e., the average values ​​for each input channel of each layer, from the model holding unit 120. The model learning unit 102 uses the read average values ​​as parameters for each layer. Subsequently, the model is trained using a single input data set formed by combining multiple input data (for example, input data A and input data B) that are input sequentially.

[0040] Next, the operation of the learning device 100 will be explained with reference to the flowchart in Figure 6. The processes in steps S101 to S105 are processes in pre-learning. The processes in steps S106 to S107 are processes in main learning.

[0041] Preprocessing is preferably performed before preprocessing. Preprocessing involves normalization, resizing, clipping, inversion, and other operations on the input data. In addition, preprocessing ensures that all modal data are the same size (e.g., same height and width) in relation to the dimensions in the joining direction (e.g., channels).

[0042] Although not explicitly shown in Figure 1, the learning device 100 may also include a preprocessing unit that performs the preprocessing described above.

[0043] In the preprocessing, the model learning unit 102 first initializes the model parameters using random numbers (step S101).

[0044] The model learning unit 102 learns a model using the input data of one modal (step S102). Then, the model learning unit 102 stores the learned model in the model holding unit 120 (S103). After performing steps S102 and S103 for all modals (step S104), the process proceeds to step S105.

[0045] In step S105, the model averaging unit 130 reads all models from the model holding unit 120.

[0046] The model averaging unit 130 then combines or averages the parameters for each layer across all models. Specifically, the layer weight averaging unit 131 combines the parameters for the first layer of the model (see Figure 4). For each layer from the second layer onward, the layer weight averaging unit 132 calculates the average value of the parameters corresponding to each input data (parameters in each modal). Then, it outputs a single model with the average value set as the parameters to the model holding unit 120 (step S105). The model holding unit 120 holds the model.

[0047] In this learning process, the model learning unit 102 reads a model from the model holding unit 120. Then, the parameters of this model are used as the initial values ​​for the parameters of the model to be used in this learning process (step S106).

[0048] Subsequently, the model learning unit 102 performs the main learning (learning process) (step S107).

[0049] As described above, when this learning is performed, the data merging unit 101 supplies the model learning unit 102 with a single input data generated by merging multiple input data (for example, input data A and input data B). Once learning is complete, the model learning unit 102 can provide the trained model as a model for actual operation.

[0050] Generally, in machine learning, the results differ with each training session. Therefore, a stable model can be obtained by creating multiple models under the same conditions and taking the average of their parameters. In this embodiment, each model is trained beforehand using the input data for each modal, the average of the parameters for each model is taken, and this average value is used as the initial parameter value for the main training. Then, in the main training, training is performed using input data that combines multiple modals, enabling stable training. As a result, even when using Early Fusion, the decrease in inference accuracy can be suppressed.

[0051] Therefore, the learning device 100 of this embodiment is expected to have the effect of improving the accuracy of models that perform inference in machine learning applications using Early Fusion, for example.

[0052] Furthermore, while each of the above embodiments can be implemented using hardware, they can also be realized using a computer equipped with a processor such as a CPU (Central Processing Unit) and memory.

[0053] For example, a program for performing the method (processing) in the above embodiment may be stored in a storage device (storage medium), and each function may be realized by executing the program stored in the storage device using the CPU.

[0054] Figure 7 is a block diagram showing an example of a computer having a CPU. The computer is implemented in the learning device 100. The CPU 1001 realizes each of the functions in the above embodiment by executing processing according to the program (software element: code) stored in the storage medium 1003. That is, it realizes the functions of the data merging unit 101, the model learning unit 102, and the model averaging unit 130 in the learning device 100 shown in Figure 1.

[0055] Multiple processors (computers) can work together to realize the functions of the learning device 100. Alternatively, a CPU and a GPU (Graphics Processing Unit) can work together to realize the functions of the learning device 100.

[0056] The storage medium 1003 is, for example, a non-transitory computer-readable medium. Non-transitory computer-readable media include various types of tangible storage media. Specific examples of non-transitory computer-readable media include magnetic recording media (e.g., hard disks), magneto-optical recording media (e.g., magneto-optical disks), CD-ROMs (Compact Disc-Read Only Memory), CD-Rs (Compact Disc-Recordable), CD-R / Ws (Compact Disc-ReWritable), and semiconductor memories (e.g., mask ROMs, PROMs (Programmable ROMs), EPROMs (Erasable PROMs), flash ROMs).

[0057] Furthermore, the program may be stored in various types of transient computer-readable medium. The transient computer-readable medium may be supplied with the program, for example, via a wired or wireless communication channel, i.e., via electrical signals, optical signals, or electromagnetic waves.

[0058] Memory 1002 is implemented, for example, as RAM (Random Access Memory) and is a storage means that temporarily stores data when the CPU 1001 executes processing. It is also conceivable that a program held by the storage medium 1003 or a temporary computer-readable medium is transferred to memory 1002, and the CPU 1001 executes processing based on the program in memory 1002. Note that the storage medium 1003 and memory 1002 may be a single unit.

[0059] Furthermore, the model holding unit 120 can be implemented using memory 1002 or storage medium 1003.

[0060] Figure 8 is a block diagram showing the main components of the learning device. The learning device 10 shown in Figure 8 includes a learning means (implemented in the model learning unit 102 in this embodiment) that learns a model corresponding to each of the multiple types of input data, and a model averaging means (implemented in the model averaging unit 130 in this embodiment) that calculates the average value of the parameters of the learned model.

[0061] Some or all of the above embodiments may also be described as follows, but are not limited to the following:

[0062] (Note 1) A learning means for training a model corresponding to each of the multiple types of input data, Model averaging means for calculating the average value of the parameters of the trained model, A learning device equipped with the following features.

[0063] (Appendix 2) The model holding means for holding the trained model is provided, The model averaging means calculates the average value of the parameters of the multiple models held in the model holding means. The learning device described in Appendix 1.

[0064] (Note 3) The model averaging means calculates the average value of the parameters corresponding to each input data across all of the multiple models for each layer of a model having multiple layers. A learning device as described in Appendix 1 or Appendix 2.

[0065] (Note 4) The model averaging means combines the parameters corresponding to each input data for the first layer of the model, and calculates the average value of the parameters for each layer from the second layer onward. The learning device described in Appendix 3.

[0066] (Note 5) The system includes data merging means for combining multiple input data into a single data (implemented in the data merging unit 101 in this embodiment), The learning means retrains the model with the combined single data set, using the mean value of the parameters as the initial value. A learning device as described in any of the appendices 1 through 4.

[0067] (Note 6) For each of the multiple types of input data, train the model corresponding to the multiple types of input data. Calculate the average value of the parameters of the trained model. Learning methods.

[0068] (Note 7) Calculate the average value of the parameters of the multiple models held in the model holding means that holds the trained models. The learning method described in Appendix 6.

[0069] (Note 8) For each layer of a model with multiple layers, calculate the average value of the parameter corresponding to each input data across all of the models. The learning method described in Appendix 6 or Appendix 7.

[0070] (Note 9) For the first layer of the model, the parameters corresponding to each input data are combined, and the average value of the parameters is calculated for each layer from the second layer onward. The learning method described in Appendix 8.

[0071] (Note 10) Combine multiple input data into a single data, The model is retrained using the combined single data set, with the mean value of the aforementioned parameters as the initial value. The learning method described in one of the appendices 6 through 9.

[0072] (Note 11) To the computer, For each of the multiple types of input data, train the model corresponding to that multiple type of input data. The average value of the parameters of the trained model is calculated. A learning program for that purpose.

[0073] (Note 12) To the computer, The model holding means, which holds the trained model, is made to calculate the average value of the parameters of the multiple models held by the model holding means. The learning program described in Appendix 11.

[0074] (Note 13) To the computer, For each layer of a multi-layered model, the average value of the parameters corresponding to each input data is calculated across all layers of the model. The learning program described in Appendix 11 or Appendix 12.

[0075] (Note 14) For the first layer of the model, parameters corresponding to each input data are combined, and the average value of the parameters is calculated for each layer from the second layer onward. The learning program described in Appendix 13.

[0076] (Note 15) Combine multiple input data into a single data set. The average value of the aforementioned parameters is used as the initial value, and the combined single data set is used to retrain the model. The learning program described in any of the appendices 11 to 14.

[0077] Some or all of the configurations described in Appendices 2 to 5, which are directly or indirectly dependent on Appendice 1 above, may be applied to various hardware, software, various recording means for recording software, or systems, provided that they do not deviate from the embodiments described above. [Explanation of Symbols]

[0078] 10 Learning device 11 Data linking means 12 Model holding means 13. Model Averaging Method 100 Learning Devices 101 Data merging section 102 Model Learning Department 110 Initialization section 120 Model holding section 130 Model average section 131 Layer weight coupling section 132-layer weighted average section 1001 CPU 1002 memory 1003 Storage medium

Claims

1. A learning means for training a model corresponding to each of several types of input data, Model averaging means for calculating the average value of the parameters of the trained model, A learning device equipped with the following features.

2. The system includes a model holding means for holding the trained model, The model averaging means calculates the average value of the parameters of the multiple models held in the model holding means. The learning device according to claim 1.

3. The model averaging means calculates the average value of the parameters corresponding to each input data across all of the multiple models, for each layer of a model having multiple layers. The learning device according to claim 1.

4. The aforementioned model averaging means combines the parameters corresponding to each input data for the first layer of the model, and calculates the average value of the parameters for each layer from the second layer onward. The learning device according to claim 3.

5. It includes a data merging means that combines multiple input data into a single data, The learning means retrains the model using the combined single data set, with the mean value of the parameters as the initial value. A learning device according to any one of claims 1 to 4.

6. For each of the multiple types of input data, the model corresponding to that multiple type of input data is trained. Calculate the average value of the parameters of the trained model. Learning methods.

7. Combine multiple input data into a single data, The model is retrained using the combined single data set, with the mean value of the aforementioned parameters as the initial value. The learning method according to claim 6.

8. On the computer, For each of the multiple types of input data, train the model corresponding to that multiple type of input data. The average value of the parameters of the trained model is calculated. A learning program for that purpose.

9. On the computer, Combine multiple input data into a single data, The average value of the aforementioned parameters is used as the initial value, and the combined single data set is used to retrain the model. The learning program according to claim 8.