Learning methods and electronic devices for multimodal models

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The learning method for multimodal models focuses on local data using a training dataset with true and false responses and masking techniques to enhance model accuracy and efficiency.

JP2026110449APending Publication Date: 2026-07-02SIONIC AI INC

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SIONIC AI INC
Filing Date: 2025-04-28
Publication Date: 2026-07-02

Application Information

Patent Timeline

28 Apr 2025

Application

02 Jul 2026

Publication

JP2026110449A

IPC: G06N3/096; G06F18/25; G06N3/0895

AI Tagging

Technology Topics

Data set Learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Multimodal models tend to learn from global data, overlooking detailed local information and require extensive training resources, making it difficult to implement specific features.

Method used

A learning method that focuses on local data using a training dataset with true and false response data, employing masking techniques and loss functions to generate a new model optimized for specific functions.

Benefits of technology

Prevents overfitting and enhances model learning by focusing on specific data, resulting in a more accurate and efficient multimodal model.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026110449000001_ABST

Patent Text Reader

Abstract

This disclosure relates to a method and electronic apparatus for learning a multimodal model, which is performed by at least one processor. [Solution] A method for learning a multimodal model may include the steps of acquiring an existing model which is a pre-trained multimodal model, acquiring a training dataset for training the multimodal model which includes true response data and false response data, and generating a new model by training the existing model based on the training dataset.

Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] This disclosure relates to a learning method and electronic device for multimodal models. [Background technology]

[0002] In recent years, in the field of natural language processing technology, techniques have been developed to optimize the performance of large language models (LLMs) according to user requirements. Furthermore, as LLM models have become capable of processing multimodal data, there is a growing demand for training them to perform tasks such as extracting specific information from specific images.

[0003] However, a problem with multimodal models is that they are likely to learn only from global data (information encompassing the overall context and framework of the data, e.g., the composition of an entire image, or the style and theme of an entire text), and are likely to overlook detailed information about local data (specific locations and details within text or images, e.g., a specific object in an image, a specific word in text).

[0004] Furthermore, even if a multimodal model with these problems is retrained, it will require a massive amount of training data and resources. Moreover, even with this vast amount of training data, it will still only learn from global data, making it difficult to implement the features desired by the learner.

[0005] Therefore, there is a need to develop technologies that generate new models optimized for specific functions from existing multimodal models. [Prior art documents] [Patent Documents]

[0006] [Patent Document 1] Korean Published Patent Publication No. 10-2024-0030307 [Overview of the project] [Problems that the invention aims to solve]

[0007] This disclosure provides a learning method and electronic device for a multimodal model to solve the problems described above. [Means for solving the problem]

[0008] This disclosure can be implemented in a variety of forms, including methods, apparatus (systems), and / or computer programs.

[0009] According to one embodiment of the present disclosure, a method for learning a multimodal model, performed by at least one processor, may include the steps of: acquiring an existing model which is a pre-trained multimodal model; acquiring a training dataset for training the multimodal model, wherein the training dataset includes true response data and false response data; and generating a new model by training the existing model based on the training dataset.

[0010] A computer program may be provided for executing a multimodal model learning method according to one embodiment of the present disclosure on a computer.

[0011] According to one embodiment of the present disclosure, the electronic device includes a memory and at least one processor connected to the memory and configured to execute at least one computer-readable program contained in the memory, the at least one program which includes instructions for acquiring an existing model which is a pre-trained multimodal model and acquiring a training dataset for training the multimodal model, wherein the training dataset includes true response data and false response data, and for generating a new model by training the existing model based on the training dataset. [Effects of the Invention]

[0012] According to some embodiments of the present disclosure, by performing additional learning on an existing multi-modal model using local data instead of global data to generate a new model, it is possible to support the generation of a multi-modal model optimized by a specific function.

[0013] Also, according to some embodiments of the present disclosure, by using a masking technique to learn only specific data that is the target of learning, overfitting and errors can be prevented, and the model learning can be advanced more effectively.

[0014] The effects of the present disclosure are not limited to the effects mentioned above, and other effects not mentioned will also be clearly understood by those having ordinary knowledge in the technical field to which the present disclosure belongs from the matters described in the claims.

Brief Description of the Drawings

[0015] Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, where like reference numerals indicate like elements. However, it is not limited thereto. [Figure 1] It is a diagram exemplarily showing an electronic device for generating a multi-modal model according to an embodiment of the present disclosure. [Figure 2] It is a schematic diagram showing a configuration in which an information processing system is communicably connected to a plurality of user terminals in relation to data processing according to an embodiment of the present disclosure. [Figure 3] It is a block diagram showing the internal configurations of a user terminal and an information processing system according to an embodiment of the present disclosure. [Figure 4] In an embodiment of the present disclosure, it is a diagram conceptually showing differences in learning methods when an existing model is learned to generate a new model. [Figure 5] In an embodiment of the present disclosure, it is a diagram conceptually showing a learning method when an existing model is learned to generate a new model. [Figure 6]Exemplarily shown are multimodal data (610) including true response data and multimodal data (620) including false response data according to an embodiment of the present disclosure. [Figure 7] It is a diagram exemplarily showing normal data (710) and abnormal data (730) according to an embodiment of the present disclosure. [Figure 8] It is a diagram for explaining a method for generating a multimodal model according to an embodiment of the present disclosure. [Figure 9] It is a diagram for explaining a method for generating a multimodal model according to an embodiment of the present disclosure.

Mode for Carrying Out the Invention

[0016] <Summary of the Invention>

[0017] According to one embodiment, the false response data included in the learning data set may be data generated by changing at least some of the true response data.

[0018] According to one embodiment, the true response data included in the learning data set may be data generated by correcting at least some of the false response data.

[0019] According to one embodiment, the step of generating a new model may be executed using a loss function for suppressing the generation of false response data for input data and increasing the generation of true response data.

[0020] According to one embodiment, the loss function may be a function having the output of the reference model as an anchor point based on the reference model.

[0021] According to one embodiment, the first difference value and the change ratio may be calculated for corresponding layers of the base model, the functional model, and the target model.

[0022] According to one embodiment, the loss function may be a function that calculates a loss value based on the output values of the reference model and the existing model for true response data, and the output values of the reference model and the existing model for false response data.

[0023] According to one embodiment, the training dataset may further include location information that indicates a specific location in the response data that is the target of training.

[0024] According to one embodiment, the location information may be represented by the insertion of a special token in the response data to indicate a specific location.

[0025] According to one embodiment, the step of generating a new model may include a step of ensuring that, in the process of learning the existing model using the loss value, the parameters of the existing model are not updated with respect to data other than the data indicated by the location information.

[0026] According to one embodiment, the at least one program may further include instructions for generating the novel model using a loss function to suppress the generation of false response data for input data and increase the generation of true response data.

[0027] According to one embodiment, the at least one program may further include instructions to prevent the parameters of the existing model from being updated with respect to data other than the data indicated by the location information, during the process of learning the existing model using the loss value.

[0028] <Detailed description of the invention>

[0029] The specific details for implementing this disclosure will be described below with reference to the attached drawings. However, in the following explanation, specific descriptions of well-known functions and configurations will be omitted if they may unnecessarily obscure the gist of this disclosure.

[0030] In the attached drawings, identical or corresponding components are assigned the same reference numerals. Furthermore, in the following descriptions of embodiments, redundant descriptions of identical or corresponding components may be omitted. However, the omission of a description of a component does not mean that the component is not included in that embodiment.

[0031] The advantages and features of the disclosed embodiments, as well as the methods for achieving them, will become clearer with reference to the embodiments described below in conjunction with the accompanying drawings. However, this disclosure is not limited to the embodiments disclosed below, and can be realized in a variety of mutually different forms. These embodiments are provided merely to complete the disclosure and to fully inform those in the ordinary skill level of the scope of the invention.

[0032] This specification provides a brief explanation of the terminology used herein and a detailed description of the disclosed embodiments. The terminology used herein is selected to the greatest extent possible, based on the function described herein, and is generally in common use today; however, this may vary depending on the intent of the articulates in the relevant field, case law, the emergence of new technologies, etc. In some cases, the applicant has arbitrarily selected terms, in which case their meaning will be described in detail in the relevant section describing the invention. Therefore, the terminology used in this disclosure should not be merely nominal terms, but should be defined based on the meaning of the term and its context throughout the disclosure.

[0033] In this specification, singular expressions include plurals unless the context clearly identifies them as singular. Similarly, plural expressions include singulars unless the context clearly identifies them as plural. Throughout the specification, when a part includes a component, it may include other components rather than excluding them, unless otherwise stated.

[0034] Furthermore, as used herein, the terms “module” or “part” refer to a software or hardware component, and a “module” or “part” may perform some function. However, the meaning of “module” or “part” is not limited to software or hardware. A “module” or “part” may be configured to reside on an addressable storage medium or to run one or more processors. Thus, for example, a “module” or “part” may include components such as software components, object-oriented software components, class components, and task components, as well as at least one of processes, functions, properties, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, or variables. The functionality of a component and a “module” or “part” may be integrated into fewer components or “modules” or “parts,” or further divided into additional components or “modules” or “parts.”

[0035] According to one embodiment of this disclosure, a “module” or “part” may be realized by a processor and memory. “Processor” should be broadly interpreted to include general-purpose processors, CPUs (central processor units), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, etc. In some environments, “processor” may also refer to ASICs, PLDs, FPGAs, etc. “Processor” may also refer to combinations of processing devices, such as a combination of a DSP and a microprocessor, a combination of multiple microprocessors, a combination of one or more microprocessors combined with a DSP core, or any other combination of configurations. “Memory” should also be broadly interpreted to include any electronic component capable of storing electronic information. “Memory” may refer to various types of processor-readable media, such as RAM, ROM, NVRAM, PROM, EPROM, EEPROM, flash memory, magnetic or marked data storage devices, registers, etc. Memory is said to be in electronic communication with the processor if the processor can read information from or write information to memory. Memory integrated into a processor is in electronic communication with the processor.

[0036] Furthermore, terms such as 1st, 2nd, A, B, (a), (b), etc., used in the following embodiments are used merely to distinguish one component from another, and these terms do not limit the nature or order of the components.

[0037] Furthermore, in the following embodiments, if it is stated that one component is “linked,” “joined,” or “connected” to another component, it should be interpreted that while that component may be directly linked or connected to that other component, it is also possible that other components may be “linked,” “joined,” or “connected” between each component.

[0038] Furthermore, the terms “comprises” and “comprising” as used in the following embodiments do not preclude the presence or addition of one or more other components, steps, operations, or elements mentioned.

[0039] Various embodiments of this disclosure will be described in detail below with reference to the accompanying drawings.

[0040] Figure 1 illustrates an electronic device 100 for generating a multimodal model according to one embodiment of the present disclosure. Referring to Figure 1, the electronic device 100 can generate a new model 120 by updating at least one parameter included in an existing model 110 through learning based on a training dataset for a multimodal model.

[0041] The electronic device 100 for learning a multimodal model may include memory and at least one processor. However, the configuration of the electronic device 100 is not limited thereto. According to various embodiments, the electronic device 100 may further include at least one other component in addition to the components described above. For example, the electronic device 100 may further include a communication circuit (or communication module) for communication with an external electronic device.

[0042] A processor may be connected to memory and configured to execute at least one computer-readable program contained in the memory. For example, the processor can execute software (or a program) to control at least one other component (e.g., hardware or software component) of an electronic device 100 connected to the processor and perform various data processing and calculations. According to one embodiment, as at least part of the data processing and calculations, the processor may load instructions or data received from other components (e.g., a communication circuit) into volatile memory, process the instructions or data stored in volatile memory, and store the resulting data in non-volatile memory.

[0043] The memory may store various data used by at least one component of the electronic device 100 (e.g., a processor). This data may include, for example, input and output data relating to software (or programs) and associated instructions. The memory may include volatile or non-volatile memory.

[0044] The processor may execute at least one program that includes instructions related to the learning and generation of a multimodal model. In the following, the processor will be described as performing a function, but this is for explanatory convenience; the execution of a function by the processor can be understood as the processor executing instructions contained in at least one program stored in memory.

[0045] Figure 2 is a schematic diagram showing a configuration in which an information processing system 230 is communicatively connected to a plurality of user terminals 210_1, 210_2, and 210_3 in relation to data processing according to one embodiment of the present disclosure. The information processing system 230 may include one or more systems capable of providing data processing services (e.g., multimodal model-based services). In one embodiment, the information processing system 230 may include one or more server devices or databases capable of storing, providing, and executing computer-executable programs (e.g., downloadable applications) and data related to the data processing services, or one or more distributed computing devices or distributed databases based on cloud computing services. For example, the information processing system 230 may include a separate system (e.g., a server) for the data processing services.

[0046] The data processing services provided by the information processing system 230 may be provided to users through data processing applications, web browser applications, etc., installed on each of the user terminals 210_1, 210_2, and 210_3.

[0047] Multiple user terminals 210_1, 210_2, and 210_3 can communicate with the information processing system 230 via the network 220. The network 220 may be configured to enable communication between the multiple user terminals 210_1, 210_2, and 210_3 and the information processing system 230. Depending on the installation environment, the network 220 may consist of wired networks such as Ethernet®, Power Line Communication, telephone line communication equipment and RS-serial communication, wireless networks such as mobile communication networks, WLAN (WirelessLAN), Wi-Fi, Bluetooth®, and ZigBee, or a combination thereof. The communication method is not limited and may include not only communication methods that utilize communication networks that the network 220 may include (e.g., mobile communication networks, wired internet, wireless internet, broadcasting networks, satellite networks, etc.), but also short-range wireless communication between user terminals 210_1, 210_2, and 210_3.

[0048] For example, multiple user terminals 210_1, 210_2, and 210_3 can send data processing requests and commands related to user requests for data processing to the information processing system 230 via the network 220, and the information processing system 230 can receive them.

[0049] In Figure 2, a mobile phone terminal 210_1, a tablet terminal 210_2, and a PC terminal 210_3 are shown as examples of user terminals, but are not limited to these. User terminals 210_1, 210_2, and 210_3 can be any computing device capable of wired and / or wireless communication, and on which data processing applications and the like can be installed and run. For example, user terminals may include smartphones, mobile phones, navigation systems, computers, laptops, digital broadcasting terminals, PDAs (Personal Digital Assistants), PMPs (Portable Multimedia Players), tablet PCs, game consoles, wearable devices, IoT (Internet of Things) devices, VR (virtual reality) devices, AR (augmented reality) devices, and the like. Furthermore, although Figure 2 shows three user terminals 210_1, 210_2, and 210_3 communicating with the information processing system 230 via the network 220, the configuration is not limited to this, and a different number of user terminals may be configured to communicate with the information processing system 230 via the network 220.

[0050] Figure 3 is a block diagram showing the internal configuration of a user terminal 210 and an information processing system 230 according to one embodiment of the present disclosure. The user terminal 210 can refer to any computing device capable of executing data processing applications and other similar tasks, and capable of wired / wireless communication. For example, it may include the mobile phone terminal 210_1, tablet terminal 210_2, and PC terminal 210_3 shown in Figure 2. As shown in the figure, the user terminal 210 may include a memory 312, a processor 314, a communication module 316, and an input / output interface 318. Similarly, the information processing system 230 may include a memory 332, a processor 334, a communication module 336, and an input / output interface 338. As shown in Figure 3, the user terminal 210 and the information processing system 230 may be configured to communicate information and / or data through the network 220 via their respective communication modules 316 and 336. The input / output device 320 may also be configured to input information and / or data to the user terminal 210 through the input / output interface 318, or to output information and / or data generated from the user terminal 210.

[0051] The memories 312 and 332 may include any non-temporary computer-readable recording medium. According to one embodiment, the memories 312 and 332 may include non-volatile mass storage devices such as ROM (read-only memory), disk drives, SSDs (solid-state drives), and flash memory. In another example, non-volatile mass storage devices such as ROM, SSDs, flash memory, and disk drives may be included in the user terminal 210 or information processing system 230 as separate persistent storage devices distinct from the memories. Furthermore, the memories 312 and 332 may store an operating system and at least one program code (e.g., code such as an application related to a data processing service).

[0052] These software components may be loaded from a computer-readable recording medium separate from memory 312, 332. This separate computer-readable recording medium may include recording media that can be directly connected to these user terminals 210 and information processing systems 230. For example, it may include computer-readable recording media such as flexible drives, disks, tapes, DVD / CD-ROM drives, and memory cards. In another example, software components may be loaded into memory 312, 332 via communication modules 316, 336 rather than through a computer-readable recording medium. For example, at least one program may be loaded into memory 312, 332 based on a computer program (e.g., an application related to data processing services) installed by a file provided through the network 220 by a developer or a file distribution system that distributes application installation files.

[0053] Processors 314, 334 may be configured to process computer program instructions by performing basic arithmetic, logic, and input / output operations. Instructions may be provided to processors 314, 334 by memory 312, 332 or by communication modules 316, 336. For example, processors 314, 334 may be configured to execute received instructions according to program code stored in a recording device such as memory 312, 332.

[0054] Communication modules 316 and 336 provide configurations or functions for user terminals 210 and information processing systems 230 to communicate with each other via the network 220, and may also provide configurations or functions for user terminals 210 and / or information processing systems 230 to communicate with other user terminals or other systems (e.g., another cloud system). For example, requests or data (e.g., data processing requests or data) generated by the processor 314 of user terminal 210 according to program code stored in a recording device such as memory 312 may be transmitted to the information processing system 230 via the network 220 under the control of the communication module 316. Conversely, control signals or instructions provided under the control of the processor 334 of the information processing system 230 may be received by user terminal 210 via the communication module 336 and the network 220 through the communication module 316 of user terminal 210.

[0055] The input / output interface 318 may be a means for interface with the input / output device 320. For example, the input device may include a camera with audio and / or image sensors, a keyboard, a microphone, a mouse, etc., and the output device may include a display, a speaker, a haptic feedback device, etc. As another example, the input / output interface 318 may be a means for interface with a device that integrates input and output functions into one, such as a touchscreen. In Figure 3, the input / output device 320 is shown not being included in the user terminal 210, but is not limited to this, and may be configured as a single device with the user terminal 210. Also, the input / output interface 338 of the information processing system 230 may be connected to the information processing system 230, or may be a means for interface with input and output devices (not shown) that the information processing system 230 may include. In Figure 3, the input / output interfaces 318 and 338 are shown as separate components from the processors 314 and 334. However, the configuration is not limited to this, and the input / output interfaces 318 and 338 may be included within the processors 314 and 334.

[0056] The user terminal 210 and the information processing system 230 may include more components than those shown in Figure 3. However, it is not necessary to explicitly show most of the conventional components. In one embodiment, the user terminal 210 may be implemented to include at least some of the input / output devices 320 described above. The user terminal 210 may also include other components such as a transceiver, a GPS (Global Positioning System) module, a camera, various sensors, and a database. For example, if the user terminal 210 is a smartphone, it may include components that are generally found in smartphones. As an example, the user terminal 210 may be implemented to further include a variety of components such as an accelerometer, a gyroscope, a microphone module, a camera module, various physical buttons, buttons using a touch panel, input / output ports, and a vibrator for vibration.

[0057] In one embodiment, the processor 314 of the user terminal 210 may be configured to run a data processing application or a web browser application that provides data processing services. In this case, program code related to the application may be loaded into the memory 312 of the user terminal 210. While the application is running, the processor 314 of the user terminal 210 may receive information and data provided by the input / output device 320 via the input / output interface 318, or receive information and data from the information processing system 230 via the communication module 316, process the received information and data, and store it in the memory 312. In addition, this information and data may be provided to the information processing system 230 via the communication module 316.

[0058] While the data processing application is running, the processor 314 may receive input or selected audio data, text, images, video, etc., through input devices such as a touchscreen, keyboard, camera including audio sensors and / or image sensors, and microphone, which are connected to the input / output interface 318. The received audio data, text, images, and / or video, etc., may be stored in the memory 312 or provided to the information processing system 230 via the communication module 316 and the network 220. In one embodiment, the processor 314 can receive user input entered through the input device and provide data / requests corresponding to the received user input to the information processing system 230 via the network 220 and the communication module 316.

[0059] The processor 314 of the user terminal 210 can transmit and output information and data to the input / output device 320 via the input / output interface 318. For example, the processor 314 of the user terminal 210 can output processed information and data through the output device 320, such as a display output device (e.g., touchscreen, display, etc.) or an audio output device (e.g., speaker).

[0060] The processor 334 of the information processing system 230 may be configured to manage, process, and / or store information and data received from multiple user terminals 210 and / or multiple external systems. The information and data processed by the processor 334 may be provided to the user terminals 210 through the communication module 336 and the network 220.

[0061] Figure 4 conceptually illustrates the differences in learning methods when generating a new model by training an existing model in one embodiment of this disclosure. Referring to Figure 4, in the case of conventional LLM models, there was a tendency to refer to global data that included all information unnecessary for output generation, such as training the image region of the entire clock (420-1) or the entire text (420-2) in order to output the current time. On the other hand, the learning method of this disclosure can optimize the model around local data that is essential for output generation by training the hour hand, minute hand region (410-1) and the text token indicating the time (410-2) for checking the time.

[0062] Figure 5 is a conceptual diagram illustrating a learning method used in one embodiment of this disclosure to generate a new model by training an existing model. For example, suppose the new model needs to learn that "apples are better than citrus fruits." With conventional datasets, overfitting can occur due to the presence of differences unrelated to the part that needs to be learned (e.g., hands and glasses, and pens and ninjas), or errors can occur where learning proceeds through data unrelated to apples or citrus fruits. On the other hand, the learning method of this disclosure has the effect of generating a new model that optimizes the existing model by clearly specifying the data to be trained and using a method to mask other data so that learning does not occur.

[0063] In this disclosure, "ground truth data" refers to output data used to induce the generation of a new model 120 by updating an existing model 110 in the electronic device 100, while "false data" (also called "hallucination data") refers to output data used to suppress the generation of a new model 120 by updating an existing model 110 in the electronic device 100.

[0064] The artificial neural network model described herein may be a multimodal model that processes multimodal data and generates output data.

[0065] A multimodal model is an artificial neural network model that can process and learn from diverse types of data simultaneously, thereby understanding the interactions and context between the data and generating output.

[0066] The multimodal data used for training the multimodal model described herein refers to data composed of information in different forms, such as image data, text data, audio data, and video data, each containing data with different characteristics.

[0067] A multimodal model that processes multimodal data can generate a final output by using separate encoders for each data type and fusing the features of each data extracted by the encoders.

[0068] For example, in the case of text data included in multimodal data, tokenization, which divides the text data into fixed units, and embedding, which converts each token into a vector value that a computer can recognize and process, may be performed. As an example, byte-pair encoding (BPE) technology can be used for tokenization. Generally, BPE is a technique that divides words into character or Unicode units to create a vocabulary, and generates tokens by merging consecutively appearing characters or Unicodes according to their frequency of occurrence in the vocabulary. Embedding is the process of converting each token generated by tokenization into an embedding vector, which can be generated by various technologies such as Glove, FastText, and Word2Vec. Furthermore, in the case of image data included in multimodal data, image features can be extracted by encoder models such as CNN (Convolutional Neural Network) or ViT (VisionTransformer), and an embedding vector can be generated as a tensor or vector with a specific dimension.

[0069] When embedding vectors containing features are generated for multiple data points included in multimodal data, the features can be merged by concatenating the multiple embedding vectors or by adding them together after matching their dimensions.

[0070] As described above, a multimodal model can generate intermediate data for creating the final output by including a fusion layer that extracts features from each of the multiple input data via an encoder model and then fuses them together.

[0071] The electronic device 100 according to this disclosure can generate true response data (y_w) and false response data (y_l) in response to input data (x) which consists of image data and query data which is text data.

[0072] Figure 6 illustrates multimodal data (610) including true response data and multimodal data (620) including false response data according to one embodiment of the present disclosure.

[0073] Referring to Figure 6, for the query data “Who wrote this book?” input along with image data, the true response data could be “Donna Eden” and the false response data could be “John Smith”. Similarly, for the query data “What is the title of this book?”, the true response data could be “The Energies of Love: Using Energy Medicine to Keep Your Relationship Thriving” and the false response data could be “The Power of Energy”. Furthermore, for the query data “What type of book is this?”, the true response data could be “Health, Fitness & Dieting” and the false response data could be “Science Fiction”. And for the query data “Is this a fitness book?”, the true response data could be “Yes” and the false response data could be “No”.

[0074] The electronic device 100 according to this disclosure can generate a tuple of false response data and true response data by revising at least a portion of the false response data to generate true response data, or by hallucinating at least a portion of the true response data to generate false response data.

[0075] For example, suppose the image data included in the multimodal data is a photograph of a cat with white fur. The predetermined false response data may be something like, "This is a photograph of a dog. The dog in the photograph has brown fur." In this case, the electronic device 100 can obtain the corrected true response data, "This is a photograph of a cat. The cat in the photograph has white fur," based on the external user input. Alternatively, if the predetermined true response data is "This is a photograph of a cat. The cat in the photograph has white fur," the electronic device 100 can generate false response data such as, "This is a photograph of a dog. The dog in the photograph has brown fur."

[0076] In this disclosure, the process of generating false response data by transforming at least a portion of the true response data may be performed by an external user or as a result of computations performed by a separately trained natural language processing model to replace specific text tokens.

[0077] The electronic device 100 can use input data (x), true response data (y_w), and false response data (y_l) to learn an existing model through a loss function as shown in Equation 1 below, and generate a new model.

[0078]

number

[0079] Here, L represents the loss function, x is the input multimodal data, and y w is the true response data, y l is false response data, θ is a learnable parameter, r θ (x,y) refers to the reward function calculated based on the difference between the output value derived by a model with parameter (θ) for an input x and y. The reward function may be a function that returns a larger absolute value the greater the difference between the model's output value and y for x.

[0080] The electronic device 100 can update an existing model using a loss function further based on the reference model and generate a new model. Specifically, the electronic device 100 can use the output of the reference model as an anchor to train the existing model to produce a better output than the reference model, thereby generating a new model. The reference model may be a separate multimodal model trained to produce a similar output to the existing model for the same input data.

[0081] For example, the electronic device 100 may use a loss function like the one shown in equation 2 below.

[0082]

number

[0083] Here, L represents the loss function, x is the input multimodal data, and y w is the true response data, y l is false response data, θ is a learnable parameter, r θ (x,y) refers to the reward function calculated based on the difference between the output value derived by a model with parameter (θ) for an input x and y. The reward function may be configured to return a larger absolute value the greater the difference between the model's output value and y for x. Furthermore, σ may be a variable used to determine the magnitude of the loss value derived by the loss function. β KL,w and β KL,l These are the Kullback-Leibler divergence terms for the true response data and false response data, respectively, and can be expressed, for example, as shown in equations 3 and 4 below.

[0084]

number

[0085]

number

[0086] The KL divergence is a term that indicates how similar two probability distributions are, and has a smaller value as the two probability distributions are more similar. Also, in Equation 3, π θ (y w |x) represents the probability distribution in which the existing model outputs the true response data (y w ) for the input data (x), and π ref (y w |x) represents the probability distribution in which the reference model outputs the true response data (y w ) for the input data (x). Similarly, in Equation 4, π θ (y l |x) represents the probability distribution in which the existing model outputs the false response data (y l ) for the input data (x), and π ref (y l |x) represents the probability distribution in which the reference model outputs the false response data (y l ) for the input data (x).

[0087] The electronic device 100 according to the present disclosure can prevent the generation of similar output data with a lower probability than the reference model for true response data and prevent the generation of similar output data with a higher probability than the reference model for false response data by learning an existing model using the loss functions expressed by Equations 2 to 4. Further, the electronic device 100 according to the present disclosure can induce the existing model not to learn data that does not require learning by learning the existing model using the KL divergence term described above.

[0088] The electronic device 100 disclosed herein can train an existing model using a training dataset that includes location information indicating a specific location in the response data to be trained, and generate a new model. For example, suppose the image data included in the multimodal data is a photograph of a "cat," and the query data is "What is this photograph about?" In this case, the true response data could be "This photograph is about a cat," and the false response data could be "This photograph is about a dog." In this case, the training dataset may further include location information about a specific location containing key information.

[0089] For example, location information can be represented by inserting special tokens into text data. For instance, true response data containing location information might be expressed as "This photo is about [CLS]cat[CLS]" or "This photo is about ≪cat≫", while false response data containing location information might be expressed as "This photo is about [CLS]dog[CLS]" or "This photo is about ≪dog≫".

[0090] As another example, location information can be represented as a masking vector with the same dimensions as the text data. For instance, if the size of the embedding vector is L as a result of embedding text data, the masking vector may be a binary vector of size L. Specifically, assuming a sequence length of 5, if the masking vector is [0,0,1,0,0], this may be a masking vector that activates only the third token.

[0091] The electronic device 100 according to this disclosure uses a response dataset that includes location information indicating a specific location in the response data, thereby preventing the parameters of the existing model from being updated for data other than the data indicated by the location information. In other words, it is possible to avoid additional training on model parameters that are unrelated to the data that the existing model must learn. As a result, this disclosure can avoid unnecessary updates to model parameters and overfitting by reflecting only the local data that the existing model should learn in the training process, and can generate a more accurate new model.

[0092] The electronic device 100 according to this disclosure can generate multiple training data by masking specific positions in the input data, thereby generating a training dataset for training an existing model.

[0093] Figure 7 is a diagram illustrating normal data (710) and abnormal data (730) according to one embodiment of the present disclosure. Referring to Figure 7, we assume that the electronic device 100 learns a multimodal model that receives input data including specific image data and query data requesting an explanation of a specific location within the image data, and outputs response data.

[0094] Normal data (710) is input data that includes the original image data, and the image displays the specific location within the image for which an explanation is requested as is, while abnormal data (720) may be an image in which the specific location within the image for which an explanation is requested is masked. In this case, the same response text, "man on end black suit," may be learned as true response data with normal data (710) and as false response data with abnormal data (720). A multimodal model trained with such a training dataset can learn more accurately which location in the image should be targeted to generate output for an input query.

[0095] Figure 8 is a diagram illustrating a method for generating a multimodal model according to one embodiment of the present disclosure.

[0096] Referring to Figure 8, the processor of the electronic device (e.g., electronic device 100 in Figure 1) can acquire an existing model, which is a pre-trained multimodal model, in step S810. The existing model may be a model that takes an image and text as input and generates response data for the input query. For example, the existing model may take a clock image and query data about the time and generate response data that describes the current time, or take a movie poster image and query data about the movie title and generate response data that outputs the name of the movie poster.

[0097] In step S820, the processor can acquire a training dataset for training a multimodal model. The training dataset may include true response data and false response data. For example, true response data could be text data such as "The title of this movie is Parasite," which is the text that the multimodal model should output. On the other hand, false response data could be text data such as "The title of this movie is Pascon," which is data in which at least part of the true response data has been replaced.

[0098] In step S830, the processor can train an existing model based on the training dataset and generate a new model. The processor can train an existing model by using a loss function that suppresses the generation of false response data for input data and increases the generation of true response data. At this time, the processor can backpropagate the loss value based on a loss function that simultaneously uses a tuple of false response data and true response data for the input data.

[0099] Figure 9 is a diagram illustrating a method for generating a multimodal model according to one embodiment of the present disclosure.

[0100] Referring to Figure 9, the processor of the electronic device (e.g., electronic device 100 in Figure 1) can acquire an existing model, which is a pre-trained multimodal model, in step S910. Step S910 can be performed identically or similarly to step S810 in Figure 8.

[0101] In step S920, the processor can acquire a training dataset for training a multimodal model. The training dataset may include true response data, false response data, and positional information indicating specific locations in the response data to be trained. Here, the positional information may be represented by inserting special tokens into the text data. Furthermore, the positional information may be represented as a masking vector of the same dimension as the text data.

[0102] In step S930, the processor can generate a new model by ensuring that, in the process of learning the existing model using the loss value, the parameters of the existing model are not updated for data other than the data indicated by the positional information. For example, when the processor calculates a loss function such as softmax, it can prevent the parameters from being updated by assigning 0 to areas of irrelevant information. In this way, the learning method of this disclosure can focus on learning only the specific positions of the response data necessary to generate the output.

[0103] The flowchart and explanation described above are examples, and different implementations are possible in some embodiments. For example, in some embodiments, the order of the steps may be changed, some steps may be repeated, some steps may be omitted, or some steps may be added.

[0104] The methods described above may be provided as computer programs stored on computer-readable storage media for execution on a computer. These media may include those that permanently store computer-executable programs, or those that temporarily store them for execution or download. Furthermore, these media can be a variety of recording and storage means, often consisting of single or multiple hardware components, and are not limited to media directly connected to a computer system; they may also be distributed across a network. Examples of media include magnetic media such as hard disks, flexible disks, and magnetic tapes; optical media such as CD-ROMs and DVDs; magnetic-optical media such as floppy disks; and ROM, RAM, and flash memory, which may be configured to store program instructions. Other examples of media include storage media managed by app stores and other sites, servers, etc., that distribute applications and other software.

[0105] The methods, operations, or techniques described herein can be implemented by a variety of means. For example, these techniques can be implemented in hardware, firmware, software, or a combination thereof. A person of ordinary skill will understand that the various exemplary logic blocks, modules, circuits, and algorithmic steps described in connection with the disclosure herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate such interchangeability between hardware and software, the various exemplary components, blocks, modules, circuits, and steps are described above from their respective functional standpoints. Whether such functions are implemented in hardware or software depends on the design requirements imposed on the particular application and the overall system. A person of ordinary skill may implement the functions described in a variety of ways for each specific application, but these implementations should not be construed as being outside the scope of this disclosure.

[0106] In hardware implementations, the processing units used to execute the technology may be implemented within one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, computers, or combinations thereof.

[0107] Accordingly, the various exemplary logic blocks, modules, and circuits described in connection with the disclosures herein may be implemented or run by general-purpose processors, DSPs, ASICs, FPGAs, other programmable logic devices, discrete gates and transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but instead, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, such as a DSP and a microprocessor, multiple microprocessors, one or more microprocessors combined with a DSP core, or any other combination of configurations.

[0108] In firmware and / or software implementations, these techniques may be implemented as instructions stored on computer-readable media such as RAM, ROM, NVRAM, PROM, EPROM, EEPROM, flash memory, compact discs (CDs), and magnetic or marked data storage devices. The instructions are executable by one or more processors, and the processors are capable of executing specific aspects of the functions described herein.

[0109] When implemented in software, the above techniques may be stored on or transferred via computer-readable media as one or more instructions or code. Computer-readable media include any medium that facilitates the transfer of computer programs from one location to another, and include both computer storage media and communication media. Storage media can be any available medium accessible by a computer. As an unrestrictive example, such computer-readable media may include RAM, ROM, EEPROM, CD-ROM, or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to transfer or store desired program code in the form of instructions or data structures and is accessible by a computer. Furthermore, any connection may appropriately be referred to as computer-readable media.

[0110] For example, if software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, stranded wire, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then coaxial cable, fiber optic cable, stranded wire, digital subscriber line, and wireless technologies such as infrared, radio, and microwave are included in the definition of media. As used herein, disk and disc include CD, laserdisc, optical disc, DVD (digital versatile disc), flexible disc, and Blu-ray disc, where disk typically reproduces data magnetically and disc reproduces data optically using a laser. The above combinations should also be included in the scope of computer-readable media.

[0111] Software modules may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disks, removable disks, CD-ROMs, or any other known form of storage medium. An exemplary storage medium may be connected to the processor in a way that allows the processor to read information from or write information to the storage medium. Alternatively, the storage medium may be integrated into the processor. The processor and storage medium may reside within an ASIC. The ASIC may reside within a user terminal. Or, the processor and storage medium may exist as separate components within a user terminal.

[0112] The embodiments described above are described as utilizing aspects of the currently disclosed subject matter in one or more independent computer systems, but the disclosure is not limited to those described above and may be implemented in conjunction with any computing environment, such as a network or a distributed computing environment. Furthermore, aspects of the subject matter of the disclosure may be implemented on multiple processing chips or devices, and storage may be similarly affected across multiple devices. Such devices may include PCs, network servers, and mobile devices.

[0113] While this disclosure is described in relation to several embodiments, various modifications and variations can be made that do not depart from the scope of this disclosure, to the extent that they can be understood by a person with ordinary skill in the art to which the invention of this disclosure pertains. Furthermore, such modifications and variations should be considered to be included in the claims appended to this specification.

Claims

1. A method for learning a multimodal model, which is executed by at least one processor, The steps include acquiring an existing model which is a pre-trained multimodal model, As a step to obtain a training dataset for training a multimodal model, the training dataset includes the step of including true response data and false response data, A step of generating a new model by training the existing model based on the aforementioned training dataset. A method for learning multimodal models, including the following.

2. The false response data included in the training dataset is data generated by modifying at least some of the data in the true response data. A method for learning a multimodal model according to claim 1.

3. The true response data included in the training dataset is data generated by modifying at least a portion of the false response data. A method for learning a multimodal model according to claim 2.

4. This is performed using a loss function that suppresses the generation of false response data for input data and increases the generation of true response data. A method for learning a multimodal model according to claim 1.

5. The loss function is a function based on the reference model, with the output of the reference model as its anchor point. A method for learning a multimodal model according to claim 4.

6. The loss function is a function that calculates a loss value based on the output values of the reference model and the existing model for the true response data, and the output values of the reference model and the existing model for the false response data. A method for learning a multimodal model according to claim 5.

7. The aforementioned training dataset further includes location information that indicates a specific location in the response data to be trained. A method for learning a multimodal model according to claim 1.

8. The aforementioned location information is information represented by inserting a special token to indicate a specific location within the response data. A method for learning a multimodal model according to claim 7.

9. The step of generating the new model includes, in the process of learning the existing model using the loss value, ensuring that the parameters of the existing model are not updated with respect to data other than the data indicated by the location information. A method for learning a multimodal model according to claim 7.

10. A computer-readable computer program for performing the method described in any one of claims 1 to 9 on a computer.

11. An electronic device, Memory and The system includes at least one processor connected to the memory and configured to execute at least one computer-readable program contained in the memory, The at least one program acquires an existing model which is a pre-trained multimodal model, and acquires a training dataset for training the multimodal model, wherein the training dataset includes true response data and false response data, and the program provides instructions for generating a new model by training the existing model based on the training dataset. Electronic devices, including those mentioned above.

12. The at least one program further includes instructions for generating the novel model using a loss function to suppress the generation of false response data for input data and increase the generation of true response data. The electronic device according to claim 11.

13. The loss function is a function based on the reference model, with the output of the reference model as its anchor point. The electronic device according to claim 12.

14. The aforementioned training dataset further includes location information that indicates a specific location in the response data to be trained. The electronic device according to claim 11.

15. The at least one program further includes instructions to prevent the parameters of the existing model from being updated with respect to data other than the data indicated by the location information, in the process of learning the existing model using the loss value. The electronic device according to claim 14.