Information processing device and information processing method

The information processing apparatus enhances VLMs by using an attention mechanism and prompt optimization to integrate domain-specific knowledge, improving the relevance of image-text processing and text generation.

JP7873322B1Active Publication Date: 2026-06-11NTT COMWARE CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
NTT COMWARE CORP
Filing Date
2025-02-04
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

Conventional Vision Language Models (VLMs) face issues with knowledge fragmentation between heterogeneous data domains, leading to insufficient reflection of target domain-specific knowledge in generated text.

Method used

An information processing apparatus incorporating a model with a first module for image-text token association using an attention mechanism and a second module for prompt optimization, along with input/output units to process images and questions, and convert semantic spaces to enhance domain-specific knowledge reflection.

🎯Benefits of technology

The model effectively incorporates target domain-specific knowledge into generated text, addressing knowledge fragmentation and enhancing the relevance of image-text processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007873322000001_ABST
    Figure 0007873322000001_ABST
Patent Text Reader

Abstract

This invention provides an information processing device and information processing method that reflects target domain-specific knowledge in generated text within a Vision Language Model (VLM). [Solution] In the information processing system 1, the learning unit 10 includes a preprocessing unit that acquires learning images and text from the data storage unit 30, divides the images and text into tokens and converts them into an embedded representation, and a calculation processing unit that learns the model parameters to maximize the objective function shown in the following equation and stores the updated parameters in the calculation result storage unit 40. TIFF0007873322000011.tif8165
Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field] 【0001】 This disclosure relates to an information processing device and an information processing method. [Background technology] 【0002】 In recent years, text generation using large-scale language models (LLMs) has attracted attention in the field of AI, and its range of applications is expanding. LLMs are evolving into Vision Language Models (VLMs) that can understand not only language comprehension and text generation, but also different types of data such as images. For example, if you give a VLM an image and a question about that image, you can get an answer. [Prior art documents] [Non-patent literature] 【0003】 [Non-Patent Document 1] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, "Visual Instruction Tuning", NeurIPS 2023 [Non-Patent Document 2] Xiaoyi Dong et al., "InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models", CoRR abs / 2401.16420 (2024). [Non-Patent Document 3] Jinze Bai et al., "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond", CoRR abs / 2308.12966 (2023). [Overview of the project] 【Problems to be Solved by the Invention】 【0004】 Conventional VLMs utilize LLMs as base models and acquire language and image processing capabilities later through fine-tuning. For example, in typical fine-tuning, images are projected into the same semantic space as the LLM. Therefore, the influence of the pre-trained model (source domain) is strong, and fragmentation of knowledge occurs between heterogeneous data. As a result, there has been a problem that knowledge specific to the target domain, such as images and text, is not sufficiently reflected. 【0005】 This disclosure has been made in view of the above, and aims to reflect knowledge specific to the target domain in the generated text in a VLM. 【Means for Solving the Problems】 【0006】 An information processing apparatus according to an aspect of this disclosure includes a model obtained by adding to a language model a first module that implements an association between image and text tokens using an attention mechanism and a second module that optimizes a prompt to the language model, and an input / output unit that inputs an image and a question text to the model and outputs an answer text. The attention mechanism The strength of the relationship between the image and text tokens is determined, and the image and text are processed according to the strength of that relationship. is a first mask that enables all tokens to be referenced and, and a second mask that enables image tokens to reference only image tokens and text tokens to reference only text tokens Apply and calculate the relationship between tokens. and, before the second module, converts the text-side output of the first module from the semantic spaces of the source domain and the target domain to the semantic space of the source domain, inputs the image-side output of the first module and the output of the second module to the decoder of the language model, and converts the output of the decoder from the semantic space of the source domain to the semantic spaces of the source domain and the target domain. 【Effects of the Invention】 【0007】 According to this disclosure, VLM can incorporate target domain-specific knowledge into the generated text. [Brief explanation of the drawing] 【0008】 [Figure 1] Figure 1 shows an example of the configuration of an information processing system using VLM. [Figure 2] Figure 2 shows an example of the VLM of this embodiment. [Figure 3] Figure 3 shows an example of tokens to be updated for the source domain and target domain. [Figure 4] Figure 4 shows an example of a mask used in the Modal Alignment Attention Mechanism. [Figure 5] Figure 5 is a flowchart showing an example of the learning process flow. [Modes for carrying out the invention] 【0009】 [System Configuration] Figure 1 shows an example of the configuration of the information processing system 1 using VLM in this embodiment. The information processing system 1 learns images and the text related to those images, receives images and question texts from the user terminal 5, inputs them into the trained model, and outputs an answer to the question. 【0010】 The information processing system 1 shown in Figure 1 comprises a learning unit 10, an input / output unit 20, a data storage unit 30, and a calculation result storage unit 40. Each unit of the information processing system 1 may be configured by at least one computer equipped with an arithmetic processing unit, a storage device, etc., and the processing of each unit may be executed by a program. This program is stored in the storage device of the information processing system 1 and can be recorded on a computer-readable non-temporary recording medium such as a magnetic disk, optical disk, or semiconductor memory, or it can be provided via a network. 【0011】 The learning unit 10 further learns VLM (or LLM) using the VLM control framework described later. The learning unit 10 comprises a preprocessing unit 11 and a calculation processing unit 12. 【0012】 The preprocessing unit 11 uses an ENCODER to split the image and text data into tokens and convert them into embedding representations. 【0013】 The computation processing unit 12 inputs token embedding representations (hereinafter sometimes simply referred to as tokens) into the VLM / LLM and learns the model parameters to maximize the objective function described later. In this embodiment, a model is trained in which Knowledge effective Fine-tuning (KEFT), which performs token alignment between images and text on a sequence-by-sequence basis using an attention mechanism, and Knowledge effective Prompt optimization (KEPO), which optimizes prompts to the VLM / LLM. 【0014】 The input / output unit 20 receives images and question text from the user terminal 5, feeds them into the newly trained model, and returns the answer obtained from the model. 【0015】 The data storage unit 30 stores training data for the target domain. Training data includes, for example, images and text related to those images. 【0016】 The calculation result storage unit 40 stores the parameters of the model learned by the learning unit 10. 【0017】 [Proposed Model] Referring to Figure 2, the language model (VLM / LLM) to which the VLM control framework of this embodiment is applied will be described. 【0018】 The language model of this embodiment is an existing VLM / LLM with KEFT and KEPO added, and is tuned using KEFT and KEPO. Specifically, KEFT aligns (aligns) multimodal (between images and text) and semantic differences between different domains at different levels, such as tokens and sequences, and learns the relationship between them. KEPO optimizes the prompts input to the VLM / LLM. 【0019】 The proposed framework defines Augmented Embedding Space (AES) by focusing on the difference between the source domain and the target domain. The definition of AES is as follows: 【0020】 【number】 【0021】 Here, L is the number of tokens in the source domain, d h is the number of embedding dimensions, and T is the number of tokens included only in the target domain. 【0022】 When fine-tuning VLM / LLM, it is not necessary to update all tokens in the source domain, so as shown in Figure 3, only a portion of the source domain (E D (E) is included only in the token and target domain. T ) Update the token. In other words, fine-tuning updates all semantic spaces (E) of VLM / LLM. L ) is not the difference E D and E T Update. 【0023】 The ENCODER is the token i of the image. d and the text token x d The input is given, and AES is used for embedding conversion for each token. The output of the l-th block of the ENCODER is expressed by the following formula. 【0024】 【Number】 【0025】 Here, h V is the embedded representation of the image token, and h T is the embedded representation of the text token. 【0026】 KEFT takes the output H V on the image side of the ENCODER and the output H T on the text side as inputs and performs Alignment between the image and the text. 【0027】 KEFT uses the Modal Alignment Attention Mechanism (MAAM) that combines two types of masks Mc and Md. The attention mechanism is a mechanism that calculates the relevance between tokens using three elements: Query (Q), Key (K), and Value (V). Each token has vectors of Q, K, and V. By taking the inner product of Q and K, a relevance score is calculated, and by multiplying it by V, an output weighted with important information is obtained. 【0028】 【Number】 【0029】 MAAM is the weighted sum of V, and the weights are calculated using Q and K. In MAAM, two types of masks Mc and Md are used and weighted by Bc to dynamically optimize the reference range of tokens (the range of tokens used for calculating relevance). 【0030】 Figure 4 shows the reference ranges of the two types of masks Mc and Md. In Figure 4, it is an example where three image tokens e V and three text tokens e T are input respectively, but it is not limited to this. The tokens on the side to be referenced vertically are arranged, and the tokens on the side to be referenced horizontally are arranged. In the case of the mask Mc, the image token e Vand the text token e T Both refer to all tokens, and in the case of mask Md, the token e in the image. V The token e in the image V Refer only to the text token e T is the text token e T Refers only to the specified tokens. The elements (i,j) of the masks Mc and Md are set to 0 for referable tokens and to an infinitely small value (-∞) for unreferable tokens. 【0031】 Bc is the output H of the ENCODER for image and text tokens. V ,H T The matrix A, obtained by multiplying the vector formed by combining the two elements by its transpose, is substituted into the sigmoid function σ to obtain the matrix I(σ(A)>μ). I(σ(A)>μ) is 1 if σ(A) is greater than μ, and 0 otherwise. In other words, Bc is the strength of the association between text and image tokens. In MAAM, if the association between text and image tokens is strong, a mask Mc that references all tokens is applied, and if the association between text and image tokens is weak, a mask Md that references only image tokens or only text tokens is applied. 【0032】 As a result, KEFT outputs a token representation (vector) that incorporates multimodal contextual information that takes into account the relationship between the image and the text, in addition to the local information of each token (the original embedding representation). 【0033】 KEPO is a prompt optimizer that uses LLM (the same as TEXT DECODER), and the text output H of KEFT T ∈R ev×dh Enter the learnable vector (soft prompt) h p ∈R dh Combine them to get the most appropriate prompt H P ∈R lp×dh To obtain. 【0034】 【number】 【0035】 H obtained from KEPO P And the KEFT image output H V ∈R ev×dh Combine them and input them into the TEXT DECODER and H L To obtain. 【0036】 【number】 【0037】 KnG is the text-side output H of KEFT. T Just before inputting into KEPO, the semantic space of the source domain (LLM, i.e., TEXT DECODER) and the target domain are converted to the same semantic space as the LLM, which is the same as the source domain. 【0038】 【number】 【0039】 Here, h t This is the output of each token on the text side of KEFT, i.e., H T It is an element of. 【0040】 Furthermore, immediately after the TEXT DECODER, KnG transforms the semantic space of the source domain into the semantic spaces of the source and target domains. This transformation uses the transposed matrix of the above formula. 【0041】 H converted to the semantic space of the source domain and target domain L Enter the following into LM HEAD, and the answer is y d To obtain. 【0042】 [Model training] The objective of the Image Text Matching (ITM) training task for training KEPT is to learn image-text alignment at the instance level (e.g., entire images or sentences) rather than at the token level. Conventional methods learn matches based on words or partial information, whereas this embodiment differs in that it learns matches based on the entire image and the entire text. 【0043】 In this embodiment, the degree of matching between the image and text is evaluated using a triplet loss function. In the triplet loss function, the image vector v is used as the anchor (reference). a , vector v of positive text corresponding to the anchor (text that correctly corresponds to the image) x , and vectors of negative text (text unrelated to the image) that do not correspond to anchors v y Using v a and v x The distance between them is v a and v y The model is trained so that the distance between them becomes smaller than the distance between them. 【0044】 【number】 【0045】 Here, B is the batch, ||·|| is the distance metric, and ε is v x ga v y v a This is a margin that guarantees that each vector v is at least ε close to the target. a ,v x ,v y This is calculated using the average of the vector representations of the hidden layers. Sampling is done in batches. 【0046】 Output H of the ENCODER for each image and text V ,H TAfter comparing vectors calculated using the mean and maximum, the best mean of each vector is selected. This ensures that the gap between visual and semantic knowledge is mitigated at the instance level and its parameters are optimized. 【0047】 The objective of the Prompt Reasoning Matching (PRM) training task for training KEPO is to match prompts with their generated text so that KEPO returns the most appropriate prompt based on the given input. 【0048】 Similar to KEFT, it uses a triplet loss function to optimize the prompt (the vector representation of the hidden layer at the input level) input to the TEXT DECODER. In KEPO's triplet loss function, the prompt vector v is used as an anchor, which means "rewrite the given input into a more appropriate prompt." a The output of LLM(KEPO) for the initial prompt, the vector v of the correct prompt. x , and in the output of LLM(KEPO) for another initial prompt, the vector v of the incorrect prompt y Using v a and v x The distance between them is v a and v y The model is trained so that the distance between them becomes smaller than the distance between them. 【0049】 【number】 【0050】 Each vector v a ,v x ,v y This is calculated using the average of the vector representations of the hidden layer. 【0051】 Learnable vector h p Through optimization, KEPO learns the relationship between input and appropriate prompts, enabling it to generate appropriate prompts based on the input. 【0052】 The objective function, including language learning, ITM, and PRM, is expressed by the following equation. L(θ) is the cross-entropy function for output tokens, which is also used in standard LLMs. 【0053】 【number】 【0054】 The learning unit 10 updates the parameters to maximize the objective function described above. 【0055】 [Operation] Refer to the flowchart in Figure 5 to explain an example of the learning process flow. 【0056】 In step S11, the learning unit 10 retrieves images and text for training from the data storage unit 30, divides the images and text into tokens, and converts them into an embedding representation. 【0057】 In step S12, the learning unit 10 updates the model parameters to maximize the objective function described above, and stores the updated parameters in the calculation result storage unit 40. 【0058】 During inference, the input / output unit 20 receives an image and the text of the question from the user terminal 5, divides the image and question into tokens, and feeds them into the trained model. The input / output unit 20 returns the answer obtained from the model to the user terminal 5. 【0059】 As described above, the information processing system 1 of this embodiment includes a model that adds KEFT, which performs association between image and text tokens using MAAM, and KEPO, which optimizes prompts, to VLM / LLM, and an input / output unit 20 that inputs images and question text to the model and outputs answer text. When MAAM calculates the association between tokens, if the association between image and text is strong at the token level, it uses a mask Mc that allows all of those tokens to be referenced, and if the association between image and text is weak at the token level, it uses a mask Md that allows only image tokens to be referenced and only text tokens to be referenced. In the stage before KEPO, the text output of KEFT is converted from the semantic space of the source domain and target domain to the semantic space of the source domain, the image output of KEFT and the output of KEPO are input to a TEXT DECODER, and the output of the TEXT DECODER is converted from the semantic space of the source domain to the semantic space of the source domain and target domain. This makes it possible to reflect target domain-specific knowledge in the generated text in VLM. [Explanation of Symbols] 【0060】 1. Information Processing System 10. Learning Department 11 Pre-processing 12. Calculation Processing Unit 20 Input / output section 30 Data storage section 40 Calculation result storage section 5. User terminals

Claims

[Claim 1] A model that adds a first module to the language model that performs association between image and text tokens using an attention mechanism, and a second module that optimizes prompts to the language model, The system includes an input / output unit that inputs an image and question text to the model and outputs an answer text, The attention mechanism calculates the relationship between tokens by determining the strength of the relationship between image and text tokens, and applying a first mask that allows all image and text tokens to be referenced, and a second mask that allows image tokens to reference only image tokens and text tokens to reference only text tokens, according to the strength of the relationship. In the preceding stage of the second module, the text output of the first module is converted from the semantic spaces of the source domain and target domain to the semantic space of the source domain, the image output of the first module and the output of the second module are input to the language model decoder, and the output of the decoder is converted from the semantic space of the source domain to the semantic spaces of the source domain and target domain. Information processing device. [Claim 2] An information processing apparatus according to claim 1, The system includes a learning unit that takes images related to a target domain and text about those images as training data, and updates the model's parameters such that the distance between the image token vector and the corresponding text token vector is smaller than the distance between the image token vector and the text token vector unrelated to the image, and the distance between the prompt vector instructing to rewrite the given input to a more appropriate prompt and the output of the second module for the initial prompt is smaller than the distance between the prompt vector instructing to rewrite the given input to a more appropriate prompt and the output of the second module for a prompt other than the initial prompt. Information processing device. [Claim 3] An information processing apparatus according to claim 2, The learning unit updates some tokens from the source domain and tokens that are included only in the target domain. Information processing device. [Claim 4] An information processing method in which a computer inputs an image and question text into a model and outputs answer text, The aforementioned model is a model that adds a first module, which performs association between image and text tokens using an attention mechanism, and a second module, which optimizes prompts to the language model, to the language model. The attention mechanism calculates the relationship between tokens by determining the strength of the relationship between image and text tokens, and applying a first mask that allows all image and text tokens to be referenced, and a second mask that allows image tokens to reference only image tokens and text tokens to reference only text tokens, according to the strength of the relationship. In the preceding stage of the second module, the text output of the first module is converted from the semantic spaces of the source domain and target domain to the semantic space of the source domain, the image output of the first module and the output of the second module are input to the language model decoder, and the output of the decoder is converted from the semantic space of the source domain to the semantic spaces of the source domain and target domain. Information processing methods. [Claim 5] The information processing method according to claim 4, The model's parameters are updated such that images related to the target domain and text about those images are input as training data, the distance between the image token vector and the corresponding text token vector is smaller than the distance between the image token vector and the unrelated text token vector, and the distance between the prompt vector instructing to rewrite the given input to a more appropriate prompt and the output of the second module for the initial prompt is smaller than the distance between the prompt vector instructing to rewrite the given input to a more appropriate prompt and the output of the second module for a prompt other than the initial prompt. Information processing methods. [Claim 6] The information processing method according to claim 5, Update some tokens in the source domain and tokens that are only included in the target domain. Information processing methods.