Information processing device, information processing method, and information processing program

The information processing device enhances LLMs by incorporating a response processing unit, preference and reason receiving units, and learning processing unit to perform RLHF/DPO learning, addressing the challenge of learning user preferences from limited signal information and improving response accuracy.

JP2026100170APending Publication Date: 2026-06-19LY CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
LY CORP
Filing Date
2024-12-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing question-and-answer systems using large language models (LLMs) struggle to effectively learn user preferences and reasons for those preferences from limited signal information, leading to suboptimal responses.

Method used

An information processing device that includes a response processing unit, preference receiving unit, reason receiving unit, and learning processing unit to perform RLHF or DPO learning using binary preferences and reasons for preferences, enhancing the learning process with methods like SFT and continuous pre-training.

Benefits of technology

Effectively learns and personalizes user preferences and reasons for preferences, improving the responsiveness and accuracy of LLMs in generating appropriate responses.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026100170000001_ABST
    Figure 2026100170000001_ABST
Patent Text Reader

Abstract

This provides a method for effectively learning a model's binary preferences and the reasons for those preferences. [Solution] The information processing device according to the present invention is characterized by comprising: a response processing unit that receives a prompt from a user, inputs the prompt into a model, and presents to the user two objects obtained as the output of the model; a preference receiving unit that receives a binary preference for the two objects from the user; a reason receiving unit that receives the reason for the binary preference from the user; and a learning processing unit that causes the model to generate learning data from the binary preference and the reason for the preference, and performs RLHF or DPO learning using the learning data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to an information processing apparatus, an information processing method, and an information processing program.

Background Art

[0002] A question-and-answer system technology using large language models (LLMs) has been disclosed (see Patent Document 1).

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] However, the above prior art optimizes the prompt in order to output a more appropriate response sentence using a general-purpose large language model. There is room for improvement in generating a large amount of learning data that captures the intention from a small amount of signal information.

[0005] The present application has been made in view of the above, and an object thereof is to provide a method for effectively learning the binary preference of the model's response and the reasons for the preference.

Means for Solving the Problems

[0006] The information processing device according to the present application is characterized by comprising: a response processing unit that receives a prompt from a user, inputs the prompt into a model, and presents the user with two objects obtained as the output of the model; a preference receiving unit that receives a binary preference for the two objects from the user; a reason receiving unit that receives the reason for the binary preference from the user; and a learning processing unit that causes the model to generate learning data from the binary preference and the reason for the preference, and performs RLHF or DPO learning using the learning data. [Effects of the Invention]

[0007] According to one embodiment, a method can be provided for effectively learning the binary preferences and reasons for preferences of a model's response. [Brief explanation of the drawing]

[0008] [Figure 1] Figure 1 is an explanatory diagram showing an overview of the information processing system according to the embodiment. [Figure 2] Figure 2 is an explanatory diagram illustrating an overview of a rapid learning method for achieving personalization and individuality using LLM. [Figure 3] Figure 3 is an explanatory diagram illustrating the general case where preferences change after an experience. [Figure 4A] Figure 4A is an explanatory diagram illustrating the structure of the training data for RLHF. [Figure 4B] Figure 4B is an explanatory diagram illustrating the general structure of the training data for SFT. [Figure 5] Figure 5 is an explanatory diagram illustrating the overview of the training data augmentation. [Figure 6] Figure 6 shows an example of the configuration of a terminal device according to the embodiment. [Figure 7] Figure 7 shows an example of the configuration of a server device according to this embodiment. [Figure 8] Figure 8 is a flowchart showing the processing procedure according to the embodiment. [Figure 9] Figure 9 shows an example of a hardware configuration. [Modes for carrying out the invention]

[0009] The following describes in detail, with reference to the drawings, embodiments for implementing the information processing device, information processing method, and information processing program according to the present application (hereinafter referred to as "embodiments"). Note that these embodiments do not limit the information processing device, information processing method, and information processing program according to the present application. Furthermore, the same parts are denoted by the same reference numerals in the following embodiments, and redundant descriptions are omitted.

[0010] [1. Overview of the Information Processing System] First, with reference to Figure 1, an overview of the information processing system according to the embodiment will be described. Figure 1 is an explanatory diagram showing an overview of the information processing system according to the embodiment. As shown in Figure 1, the information processing system 1 according to the embodiment includes a terminal device 10 and a server device 100. The terminal device 10 and the server device 100 are connected to each other via a network N, either by wired or wireless means, enabling communication between them. This allows the terminal device 10 to cooperate with the server device 100. The network N is, for example, a LAN (Local Area Network), a WAN (Wide Area Network), or the Internet.

[0011] Terminal device 10 is an information processing device used by user U. For example, terminal device 10 may be a smart device such as a smartphone or tablet, a PC (Personal Computer) such as a desktop or notebook (laptop), a mobile phone such as a feature phone, a PDA (Personal Digital Assistant), a game console or AV equipment with communication functions, an information appliance or digital appliance, a car navigation system, a wearable device such as a smartwatch, head-mounted display, or smart glasses. Alternatively, terminal device 10 may be a house or building compatible with the Internet of Things (IoT), a car, a home appliance, an electronic device, etc.

[0012] In this embodiment, the terminal device 10 is a smart device such as a smartphone or tablet used by user U, and is a mobile terminal device that can communicate with any server device via wireless communication networks such as LTE (Long Term Evolution), 4G (4th Generation), 5G (5th Generation), Bluetooth (registered trademark), or wireless LAN. The terminal device 10 also has a screen such as a liquid crystal display with touch panel functionality, and accepts various operations on displayed data such as content from user U using a finger or stylus, such as tapping, sliding, and scrolling. Operations performed on the area of ​​the screen where content is displayed may also be considered as operations on the content. Furthermore, the terminal device 10 may be an information processing device such as a desktop PC or notebook PC, not just a smart device.

[0013] The server device 100 is, for example, a computer such as a PC or blade server, or a mainframe or workstation. The server device 100 may also be implemented through cloud computing.

[0014] In this embodiment, the server device 100 is an information processing device that works in conjunction with each user U's terminal device 10 and provides each user U's terminal device 10 with API (Application Programming Interface) services for various applications (hereinafter referred to as "apps") and various data, and is implemented by a computer or cloud system.

[0015] Further, the server device 100 may be an information processing device that provides some online service to the terminal device 10 of each user U. For example, as an online service, the server device 100 may provide services such as Internet connection, search service, chat service, dialogue service by voice, image, video, etc., SNS (Social Networking Service), electronic commerce (EC: Electronic Commerce), electronic payment, online game, online banking, online trading, accommodation and ticket reservation, video and music distribution, news, map, route search, route guidance, route information, operation information, weather forecast, etc. In reality, the server device 100 may cooperate with various servers that provide the above online services and mediate the online services, or be responsible for the processing of the online services.

[0016] In addition, the server device 100 can acquire user information regarding the user U. For example, as user information, the server device 100 acquires information (attribute information) regarding the attributes of the user U such as the gender, age, and residential area of the user U. Further, the server device 100 can acquire information regarding attributes such as the demographic (demographic attributes), psychographic (psychological attributes), geographic (geographic attributes), and behavioral (behavioral attributes) of the user U. Also, the server device 100 may acquire, as user information, the segment or persona (persona) to which the user U belongs in the field of marketing. Then, the server device 100 stores and manages information (attribute information) regarding the attributes of the user U together with the identification information (user ID, etc.) indicating the user U.

[0017] In addition, the server device 100 acquires various types of history information (log data) indicating the actions of the user U from the terminal device 10 of the user U or from various servers, etc. based on the user ID, etc. For example, the server device 100 acquires a location history, which is a history of the location and time of the user U, from the terminal device 10. In addition, the server device 100 acquires a search history, which is a history of search queries input by the user U, from a search server (search engine). In addition, the server device 100 acquires a browsing history, which is a history of the content browsed by the user U, from a content server. In addition, the server device 100 acquires a purchase history (settlement history), which is a history of the user U's product purchases and settlement processes, from an e-commerce server or a settlement processing server. In addition, the server device 100 may acquire a listing history or a sales history, which is a history of the user U's listings on the marketplace, from an e-commerce server or a settlement processing server. In addition, the server device 100 acquires a posting history, which is a history of the user U's posts, from a posting server or an SNS server that provides a word-of-mouth posting service. Note that each of the above-mentioned various servers, etc. may be the server device 100 itself. That is, the server device 100 may function as each of the above-mentioned various servers, etc.

[0018] Also, the number of each device included in the information processing system 1 shown in FIG. 1 is not limited to that shown. For example, in FIG. 1, for the sake of simplification of the illustration, only one terminal device 10 is shown, but this is merely an example and is not limited, and two or more may be used.

[0019] [2. Two-choice + Reason RLHF Augmentation] In this embodiment, a powerful personalization and personalized learning tuning method using an LLM (Large Language Model) or an LMM (Language Multimodal Model) is proposed. The LLM processes text information. The LMM integrates and processes different types of information such as text, images, and voices. Note that the LMM may be an LMM trained based on the LLM.

[0020] As exemplified by the scaling law, it has been shown that performance can be improved indefinitely by increasing the amount of training data and the model size (the saturation point has not yet been reached). Conversely, preparing a large amount of training data is one of the keys to improving performance.

[0021] When providing services to users on the internet, the amount of signal information (what information, products, and services they like or dislike) is relatively small. Especially at the beginning of service use, there is little signal information (known as the cold start problem), making it crucial to quickly learn the user's preferences. In other words, one of the keys is to generate a large amount of training data that captures the user's intentions from limited signal information.

[0022] In services using LLM, it is expected that services such as avatars and agents where the LLM itself has personality will grow, in addition to understanding user preferences. Such LLMs will also need to be able to quickly learn and implement their own personalities.

[0023] Within LLM technologies, there are methods such as RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) that compare two objects obtained as LLM responses to a prompt and perform alignment learning on paired datasets of preferred objects (chosen) and unpredictable objects (rejected). However, RLHF and DPO are just examples. Many similar methods exist, and the methods discussed here are not limited to RLHF and DPO, but can be applied to methods with similar objectives in general.

[0024] In this embodiment, the preference between a preferred object (chosen) and an unpredictable object (rejected) is expressed as a "two-choice" situation, and a method is proposed that considers not only the comparison of response content but also the preference evaluation after actually experiencing the response content. For example, when prompting an LLM (or LMM) with "Recommend a restaurant," a method is proposed that considers not only the two-choice evaluation of the response content (expressed as "response evaluation") but also the two-choice evaluation after actually going to that restaurant and eating (expressed as "post-experience evaluation"). Although it should be assumed that the object not preferred in the two-choice situation has not been experienced, if both have been experienced, a "post-experience two-choice" situation can theoretically exist. That is, in this embodiment, as shown in Figure 1, in addition to the above two-choice situation, a method is proposed that compares the two responses and takes into account the "reasons" for preference and the "reasons" for non-preference. Note that it is not necessary to have all reasons; only the reasons that can be obtained are required. For example, it is acceptable for only the response evaluation to exist and the post-experience evaluation not to exist, or conversely, for only the post-experience evaluation to exist and the response evaluation not to exist.

[0025] Furthermore, the server device 100 may implement the above method using AI (Artificial Intelligence) such as GPT (Generative Pre-trained Transformer). GPT is a text generation AI and a language model capable of generating text using natural language processing.

[0026] For example, as shown in Figure 1, the server device 100 receives a prompt from the user U's terminal device 10, inputs it to the LLM (or LMM), and presents the two objects (or their descriptions) output as a response from the LLM to the user U's terminal device 10 (step S1).

[0027] Next, the server device 100 receives preferences (a binary response) from user U's terminal device 10 for two objects: a preferred object (chosen) and an undesirable object (rejected) (step S2).

[0028] Next, the server device 100 receives from the user U's terminal device 10, as an evaluation of the LLM response (response evaluation), favorable reasons and negative reasons for both preferred items (chosen) and unfavorable items (rejected) (step S3). Note that, regarding the reasons, it is not necessary for all reasons to be present, such as having only favorable reasons and no negative reasons, or even having no reasons at all.

[0029] Next, the server device 100 inputs the preference and reason data up to this point (from the binary response to the response evaluation) as the source data to generate generated and extended data, and then performs RLHF / DPO learning using the source data and generated and extended data (step S4).

[0030] Prior to RLHF / DPO learning, the SFT (Supervised Fine-Tuning) learning step is generally performed. In other words, SFT and RLHF / DPO learning are performed together. However, SFT can be omitted. SFT is a method for learning data in pairs of Question (Q) and Answer (A). The Question (Q) part consists of text data of the prompt entered by user U and the full text of the two answer choices (including detailed explanatory information about the content of each choice that is not displayed to user U). The Answer (A) part consists of textual explanations of user U's evaluation reasons at the time of the response and after the experience, followed by the result of which of the two choices user U preferred. The preference result should be a proper noun or symbol that clearly distinguishes the two choices. For example, for a restaurant, this could be the main location and the official name of the restaurant, or if symbols are attached to the choices, those symbols can also be used. To summarize SFT, the task involves training a model to predict user U's reasons for preference and the outcome, with Q being the user U's prompt and the two-choice response options.

[0031] Furthermore, conducting continuous pre-training prior to SFT is particularly effective in terms of understanding new information (for example, restaurants that did not exist at the time of initial learning for LLM). In this case, continuous pre-training uses the same data created for SFT, but without distinguishing between questions and answers, and trains the model on it (next token prediction). The difference from SFT is that continuous pre-training trains the model on next token prediction for all sentences, including the parts corresponding to questions, whereas SFT treats the questions as given and trains only on the subsequent answer parts using next token prediction. Note that continuous pre-training is optional.

[0032] Next, after user U has experienced one or both of the two objects, the server device 100 receives a post-experience evaluation from user U's terminal device 10, requesting a second preference for the two objects, indicating which are preferred (chosen) and which are undesirable (rejected) (step S5). If user U has only experienced one of the two objects, the server device 100 may request a second preference only for the object that user U experienced. In this case, the evaluation of the object that was not experienced may be automatically the opposite of the evaluation of the experienced object. For example, if the experienced object is deemed undesirable (rejected) in the second preference request, the object that was not experienced may be designated as preferred (chosen). Furthermore, the results of the second preference request after the experience may be the same as or different from the results of the LLM response.

[0033] Next, the server device 100 receives from user U's terminal device 10, as an evaluation after the experience, the reasons for favorable and unfavorable objects (rejected) as a result of the above-mentioned re-selection process (step S6). Note that if user U has only experienced one of the two objects, the server device 100 may only accept reasons for the object that user U experienced.

[0034] Next, the server device 100 inputs the data on preferences and reasons after the experience (post-experience data) into the LLM to generate generated and augmented data, and then performs RLHF / DPO learning using the post-experience data and generated and augmented data (step S7). In addition, similar to step S4, it is more effective to perform a principle SFT (supervised fine-tuning) learning step prior to RLHF / DPO learning, and to add continuous pre-training.

[0035] [2-1. Methods for effectively learning binary preferences and reasons for those preferences] Referring to Figure 2, a rapid learning method for achieving personalization and individuality using LLM (or LMM) will be explained. Figure 2 is an explanatory diagram showing an overview of a rapid learning method for achieving personalization and individuality using LLM. In this embodiment, as a rapid learning method for achieving personalization and individuality using LLM, we propose a method that effectively learns binary preferences + reasons for preferences, particularly using RLHF / DPO as the basic technology.

[0036] For example, as shown in Figure 2, the server device 100 receives a prompt from the user's terminal device 10, "Please recommend a restaurant for this purpose," and inputs this prompt into the LLM. The server device 100 then receives a "description of Restaurant A" and a "description of Restaurant B" as responses from the LLM and presents them to the user's terminal device 10. The "description of Restaurant A" and the "description of Restaurant B" each contain written information about the restaurant, such as genre, atmosphere, price range, menu, location, interior, exterior, reviews, etc. In the case of an LMM (Multimodal Model) response, it may also include information such as images and audio (or video) in addition to text information. Note that the above information is just an example. In reality, it is not limited to the above information.

[0037] Next, the server device 100 receives a choice of two responses from the user's terminal device 10, which has referred to the "Description of Restaurant A" and the "Description of Restaurant B": a preferred item (chosen) and an undesirable item (rejected). Here, "Restaurant A" is the preferred item (chosen), and "Restaurant B" is the undesirable item (rejected).

[0038] Furthermore, the server device 100 receives, as part of the response evaluation, favorable reasons (referred to as [positive]) and negative reasons (referred to as [negative]) for both preferred and unfavorable items (rejected) from the user's terminal device 10. Here, for the preferred item (chosen), "Restaurant A," the server device 100 receives [positive] reasons for preference and [negative] points of concern. Also, for the unfavorable item (rejected), "Restaurant B," the server device 100 receives [positive] points that seem good and [negative] reasons for inferiority (reasons for not choosing it).

[0039] Subsequently, the server device 100 receives post-experience preferences from the user's terminal device 10, indicating which restaurant was preferred (chosen) and which was rejected. Here, "Restaurant A" is considered the preferred (chosen) restaurant, and "Restaurant B" is considered the rejected restaurant.

[0040] Furthermore, if the user only uses "Restaurant A," the server device 100 may accept a choice between preferred (chosen) and undesirable (rejected) items for "Restaurant A" from the user's terminal device 10 after the user's experience with "Restaurant A." The same applies if the user only uses "Restaurant B."

[0041] Next, the server device 100 receives positive and negative reasons from the user's terminal device 10 as part of the post-experience evaluation for both preferred (chosen) and unpreferred (rejected) items. Reasons can be entered manually by the user or selected from pre-prepared options. Here, the server device 100 receives positive and negative feedback for "Restaurant A," which is a preferred (chosen) item. The server device 100 also receives positive and negative feedback for "Restaurant B," which is an unpreferred (rejected) item.

[0042] Furthermore, if the user only uses "Restaurant A," the server device 100 may only receive feedback on "Restaurant A," specifically regarding positive aspects and negative aspects. The same applies if the user only uses "Restaurant B."

[0043] Here, we will explain the meaning of the abbreviations shown in Figure 2. U (User Input Prompt) indicates a prompt for user input. A (Agent Answer) shows the LLM's response. P (Preference for LLM responses (user input prompt)) indicates the user's evaluation input for the two items mentioned in the LLM response. E (Experience-based preference (user input prompt)) indicates the user's evaluation of the two objects after actually experiencing them. C(Chosen) indicates the preferred object among the two objects listed in the LLM response. R (Rejected) indicates the undesirable of the two targets listed in the LLM response. G (Good point) indicates the positive reasons ([positive]) in the user's evaluation. B (Bad point) indicates the negative reasons ([negative]) in the user's evaluation.

[0044] Here, we will explain the processing flow shown in Figure 2. The server device 100 performs in-context learning on the data from the binary response to the response evaluation (original data) and causes the LLM to generate similar RLHF and SFT training data (generated augmented data). At this time, the server device 100 causes the LLM to generate n data from one data, where n is a parameter and is an augmentation multiplier.

[0045] Next, the server device 100 performs RLHF / DPO training using the original data and the generated extended data. Prior to RLHF / DPO training, SFT is performed as a general rule.

[0046] Next, when the server device 100 has the data from the post-experience two-choice selection to the post-experience evaluation (post-experience data), it generates an expanded LLM for m pieces of post-experience data and similarly performs RLHF / DPO training. m > n is the basic setting, and more emphasis is placed on the post-experience evaluation.

[0047] [2-2. If preferences change after the experience] Refer to Figure 3 to explain what happens when preferences change after an experience. Figure 3 is an explanatory diagram that shows an overview of what happens when preferences change after an experience. The meaning of the abbreviations in Figure 3 is the same as in Figure 2.

[0048] The process from prompting to response evaluation is the same as in Figure 2. That is, the server device 100 accepts response evaluations for "Restaurant A" as a preferred target (chosen) and "Restaurant B" as an undesirable target (rejected).

[0049] Subsequently, the server device 100 receives post-experience preferences from the user's terminal device 10, which has used "Restaurant A" and "Restaurant B," as a choice between a preferred item (chosen) and an undesirable item (rejected). At this point, the user changes "Restaurant A" to an undesirable item (rejected) and "Restaurant B" to a preferred item (chosen).

[0050] Next, the server device 100 receives positive and negative reasons from the user's terminal device 10 as post-experience evaluations for both preferred (chosen) and unpredictable (rejected) items. Here, for the unpredictable (rejected) item "Restaurant A," the server device 100 receives positive points and negative reasons for the negative post-experience evaluation (reasons for the poor evaluation). The server device 100 also receives positive reasons for the preferred (chosen) item "Restaurant B" and negative points that were of concern.

[0051] Here, we will explain the processing flow shown in Figure 3. The server device 100 performs in-context learning on the data (original data) up to the response (response evaluation) and causes the LLM to generate similar RLHF and SFT training data (generated extended data). At this time, the server device 100 causes the LLM to generate n data from one data, where n is a parameter and is an extended multiplier.

[0052] Next, the server device 100 performs RLHF / DPO learning using the original data and the generated extended data. Note that prior to RLHF / DPO learning, SFT is preferably performed. The processing flow up to this point is the same as that in FIG. 2.

[0053] Next, when the post-experience data (data from post-experience binary choice to post-experience evaluation) is available, the server device 100 expands the post-experience data into r pieces to generate LLM and similarly performs RLHF / DPO learning. The basic setting is r>m>n, with more emphasis on post-experience evaluation.

[0054] [2-3. Structure of Learning Data] Referring to FIGS. 4A and 4B, the structure of the learning data will be described. FIG. 4A is an explanatory diagram showing an overview of the structure of the learning data for RLHF. FIG. 4B is an explanatory diagram showing an overview of the structure of the learning data for SFT. The learning data is a single text formed by separating the response text and the reasons for the binary choice with delimiters or keys. The meanings of the abbreviations shown in FIGS. 4A and 4B are the same as those in FIG. 2.

[0055] For example, as shown in FIG. 4A, the server device 100 causes the LLM (LMM is also acceptable) to generate the following preferred target (chosen) side dataset example (text data·chat_format) and non-preferred target (rejected) side dataset example (text data·chat_format) as the learning data for RLHF.

[0056] [Learning data for RLHF: Preferred target (chosen) side dataset example] U: Please introduce a restaurant with the following conditions - A: I recommend the restaurant "C{store name}". - The reasons are as follows [When E information exists. Description for the Chosen side] The reason after actual experience is that it meets your preference in terms of ECG. However, there is also a concerning point about ECB. {Additional part when E information exists. Up to here} - The reasons for preference based on only response information are as follows. The reason is that it is PCG and I think it meets your preference. However, there is also a concerning point about PCB. - This restaurant "C{store name}" offers authentic Italian cuisine along with a stylish in-store atmosphere... {Store introduction text}... The price range is {¥} for lunch and {¥} for dinner. Word-of-mouth reviews mention {word-of-mouth information} and so on. {Based on the premise of being displayed on U, do not stuff too much information}

[0057] <RLHF learning data: Example of dataset for the unfavorable (rejected) side> U: Please introduce a restaurant that meets the {conditions described} - A: I recommend the restaurant "R{store name}". - The reasons are as follows {When E information exists. Description about the Rejected side} The reason after actual experience is that it meets your preference in terms of ERG. However, there is also a concerning point about ERB. {Additional part when E information exists. Up to here} - The reasons for preference based on only response information are as follows. The reason is that it is PRG and I think it meets your preference. However, there is also a concerning point about PRB. - This restaurant "C{store name}" offers authentic Italian cuisine along with a stylish in-store atmosphere... {Store introduction text}... The price range is {¥} for lunch and {¥} for dinner. Word-of-mouth reviews mention {word-of-mouth information} and so on. {Based on the premise of being displayed on U, do not stuff too much information} ()

[0058] Also, as shown in FIG. 4B, the server device 100 causes the LLM (or LMM) to generate the following SFT dataset example (text data·chat format) as the learning data for SFT, together with the above-described RLHF learning data.

[0059] <Example of SFT dataset> {From the Q part of SFT here} U: Please introduce a {condition description} restaurant. A: The restaurant "C{store name}" offers authentic Italian cuisine with a stylish interior atmosphere... {store introduction text}... The price range is {¥} for lunch and {¥} for dinner. {Also, summarize and include in the text the store details that are not directly shown to user U (the same applies for SFT).} It is said in reviews that {review information}, etc. {Also summarize and include the review information that is not directly shown to user U.} The restaurant "R{store name}" has all the explanatory texts in the same way as above... <U+ {To here for the Q part of SFT} - {From the A part of SFT here} The reasons for preference based only on the response information are as follows. P: The good point of Restaurant C is PCG, and the bad point of Restaurant R is PRB, so I chose C. However, there is also a concerning point PCB in C, and there is a good point PRG in R that I didn't choose which seems nice. - The evaluation after actually experiencing the restaurant is as follows E: The good point of Restaurant C was ECG, and the concerning point was ECB. For Restaurant R, the ERB point was not to my liking, but it also had a good point ERG. - Based on the above, the result of these two comparisons is to choose Restaurant C. {To here for the A part of SFT}

[0060] Regarding the above SFT dataset example, the differences from RLHF / DPO and the characteristics unique to SFT will be explained.

[0061] SFT datasets are generated in the form of Question (Q) and Answer (A). Here, the Question (Q) portion of the SFT is referred to as the "prompt response," and the Answer (A) portion is referred to as the "preference reason." The Question (Q) portion of the SFT (prompt response) ranges from U to A. The Answer (A) portion of the SFT (preference reason) ranges from E to P (predicted reason).

[0062] In the section that reads, "Restaurant 'R{Store Name}' has the same description as above...", both of the two options are listed side by side. That is, both the preferred option (chosen) and the undesirable option (rejected) are listed together.

[0063] In the section "Preference reasons based solely on response information," list all of the C and R for P and C and R for E. Information that is unavailable, such as E information, may be omitted.

[0064] In the "Evaluation after actually experiencing the restaurant" section, if your preference changed after the experience, please describe it as follows: "After the experience, my evaluation changed, and I now prefer Restaurant R over Restaurant C. The reason is..."

[0065] Finally, the final selection results, including the reasoning behind them, will be stated. If the evaluation changed after the experience, the updated final evaluation will be stated.

[0066] Furthermore, if data augmentation can be performed using the structure shown in Figure 2 or Figure 3, and then converted to the above format, it can be used for both SFT and RLHF / DPO.

[0067] [2-4. Expanding training data (methods for increasing data volume through generation)] Refer to Figure 5 to explain the augmentation of the training data. Figure 5 is an explanatory diagram showing an overview of the augmentation of the training data.

[0068] In this embodiment, the server device 100 inputs both the actual store data and the user's two-choice reasoning results into the LLM (or LMM) to generate augmented training data. For example, as shown in Figure 5, the server device 100 causes the LLM to generate similar RLHF and SFT training data (generated augmented data) for pairs of preferred (chosen) and unpredictable (rejected) objects.

[0069] At this time, the server device 100 gathers training data for several users (multiple responses and two-choice reasons), including training data for other users' chosen and rejected pairs. However, to implement the program, training data for at least one user is sufficient. In other words, it is sufficient to gather training data for one or more users.

[0070] Furthermore, the server device 100 extracts text from the actual store data, including customer reviews, that is simply the same store data as the training data, converted into sentences.

[0071] The server device 100 then generates the following prompt from the training data of several people and the text data of the actual store.

[0072] <prompt> For the stores listed in the [physical store data], multiple people have provided the following selections or reasons for not selecting a particular store: Here is a list of [reasons for multiple choices]. *List both chosen and rejected entries. * Specify the output format. Based on the above, consider the store's characteristics, strengths, and weaknesses, and generate diverse data in a format that differs from the [multiple people's two-choice reasoning], but represents how others might evaluate it.

[0073] The server device 100 sends the above prompt to the LLM and generates data (generated and extended data) in the exact same format as the training data from earlier. As a result, the server device 100 generates training data for extended chosen and rejected pairs.

[0074] [2-5. Supplement] As described above, this embodiment proposes a binary choice + reason RLHF augmentation. The server device 100 acquires the selection result and reason for the proposed target and uses the selection result and reason to train the RLHF / DPO.

[0075] The server device 100 learns the reasons for the positive / negative ratings of the selected options and the reasons for the positive / negative ratings of the options not selected. The server device 100 further learns store information. The server device 100 may further learn advertising information.

[0076] The server device 100 learns the selection results and reasons before and after an action. For example, the server device 100 learns the selection results and reasons when responding to an LLM (during the introduction) and after the experience.

[0077] It is also possible to assign one LLM per user. In this case, the server device 100 assigns one LLM to each user. That is, there is a one-to-one relationship between users and LLMs.

[0078] When there are n users and 1 LLM (where n is arbitrary), the server device 100 inputs the user attribute as "Human". The user attribute may be demographic, estimated, or the user's context (situation, background).

[0079] The server device 100 may infer reasons from reviews and posts. Since reviews contain the content of "selected" items, the reasons for positive / negative reactions can be inferred from them. Alternatively, the server device 100 may train an LLM to output reasons.

[0080] [3. Example of terminal device configuration] Next, the configuration of the terminal device 10 will be described using Figure 6. Figure 6 is a diagram showing an example of the configuration of the terminal device 10 according to this embodiment. As shown in Figure 6, the terminal device 10 comprises a communication unit 11, a display unit 12, an input unit 13, a positioning unit 14, a sensor unit 20, a control unit 30 (controller), and a storage unit 40.

[0081] (Communications Section 11) The communication unit 11 is connected to the network N by wire or wireless connection and transmits and receives information to and from the server device 100 via the network N. For example, the communication unit 11 can be implemented using a NIC (Network Interface Card) or an antenna.

[0082] (Display section 12) The display unit 12 is a display device that displays various information such as location information. For example, the display unit 12 may be a liquid crystal display (LCD) or an organic electro-luminescent display (OLED). The display unit 12 may also be a touch panel display, but is not limited to this.

[0083] (Input section 13) The input unit 13 is an input device that receives various operations from the user U. For example, the input unit 13 has buttons for inputting characters, numbers, etc. The input unit 13 may also be an input / output port (I / O port) or a USB (Universal Serial Bus) port. If the display unit 12 is a touch panel display, a part of the display unit 12 functions as the input unit 13. The input unit 13 may also be a microphone that receives voice input from the user U. The microphone may be wireless.

[0084] (Positioning unit 14) The positioning unit 14 receives signals (radio waves) transmitted from GPS (Global Positioning System) satellites and, based on the received signals, acquires position information (e.g., latitude and longitude) indicating the current position of the terminal device 10. In other words, the positioning unit 14 determines the position of the terminal device 10. Note that GPS is just one example of a GNSS (Global Navigation Satellite System).

[0085] Furthermore, the positioning unit 14 can determine its position using various methods other than GPS. For example, the positioning unit 14 may use various communication functions of the terminal device 10 to determine its position as an auxiliary positioning means for position correction, etc., as described below.

[0086] (Wi-Fi positioning) For example, the positioning unit 14 determines the location of the terminal device 10 by utilizing the Wi-Fi® communication function of the terminal device 10 and the communication network provided by each telecommunications company. Specifically, the positioning unit 14 determines the location of the terminal device 10 by performing Wi-Fi communication, etc., and determining the distance to nearby base stations and access points.

[0087] (Beacon positioning) Furthermore, the positioning unit 14 may determine the location using the Bluetooth® function of the terminal device 10. For example, the positioning unit 14 determines the location of the terminal device 10 by connecting to a beacon transmitter connected via the Bluetooth® function.

[0088] (Geomagnetic positioning) Furthermore, the positioning unit 14 determines the position of the terminal device 10 based on the geomagnetic pattern of the structure, which has been measured in advance, and the geomagnetic sensor provided by the terminal device 10.

[0089] (RFID positioning) Furthermore, if, for example, the terminal device 10 is equipped with an RFID (Radio Frequency Identification) tag function equivalent to that of a contactless IC card used at a train station ticket gate or in a store, or if it is equipped with a function to read RFID tags, the location where it was used will be recorded along with the information on the payment or other transactions made by the terminal device 10. The positioning unit 14 may determine the location of the terminal device 10 by acquiring such information. Alternatively, the location may be determined by an optical sensor or infrared sensor equipped in the terminal device 10.

[0090] The positioning unit 14 may, if necessary, determine the position of the terminal device 10 using one or a combination of the positioning means described above.

[0091] (Sensor unit 20) The sensor unit 20 includes various sensors mounted on or connected to the terminal device 10. The connection can be wired or wireless. For example, the sensors may be other detection devices besides the terminal device 10, such as wearable devices or wireless devices. In the example shown in Figure 6, the sensor unit 20 includes an acceleration sensor 21, a gyro sensor 22, a barometric pressure sensor 23, a temperature sensor 24, a sound sensor 25, a light sensor 26, a magnetic sensor 27, and an image sensor (camera) 28.

[0092] The sensors 21-28 described above are merely examples and not limiting. In other words, the sensor unit 20 may be configured to include some of the sensors 21-28, or it may include other sensors such as humidity sensors in addition to or instead of the sensors 21-28.

[0093] The acceleration sensor 21 is, for example, a 3-axis acceleration sensor and detects the physical movement of the terminal device 10, such as its direction of movement, velocity, and acceleration. The gyro sensor 22 detects the physical movement of the terminal device 10, such as its tilt in the three axes, based on its angular velocity. The barometric pressure sensor 23 detects the atmospheric pressure around the terminal device 10, for example.

[0094] Since the terminal device 10 is equipped with the acceleration sensor 21, gyroscope 22, barometric pressure sensor 23, etc., it becomes possible to determine the position of the terminal device 10 using technologies such as pedestrian dead-reckoning (PDR) that utilize these sensors 21 to 23. This makes it possible to obtain indoor location information that is difficult to obtain with positioning systems such as GPS.

[0095] For example, a pedometer using an accelerometer 21 can calculate the number of steps, walking speed, and distance walked. Additionally, a gyroscope 22 can be used to determine the user U's direction of movement, gaze direction, and body tilt. Furthermore, the barometric pressure detected by the barometric pressure sensor 23 can be used to determine the altitude and floor number of the user U's terminal device 10.

[0096] The temperature sensor 24 detects, for example, the ambient temperature around the terminal device 10. The sound sensor 25 detects, for example, the ambient sound around the terminal device 10. The light sensor 26 detects the ambient illumination around the terminal device 10. The magnetic sensor 27 detects, for example, the Earth's magnetic field around the terminal device 10. The image sensor 28 captures an image of the area around the terminal device 10.

[0097] The aforementioned pressure sensor 23, temperature sensor 24, sound sensor 25, light sensor 26, and image sensor 28 can detect the surrounding environment and conditions of the terminal device 10 by detecting atmospheric pressure, temperature, sound, and illuminance, respectively, and by capturing images of the surroundings. Furthermore, it becomes possible to improve the accuracy of the location information of the terminal device 10 based on the surrounding environment and conditions.

[0098] (Control Unit 30) The control unit 30 includes, for example, a microcomputer having a CPU (Central Processing Unit) or MPU (Micro Processing Unit), ROM (Read Only Memory), RAM (Random Access Memory), input / output ports, and various circuits. Alternatively, the control unit 30 may be composed of hardware such as an integrated circuit (ASIC) or FPGA (Field Programmable Gate Array). The control unit 30 includes a transmission unit 31, a reception unit 32, and a processing unit 33.

[0099] (Transmitter 31) The transmission unit 31 can transmit various information, such as information input by the user U using the input unit 13, various information detected by sensors 21-28 mounted on or connected to the terminal device 10, and location information of the terminal device 10 determined by the positioning unit 14, to the server device 100 via the communication unit 11.

[0100] (Receiving unit 32) The receiving unit 32 can receive various information provided by the server device 100, as well as requests for various information from the server device 100, via the communication unit 11.

[0101] (Processing 33) The processing unit 33 controls the entire terminal device 10, including the display unit 12. For example, the processing unit 33 can output and display various information transmitted by the transmission unit 31 and various information received from the server device 100 by the reception unit 32 to the display unit 12.

[0102] (Storage unit 40) The storage unit 40 is implemented by, for example, semiconductor memory elements such as RAM (Random Access Memory) and flash memory, or by storage devices such as HDD (Hard Disk Drive), SSD (Solid State Drive), and optical discs. Various programs and various data are stored in this storage unit 40.

[0103] [4. Example of Server Device Configuration] Next, the configuration of the server device 100 according to the embodiment will be described using Figure 7. Figure 7 is a diagram showing an example of the configuration of the server device 100 according to the embodiment. As shown in Figure 7, the server device 100 includes a communication unit 110, a storage unit 120, and a control unit 130.

[0104] (Communications Department 110) The communication unit 110 is implemented, for example, by a NIC (Network Interface Card). The communication unit 110 is connected to the network N by wire or wireless connection.

[0105] (Storage unit 120) The storage unit 120 is implemented by, for example, semiconductor memory elements such as RAM (Random Access Memory) and flash memory, or by storage devices such as HDDs, SSDs, and optical discs. The storage unit 120 may store identification information (such as a user ID) indicating user U, as well as attribute information and history information (log data) of user U.

[0106] (Control unit 130) The control unit 130 is a controller, and is realized by executing various programs (corresponding to an example of an information processing program) stored in the internal storage device of the server device 100 using a storage area such as RAM as a working area, for example, by a CPU (Central Processing Unit), MPU (Micro Processing Unit), GPU (Graphics Processing Unit), ASIC (Application Specific Integrated Circuit), or FPGA (Field Programmable Gate Array). In the example shown in Figure 7, the control unit 130 has an acquisition unit 131, a response processing unit 132, a preference receiving unit 133, a reason receiving unit 134, and a learning processing unit 135.

[0107] (Acquisition part 131) The acquisition unit 131 acquires the search query entered by the user U. For example, when the user U enters a search query into a search engine or the like and performs a keyword search, the acquisition unit 131 acquires the search query via the communication unit 110. In other words, the acquisition unit 131 acquires the keyword entered by the user U into the search box of a search engine, website, or application via the communication unit 110.

[0108] Furthermore, the acquisition unit 131 acquires user information about user U via the communication unit 110. For example, the acquisition unit 131 acquires identification information (such as user ID), location information, and attribute information of user U from user U's terminal device 10. The acquisition unit 131 may also acquire identification information and attribute information of user U when user U is registered. The acquisition unit 131 then stores the user information in the storage unit 120.

[0109] Furthermore, the acquisition unit 131 acquires various historical information (log data) indicating the user U's actions via the communication unit 110. For example, the acquisition unit 131 acquires various historical information indicating the user U's actions from the user U's terminal device 10, or from various servers based on the user ID, etc. The acquisition unit 131 then stores the various historical information in the storage unit 120.

[0110] (Response processing unit 132) The response processing unit 132 receives a prompt from user U via the communication unit 110, inputs the prompt into the model, and presents the two objects obtained as the model's output to user U. The response processing unit 132 may also automatically generate prompts from inquiry information from user U, or from user U's attribute information, history information, etc.

[0111] (Preference Reception Department 133) The preference receiving unit 133 receives binary preferences for two objects from user U via the communication unit 110. For example, the preference receiving unit 133 receives binary preferences for two objects at the time the two objects are presented to user U. The preference receiving unit 133 also receives further binary preferences for the two objects (or only the objects experienced) after user U has experienced at least one of the two objects. For example, the preference receiving unit 133 receives further binary preferences for the two objects (or only the objects experienced) after user U has experienced an object that user U evaluated as favorable. Alternatively, the preference receiving unit 133 receives further binary preferences for the two objects (or only the objects experienced) after user U has experienced an object that user U evaluated as unfavorable.

[0112] (Reason Reception Department 134) The reason receiving unit 134 receives the reason for a binary preference from user U via the communication unit 110. For example, the reason receiving unit 134 receives the reason for a binary preference for two objects at the time the two objects are presented to user U. The preference receiving unit 133 also receives the reason for a second binary preference for two objects (or only the objects experienced) after user U has experienced at least one of the two objects. For example, the reason receiving unit 134 receives the reason for a second binary preference for two objects (or only the objects experienced) after user U has experienced an object that user U evaluated as favorable. Alternatively, the reason receiving unit 134 receives the reason for a second binary preference for two objects (or only the objects experienced) after user U has experienced an object that user U evaluated as unfavorable.

[0113] At this time, the reason receiving unit 134 receives positive and negative reasons for each of the two objects as reasons for preference in a binary choice between the two objects.

[0114] (Learning Processing Unit 135) The learning processing unit 135 generates training data for the model from binary preferences and reasons for those preferences, and performs RLHF or DPO learning using the training data. The learning processing unit 135 generally performs SFT prior to RLHF or DPO learning, however, SFT can be omitted. For example, the learning processing unit 135 generates training data for the model from the binary preferences and reasons for those preferences for the two objects presented to user U, performs SFT using the training data, and then performs RLHF or DPO learning. Alternatively, the learning processing unit 135 generates training data for the model from the binary preferences and reasons for those preferences for the two objects (or only the object experienced) after user U has experienced at least one of the two objects, performs SFT using the training data, and then performs RLHF or DPO learning.

[0115] For example, the learning processing unit 135 generates training data for the model from the user U's binary preferences and reasons for preferences for two objects (or only the objects experienced) after the user U has experienced the object that was evaluated as favorable. The unit then performs SFT using the training data, and subsequently performs RLHF or DPO learning.

[0116] Alternatively, the learning processing unit 135 generates training data for the model from the user U's second choice preference and reason for preference for the two objects (or only the objects experienced) after the user U has experienced the object that was evaluated as undesirable, performs SFT using the training data, and then performs RLHF or DPO training.

[0117] In other words, when the learning processing unit 135 receives the user U's binary preferences and reasons for preference for the two objects presented to the user U, it causes the model to generate training data from the binary preferences and reasons, performs SFT using the training data, and then performs RLHF or DPO learning. Furthermore, when the learning processing unit 135 receives the user U's binary preferences and reasons for preference again after the user U has experienced at least one of the two objects presented to the user U, it causes the model to generate training data from the binary preferences and reasons, performs SFT using the training data, and then performs RLHF or DPO learning.

[0118] At this time, the learning processing unit 135 performs in-context learning on the training data, which includes binary preferences and reasons for those preferences, to cause the model to generate generative augmentation data, which is similar training data, and then performs RLHF or DPO learning using the training data and the generative augmentation data.

[0119] Furthermore, the learning processing unit 135 causes the model to generate multiple generative and augmented data sets from a single training data set, and performs RLHF or DPO learning using the single training data set and the multiple generative and augmented data sets.

[0120] [5. Processing Procedure] Next, the processing procedure by the server device 100 according to the embodiment will be described using Figure 8. Figure 8 is a flowchart of the processing procedure according to the embodiment. Note that the processing procedure shown below is repeatedly executed by the control unit 130 of the server device 100.

[0121] For example, as shown in Figure 8, the response processing unit 132 of the server device 100 receives a prompt from user U via the communication unit 110, inputs the prompt into the model, and presents the two objects obtained as the output of the model to user U (step S101).

[0122] Next, the preference receiving unit 133 of the server device 100 receives the user U's two-choice preference for the two objects at the time the two objects are presented to the user U (step S102).

[0123] Next, the reason receiving unit 134 of the server device 100 receives the reason for the two-choice preference for the two objects at the time the two objects were presented to the user U (step S103).

[0124] Next, the learning processing unit 135 of the server device 100 generates training data for the model from the user U's binary preference for the two objects and the reasons for that preference at the time the two objects are presented to the user U, and then performs RLHF or DPO training using the training data (step S104).

[0125] Next, the preference receiving unit 133 of the server device 100 receives a second choice of two targets (or only the targets that have been experienced) from user U after he has experienced at least one of the two targets (step S105).

[0126] Next, the preference receiving unit 133 of the server device 100 receives the reason for the second choice preference for the two targets (or only the targets that have been experienced) after user U has experienced at least one of the two targets (step S106).

[0127] Next, the learning processing unit 135 of the server device 100 generates training data for the model from the user U's second choice preference and reason for preference for at least one of the two targets (or only the target that was experienced) after the user U has experienced at least one of the two targets, and then performs RLHF or DPO training using the training data (step S107).

[0128] [6. Variant Example] The terminal device 10 and server device 100 described above may be implemented in various other forms besides those of the embodiment described above. Therefore, the following describes modifications of the embodiment.

[0129] In the above embodiment, some or all of the processing performed by the server device 100 may actually be performed by the terminal device 10 (or an application running on the terminal device 10). For example, the processing may be completed in a standalone manner (by the terminal device 10 alone). In this case, the terminal device 10 is assumed to have the same functions as the server device 100 in the above embodiment. Furthermore, in the above embodiment, since the terminal device 10 is in cooperation with the server device 100, from the perspective of the user U, it appears as if the processing of the server device 100 is also being performed by the terminal device 10. In other words, from another perspective, it can be said that the terminal device 10 is equipped with the server device 100.

[0130] Furthermore, in the above embodiment, the server device 100 accepts binary preferences for two targets, Restaurant A and Restaurant B, but in reality, it may accept binary preferences for three or more targets. For example, the server device 100 may accept preferences for each of the three targets, Restaurant A, Restaurant B, and Restaurant C, indicating whether they are preferred (chosen) or unprepared (rejected). That is, there may be multiple preferred (chosen) and unprepared (rejected) targets. The same applies to accepting reasons for preference.

[0131] [7. Effects] As described above, the information processing device (terminal device 10 and server device 100) according to the present application is characterized by comprising: a response processing unit 132 that receives a prompt from the user, inputs the prompt into the model, and presents the two objects obtained as the output of the model to the user; a preference receiving unit 133 that receives a binary preference for the two objects from the user; a reason receiving unit 134 that receives the reason for the binary preference from the user; and a learning processing unit 135 that causes the model to generate learning data from the binary preference and the reason for the preference, and performs RLHF or DPO learning using the learning data.

[0132] This allows us to provide a method for effectively learning the binary preferences and reasons for those preferences in a model's response. In particular, we can provide a method for effectively learning binary preferences plus reasons for those preferences, using RLHF / DPO as the basic technology.

[0133] The preference receiving unit 133 receives the user's binary preference for the two objects at the time the two objects are presented to the user. The reason receiving unit 134 receives the reason for the binary preference for the two objects at the time the two objects are presented to the user. The learning processing unit 135 generates training data for the model from the binary preference for the two objects and the reason for the preference at the time the two objects are presented to the user, and performs RLHF or DPO learning using the training data.

[0134] This allows the model to learn its binary preferences between preferred (chosen) and unpredictable (rejected) objects at the time of its response, as well as the reasons for those preferences.

[0135] The preference receiving unit 133 receives the user's second choice preference for two objects after the user has experienced the object that was evaluated as favorable. The reason receiving unit 134 receives the reason for the second choice preference for two objects after the user has experienced the object that was evaluated as favorable. The learning processing unit 135 generates training data for the model from the second choice preference and reason for preference for two objects after the user has experienced the object that was evaluated as favorable, and performs RLHF or DPO learning using the training data.

[0136] This allows the system to learn the user's preference for a preferred object (chosen) and the reasons for that preference after the user has actually experienced it. The evaluation may change after the user's actual experience, and a preferred object (chosen) may become an undesirable object (rejected).

[0137] The preference receiving unit 133 receives the user's second choice preference for two objects after the user has experienced the object that was evaluated as undesirable. The reason receiving unit 134 receives the reason for the second choice preference for the two objects after the user has experienced the object that was evaluated as undesirable. The learning processing unit 135 generates training data for the model from the second choice preference and reason for preference for the two objects after the user has experienced the object that was evaluated as undesirable, and performs RLHF or DPO learning using the training data.

[0138] This allows the system to learn the user's preference for a second choice between two options and the reasons for that preference after the user has actually experienced an undesirable option (rejected). The evaluation may change after the user's actual experience, and an undesirable option (rejected) may become a preferred option (chosen).

[0139] When the learning processing unit 135 receives the user's binary preference and reason for preference for two objects presented to the user, it causes the model to generate training data from the binary preference and reason, and performs RLHF or DPO learning using the training data. When the user has experienced at least one of the two objects presented to the user and the learning processing unit 135 receives the user's binary preference and reason for preference again, it causes the model to generate training data from the binary preference and reason, and performs RLHF or DPO learning using the training data.

[0140] This allows the system to learn not only the binary preferences and reasons for those preferences in the response of models such as LLNs, but also the binary preferences and reasons for those preferences after the user has actually experienced them.

[0141] The learning processing unit 135 performs in-context learning on training data including binary preferences and reasons for those preferences, causes the model to generate similar training data, which is generative augmentation data, and then performs RLHF or DPO learning using the training data and generative augmentation data.

[0142] This allows the model to predict how other users would evaluate something and generate similar training data. For example, by inputting data about the subject, along with users' binary preferences and the reasons for those preferences, the model can generate similar training data.

[0143] The learning processing unit 135 causes the model to generate multiple generative and augmented data sets from a single training data set, and performs RLHF or DPO learning using the single training data set and the multiple generative and augmented data sets.

[0144] This makes it possible to generate a large amount of training data that captures intent from a small amount of signal information.

[0145] The reason reception unit 134 receives both positive and negative reasons for each of the two items presented to the user at the time of the two-choice preference.

[0146] This allows for the provision of rapid learning methods to realize personalization and individuality.

[0147] By any or a combination of the above-described processes, the information processing device according to the present invention can provide a method for effectively learning the binary preferences and reasons for preferences of a model's response.

[0148] [8. Hardware Configuration] Furthermore, the terminal device 10 and server device 100 according to the above-described embodiment are realized by a computer 1000 having a configuration such as that shown in Figure 9. The following explanation will use the server device 100 as an example. Figure 9 is a diagram showing an example of the hardware configuration. The computer 1000 is connected to an output device 1010 and an input device 1020, and has a configuration in which an arithmetic unit 1030, a primary storage device 1040, a secondary storage device 1050, an output interface 1060, an input interface 1070, and a network interface 1080 are connected by a bus 1090.

[0149] The arithmetic unit 1030 operates based on programs stored in the primary storage device 1040 and the secondary storage device 1050, as well as programs read from the input device 1020, and executes various processes. The arithmetic unit 1030 can be implemented using, for example, a CPU (Central Processing Unit), an MPU (Micro Processing Unit), a GPU (Graphics Processing Unit), an ASIC (Application Specific Integrated Circuit), or an FPGA (Field Programmable Gate Array).

[0150] The primary storage device 1040 is a memory device, such as RAM (Random Access Memory), that temporarily stores data used by the arithmetic unit 1030 for various calculations. The secondary storage device 1050 is a storage device where data used by the arithmetic unit 1030 for various calculations and various databases are registered, and can be implemented using ROM (Read Only Memory), HDD (Hard Disk Drive), SSD (Solid State Drive), flash memory, etc. The secondary storage device 1050 may be internal storage or external storage. The secondary storage device 1050 may also be a removable storage medium such as USB (Universal Serial Bus) memory or SD (Secure Digital) memory card. The secondary storage device 1050 may also be cloud storage (online storage), NAS (Network Attached Storage), file server, etc.

[0151] The output I / F 1060 is an interface for transmitting information to be output to output devices 1010, such as displays, projectors, and printers, and is implemented using connectors of standards such as USB (Universal Serial Bus), DVI (Digital Visual Interface), and HDMI (High Definition Multimedia Interface). The input I / F 1070 is an interface for receiving information from various input devices 1020, such as mice, keyboards, keypads, buttons, and scanners, and is implemented using, for example, USB.

[0152] Furthermore, the output interface 1060 and input interface 1070 may be wirelessly connected to the output device 1010 and input device 1020, respectively. In other words, the output device 1010 and input device 1020 may be wireless devices.

[0153] Furthermore, the output device 1010 and the input device 1020 may be integrated as a touch panel. In this case, the output I / F 1060 and the input I / F 1070 may also be integrated as an input / output I / F.

[0154] The input device 1020 may also be a device that reads information from, for example, an optical recording medium such as a CD (Compact Disc), DVD (Digital Versatile Disc), or PD (Phase Change Rewritable Disk), a magneto-optical recording medium such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory.

[0155] The network interface 1080 receives data from other devices via network N and sends it to the computing unit 1030, and also transmits data generated by the computing unit 1030 to other devices via network N.

[0156] The arithmetic unit 1030 controls the output device 1010 and the input device 1020 via the output interface 1060 and the input interface 1070. For example, the arithmetic unit 1030 loads a program from the input device 1020 or the secondary storage device 1050 onto the primary storage device 1040 and executes the loaded program.

[0157] For example, when computer 1000 functions as a server device 100, the arithmetic unit 1030 of computer 1000 realizes the functions of the control unit 130 by executing a program loaded onto the primary storage device 1040. Alternatively, the arithmetic unit 1030 of computer 1000 may load a program obtained from another device via the network interface 1080 onto the primary storage device 1040 and execute the loaded program. Furthermore, the arithmetic unit 1030 of computer 1000 may cooperate with other devices via the network interface 1080 and call and use program functions, data, etc., from other programs on other devices.

[0158] [9. Other] Although embodiments of the present invention have been described above, the present invention is not limited by the content of these embodiments. Furthermore, the aforementioned components include those that can be easily conceived by those skilled in the art, those that are substantially the same, and those that fall within the so-called equivalent range. Moreover, the aforementioned components can be combined as appropriate. Furthermore, various omissions, substitutions, or modifications of the components can be made without departing from the gist of the embodiments described above.

[0159] Furthermore, among the processes described in the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically by known methods. In addition, the processing procedures, specific names, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified. For example, the various information shown in each figure is not limited to the information shown.

[0160] Furthermore, the components of each illustrated device are functionally conceptual and do not necessarily need to be physically configured as shown. In other words, the specific forms of distribution and integration of each device are not limited to those shown, and all or part of them can be functionally or physically distributed and integrated in any unit according to various loads and usage conditions.

[0161] For example, the server device 100 described above may be implemented using multiple server computers, and the configuration can be flexibly changed, such as by calling external platforms via APIs (Application Programming Interfaces) or network computing depending on the function.

[0162] Furthermore, the embodiments and modifications described above can be combined as appropriate, provided that the processing content is not inconsistent.

[0163] Furthermore, the terms "section, module, unit" mentioned above can be replaced with "means" or "circuit," etc. For example, the acquisition unit can be replaced with acquisition means or acquisition circuit. [Explanation of symbols]

[0164] 1. Information Processing System 10 Terminal devices 100 Server Devices 110 Communications Department 120 Storage section 130 Control Unit 131 Acquisition Department 132 Response Processing Unit 133 Preference Reception Department 134 Reason Reception Department 135 Learning Processing Unit

Claims

1. A response processing unit that receives a prompt from the user, inputs the prompt into a model, and presents the two objects obtained as the output of the model to the user, A preference receiving unit that receives a binary preference from the user for the two targets, A reason receiving unit that receives two-choice preference reasons from the user, A learning processing unit generates training data for the model from binary preferences and reasons for those preferences, and performs RLHF or DPO learning using the training data. An information processing device characterized by comprising:

2. The preference receiving unit receives the user's two-choice preference for the two objects at the time the two objects are presented to the user. The reason receiving unit receives the reason for the two-choice preference for the two objects at the time the two objects are presented to the user. The learning processing unit causes the model to generate training data from the user's binary preferences and reasons for those preferences at the time the two objects are presented to the user, and then performs RLHF or DPO learning using the training data. The information processing apparatus according to feature 1.

3. The preference receiving unit receives a second choice of the two objects after the user has experienced the object that was evaluated as favorable. The reason receiving unit receives the reason for the user's second choice between the two items after the user has experienced the item that was evaluated as favorable. The learning processing unit generates training data for the model from the user's second choice preference and reason for preference regarding the two objects after the user has experienced the object that was evaluated as preferred, and then performs RLHF or DPO learning using the training data. The information processing apparatus according to feature 2.

4. The preference receiving unit receives a second choice of the two objects after the user has experienced the object that was evaluated as undesirable. The reason receiving unit receives the reason for the user's second choice between the two items after the user has experienced the item that was evaluated as undesirable. The learning processing unit generates training data for the model from the user's second choice preference and reason for preference regarding the two objects after the user has experienced the object that was evaluated as undesirable, and then performs RLHF or DPO learning using the training data. The information processing apparatus according to feature 2.

5. The aforementioned learning processing unit, When the two aforementioned objects are presented to the user, and the user receives their binary preference and reason for preference for those two objects, the model is made to generate training data from the binary preference and reason for preference, and RLHF or DPO training is performed using the training data. After the user has experienced at least one of the two objects presented to the user, and the user has again provided a binary preference and reason for that preference for the two objects, the model is instructed to generate training data from the binary preference and reason, and RLHF or DPO training is performed using the training data. The information processing apparatus according to feature 1.

6. The learning processing unit performs in-context learning on training data including binary preferences and reasons for those preferences, causes the model to generate generative augmentation data, which is similar training data, and performs RLHF or DPO learning using the training data and the generative augmentation data. The information processing apparatus according to feature 1.

7. The learning processing unit causes the model to generate multiple generative and augmented data sets from a single training data set, and performs RLHF or DPO learning using the single training data set and the multiple generative and augmented data sets. The information processing apparatus according to feature 6.

8. The reason receiving unit receives, as the reason for a binary choice between the two objects, a favorable reason and a negative reason for each object. The information processing apparatus according to feature 1.

9. An information processing method performed by an information processing device, A response processing step that receives a prompt from the user, inputs the prompt into the model, and presents the two objects obtained as the output of the model to the user, A preference reception process for receiving a binary preference from the user for the two targets, A reason reception process for receiving the reason for a two-choice preference from the user, A learning process that generates training data for the model from binary preferences and reasons for those preferences, and performs RLHF or DPO learning using the training data. An information processing method characterized by including

10. A response processing procedure that receives a prompt from the user, inputs the prompt into a model, and presents the two objects obtained as the output of the model to the user, A preference reception procedure for receiving a binary preference from the user for the two targets, A procedure for receiving reasons for a two-option preference from the user, A learning process procedure which involves generating training data for the model from binary preferences and reasons for those preferences, and performing RLHF or DPO learning using the training data. An information processing program characterized by causing a computer to execute it.