Conversation device, conversation method, and conversation program
The dialogue device uses a trained model to generate and retain summaries by topic completion, addressing the challenge of inadequate summarization in conventional systems.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- PIONEER IP
- Filing Date
- 2025-07-25
- Publication Date
- 2026-07-02
Smart Images

Figure JP2025026546_02072026_PF_FP_ABST
Abstract
Description
Dialogue device, dialogue method, and dialogue program
[0001] The present invention relates to a dialogue device, a dialogue method, and a dialogue program.
[0002] Conventionally, due to the development of AI and the like, a dialogue device that interacts with a user by providing an appropriate response to the user's utterance is known. And, in order to utilize the dialogue content with the user as a history, the dialogue device may generate a summary of the dialogue content (see, for example, Patent Document 1).
[0003] Japanese Patent Application Laid-Open No. 2005-309185
[0004] However, in the prior art, there are cases where a summary sentence cannot be generated reasonably. For example, in the prior art, a summary sentence is generated based on the dialogue history before the interruption of the voice input by the dialogue, but the dialogue history before the interruption cannot be distinguished for each topic, and there are cases where a reasonable summary sentence for each topic cannot be generated.
[0005] The present invention has been made in view of the above, and an object thereof is to provide a dialogue device, a dialogue method, and a dialogue program that can reasonably generate a summary sentence.
[0006] In order to solve the above-described problems and achieve the object, a dialogue device according to the present invention is a dialogue device that interacts with a user, and in the dialogue with the user, a generation unit that generates a summary or a summary and tags of the dialogue content with the user using a trained model, a determination unit that determines whether or not the topic in the dialogue has ended from the dialogue content, and a holding unit that holds the summary generated by the generation unit when it is determined by the determination unit that the topic has ended.
[0007] Further, a dialogue method according to the present invention is a dialogue method executed by a dialogue device, and in the dialogue with the user, a generation step of generating a summary or a summary and tags of the dialogue content with the user using a trained model, a determination step of determining whether or not the topic in the dialogue has ended from the dialogue content, and a holding step of holding the summary generated by the generation step when it is determined by the determination step that the topic has ended.
[0008] Furthermore, the dialogue program according to the present invention is characterized in that, in a dialogue with a user, it causes a computer to execute a generation procedure that generates a summary of the dialogue content with the user or a summary and tags using a trained model; a determination procedure that determines from the dialogue content whether or not the topic in the dialogue has ended; and a retention procedure that, if the determination procedure determines that the topic has ended, retains the summary generated by the generation procedure.
[0009] Figure 1 is a diagram showing an example of a scenario in which the dialogue device according to the embodiment is utilized. Figure 2 is a diagram illustrating the processing content of the dialogue device according to the embodiment. Figure 3 is a block diagram showing an example of the configuration of the dialogue device according to the embodiment. Figure 4 is a diagram showing an example of dialogue data stored in the dialogue device according to the embodiment. Figure 5 is a diagram showing an example of summary data stored in the dialogue device according to the embodiment. Figure 6 is a diagram showing an example of the processing content of the generation unit according to the embodiment. Figure 7 is a diagram showing an example of the processing content of the determination unit according to the embodiment. Figure 8 is a diagram showing an example of the processing content of the determination unit according to the embodiment. Figure 9 is a diagram showing an example of the processing content of the determination unit according to the embodiment. Figure 10 is a diagram showing an example of the processing content of the dialogue device according to the embodiment. Figure 11 is a diagram showing an example of the processing content of the dialogue device according to the embodiment. Figure 12 is a diagram showing an example of the processing content of the dialogue device according to the embodiment. Figure 13 is a diagram showing an example of the processing content of the dialogue device according to the embodiment. Figure 14 is a diagram showing an example of the processing content of the dialogue device according to the embodiment. Figure 15 is a diagram showing an example of the processing content of the dialogue device according to the embodiment. Figure 16 is a diagram showing an example of the processing content of the dialogue device according to the embodiment. Figure 17 is a diagram showing an example of the processing content of the dialogue device according to the embodiment. Figure 18 is a diagram showing an example of the processing content of the dialogue device according to the embodiment. Figure 19 is a flowchart showing an example of the processing flow performed by the dialogue device according to the embodiment. Figure 20 is a flowchart showing an example of the processing flow performed by the dialogue device according to the embodiment. Figure 21 is a diagram showing an example of the application of the usage of the dialogue device according to the embodiment. Figure 22 is a hardware configuration diagram showing an example of a computer that realizes the functions of the dialogue device according to the embodiment.
[0010] Embodiments of the dialogue device, dialogue method, and dialogue program according to the present invention will be described in detail below with reference to the drawings. However, the present invention is not limited to the embodiments described below.
[0011] [Embodiment 1] (Introduction) Figure 1 is a diagram showing an example of a scenario in which the dialogue device according to this embodiment is utilized. As shown in Figure 1, the dialogue device 1 according to this embodiment is, for example, a device mounted on a vehicle and has a navigation function that guides the user riding in the vehicle to their destination. Note that the contents shown in Figure 1 are merely an example, and the dialogue device 1 does not have to be mounted on a vehicle. For example, the dialogue device 1 may be installed in a predetermined facility for the purpose of conversing with children or the elderly, or it may be installed in a commercial facility, etc., for the purpose of guiding the user to their desired store.
[0012] The dialogue device 1 is implemented, for example, by a computer mounted in a vehicle. For example, the dialogue device 1 may be a dedicated navigation device built into or mounted in a vehicle. For example, the dialogue device 1 may consist of a navigation device and a dialogue processing device. For example, the dialogue device 1 enables dialogue with the user by providing appropriate responses to the user's utterances and summarizes the dialogue content topic by topic using a trained machine learning model.
[0013] (Processing details of the dialogue device 1) Next, the processing details of the dialogue device 1 according to the embodiment will be described. As mentioned above, in the prior art, there is a known device that generates a summary of the dialogue content in order to utilize the content of the dialogue with the user as a history.
[0014] However, conventional technology generates a summary based on the dialogue history before the interruption of voice input via dialogue, but it has the problem that it cannot distinguish the dialogue history before the interruption by topic, and therefore cannot generate a rational summary for each topic.
[0015] Therefore, in the embodiment, the dialogue device 1 generates a summary of the dialogue content with the user, or a summary and tags, using a trained model during the dialogue with the user. The dialogue device 1 then determines from the dialogue content whether the topic in the dialogue has ended, and if it is determined that the topic has ended, it retains the generated summary.
[0016] Here, we will explain the processing content of the dialogue device 1 with an example. Figure 2 is a diagram illustrating the processing content of the dialogue device according to the embodiment. Figure 2 shows an example of dialogue on two different topics: "a birthday present for my daughter" and "tomorrow's weather."
[0017] As shown in Figure 2, the dialogue device 1 first engages in a conversation with the user about the topic of "a birthday present for my daughter." Then, regarding the topic of "a birthday present for my daughter," if there is a break in the conversation for a predetermined amount of time, or after each exchange of conversations, the dialogue device 1 generates a summary and tags using a trained model such as a large-scale language model.
[0018] Then, if the dialogue is interrupted for a predetermined time, the dialogue device 1 inputs a summary of the previous dialogue content, specifically "My daughter's birthday present," and the user's subsequent utterance of "Tomorrow's weather" after the interruption into the trained model to determine that the topic of "My daughter's birthday present" has ended. Having determined that the topic of "My daughter's birthday present" has ended, the dialogue device 1 stores the generated summary and tags related to "My daughter's birthday present."
[0019] As a result, the dialogue device 1 can finalize and retain the summary generated during the dialogue with the user when the topic has ended, thus enabling it to rationally generate a summary for each topic in the dialogue with the user.
[0020] The following explanation describes an example in which the dialogue device 1 generates both a summary and tags, but is not limited to this example. For example, the dialogue device 1 may generate only a summary of the dialogue content without generating tags.
[0021] (Configuration of the dialogue device 1) Next, an example of the functional configuration of the dialogue device 1 according to Embodiment 1 will be described. Figure 3 is a block diagram showing an example of the configuration of the dialogue device according to the embodiment. As shown in Figure 3, the dialogue device 1 includes, for example, a communication unit 2, an audio input unit 3, an output unit 4, a control unit 5, and a storage unit 6.
[0022] The communication unit 2, under the control of the control unit 5, connects to the server via the network and sends and receives various types of information. For example, the dialogue data 61 and trained model 63 contained in the storage unit 6 of the dialogue device 1 may be stored in an external device such as a server, in which case the communication unit 2 sends and receives data via the network.
[0023] The voice input unit 3 includes a microphone for capturing sound and generates voice information based on the captured sound. For example, under the control of the control unit 5, the voice input unit 3 generates voice information when it detects utterances from users who are conversation partners of the dialogue device 1, such as the driver or passengers. The voice input unit 3 can also include attribute information of the speaking user and the speaking environment in addition to the user's utterances. For example, in addition to the recorded data of the conversation between the user and the dialogue device 1, the voice input unit 3 can generate voice information that corresponds to user attribute information determined from the type of voice for each utterance, or it can generate voice information that links the speaking environment information, such as "driving," determined based on the sounds that occur while the vehicle is in operation, to the recorded data. The voice input unit 3 then outputs the generated voice information to the control unit 5 or stores it in the storage unit 6.
[0024] The output unit 4 outputs information under the control of the control unit 5. For example, the output unit 4 may include an audio output device and output the information input from the control unit 5 as audio. Alternatively, the output unit 4 may include a display device and, in addition to audio output, may also output by displaying characters, symbols, etc.
[0025] The storage unit 6 stores various programs executed by the control unit 5 (dialogue programs according to this embodiment), as well as data necessary for the control unit 5 to perform processing. For example, as shown in Figure 3, the storage unit 6 includes dialogue data 61, summary data 62, and a trained model 63.
[0026] The dialogue data 61 stores the date and time associated with the original text of past conversations between the user and the dialogue device 1. Here, the date and time indicates the date and time when the conversation took place, and the original text of the dialogue is a document that records the content of the conversation exactly as it happened, and is different from the dialogue summary described later.
[0027] Figure 4 shows an example of dialogue data stored in the dialogue device according to the embodiment. As shown in Figure 4, the dialogue data 61 stores that at "10 / 17 / 11:55", the following dialogue took place: "What would be a good birthday present for my daughter who is in the first year of junior high school? That's a great idea for a present for your daughter who is in the first year of junior high school!..."
[0028] The summary data 62 stores the date and time associated with a summary and tags related to past user-to-dialogue interactions. Here, the date and time indicates the date and time when the summary and tags were generated, the summary is a summary of the conversation, and the tags are words that the trained model 63 (described later) determined to be related to the conversation content and output accordingly.
[0029] Figure 5 shows an example of summary data stored in the dialogue device according to the embodiment. As shown in Figure 5, the summary data 62 stores the content of the dialogue that took place at "10 / 17 / 11:55", a summary that reads "The user is looking for a birthday present for his junior high school daughter...", and tags that are generated as being related to the summary, such as "birthday present, daughter,...".
[0030] The trained model 63 is, for example, a Large Language Model (LLM). Alternatively, the trained model 63 may be a group of language models containing multiple language models. Furthermore, the configuration and type of the trained model 63 are not particularly limited, as long as it is a machine learning model trained to output a summary of the utterance and tags from the original text of the input utterance in the processing described later.
[0031] The control unit 5 is implemented by a controller such as a CPU (Central Processing Unit) or MPU (Micro Processing Unit) executing various programs stored in the storage unit 6, and controls the operation of the entire interactive device 1. Note that the control unit 5 is not limited to a CPU or MPU; it may also be composed of integrated circuits such as an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array). For example, as shown in Figure 3, the control unit 5 includes a generation unit 51, a determination unit 52, and a holding unit 53.
[0032] The generation unit 51 generates a summary and tags of the dialogue content in the dialogue between the user and the dialogue device 1 using the trained model 63. For example, the generation unit 51 inputs the voice information (original text) related to the dialogue content between the user and the dialogue device 1, which was generated by the voice input unit 3 mentioned above, into the trained model 63, and sets a prompt sentence to generate a summary and tags of the input content. The generation unit 51 then obtains the summary and tags of the voice information output by the trained model 63. The generation unit 51 can also generate only the summary by changing the prompt sentence, for example.
[0033] Here, the processing of the generation unit 51 will be explained. Figure 6 is a diagram showing an example of the processing content of the generation unit according to the embodiment. Figure 6 shows, in order, examples (1) to (4) of generating a summary and tags of the dialogue content of "birthday present for my daughter" between the user and the dialogue device 1. In Figure 6, the explanation is based on a series of dialogue contents in which the user says, "What would be a good birthday present for my daughter who is in the first year of junior high school?" and the dialogue device 1 responds with choices regarding a birthday present for the daughter who is in the first year of junior high school.
[0034] Here, the "Q" in the prompt statement in the following embodiment indicates that the following sentence is an instruction to be set for the trained model 63. Also, the "A" in the output of the trained model 63 indicates that the following sentence is a response to the prompt statement output by the trained model 63.
[0035] As shown in Figure 6, the generation unit 51 first sets the trained model 63 with the prompt "Q: Please think of a summary of the following dialogue and several tags." (Figure 6(1)). The generation unit 51 also inputs the original text of the dialogue between the user and the dialogue device 1 regarding the topic of "birthday present for my daughter": "(User) For my daughter who is in the first year of junior high school..." (Figure 6(2)). The generation unit 51 then obtains the summary "A: ### Summary_The user is thinking of a birthday present for her daughter who is in the first year of junior high school..." (Figure 6(3)) and the tags "### Tags_-Birthday present--Junior high school student--Daughter..." (Figure 6(4)).
[0036] Furthermore, the generation unit 51 can also generate summaries or summaries and tags such that information about the environment in which the interaction between the user and the dialogue device 1 took place is included in either the summary or the summary and tags, or both. For example, if the user is driving alone when the interaction with the dialogue device 1 takes place, the generation unit 51 generates the summary of the interaction content and tags so that they include the phrase "user is driving alone," indicating the environment in which the interaction took place. In this case, the dialogue device 1 can determine the user who is having the interaction from the type of voice through processing by the voice input unit 3, etc., and obtain information such as the number of users and the user's attributes (father, mother).
[0037] Furthermore, the dialogue device 1 can also determine information about the environment in which the interaction between the user and the dialogue device 1 took place, for example, by using information from external sensors in addition to the type of voice. To give a specific example, if the user is driving alone, the dialogue device 1 can use in-car cameras, GPS information, etc., to determine that the user is alone and that the user is driving.
[0038] Here, we will explain an example of the timing at which the generation unit 51 generates a summary of the dialogue content. For example, the generation unit 51 generates a summary and tags when the dialogue between the user and the dialogue device 1 is interrupted for a predetermined period of time or longer. To explain with a specific example, in the dialogue between the user and the dialogue device 1, the generation unit 51 determines that the dialogue has been interrupted for a predetermined period of time or longer if, for example, 10 seconds or more has not been detected from the user after a response to the user's utterance. Then, when the generation unit 51 determines that the dialogue has been interrupted, it generates a summary of the dialogue content and tags using the dialogue content (original text) detected before the dialogue was interrupted.
[0039] For example, the generation unit 51 generates a summary and tags when the dialogue device 1 responds to a user's utterance. To explain with a specific example, in a dialogue between the user and the dialogue device 1, each time the dialogue device 1 responds to a user's utterance, the generation unit 51 generates a summary of the dialogue content and tags using the dialogue content (original text) of one round trip from the user's utterance to the dialogue device 1's response, which was detected at the time of the response.
[0040] The determination unit 52 determines from the dialogue content whether the topic in the dialogue has ended. For example, the determination unit 52 inputs to the trained model 63 the summary and tags generated in the dialogue with the user, as well as the user's new utterance (original text) after the timing of generating the summary (summary generation time) and a prompt sentence to determine whether the new utterance is related to the summary and tags. The determination unit 52 then obtains the relationship determination result and the reason for the determination output from the trained model 63.
[0041] In the processing of the determination unit 52 described above, an example was described in which the generated summary and tags, and the new utterance content are input to the trained model 63 to determine the relevance, but the present invention is not limited to this. For example, when the determination unit 52 generates summaries and tags for new utterances after the summary generation time point, the determination unit 52 can determine the presence or absence of relevance using both the summary and tags before the summary generation time point and the summary and tags after the summary generation time point. Further, for example, the determination unit 52 can calculate the degree of relevance between the tags of the conversation content before the summary generation time point and the tags of the new utterance content after the summary generation time point, and determine that there is no relevance when the degree of relevance is less than or equal to a threshold value.
[0042] Here, specific examples will be given and explained respectively for the processing example when the determination unit 52 determines the end of the topic. For example, when the summary and tags generated by the generation unit 51 are not related to the content of a new utterance after the conversation has been interrupted for a predetermined time or more, the determination unit 52 determines that the topic has ended. FIG. 7 is a diagram showing an example of the processing content of the determination unit according to the embodiment. In FIG. 7, a processing example for determining the relevance between the summary from (1) to (4) before the elapse of a predetermined time and the new utterance (5) after the elapse of a predetermined time for the conversation content from (1) to (6) between the user and the dialogue device 1 is shown.
[0043] As shown in FIG. 7, the determination unit 52 obtains the summary "The user is playing the piano..." of the conversation content from (1) to (4) generated by detecting that 10 seconds or more has elapsed between (4) and (5) in the conversation content between the user and the dialogue device 1. Then, the determination unit 52 obtains the original text of the new utterance (5) "What is the weather in Tokyo tomorrow?" after 10 seconds or more have elapsed.
[0044] Then, the determination unit 52 sets a prompt sentence for determining the relevance between the summary from (1) to (4) and the new utterance (5), and inputs both texts to the trained model 63 to obtain an output indicating that there is no relevance between the two texts. Then, since there is no relevance between the two texts, the determination unit 52 determines that the topic has switched before and after 10 seconds or more have elapsed, and the topic before 10 seconds or more have elapsed has ended.
[0045] For example, when the summary and tags generated by the generation unit 51 are not related to the content of the user's new utterance after the response by the dialogue device 1, the determination unit 52 determines that the topic has ended. FIG. 8 is a diagram showing an example of the processing content of the determination unit according to the embodiment. In FIG. 8, for the dialogue content from (1) to (4) between the user and the dialogue device 1, a processing example of determining the relevance between the summary from (1) to (2) of one round of dialogue 1 and the new utterance (3) after one round of dialogue 1 is shown.
[0046] As shown in FIG. 8, the determination unit 52 acquires the summary "The user is playing the piano..." of the dialogue content from (1) to (2) generated by one round of dialogue between (1) and (2) in the dialogue content between the user and the dialogue device 1. Then, the determination unit 52 acquires the original text of the new utterance (3) "Let's make it a metronome." after one round of dialogue 1.
[0047] Then, the determination unit 52 sets a prompt sentence for determining the presence or absence of relevance between the summary from (1) to (2) and the new utterance (3), and inputs both texts into the trained model 63 to obtain an output indicating that there is relevance between the two texts. Then, since there is relevance between the two texts, the determination unit 52 determines that the topic has not switched in the new utterance after one round and the topic of one round of dialogue has not ended.
[0048] When it is determined by the determination unit 52 that the topic has not ended, for example, the dialogue device 1 continues to respond regarding the ongoing topic and continues the dialogue until the timing when the next summary is generated. Then, when the timing when the next summary is generated arrives, the dialogue device 1 generates a summary of the dialogue content from the beginning to the current point of the ongoing topic and determines the presence or absence of relevance with the new dialogue. That is, when it is determined that the topic has not ended, the dialogue device 1 continues the dialogue with the user and generates a summary of all the dialogue content of the ongoing topic including the content of the previously generated summary for the generation of the next summary.
[0049] Furthermore, the determination unit 52 can also determine that the conversation has ended when it begins guiding the user to their destination. Figure 9 is a diagram showing an example of the processing content of the determination unit according to the embodiment. Figure 9 shows an example of processing in which the conversational device 1, which has a route guidance function, determines the end of the conversation when it conducts a conversation to guide the user to the target store.
[0050] As shown in Figure 9, in response to the user's utterance, "Tell me about ramen restaurants near Kawagoe Station," the dialogue device 1 introduces "1. Ramen ○○ya Kawagoe branch, 2. Kawagoe △△ken, 3. Maguro Ramen □□." Then, in response to the user's utterance, "I think I'll go to number 2," the dialogue device 1 responds, "I will guide you to Kawagoe △△ken." In this situation, the determination unit 52 determines that the response, "I will guide you to Kawagoe △△ken," indicates the start of guidance to the destination, and determines that the conversation up to the start of guidance has ended.
[0051] For example, the determination unit 52 determines that a topic has ended when it detects a predetermined word that suggests the end of the topic. To explain with a specific example, in a conversation with the user, the determination unit 52 determines that a topic has ended when it detects a word that suggests the end of the previous topic, such as "thank you," "I understand," or "it's okay."
[0052] The storage unit 53 stores the summary generated by the generation unit 51 when the determination unit 52 determines that the topic has ended. For example, when the determination unit 52 determines that the topic has ended, the storage unit 53 stores the summary and tags used in the processing by the determination unit 52 in the summary data 62. In this way, the storage unit 53 can store the summary and tags that have been determined by the determination of the end of the topic.
[0053] (Specific Example) Next, a specific example of the processing content of the dialogue device 1 according to this embodiment will be described. Figures 10 to 18 are diagrams showing an example of the processing content of the dialogue device according to this embodiment. In the example in Figures 10 to 18, the content of the dialogue between the user and the dialogue device 1 and the processing content of the dialogue device 1 according to the content of the dialogue are shown in chronological order. In the specific example described below, an example of processing when a summary and tags are generated when the dialogue is interrupted for a predetermined time or longer is shown.
[0054] Figure 10 shows the content of the dialogue from the start of the conversation initiated by the user's utterance until the dialogue is interrupted for a predetermined period of time. As shown in Figure 10, the dialogue device 1 responds to the user's utterance, "What would be a good birthday present for my daughter who is in the first year of junior high school?" with, "That's a great idea for a birthday present for your daughter who is in the first year of junior high school! I've thought of a few options: 1. **Books or manga** 2. **Art supplies** 3. **Sports equipment** 4. **Accessories** 5. **Gadgets** 6. **Experience gifts**."
[0055] Then, in response to the user's utterance, "My daughter is learning piano, do you have any music-related suggestions?", the dialogue device 1 responds, "A music-related gift for your daughter who is learning piano sounds wonderful! Here are a few ideas: 1. **Sheet music** 2. **Metronome** 3. **Piano accessories** 4. **Concert tickets** 5. **Lesson notebooks**."
[0056] The dialogue device 1 then determines that no new utterances from the user have been detected for a predetermined period of time following the aforementioned response of "a music-related gift for your daughter who is learning piano."
[0057] Figure 11 shows an example of the process of generating a summary and tags for the dialogue content shown in Figure 10, based on the determination that no new utterances have been made by the user for a predetermined period of time. As shown in Figure 11, the dialogue device 1 sets the trained model 63 with the prompt "Q: Please think of a summary of the following dialogue and several tags," and inputs the dialogue content shown in Figure 10.
[0058] Then, the dialogue device 1 obtains the summary and tags of the dialogue content shown in Figure 10, which is output by the trained model 63: "A: ###Summary_The user consulted about a birthday present for his junior high school daughter, and the dialogue device suggested ideas such as books, comics, art supplies, sports equipment, accessories, gadgets, and experience gifts. Furthermore, when the user mentioned that his daughter was taking piano lessons, the dialogue device suggested music-related presents such as sheet music, a metronome, piano accessories, concert tickets, and lesson notebooks. ###Tags_-Birthday present-Junior high school student-Music-Piano-Gift ideas-Presents for children-Parent-child conversation".
[0059] If a new utterance is detected after a predetermined time has elapsed since Figure 10, the processes shown in Figure 12 and subsequent figures will be executed. Figure 12 shows the content of the new utterance by the user detected after a predetermined time has elapsed. As shown in Figure 12, the dialogue device 1 detects a new utterance from the user after a predetermined time has elapsed, saying, "A metronome, huh? We don't have one, but if it has a cute design, maybe they'll like it. Let's get that!"
[0060] Figure 13 shows an example of a process that determines whether a topic has ended based on the detection of new user utterances. As shown in Figure 13, the dialogue device 1 sets the trained model 63 with the prompt "Q: Please tell me if the following content is related to the summary and tags," and inputs the new utterance shown in Figure 12 and the summary and tags shown in Figure 11.
[0061] The dialogue device 1 then obtains the judgment output by the trained model 63, "A: Yes, it is related. The additional information that the user selected the metronome complements the content of the summary," which indicates that the new utterance is related to the summary of the topic up to a certain time ago, and determines that the topic has not ended.
[0062] Figure 14 shows the dialogue content when the device responds to the same topic after determining that the topic has not ended. As shown in Figure 14, the dialogue device 1, upon determining that the topic has not ended, responds to the new utterance from the user shown in Figure 12 with "That's a great choice! A cute metronome would make practicing more fun. Here are some popular designs: 1. **Animal motifs** 2. **Colorful designs** 3. **Character designs**".
[0063] The dialogue device 1 then determines that no new utterances from the user have been detected for a predetermined period of time following the response regarding "popular metronome designs" mentioned above.
[0064] Figure 15 shows an example of a process that generates a summary and tags for the series of dialogues shown in Figures 10, 12, and 14, when it is determined that there have been no new utterances from the user for a predetermined period of time. As shown in Figure 15, the dialogue device 1 sets the trained model 63 with the prompt "Q: Please think of a summary of the following dialogue and several tags," and inputs the series of dialogues shown in Figures 10, 12, and 14.
[0065] Furthermore, if the dialogue device 1 has previously generated summaries and tags for a series of dialogue contents, it can also input the previously generated summaries and tags along with additional dialogue contents. For example, since the dialogue device 1 has generated summaries and tags at the time shown in Figure 11, prior to the process of generating summaries and tags at the time shown in Figure 15, it inputs the summary and tags shown in Figure 11 instead of the dialogue contents shown in Figure 10. In other words, the dialogue device 1 can also input the summary and tags from Figure 11 along with the additional dialogue contents shown in Figures 12 and 14, which occur after the time shown in Figure 11, into the trained model 63.
[0066] Then, the dialogue device 1 obtains the summary and tags of the series of dialogue contents shown in Figures 10, 12, and 14, which are output by the trained model 63: "A: ### Summary_The user consulted about a birthday present for their junior high school daughter, and the dialogue device suggested ideas such as books, comics, art supplies, sports equipment, accessories, gadgets, and experience gifts. Furthermore, when the user mentioned that their daughter was taking piano lessons, the dialogue device suggested music-related presents such as sheet music, metronomes, piano accessories, concert tickets, and lesson notebooks. The user chose a metronome, and the dialogue device introduced popular designs. ### Tags_-Birthday present-Junior high school student-Music-Piano-Gift ideas-Gifts for children-Parent-child conversation-Metronome-Design".
[0067] If a new utterance is detected after a predetermined time has elapsed as shown in Figure 14, the processing shown in Figure 16 and subsequent figures will be executed. Figure 16 shows the content of the new utterance by the user detected after a predetermined time has elapsed. As shown in Figure 16, the dialogue device 1 detects a new utterance from the user, "What will the weather be like in Tokyo tomorrow?", after a predetermined time has elapsed.
[0068] Figure 17 shows an example of a process that determines whether a topic has ended based on the detection of new user utterances. As shown in Figure 17, the dialogue device 1 sets the trained model 63 with the prompt "Q: Please tell me if the following content is related to the summary and tags," and inputs the new utterance shown in Figure 16 and the summary and tags shown in Figure 15.
[0069] Then, the dialogue device 1 obtains the judgment output by the trained model 63, "A: No, it is not related. ### Reason for relevance - **Summary**: The user's question about the weather in Tokyo is unrelated to the summary about a birthday present for his daughter. - **Tags**: Questions about the weather are not related to birthday presents or music-related tags," which indicates that the new utterance is unrelated to the summary of the topic up to a certain time ago, and determines that the topic has ended.
[0070] Figure 18 shows the dialogue content when responding with a different topic after determining that the topic has ended. As shown in Figure 18, the dialogue device 1, upon determining that the topic has ended, stores the summary and tag of "Birthday present for my daughter" shown in Figure 15 as "Summary #1" and "Tag #1" in the summary data 62.
[0071] Then, based on the determination that the topic has ended, the dialogue device 1 responds to the new utterance from the user shown in Figure 16 with "Tomorrow's weather in Tokyo will be 'partly cloudy and sunny,' with a high of 33 degrees and a low of 28 degrees."
[0072] As described above regarding the processing of the series of dialogue contents, the dialogue device 1 can appropriately determine the end of a topic by generating a summary when the dialogue is interrupted for a predetermined period of time or longer, and by determining whether or not it is related to a new utterance. As a result, the dialogue device 1 finalizes and retains the summary when it determines that a topic has ended, so it can be said that it can rationally generate summary sentences for each topic.
[0073] (Processing flow of the dialogue device 1) Next, an example of the processing procedure by the dialogue device 1 will be described using Figures 19 and 20. Figures 19 and 20 are flowcharts showing an example of the processing flow executed by the dialogue device according to the embodiment. Note that each step in the flowcharts shown in Figures 19 and 20 can be executed in a different order, and additional or omitted processes may be included.
[0074] First, using Figure 19, we will explain an example of the process for generating a summary when the dialogue between the user and the dialogue device 1 is interrupted for a predetermined period of time or longer. First, the dialogue device 1 detects the user's utterance (S101). Then, the dialogue device 1 responds to the user's utterance (S102).
[0075] The dialogue device 1 then determines whether a predetermined amount of time has elapsed since the user's next utterance (S103). If it determines that the predetermined amount of time has elapsed (S103; Yes), the dialogue device 1 generates a summary of the conversation content up to that point (S104). On the other hand, if it determines that the predetermined amount of time has not elapsed (S103; No), the dialogue device 1 returns to S101 and continues processing.
[0076] After the processing in S104, the dialogue device 1 detects a new utterance from the user after a predetermined time has elapsed (S105). The dialogue device 1 then determines whether the content of the summary and the new utterance are related (S106). If it is determined that the content of the summary and the new utterance are not related (S106; No), the dialogue device 1 retains the summary (S107). After the processing in S107, or if it is determined that the content of the summary and the new utterance are related (S106; Yes), the dialogue device 1 returns to S102 and continues processing.
[0077] Next, using Figure 20, we will explain an example of the process for generating a summary when one round of dialogue has taken place between the user and the dialogue device 1. First, the dialogue device 1 detects the user's utterance (S201). Then, the dialogue device 1 responds to the user's utterance (S202).
[0078] The dialogue device 1 then generates a summary of the conversation with the user (S203). The dialogue device 1 then detects any new utterances from the user after the response (S204). The dialogue device 1 then determines whether the content of the summary and the new utterances are related (S205). If it determines that the content of the summary and the new utterances are not related (S205; No), the dialogue device 1 retains the summary (S206) and returns to S202 to continue processing.
[0079] On the other hand, if the dialogue device 1 determines that the content of the summary and the new utterance are related (S205; Yes), it responds to the user's new utterance (S207). The dialogue device 1 then generates a new summary from the dialogue content regarding the summary and the new utterance (S208), and returns to S204 to continue processing.
[0080] (Effects) As described above, the dialogue device 1 according to the embodiment is a dialogue device that engages in dialogue with a user and includes a generation unit 51, a determination unit 52, and a storage unit 53. The generation unit 51 generates a summary of the dialogue content and tags of the user in the dialogue between the user and the dialogue device 1 using a trained model 63. The determination unit 52 determines from the dialogue content whether or not the topic in the dialogue has ended. The storage unit 53 stores the summary and tags generated by the generation unit 51 when the determination unit 52 determines that the topic has ended.
[0081] As a result, the dialogue device 1 can appropriately determine the end of a topic by generating a summary during the dialogue with the user and determining whether or not it is related to any new utterances after the summary generation. Thus, the dialogue device 1 can rationally generate summary sentences for each topic, as it determines that the topic has ended and then finalizes and retains the summary.
[0082] Furthermore, the generation unit 51 generates a summary and tags when the dialogue is interrupted for a predetermined period of time or longer, and the determination unit 52 determines that the topic has ended if the summary and tags generated by the generation unit 51 are not related to the content of any new utterances after the dialogue has been interrupted for a predetermined period of time or longer.
[0083] As a result, the dialogue device 1 can appropriately determine the end of a topic when the dialogue has been interrupted for a predetermined amount of time, since the topic usually changes when the dialogue is interrupted for a long period of time.
[0084] Furthermore, the generation unit 51 generates a summary and tags in response to a user's utterance, and the determination unit 52 determines that the topic has ended if the summary and tags generated by the generation unit 51 are not related to the content of any new utterance after the response.
[0085] As a result, the dialogue device 1 determines whether the topic has ended after each round of dialogue with the user, thus enabling it to appropriately determine when the topic has ended.
[0086] Furthermore, if the dialogue device 1 has a route guidance function that guides the user to their destination, the determination unit 52 determines that the topic has ended when it starts guiding the user to their destination. As a result, when guidance to the destination is started, the dialogue device 1 can appropriately determine the end of the topic because it considers that the previous dialogue content was related to guidance to the destination, and the topic has ended with the start of the guidance.
[0087] Furthermore, the determination unit 52 determines that a topic has ended when it detects a predetermined word that suggests the end of the topic. As a result, the dialogue device 1 can determine that the previous topic has ended based on words such as "thank you" or "I understand" from the user, thus appropriately determining the end of a topic.
[0088] Furthermore, the generation unit 51 generates summaries or summaries and tags such that information about the environment in which the user interaction took place is included in either the summary or the summary and tags, or both. As a result, the dialogue device 1 includes content indicating the situation in which the interaction with the user took place in the summaries and tags, so it can store summaries and tags that contain information about the interaction situation. This allows the dialogue device 1 to easily search for summaries and tags of a specific interaction situation when searching for stored summaries and tags.
[0089] [Embodiment 2] Now, embodiments of the present invention have been described so far, but the present invention may be implemented in various other forms besides those described above.
[0090] (Examples of Application) In the above embodiment, the dialogue device 1 was described as an in-vehicle device, but it is not limited to this. The dialogue device 1 can be used not only as an in-vehicle device, but also as various other devices that can interact with a user. Here, we will illustrate some examples of applications for the dialogue device 1.
[0091] Figure 21 shows an example of how the dialogue device 1 according to the embodiment is used. For example, as shown in Figure 21(1), the dialogue device 1 can be used with a smart speaker. In this case, the user activates a smart speaker having the same functions as the dialogue device 1 and speaks to the smart speaker. Upon receiving this speech, the smart speaker executes the processes described in the above embodiment and responds to the user's speech.
[0092] Furthermore, as shown in Figure 21 (2), for example, the dialogue device 1 can be used in a smartphone dialogue application or the like. In this case, the user launches a dialogue application or the like that has the same functions as the dialogue device 1 and speaks to the smartphone. The dialogue application or the like that receives this utterance executes the processes described in the above embodiment and responds to the user's utterance. Note that the dialogue device 1 is not limited to smartphones, but can also be applied to mobile phones, tablets, etc.
[0093] Furthermore, as shown in Figure 21 (3), for example, it can be used as a dialogue device 1 in a personal computer. In this case, the user starts dialogue software or the like that has the same functions as dialogue device 1 and speaks to the personal computer. The dialogue software or the like that receives this speech executes the processes described in the above embodiment and responds to the user's speech. Note that dialogue device 1 is not limited to personal computers, but can be similarly applied to various computers such as game consoles with internet communication capabilities.
[0094] (Variation) The trained model 63 of the dialogue device 1 can determine the content of the instruction from the context and respond even without inputting the instruction "Q:". Furthermore, the symbols "A", "#", and "*" in the output of the trained model 63 are merely examples and do not limit the content of the output of the trained model 63.
[0095] (Numerical values, etc.) The numerical values, graphs, and thresholds (for example, arbitrary numerical values) used in the above embodiment are merely examples and can be changed as desired.
[0096] [Hardware Configuration] The dialogue device 1 according to the above embodiment is also realized by a computer 1000 having the configuration shown in Figure 22. Figure 22 is a hardware configuration diagram showing an example of a computer that realizes the functions of the dialogue device according to the embodiment. The computer 1000 has a CPU 1100, RAM 1200, ROM 1300, HDD 1400, communication interface (I / F) 1500, input / output interface (I / F) 1600, and media interface (I / F) 1700.
[0097] The CPU 1100 operates based on programs stored in the ROM 1300 or HDD 1400, and controls various parts. The ROM 1300 stores boot programs executed by the CPU 1100 when the computer 1000 starts up, as well as programs that depend on the computer 1000's hardware.
[0098] The HDD 1400 stores programs executed by the CPU 1100, and data used by such programs. The communication interface 1500 receives data from other devices via a predetermined communication network and sends it to the CPU 1100, and transmits data generated by the CPU 1100 to other devices via the predetermined communication network.
[0099] The CPU 1100 controls output devices such as displays and printers, and input devices such as keyboards and mice, via the input / output interface 1600. The CPU 1100 acquires data from input devices via the input / output interface 1600. The CPU 1100 also outputs the generated data to output devices via the input / output interface 1600.
[0100] The media interface 1700 reads a program or data stored in the recording medium 1800 and provides it to the CPU 1100 via the RAM 1200. The CPU 1100 loads the program from the recording medium 1800 onto the RAM 1200 via the media interface 1700 and executes the loaded program. The recording medium 1800 is, for example, an optical recording medium such as a DVD (Digital Versatile Disc) or PD (Phase Change Rewritable Disk), a magneto-optical recording medium such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory.
[0101] For example, when computer 1000 functions as an interactive device 1, the CPU 1100 of computer 1000 implements functions such as the control unit 5 by executing programs loaded onto RAM 1200. The CPU 1100 of computer 1000 reads and executes these programs from the recording medium 1800, but as another example, these programs may be obtained from other devices via a predetermined communication network.
[0102] [Other] Furthermore, among the processes described in each of the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically by known methods. In addition, the processing procedures, specific names, and information including various data and parameters shown in the above document and drawings can be changed at will unless otherwise specified. For example, the various information shown in each figure is not limited to the information shown.
[0103] Furthermore, the components of each illustrated device are functionally conceptual and do not necessarily need to be physically configured as shown. In other words, the specific forms of distribution and integration of each device are not limited to those shown, and all or part of them can be functionally or physically distributed and integrated in any unit according to various loads and usage conditions.
[0104] Furthermore, the above embodiments can be combined as appropriate, provided that the processing content is not contradictory.
[0105] Although some embodiments of the present invention have been described in detail above with reference to the drawings, these are illustrative examples, and the present invention can be implemented in various other forms with modifications and improvements based on the knowledge of those skilled in the art, starting with the embodiments described in the disclosure section of the invention.
[0106] Furthermore, the terms "section, module, unit" mentioned above can be replaced with "means" or "circuit," etc. For example, the receiving unit can be replaced with receiving means or receiving circuit.
[0107] 1 Dialogue device 2 Communication unit 3 Voice input unit 4 Output unit 5 Control unit 51 Generation unit 52 Judgment unit 53 Holding unit 6 Memory unit 61 Dialogue data 62 Summary data 63 Trained model
Claims
1. A dialogue device for interacting with a user, comprising: a generation unit that generates a summary of the content of the dialogue with the user, or a summary and tags, using a trained model; a determination unit that determines from the content of the dialogue whether or not the topic in the dialogue has ended; and a holding unit that holds the summary or summary and tags generated by the generation unit when the determination unit determines that the topic has ended.
2. The dialogue device according to claim 1, characterized in that the generation unit generates the summary or summary and tags when the dialogue is interrupted for a predetermined time or longer, and the determination unit determines that the topic has ended when the summary or summary and tags generated by the generation unit is not related to the content of a new utterance by the user after the dialogue has been interrupted for a predetermined time or longer.
3. The dialogue device according to claim 1, characterized in that the generation unit generates the summary or summary and tags when the dialogue device responds to the user's utterance, and the determination unit determines that the topic has ended when the summary or summary and tags generated by the generation unit is not related to the content of a new utterance by the user after the response.
4. The dialogue device according to claim 1, wherein the dialogue device has a route guidance function that provides directions to the user's destination, and the determination unit determines that the topic has ended when it starts guiding the user to their destination.
5. The dialogue device according to claim 1, characterized in that the determination unit determines that the topic has ended when a predetermined word indicating the end of the topic is detected.
6. The dialogue device according to claim 1, characterized in that the generation unit generates the summary or the summary and the tag such that it includes information about the environment in which the dialogue between the user and the dialogue device took place in the summary or in either the summary or the summary and the tag or both.
7. A dialogue method performed by a dialogue device, comprising: a generation step of generating a summary of the content of a dialogue with a user, or a summary and tags, using a trained model; a determination step of determining from the content of the dialogue whether or not the topic in the dialogue has ended; and a holding step of holding the summary generated by the generation step if the determination step determines that the topic has ended.
8. A dialogue program characterized by causing a computer to execute the following steps in an interaction with a user: a generation step that generates a summary of the content of the interaction with the user, or a summary and tags, using a trained model; a determination step that determines from the content of the interaction whether or not the topic in the interaction has ended; and a retention step that, if the determination step determines that the topic has ended, retains the summary generated by the generation step.