system
A wearable device uses speech recognition and context analysis to generate conversation topics, addressing silences and interruptions, ensuring smooth and engaging interactions.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-09
- Publication Date
- 2026-06-19
AI Technical Summary
Silences or interruptions in conversations lead to uncomfortable situations and degrade the quality of communication, making it difficult to continue the conversation naturally.
A wearable device that records ambient sounds, converts them into text data using speech recognition, analyzes the conversation context, acquires information about the conversation partner, and generates new topics to facilitate smooth conversation.
Enables users to continue conversations naturally without interruptions by suggesting appropriate topics based on context and partner information, improving communication quality.
Smart Images

Figure 2026100518000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] In human communication, silence or interruption of the conversation often creates an uncomfortable situation and is a factor that degrades the quality of the conversation experience. In particular, when the content of the conversation runs out, it is difficult to immediately find a new topic, and the conversation may break off. It is desirable to provide a method for improving such a situation and enabling users to continue a natural and smooth conversation.
Means for Solving the Problems
[0005] This invention provides a device worn by the user that records ambient sounds and converts the audio into text data using speech recognition technology. Based on this text data, the device analyzes the context of the conversation and obtains information about the conversation partner from external sources to generate new topics suitable for the current conversation. Furthermore, by presenting the generated topics to the user, the conversation can continue naturally without interruption. This makes it possible to facilitate smooth conversation and improve the quality of communication.
[0006] "Voice acquisition means" refers to a function that allows a device worn by the user to record ambient sounds in real time and acquire that voice information.
[0007] "Speech recognition means" refers to a technology that converts recorded speech data into text data, enabling language analysis using a speech recognition engine.
[0008] "Contextual analysis means" refers to a function that uses natural language processing technology to understand the flow and topic of a conversation based on text data converted by speech recognition means.
[0009] "Information acquisition means" refers to the function of retrieving data via a communication network in order to obtain the profile and related information of the person being spoken to from an external information source.
[0010] A "topic generation method" is a technology that generates new conversation topics suitable for the user based on the results of a context analysis method and data obtained by an information acquisition method.
[0011] A "presentation tool" is an interface that facilitates conversation by naturally suggesting generated topics to the user, either audibly or visually. [Brief explanation of the drawing]
[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]
[0013] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.
[0014] First, the terms used in the following description will be explained.
[0015] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0016] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0017] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.
[0018] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), and the like.
[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0020] [First Embodiment]
[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0033] This invention is a system that uses an ear-worn device worn by the user to prevent silence during conversations and enable smooth communication with the other party. This system consists of a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, and a presentation means.
[0034] Description of the Embodiment
[0035] When a user wears the device and begins a casual conversation, the device first records the conversation using a voice acquisition tool. The recorded audio is then converted into text data in real time by a speech recognition tool. This text data is sent to a server, where a context analysis tool analyzes the flow of the conversation and important themes.
[0036] The server further uses information retrieval methods to obtain public profile information and related data of the conversation partner via external networks. This data can be obtained from public domain sources such as social networking services and business profiles.
[0037] Based on the acquired conversation context and information about the conversation partner, the server uses a topic generation mechanism to generate new conversation topics that can be suggested to the user. For example, new topics may be generated based on places the conversation partner has recently visited or topics they are interested in. This process allows the user to receive appropriate topics naturally when the conversation tends to stall.
[0038] The generated topics are communicated to the user through presentation methods. The device utilizes audio and display devices to present new topics to the user in an easy-to-understand and natural way.
[0039] Specific example
[0040] One evening, a user was talking with a friend at a cafe. When the conversation seemed to falter, the device started recording and transcribed the audio in real time. The server extracted the keywords "coffee" and "travel," and through its information acquisition tools, discovered that the friend had made a social media post about coffee that morning. The server used this information to generate a topic about "new coffee beans." The device then suggested to the user via voice, "Why don't you ask your friend about the coffee they were talking about on social media this morning?" The user was then able to restart the conversation based on that topic and continue communicating in a natural flow.
[0041] The following describes the processing flow.
[0042] Step 1:
[0043] The device monitors the conversation and starts recording when it detects specific keywords or silences. Recording proceeds in real time and continues even in the middle of a conversation.
[0044] Step 2:
[0045] The device sends the recorded audio data to the speech recognition engine. The speech recognition engine converts the audio into text data, thereby digitizing the linguistic information.
[0046] Step 3:
[0047] The server receives text data from the terminal. The server uses context analysis tools and applies natural language processing techniques to understand the topic and flow of the conversation.
[0048] Step 4:
[0049] The server uses information retrieval methods to obtain relevant information from the conversation partner's social media accounts and public profiles via an external network. This allows the server to collect information about the other person's interests and recent activities.
[0050] Step 5:
[0051] The server combines the analysis results of the context analysis means with the data collected by the information acquisition means, and the topic generation means creates a new topic. This involves using a generative AI model to form a topic suitable for conversation.
[0052] Step 6:
[0053] The terminal receives generated topics from the server and presents them to the user using speech synthesis technology and a display. The presentation is designed to be natural.
[0054] Step 7:
[0055] Based on the device's suggestions, the user can resume the conversation with the other party and continue communication smoothly. This prevents interruptions in the conversation and improves the quality of interaction.
[0056] (Example 1)
[0057] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0058] In everyday conversations, there is a challenge in keeping the conversation flowing smoothly without interruptions. In particular, when a conversation stalls, it is not easy to quickly find a topic appropriate for that moment. There is a need for a system that allows users to naturally introduce topics and continue the conversation smoothly in such situations.
[0059] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0060] In this invention, the server includes: a voice acquisition means worn by the user to acquire ambient sounds; a voice recognition means to convert the voice obtained by the voice acquisition means into digital data; a context analysis means to analyze the digital data and identify the flow of conversation and main topics; an information acquisition means to collect publicly available information of the conversation partner from an external information source via a communication means; a topic generation means to create a new topic based on the data obtained by the context analysis means and the information acquisition means; and a display means to present the topic constructed by the topic generation means to the user. This makes it possible to prevent conversational stagnation and smoothly provide the user with appropriate topics.
[0061] "Voice acquisition means" refers to a device or mechanism worn by the user that physically collects ambient sounds.
[0062] "Speech recognition means" refers to a technology or system that analyzes acquired speech data and converts it into corresponding text data.
[0063] "Contextual analysis means" refers to a process or method that analyzes text data generated by speech recognition means to identify the flow of conversation and the main topics.
[0064] "Information acquisition means" refers to technology or equipment for collecting publicly available profile information and related data of a conversation partner via an external communication network.
[0065] "Topic generation means" refers to a process or algorithm that generates new conversation topics based on data obtained by context analysis means and information acquisition means.
[0066] "Display means" refers to a device or interface for presenting a topic constructed by a topic generation means to a user.
[0067] This invention is a dialogue support system for maintaining a smooth flow of conversation, and comprises a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, and a presentation means. This system is implemented as a device worn by the user, and its embodiments are shown below.
[0068] The user uses an ear-worn device to initiate everyday conversations. The device uses a voice acquisition mechanism to capture ambient sounds and collects voice data through a built-in microphone. The acquired voice data is converted into text data in real time via a speech recognition mechanism. This achieves high-precision speech recognition using the Google® Cloud Speech-to-Text API.
[0069] Subsequently, the terminal sends the converted text data to the server, which analyzes the text data using contextual analysis tools to identify the flow of the conversation and the main topics. This analysis utilizes natural language processing techniques to extract context and keywords.
[0070] Furthermore, the server uses information acquisition methods to collect public profile information and related data of the person it is interacting with via an external network. This is done by utilizing SNS APIs and public databases.
[0071] Based on the acquired information and contextual analysis results, the server uses a topic generation mechanism and a generative AI model to create new and appropriate conversation topics. This process generates highly adaptable prompt sentences that users can use immediately when they are struggling to find a topic. For example, a prompt sentence such as "Please suggest some new and interesting topics about recent travel destinations or events that have been in the news" might be used.
[0072] Finally, the generated topic is notified to the user verbally or visually through a presentation mechanism. The device uses speech synthesis technology to present the suggested content to the user in a natural way. In this way, the user can prevent silences during the conversation and maintain smooth communication.
[0073] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0074] Step 1:
[0075] Voice acquisition
[0076] When a user starts a conversation, the device activates its voice acquisition system and captures surrounding conversational audio in high quality via the microphone. The input is real-time audio, and the output is digital audio data. Specifically, it uses noise cancellation technology to suppress ambient noise while clearly recording the necessary audio.
[0077] Step 2:
[0078] Speech recognition and text conversion
[0079] The terminal supplies the acquired digital audio data to a speech recognition system, which then converts this data into text data. The input is digital audio data, and the output is real-time text data. This process uses, for example, the Google Cloud Speech-to-Text API to accurately transcribe the audio content.
[0080] Step 3:
[0081] Contextual analysis
[0082] The server receives text data sent from the terminal and analyzes its content using contextual analysis tools. The input is text data obtained through speech recognition, and the output is analytical data including the flow of the conversation and its main themes. This analysis uses natural language processing techniques to identify the central themes and related keywords of the conversation.
[0083] Step 4:
[0084] Information acquisition
[0085] The server collects additional data about the conversation partner from publicly available sources via the external network. Inputs include the conversation partner's identification information and related keywords, while outputs include their public profile and related information. Specifically, it sends queries to SNS APIs and online databases to explore the conversation partner's recent activities and interests.
[0086] Step 5:
[0087] Topic generation
[0088] The server creates a new topic using a topic generation mechanism based on the context analysis results and acquired information. The input is the analyzed context data and collected information, and the output is the proposed conversation topic. A generative AI model is used to generate prompt sentences appropriate for the user. A specific example of a prompt such as "Please suggest some interesting recent travel destinations" is considered.
[0089] Step 6:
[0090] Topic presentation
[0091] The generated topic is sent from the server to the terminal and notified to the user via a presentation mechanism. The input is the generated conversation topic, and the output is the presented suggestion. The terminal uses speech synthesis technology to convey the topic to the user in a natural voice, preventing interruptions in the conversation.
[0092] (Application Example 1)
[0093] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0094] In modern households, conversations among family members often become monotonous or silent, frequently hindering smooth communication. This can weaken family bonds and lead to insufficient mutual understanding in daily life. Furthermore, it's difficult to naturally generate and introduce new topics to revitalize conversation, highlighting the need for technologies to support this process.
[0095] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0096] In this invention, the server includes means for acquiring sound information, means for interpreting sound information, means for analyzing background information, means for acquiring external information, means for generating topics, means for presenting topics, and a topic detection and generation function incorporated into a device for improving in-home communication. This makes it possible to prevent stagnation in conversations among family members and to stimulate conversation in a natural way.
[0097] "Sound information acquisition means" refers to a device worn by the user that has the function of collecting ambient sounds.
[0098] A "sound information interpretation means" is a system that analyzes acoustic data collected by a sound information acquisition means and converts it into linguistic data.
[0099] A "background analysis device" is a device that has the function of understanding the context and important themes of a dialogue using converted language data.
[0100] "External information acquisition means" refers to a system that acquires information related to the interlocutor from external sources via the internet or other communication means.
[0101] A "topic generation device" is a device that has the function of generating new conversation themes based on data obtained from background analysis devices and external information acquisition devices.
[0102] A "topic presentation device" is a device that has the function of presenting or providing users with newly generated conversation topics.
[0103] "Devices for improving domestic communication" are devices designed to facilitate conversations and smooth communication among family members in a specific home environment.
[0104] This invention provides a system to support natural conversations within the home. The user-worn device incorporates an audio information acquisition means that captures ambient sounds. This device also has an audio information interpretation means that converts the acquired audio data into language data in real time. Structurally, technologies such as Google Cloud Speech-to-Text are used as the audio information interpretation means.
[0105] The language data obtained by the sound information interpretation means is sent to the server. The server uses natural language processing libraries (e.g., spaCy or NLTK) to perform background analysis and understand the flow of the conversation and important themes. Furthermore, the server obtains relevant information from publicly available APIs on the internet (e.g., Twitter API or Wikipedia API) through external information acquisition means.
[0106] Based on these analysis results and acquired information, the server generates new conversation topics using topic generation tools. It uses a generation AI model (e.g., OpenAI's GPT-4®). The generated topics are presented to users within the home through topic presentation tools. Users can then use these topics to make conversations more engaging.
[0107] As a concrete example, when a conversation between a parent and child in the living room pauses, the system suggests, "How about talking about recent trends in home cooking?" providing a natural conversation starter. An example of a prompt used in the generative AI model is, "The conversation between parent and child has paused. Please suggest a conversation topic that will interest the child. Please use recent trends in home cooking."
[0108] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0109] Step 1:
[0110] The terminal collects ambient sound using sound information acquisition means. During this process, audio data is input in real time via a microphone. The collected audio data is then prepared for processing in the next step.
[0111] Step 2:
[0112] The device uses Google Cloud Speech-to-Text as a means of interpreting sound information, converting the input speech data into language data. This process converts the speech data into text data, preparing it for transmission to the server.
[0113] Step 3:
[0114] The server receives text data sent from the terminal and analyzes it using background analysis tools. Specifically, it uses natural language processing libraries (such as spaCy and NLTK) to extract context and important themes from the text data and understand conversation patterns.
[0115] Step 4:
[0116] The server uses publicly available APIs on the internet (e.g., Twitter API and Wikipedia API) as means of acquiring external information to collect external information related to the interlocutor. It uses keywords obtained by the background analysis means as input and receives related information based on these keywords as output.
[0117] Step 5:
[0118] The server generates new conversation topics using a generative AI model (e.g., OpenAI's GPT-4) based on data obtained from background analysis and external information acquisition. It uses analysis results and external information as input and outputs a newly constructed topic.
[0119] Step 6:
[0120] The server sends topics generated through topic suggestion mechanisms to the terminal, which then presents them to the user. Specifically, it uses voice suggestions and on-screen displays to inform the user of new topics in a natural way.
[0121] Step 7:
[0122] The conversation resumes based on the topic suggested by the user. In this step, the user's responses become input for the next analysis, and a mechanism is implemented to allow the system to continuously learn and improve.
[0123] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0124] This invention is a system that uses an ear-worn device worn by the user to compensate for silences and lack of topics during conversations, and in particular provides dialogue support that takes the user's emotional state into consideration. This invention includes a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, a presentation means, and an emotion engine that analyzes the user's emotional state.
[0125] Description of the Embodiment
[0126] Consider a scenario where a user wears a device and engages in a normal conversation. During the conversation, the device records surrounding sounds using a voice acquisition device. The recorded audio data is immediately converted into text data by a speech recognition device, and this text data is sent to a server.
[0127] The server uses contextual analysis tools to analyze the flow of the conversation based on this text data. Furthermore, it uses external communication networks via information acquisition tools to collect publicly available information about the conversation partner. This includes social media posts and publicly available project information.
[0128] A key feature of this invention is that the emotion engine monitors the user's voice and speech patterns to identify their emotional state. This emotional state, combined with the aforementioned analysis results, helps the topic generation means determine the topic best suited to the user's mood. For example, if the user is relaxed, adjustments such as selecting a light topic are made.
[0129] The emotion engine is designed to analyze the user's facial expressions in real time, using this additional information to improve the accuracy of their emotional state. This data is collected on a server and used to support the generation of topics that best reflect the user's current situation.
[0130] The generated topics are delivered to the user via the device. The device naturally suggests topics to the user through speech synthesis and display. This suggests that the user can continue the conversation smoothly with the other person.
[0131] Specific example
[0132] Consider a scenario where a user is talking to a friend in a park. If there is a silence in the conversation, the device starts recording the audio and transcribes it into text using speech recognition. The server extracts specific keywords and learns from social media that the person the user is talking to has recently been planning a camping trip. The emotion engine also detects that the user is in a calm mood. Based on this, the server generates a topic about "the latest camping gear," and the device suggests to the user via voice, "Why don't you ask your friend about the latest camping items?", allowing the user to resume the conversation in a natural way.
[0133] The following describes the processing flow.
[0134] Step 1:
[0135] The device detects conversational audio around the user and starts recording when it detects trigger keywords or silence. Recording continues in real time, even while the conversation is ongoing.
[0136] Step 2:
[0137] The device sends the recorded audio data to a speech recognition engine, which converts the audio into text data. This text data is then used for later analysis and information extraction.
[0138] Step 3:
[0139] The server analyzes the text data received from the terminal through contextual analysis tools to understand the flow of the conversation and its main topics. This allows for an understanding of the direction of the conversation.
[0140] Step 4:
[0141] The server uses information retrieval methods to obtain the conversation partner's social media and public information via an external network. This reveals the conversation partner's recent activities and areas of interest.
[0142] Step 5:
[0143] The device sends the user's voice and voice patterns to the emotion engine in real time to evaluate their emotional state. If necessary, a facial expression sensor is also used for more accurate emotion recognition.
[0144] Step 6:
[0145] The server combines the results of the context analysis means, the data from the information acquisition means, and the emotional state determination by the emotion engine to activate the topic generation means and generate new conversation topics suitable for the user.
[0146] Step 7:
[0147] The device presents generated topics to the user audibly or visually, and suggests new conversation topics. The presentation takes into account the user's emotional state and the flow of the conversation.
[0148] Step 8:
[0149] By accepting the device's suggestions and incorporating new topics into the conversation, users can maintain smooth communication with the other person, improving the continuity and quality of the conversation.
[0150] (Example 2)
[0151] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0152] Silences and a lack of topics during conversations can make it difficult for users to engage in dialogue. Furthermore, offering topics that don't align with the user's emotional state can lead to unnatural conversations. There is a need for a system that addresses these challenges and supports smooth and natural dialogue.
[0153] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0154] In this invention, the server includes a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, an emotion analysis means, a topic generation means, and a presentation means. This reduces silence during the user's conversation and enables natural topic provision according to their emotional state.
[0155] "Voice acquisition means" refers to a means of collecting ambient sounds using a device worn by the user.
[0156] "Speech recognition means" refers to means for converting collected speech data into text data.
[0157] "Contextual analysis methods" refer to methods for analyzing text data to understand the flow and context of a conversation.
[0158] "Information acquisition methods" refer to means of collecting information about the person you are talking to from external sources.
[0159] "Emotional analysis tools" are means of analyzing a user's voice and voice patterns to identify their emotional state.
[0160] A "topic generation method" is a means of generating new topics based on context analysis results, information acquisition results, and sentiment analysis results.
[0161] A "presentation method" is a means of suggesting generated topics to users in a natural way.
[0162] This invention is a system that utilizes an ear-worn device worn by the user to facilitate smooth conversation. In particular, it focuses on providing dialogue support while taking into account the user's emotional state. A detailed embodiment is shown below.
[0163] Hardware and software to be used
[0164] A terminal device is used. This includes a voice acquisition means for collecting and recording ambient sounds. This voice data is converted into text data in real time by a speech recognition means. The converted text data is sent to a server.
[0165] The server performs advanced analysis. Based on the received text data, it analyzes the flow of the conversation using contextual analysis tools. Furthermore, it collects publicly available information about the conversation partner via external communication networks through information acquisition tools.
[0166] On the other hand, emotion analysis tools analyze the user's voice patterns to identify the user's emotional state. Information about the emotional state is an important factor in determining the topic of conversation.
[0167] Based on these results, the server uses a topic generation mechanism to generate topics in the generation AI model that match the user's emotions and the context of the conversation.
[0168] The generated topics are presented to the user through the device's display mechanisms. Because they are displayed or spoken in a natural way using speech synthesis technology, the user can engage in smooth conversations.
[0169] Specific example
[0170] For example, consider a situation where a user is talking with a friend in a park and a silence occurs. At this point, the device immediately starts recording the audio, converts it into text data, and sends it to the server. The server analyzes the context of the conversation and the user's emotional state, and selects a topic, for example, "the latest camping gear." Based on this, the device can suggest to the user, "Why don't you ask your friend about the latest camping items?"
[0171] Example of a prompt
[0172] An example of a prompt to input into the generative AI model is, "Generate an appropriate topic when the user is relaxed and knows that the conversation partner has recently planned a camping trip." Based on this, topics that maintain a natural flow of conversation will be generated.
[0173] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0174] Step 1:
[0175] The terminal collects ambient sound in real time through the attached device. After audio data is obtained as input, the terminal records the data using an audio acquisition means and converts it into text data using a speech recognition means. The converted text data becomes the output. Specifically, this is a process in which the waveform of the sound is captured and converted into text information.
[0176] Step 2:
[0177] The server receives text data sent from the terminal. Using contextual analysis techniques, it analyzes the flow of the conversation and important topics based on this text data as input. The output provides analysis results regarding the themes of the conversation and the user's intentions. Specifically, it utilizes natural language processing techniques to extract keywords and understand the context.
[0178] Step 3:
[0179] Based on the information obtained through contextual analysis, the server uses information acquisition methods to collect publicly available information about the conversation partner from external communication networks. The collected information becomes the input, and information such as SNS posts and public profiles is output. Specifically, this is the operation of collecting digital content from the internet via an API.
[0180] Step 4:
[0181] The device uses emotion analysis techniques to analyze the user's voice and speech patterns, identifying their emotional state in real time. The user's voice data is input, and the analyzed emotional state is output. Specifically, it evaluates the tone and tempo of the voice to identify the emotional state (e.g., joy, calmness).
[0182] Step 5:
[0183] The server uses a topic generation mechanism based on all the information acquired so far to input prompt sentences into the generative AI model. This results in the output of new topics optimized for the user's emotions and the context of the conversation. Specifically, the generative AI model constructs topics that reflect the appropriate context and prepares natural suggestions.
[0184] Step 6:
[0185] The terminal receives new topics sent from the server and presents them to the user. Using the presentation means, it presents topics via speech synthesis as needed. The suggested topics become the output, and the user can resume the conversation based on them. Specifically, speech synthesis technology is used to produce speech output that closely resembles human speech.
[0186] (Application Example 2)
[0187] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0188] In conversations, silences or a lack of topics often lead to a breakdown in communication. Furthermore, finding appropriate topics while considering the emotional state of the person being spoken to is difficult in such situations. Especially in situations with limited interaction, the person being spoken to is more likely to feel anxious or tense, leading to communication breakdowns. To solve this, a comprehensive system is needed that accurately analyzes the user's emotional state and suggests topics appropriate to the situation.
[0189] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0190] In this invention, the server includes an acoustic acquisition means worn by the user to record ambient sounds, a speech recognition means to convert the sounds acquired by the acoustic acquisition means into textual information, and a context analysis means to analyze the textual information to understand the conversation situation. This makes it possible to maintain smooth conversation by accurately selecting and presenting topics to the user based on the user's emotional state.
[0191] "Sound acquisition means" refers to a device worn by the user to record ambient sounds.
[0192] "Speech recognition means" refers to a technology that converts speech acquired by sound acquisition means into text information.
[0193] A "contextual analysis tool" is a mechanism that analyzes textual information to understand the flow and situation of a conversation.
[0194] An "information gathering tool" is a device used to obtain information about the person you are interacting with from external sources.
[0195] "Topic creation means" refers to a technique that generates new topics of conversation based on information collected by context analysis means and information gathering means.
[0196] A "presentation device" is a device for suggesting conversation topics generated by topic creation means to the user.
[0197] An "emotion analysis engine" is an engine that determines the user's emotional state from voice and facial expression data.
[0198] This invention is a system designed for use with consumer robots and is intended to facilitate smooth interaction with users. It begins by recording ambient sounds using an ear-worn device worn by a human and converting them into text data.
[0199] The server uses a microphone attached to the device as a means of acquiring sound. For speech recognition, it utilizes a speech recognition service such as Google Cloud Speech-to-Text to convert the recorded audio into text information in real time. The context analysis means analyzes the obtained text information and plays a role in understanding the context of the conversation.
[0200] Furthermore, the server utilizes information gathering tools to collect publicly available information about the conversation partner via external communication networks. This information is extracted from data such as social media posts and publicly available project information. This makes it possible to enhance the context of the conversation.
[0201] The emotion analysis engine analyzes voice and user facial expressions in real time to determine the user's emotional state. This allows it to suggest topics that are most appropriate to the user's emotions at that moment.
[0202] The presentation device provides a way to naturally suggest topics generated by a topic generation mechanism to the user through speech synthesis and display. For example, if silence occurs during a conversation with family, the system will determine that the user is relaxed and generate an appropriate topic. It will then restart the conversation in a natural way by suggesting, for instance, "How about talking about recent camping gear?" through the speaker.
[0203] One of the main processing technologies used by the server is a generative AI model, which generates and suggests information using a prompt such as, "Please suggest some light, family-oriented topics that will interest the user when they are relaxing."
[0204] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0205] Step 1:
[0206] An acoustic acquisition device attached to the terminal collects ambient sound and transmits it to a server as digital audio data. The input is ambient sound, and the output is digital audio data. Specifically, it continuously records sound using a microphone and stores it in a buffer.
[0207] Step 2:
[0208] The server uses speech recognition to convert digital audio data into text. This process utilizes speech recognition services such as Google Cloud Speech-to-Text. The input is digital audio data, and the output is text. Specifically, the audio data is sent to the cloud service, and the returned text is received.
[0209] Step 3:
[0210] The server uses contextual analysis tools to analyze textual information and understand the flow and situation of the conversation. The input is textual information, and the output is the analyzed conversational context. Specifically, it uses natural language processing techniques to perform keyword extraction and relevance analysis.
[0211] Step 4:
[0212] The server uses information gathering tools to obtain publicly available information about the conversation partner from an external communication network. The input is the conversation partner's identifier, and the output is the obtained publicly available information. Specifically, it uses the SNS API to retrieve the other party's latest SNS posts, project information, etc.
[0213] Step 5:
[0214] The server uses an emotion analysis engine to determine the user's emotional state from their voice and facial expressions. The input is voice data and facial expression data, and the output is the analyzed emotional state. Specifically, it performs voice tone analysis and facial expression recognition from camera footage.
[0215] Step 6:
[0216] The server uses a topic generation mechanism to generate new conversation topics based on the information obtained in the previous steps, utilizing a generative AI model. This process uses prompts. The input consists of conversation context information, public information, and emotional state, and the output is the generated conversation topic. A specific prompt for the generative AI model is, "Suggest a light, family-oriented topic that will interest the user when they are relaxed."
[0217] Step 7:
[0218] The server proposes generated conversation topics to the user via a presentation device. The presentation device uses speech synthesis and a display. The input is the generated conversation topic, and the output is the conversation topic proposal to the user. Specifically, it outputs synthesized speech from a speaker and displays supplementary information on a display.
[0219] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0220] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0221] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0222] [Second Embodiment]
[0223] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0224] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0225] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0226] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0227] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0228] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0229] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0230] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0231] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0232] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0233] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0234] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0235] This invention is a system that uses an ear-worn device worn by the user to prevent silence during conversations and enable smooth communication with the other party. This system consists of a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, and a presentation means.
[0236] Description of the Embodiment
[0237] When a user wears the device and begins a casual conversation, the device first records the conversation using a voice acquisition tool. The recorded audio is then converted into text data in real time by a speech recognition tool. This text data is sent to a server, where a context analysis tool analyzes the flow of the conversation and important themes.
[0238] The server further uses information retrieval methods to obtain public profile information and related data of the conversation partner via external networks. This data can be obtained from public domain sources such as social networking services and business profiles.
[0239] Based on the acquired conversation context and information about the conversation partner, the server uses a topic generation mechanism to generate new conversation topics that can be suggested to the user. For example, new topics may be generated based on places the conversation partner has recently visited or topics they are interested in. This process allows the user to receive appropriate topics naturally when the conversation tends to stall.
[0240] The generated topics are communicated to the user through presentation methods. The device utilizes audio and display devices to present new topics to the user in an easy-to-understand and natural way.
[0241] Specific example
[0242] One evening, a user was talking with a friend at a cafe. When the conversation seemed to falter, the device started recording and transcribed the audio in real time. The server extracted the keywords "coffee" and "travel," and through its information acquisition tools, discovered that the friend had made a social media post about coffee that morning. The server used this information to generate a topic about "new coffee beans." The device then suggested to the user via voice, "Why don't you ask your friend about the coffee they were talking about on social media this morning?" The user was then able to restart the conversation based on that topic and continue communicating in a natural flow.
[0243] The following describes the processing flow.
[0244] Step 1:
[0245] The device monitors the conversation and starts recording when it detects specific keywords or silences. Recording proceeds in real time and continues even in the middle of a conversation.
[0246] Step 2:
[0247] The device sends the recorded audio data to the speech recognition engine. The speech recognition engine converts the audio into text data, thereby digitizing the linguistic information.
[0248] Step 3:
[0249] The server receives text data from the terminal. The server uses context analysis tools and applies natural language processing techniques to understand the topic and flow of the conversation.
[0250] Step 4:
[0251] The server uses information retrieval methods to obtain relevant information from the conversation partner's social media accounts and public profiles via an external network. This allows the server to collect information about the other person's interests and recent activities.
[0252] Step 5:
[0253] The server combines the analysis results of the context analysis means with the data collected by the information acquisition means, and the topic generation means creates a new topic. This involves using a generative AI model to form a topic suitable for conversation.
[0254] Step 6:
[0255] The terminal receives generated topics from the server and presents them to the user using speech synthesis technology and a display. The presentation is designed to be natural.
[0256] Step 7:
[0257] Based on the device's suggestions, the user can resume the conversation with the other party and continue communication smoothly. This prevents interruptions in the conversation and improves the quality of interaction.
[0258] (Example 1)
[0259] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0260] In everyday conversations, there is a challenge in keeping the conversation flowing smoothly without interruptions. In particular, when a conversation stalls, it is not easy to quickly find a topic appropriate for that moment. There is a need for a system that allows users to naturally introduce topics and continue the conversation smoothly in such situations.
[0261] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0262] In this invention, the server includes: a voice acquisition means worn by the user to acquire ambient sounds; a voice recognition means to convert the voice obtained by the voice acquisition means into digital data; a context analysis means to analyze the digital data and identify the flow of conversation and main topics; an information acquisition means to collect publicly available information of the conversation partner from an external information source via a communication means; a topic generation means to create a new topic based on the data obtained by the context analysis means and the information acquisition means; and a display means to present the topic constructed by the topic generation means to the user. This makes it possible to prevent conversational stagnation and smoothly provide the user with appropriate topics.
[0263] "Voice acquisition means" refers to a device or mechanism worn by the user that physically collects ambient sounds.
[0264] "Speech recognition means" refers to a technology or system that analyzes acquired speech data and converts it into corresponding text data.
[0265] "Contextual analysis means" refers to a process or method that analyzes text data generated by speech recognition means to identify the flow of conversation and the main topics.
[0266] "Information acquisition means" refers to technology or equipment for collecting publicly available profile information and related data of a conversation partner via an external communication network.
[0267] "Topic generation means" refers to a process or algorithm that generates new conversation topics based on data obtained by context analysis means and information acquisition means.
[0268] "Display means" refers to a device or interface for presenting a topic constructed by a topic generation means to a user.
[0269] This invention is a dialogue support system for maintaining a smooth flow of conversation, and comprises a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, and a presentation means. This system is implemented as a device worn by the user, and its embodiments are shown below.
[0270] The user uses an ear-worn device to initiate everyday conversations. The device uses a voice acquisition mechanism to capture ambient sounds and collects voice data through a built-in microphone. The acquired voice data is converted into text data in real time via a speech recognition mechanism. This achieves high-precision speech recognition using the Google Cloud Speech-to-Text API.
[0271] Subsequently, the terminal sends the converted text data to the server, which analyzes the text data using contextual analysis tools to identify the flow of the conversation and the main topics. This analysis utilizes natural language processing techniques to extract context and keywords.
[0272] Furthermore, the server uses information acquisition methods to collect public profile information and related data of the person it is interacting with via an external network. This is done by utilizing SNS APIs and public databases.
[0273] Based on the acquired information and contextual analysis results, the server uses a topic generation mechanism and a generative AI model to create new and appropriate conversation topics. This process generates highly adaptable prompt sentences that users can use immediately when they are struggling to find a topic. For example, a prompt sentence such as "Please suggest some new and interesting topics about recent travel destinations or events that have been in the news" might be used.
[0274] Finally, the generated topic is notified to the user verbally or visually through a presentation mechanism. The device uses speech synthesis technology to present the suggested content to the user in a natural way. In this way, the user can prevent silences during the conversation and maintain smooth communication.
[0275] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0276] Step 1:
[0277] Voice acquisition
[0278] When a user starts a conversation, the device activates its voice acquisition system and captures surrounding conversational audio in high quality via the microphone. The input is real-time audio, and the output is digital audio data. Specifically, it uses noise cancellation technology to suppress ambient noise while clearly recording the necessary audio.
[0279] Step 2:
[0280] Speech recognition and text conversion
[0281] The terminal supplies the acquired digital voice data to the speech recognition means and converts this data into text data. The input is digital voice data and the output is real-time text data. For this process, for example, the Google Cloud Speech-to-Text API is used to perform accurate transcription of the voice content.
[0282] Step 3:
[0283] Context analysis
[0284] The server receives the text data transmitted from the terminal and analyzes its content by the context analysis means. The input is the text data obtained by speech recognition, and the output is the analysis data including the flow of the conversation and the main theme. In this analysis, natural language processing technology is used to identify the central theme of the conversation and related keywords.
[0285] Step 4:
[0286] Information acquisition
[0287] The server collects additional data about the conversation partner from information sources publicly available through an external network. The input is the identification information of the conversation partner and related keywords, and the output is the public profile and related information of the partner. Specifically, queries are sent to the SNS API and online databases to find the recent activities and interests of the partner.
[0288] Step 5:
[0289] Topic generation
[0290] The server creates a new topic using a topic generation mechanism based on the context analysis results and acquired information. The input is the analyzed context data and collected information, and the output is the proposed conversation topic. A generative AI model is used to generate prompt sentences appropriate for the user. A specific example of a prompt such as "Please suggest some interesting recent travel destinations" is considered.
[0291] Step 6:
[0292] Topic presentation
[0293] The generated topic is sent from the server to the terminal and notified to the user via a presentation mechanism. The input is the generated conversation topic, and the output is the presented suggestion. The terminal uses speech synthesis technology to convey the topic to the user in a natural voice, preventing interruptions in the conversation.
[0294] (Application Example 1)
[0295] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0296] In modern households, conversations among family members often become monotonous or silent, frequently hindering smooth communication. This can weaken family bonds and lead to insufficient mutual understanding in daily life. Furthermore, it's difficult to naturally generate and introduce new topics to revitalize conversation, highlighting the need for technologies to support this process.
[0297] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0298] In this invention, the server includes means for acquiring sound information, means for interpreting sound information, means for analyzing background information, means for acquiring external information, means for generating topics, means for presenting topics, and a topic detection and generation function incorporated into a device for improving in-home communication. This makes it possible to prevent stagnation in conversations among family members and to stimulate conversation in a natural way.
[0299] "Sound information acquisition means" refers to a device worn by the user that has the function of collecting ambient sounds.
[0300] A "sound information interpretation means" is a system that analyzes acoustic data collected by a sound information acquisition means and converts it into linguistic data.
[0301] A "background analysis device" is a device that has the function of understanding the context and important themes of a dialogue using converted language data.
[0302] "External information acquisition means" refers to a system that acquires information related to the interlocutor from external sources via the internet or other communication means.
[0303] A "topic generation device" is a device that has the function of generating new conversation themes based on data obtained from background analysis devices and external information acquisition devices.
[0304] A "topic presentation device" is a device that has the function of presenting or providing users with newly generated conversation topics.
[0305] "Devices for improving domestic communication" are devices designed to facilitate conversations and smooth communication among family members in a specific home environment.
[0306] This invention provides a system for supporting natural conversations within a household. The device worn by the user incorporates sound information acquisition means for obtaining ambient voices. This device has sound information interpretation means for converting the acquired voice data into language data in real time. Structurally, technologies such as Google Cloud Speech-to-Text are utilized as the sound information interpretation means.
[0307] The language data obtained by the sound information interpretation means is transmitted to the server. The server implements background analysis means by leveraging natural language processing libraries (e.g., spaCy or NLTK) to understand the flow of the conversation and important themes. Furthermore, the server obtains relevant information from public APIs on the Internet (e.g., Twitter API or Wikipedia API) through external information acquisition means.
[0308] Based on these analysis results and the acquired information, the server generates new conversation topics using topic creation means. In doing so, a generation AI model (e.g., GPT-4 of OpenAI) is used. The generated topics are presented to the users within the household through topic presentation means. Users can thereby make the conversation more lively based on this.
[0309] As a specific example, during a conversation between a parent and a child in the living room, when the conversation breaks off, the system proposes "How about talking about the recent trends in home cooking?" to provide an opportunity for a natural conversation. An example of the prompt sentence used in the generation AI model is "The conversation between the parent and the child has broken off. Please propose a conversation topic that will catch the child's interest. Use the recent trends in home cooking."
[0310] The flow of the specific process in Application Example 1 will be described using FIG. 12.
[0311] Step 1:
[0312] The terminal collects ambient sound using sound information acquisition means. During this process, audio data is input in real time via a microphone. The collected audio data is then prepared for processing in the next step.
[0313] Step 2:
[0314] The device uses Google Cloud Speech-to-Text as a means of interpreting sound information, converting the input speech data into language data. This process converts the speech data into text data, preparing it for transmission to the server.
[0315] Step 3:
[0316] The server receives text data sent from the terminal and analyzes it using background analysis tools. Specifically, it uses natural language processing libraries (such as spaCy and NLTK) to extract context and important themes from the text data and understand conversation patterns.
[0317] Step 4:
[0318] The server uses publicly available APIs on the internet (e.g., Twitter API and Wikipedia API) as means of acquiring external information to collect external information related to the interlocutor. It uses keywords obtained by the background analysis means as input and receives related information based on these keywords as output.
[0319] Step 5:
[0320] The server generates new conversation topics using a generative AI model (e.g., OpenAI's GPT-4) based on data obtained from background analysis and external information acquisition. It uses analysis results and external information as input and outputs a newly constructed topic.
[0321] Step 6:
[0322] The server sends topics generated through topic suggestion mechanisms to the terminal, which then presents them to the user. Specifically, it uses voice suggestions and on-screen displays to inform the user of new topics in a natural way.
[0323] Step 7:
[0324] The conversation resumes based on the topic suggested by the user. In this step, the user's responses become input for the next analysis, and a mechanism is implemented to allow the system to continuously learn and improve.
[0325] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0326] This invention is a system that uses an ear-worn device worn by the user to compensate for silences and lack of topics during conversations, and in particular provides dialogue support that takes the user's emotional state into consideration. This invention includes a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, a presentation means, and an emotion engine that analyzes the user's emotional state.
[0327] Description of the Embodiment
[0328] Consider a scenario where a user wears a device and engages in a normal conversation. During the conversation, the device records surrounding sounds using a voice acquisition device. The recorded audio data is immediately converted into text data by a speech recognition device, and this text data is sent to a server.
[0329] The server uses contextual analysis tools to analyze the flow of the conversation based on this text data. Furthermore, it uses external communication networks via information acquisition tools to collect publicly available information about the conversation partner. This includes social media posts and publicly available project information.
[0330] A key feature of this invention is that the emotion engine monitors the user's voice and speech patterns to identify their emotional state. This emotional state, combined with the aforementioned analysis results, helps the topic generation means determine the topic best suited to the user's mood. For example, if the user is relaxed, adjustments such as selecting a light topic are made.
[0331] The emotion engine is designed to analyze the user's facial expressions in real time, using this additional information to improve the accuracy of their emotional state. This data is collected on a server and used to support the generation of topics that best reflect the user's current situation.
[0332] The generated topics are delivered to the user via the device. The device naturally suggests topics to the user through speech synthesis and display. This suggests that the user can continue the conversation smoothly with the other person.
[0333] Specific example
[0334] Consider a scenario where a user is talking to a friend in a park. If there is a silence in the conversation, the device starts recording the audio and transcribes it into text using speech recognition. The server extracts specific keywords and learns from social media that the person the user is talking to has recently been planning a camping trip. The emotion engine also detects that the user is in a calm mood. Based on this, the server generates a topic about "the latest camping gear," and the device suggests to the user via voice, "Why don't you ask your friend about the latest camping items?", allowing the user to resume the conversation in a natural way.
[0335] The following describes the processing flow.
[0336] Step 1:
[0337] The device detects conversational audio around the user and starts recording when it detects trigger keywords or silence. Recording continues in real time, even while the conversation is ongoing.
[0338] Step 2:
[0339] The device sends the recorded audio data to a speech recognition engine, which converts the audio into text data. This text data is then used for later analysis and information extraction.
[0340] Step 3:
[0341] The server analyzes the text data received from the terminal through contextual analysis tools to understand the flow of the conversation and its main topics. This allows for an understanding of the direction of the conversation.
[0342] Step 4:
[0343] The server uses information retrieval methods to obtain the conversation partner's social media and public information via an external network. This reveals the conversation partner's recent activities and areas of interest.
[0344] Step 5:
[0345] The device sends the user's voice and voice patterns to the emotion engine in real time to evaluate their emotional state. If necessary, a facial expression sensor is also used for more accurate emotion recognition.
[0346] Step 6:
[0347] The server combines the results of the context analysis means, the data from the information acquisition means, and the emotional state determination by the emotion engine to activate the topic generation means and generate new conversation topics suitable for the user.
[0348] Step 7:
[0349] The device presents generated topics to the user audibly or visually, and suggests new conversation topics. The presentation takes into account the user's emotional state and the flow of the conversation.
[0350] Step 8:
[0351] By accepting the device's suggestions and incorporating new topics into the conversation, users can maintain smooth communication with the other person, improving the continuity and quality of the conversation.
[0352] (Example 2)
[0353] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0354] Silences and a lack of topics during conversations can make it difficult for users to engage in dialogue. Furthermore, offering topics that don't align with the user's emotional state can lead to unnatural conversations. There is a need for a system that addresses these challenges and supports smooth and natural dialogue.
[0355] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0356] In this invention, the server includes a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, an emotion analysis means, a topic generation means, and a presentation means. This reduces silence during the user's conversation and enables natural topic provision according to their emotional state.
[0357] "Voice acquisition means" refers to a means of collecting ambient sounds using a device worn by the user.
[0358] "Speech recognition means" refers to means for converting collected speech data into text data.
[0359] "Contextual analysis methods" refer to methods for analyzing text data to understand the flow and context of a conversation.
[0360] "Information acquisition methods" refer to means of collecting information about the person you are talking to from external sources.
[0361] "Emotional analysis tools" are means of analyzing a user's voice and voice patterns to identify their emotional state.
[0362] A "topic generation method" is a means of generating new topics based on context analysis results, information acquisition results, and sentiment analysis results.
[0363] A "presentation method" is a means of suggesting generated topics to users in a natural way.
[0364] This invention is a system that utilizes an ear-worn device worn by the user to facilitate smooth conversation. In particular, it focuses on providing dialogue support while taking into account the user's emotional state. A detailed embodiment is shown below.
[0365] Hardware and software to be used
[0366] A terminal device is used. This includes a voice acquisition means for collecting and recording ambient sounds. This voice data is converted into text data in real time by a speech recognition means. The converted text data is sent to a server.
[0367] The server performs advanced analysis. Based on the received text data, it analyzes the flow of the conversation using contextual analysis tools. Furthermore, it collects publicly available information about the conversation partner via external communication networks through information acquisition tools.
[0368] On the other hand, emotion analysis tools analyze the user's voice patterns to identify the user's emotional state. Information about the emotional state is an important factor in determining the topic of conversation.
[0369] Based on these results, the server uses a topic generation mechanism to generate topics in the generation AI model that match the user's emotions and the context of the conversation.
[0370] The generated topics are presented to the user through the device's display mechanisms. Because they are displayed or spoken in a natural way using speech synthesis technology, the user can engage in smooth conversations.
[0371] Specific example
[0372] For example, consider a situation where a user is talking with a friend in a park and a silence occurs. At this point, the device immediately starts recording the audio, converts it into text data, and sends it to the server. The server analyzes the context of the conversation and the user's emotional state, and selects a topic, for example, "the latest camping gear." Based on this, the device can suggest to the user, "Why don't you ask your friend about the latest camping items?"
[0373] Example of a prompt
[0374] An example of a prompt to input into the generative AI model is, "Generate an appropriate topic when the user is relaxed and knows that the conversation partner has recently planned a camping trip." Based on this, topics that maintain a natural flow of conversation will be generated.
[0375] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0376] Step 1:
[0377] The terminal collects ambient sound in real time through the attached device. After audio data is obtained as input, the terminal records the data using an audio acquisition means and converts it into text data using a speech recognition means. The converted text data becomes the output. Specifically, this is a process in which the waveform of the sound is captured and converted into text information.
[0378] Step 2:
[0379] The server receives text data sent from the terminal. Using contextual analysis techniques, it analyzes the flow of the conversation and important topics based on this text data as input. The output provides analysis results regarding the themes of the conversation and the user's intentions. Specifically, it utilizes natural language processing techniques to extract keywords and understand the context.
[0380] Step 3:
[0381] Based on the information obtained through contextual analysis, the server uses information acquisition methods to collect publicly available information about the conversation partner from external communication networks. The collected information becomes the input, and information such as SNS posts and public profiles is output. Specifically, this is the operation of collecting digital content from the internet via an API.
[0382] Step 4:
[0383] The device uses emotion analysis techniques to analyze the user's voice and speech patterns, identifying their emotional state in real time. The user's voice data is input, and the analyzed emotional state is output. Specifically, it evaluates the tone and tempo of the voice to identify the emotional state (e.g., joy, calmness).
[0384] Step 5:
[0385] The server uses a topic generation mechanism based on all the information acquired so far to input prompt sentences into the generative AI model. This results in the output of new topics optimized for the user's emotions and the context of the conversation. Specifically, the generative AI model constructs topics that reflect the appropriate context and prepares natural suggestions.
[0386] Step 6:
[0387] The terminal receives new topics sent from the server and presents them to the user. Using the presentation means, it presents topics via speech synthesis as needed. The suggested topics become the output, and the user can resume the conversation based on them. Specifically, speech synthesis technology is used to produce speech output that closely resembles human speech.
[0388] (Application Example 2)
[0389] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0390] In conversations, silences or a lack of topics often lead to a breakdown in communication. Furthermore, finding appropriate topics while considering the emotional state of the person being spoken to is difficult in such situations. Especially in situations with limited interaction, the person being spoken to is more likely to feel anxious or tense, leading to communication breakdowns. To solve this, a comprehensive system is needed that accurately analyzes the user's emotional state and suggests topics appropriate to the situation.
[0391] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0392] In this invention, the server includes an acoustic acquisition means worn by the user to record ambient sounds, a speech recognition means to convert the sounds acquired by the acoustic acquisition means into textual information, and a context analysis means to analyze the textual information to understand the conversation situation. This makes it possible to maintain smooth conversation by accurately selecting and presenting topics to the user based on the user's emotional state.
[0393] "Sound acquisition means" refers to a device worn by the user to record ambient sounds.
[0394] "Speech recognition means" refers to a technology that converts speech acquired by sound acquisition means into text information.
[0395] A "contextual analysis tool" is a mechanism that analyzes textual information to understand the flow and situation of a conversation.
[0396] An "information gathering tool" is a device used to obtain information about the person you are interacting with from external sources.
[0397] "Topic creation means" refers to a technique that generates new topics of conversation based on information collected by context analysis means and information gathering means.
[0398] A "presentation device" is a device for suggesting conversation topics generated by topic creation means to the user.
[0399] An "emotion analysis engine" is an engine that determines the user's emotional state from voice and facial expression data.
[0400] This invention is a system designed for use with consumer robots and is intended to facilitate smooth interaction with users. It begins by recording ambient sounds using an ear-worn device worn by a human and converting them into text data.
[0401] The server uses a microphone attached to the device as a means of acquiring sound. For speech recognition, it utilizes a speech recognition service such as Google Cloud Speech-to-Text to convert the recorded audio into text information in real time. The context analysis means analyzes the obtained text information and plays a role in understanding the context of the conversation.
[0402] Furthermore, the server utilizes information gathering tools to collect publicly available information about the conversation partner via external communication networks. This information is extracted from data such as social media posts and publicly available project information. This makes it possible to enhance the context of the conversation.
[0403] The emotion analysis engine analyzes voice and user facial expressions in real time to determine the user's emotional state. This allows it to suggest topics that are most appropriate to the user's emotions at that moment.
[0404] The presentation device provides a way to naturally suggest topics generated by a topic generation mechanism to the user through speech synthesis and display. For example, if silence occurs during a conversation with family, the system will determine that the user is relaxed and generate an appropriate topic. It will then restart the conversation in a natural way by suggesting, for instance, "How about talking about recent camping gear?" through the speaker.
[0405] One of the main processing technologies used by the server is a generative AI model, which generates and suggests information using a prompt such as, "Please suggest some light, family-oriented topics that will interest the user when they are relaxing."
[0406] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0407] Step 1:
[0408] An acoustic acquisition device attached to the terminal collects ambient sound and transmits it to a server as digital audio data. The input is ambient sound, and the output is digital audio data. Specifically, it continuously records sound using a microphone and stores it in a buffer.
[0409] Step 2:
[0410] The server uses speech recognition to convert digital audio data into text. This process utilizes speech recognition services such as Google Cloud Speech-to-Text. The input is digital audio data, and the output is text. Specifically, the audio data is sent to the cloud service, and the returned text is received.
[0411] Step 3:
[0412] The server uses contextual analysis tools to analyze textual information and understand the flow and situation of the conversation. The input is textual information, and the output is the analyzed conversational context. Specifically, it uses natural language processing techniques to perform keyword extraction and relevance analysis.
[0413] Step 4:
[0414] The server uses information gathering tools to obtain publicly available information about the conversation partner from an external communication network. The input is the conversation partner's identifier, and the output is the obtained publicly available information. Specifically, it uses the SNS API to retrieve the other party's latest SNS posts, project information, etc.
[0415] Step 5:
[0416] The server uses an emotion analysis engine to determine the user's emotional state from their voice and facial expressions. The input is voice data and facial expression data, and the output is the analyzed emotional state. Specifically, it performs voice tone analysis and facial expression recognition from camera footage.
[0417] Step 6:
[0418] The server uses a topic generation mechanism to generate new conversation topics based on the information obtained in the previous steps, utilizing a generative AI model. This process uses prompts. The input consists of conversation context information, public information, and emotional state, and the output is the generated conversation topic. A specific prompt for the generative AI model is, "Suggest a light, family-oriented topic that will interest the user when they are relaxed."
[0419] Step 7:
[0420] The server proposes generated conversation topics to the user via a presentation device. The presentation device uses speech synthesis and a display. The input is the generated conversation topic, and the output is the conversation topic proposal to the user. Specifically, it outputs synthesized speech from a speaker and displays supplementary information on a display.
[0421] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0422] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0423] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0424] [Third Embodiment]
[0425] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0426] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0427] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0428] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0429] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0430] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0431] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0432] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0433] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0434] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0435] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0436] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0437] This invention is a system that uses an ear-worn device worn by the user to prevent silence during conversations and enable smooth communication with the other party. This system consists of a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, and a presentation means.
[0438] Description of the Embodiment
[0439] When a user wears the device and begins a casual conversation, the device first records the conversation using a voice acquisition tool. The recorded audio is then converted into text data in real time by a speech recognition tool. This text data is sent to a server, where a context analysis tool analyzes the flow of the conversation and important themes.
[0440] The server further uses information retrieval methods to obtain public profile information and related data of the conversation partner via external networks. This data can be obtained from public domain sources such as social networking services and business profiles.
[0441] Based on the acquired conversation context and information about the conversation partner, the server uses a topic generation mechanism to generate new conversation topics that can be suggested to the user. For example, new topics may be generated based on places the conversation partner has recently visited or topics they are interested in. This process allows the user to receive appropriate topics naturally when the conversation tends to stall.
[0442] The generated topics are communicated to the user through presentation methods. The device utilizes audio and display devices to present new topics to the user in an easy-to-understand and natural way.
[0443] Specific example
[0444] One evening, a user was talking with a friend at a cafe. When the conversation seemed to falter, the device started recording and transcribed the audio in real time. The server extracted the keywords "coffee" and "travel," and through its information acquisition tools, discovered that the friend had made a social media post about coffee that morning. The server used this information to generate a topic about "new coffee beans." The device then suggested to the user via voice, "Why don't you ask your friend about the coffee they were talking about on social media this morning?" The user was then able to restart the conversation based on that topic and continue communicating in a natural flow.
[0445] The following describes the processing flow.
[0446] Step 1:
[0447] The device monitors the conversation and starts recording when it detects specific keywords or silences. Recording proceeds in real time and continues even in the middle of a conversation.
[0448] Step 2:
[0449] The device sends the recorded audio data to the speech recognition engine. The speech recognition engine converts the audio into text data, thereby digitizing the linguistic information.
[0450] Step 3:
[0451] The server receives text data from the terminal. The server uses context analysis tools and applies natural language processing techniques to understand the topic and flow of the conversation.
[0452] Step 4:
[0453] The server uses information retrieval methods to obtain relevant information from the conversation partner's social media accounts and public profiles via an external network. This allows the server to collect information about the other person's interests and recent activities.
[0454] Step 5:
[0455] The server combines the analysis results of the context analysis means with the data collected by the information acquisition means, and the topic generation means creates a new topic. This involves using a generative AI model to form a topic suitable for conversation.
[0456] Step 6:
[0457] The terminal receives generated topics from the server and presents them to the user using speech synthesis technology and a display. The presentation is designed to be natural.
[0458] Step 7:
[0459] Based on the device's suggestions, the user can resume the conversation with the other party and continue communication smoothly. This prevents interruptions in the conversation and improves the quality of interaction.
[0460] (Example 1)
[0461] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0462] In everyday conversations, there is a challenge in keeping the conversation flowing smoothly without interruptions. In particular, when a conversation stalls, it is not easy to quickly find a topic appropriate for that moment. There is a need for a system that allows users to naturally introduce topics and continue the conversation smoothly in such situations.
[0463] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0464] In this invention, the server includes: a voice acquisition means worn by the user to acquire ambient sounds; a voice recognition means to convert the voice obtained by the voice acquisition means into digital data; a context analysis means to analyze the digital data and identify the flow of conversation and main topics; an information acquisition means to collect publicly available information of the conversation partner from an external information source via a communication means; a topic generation means to create a new topic based on the data obtained by the context analysis means and the information acquisition means; and a display means to present the topic constructed by the topic generation means to the user. This makes it possible to prevent conversational stagnation and smoothly provide the user with appropriate topics.
[0465] "Voice acquisition means" refers to a device or mechanism worn by the user that physically collects ambient sounds.
[0466] "Speech recognition means" refers to a technology or system that analyzes acquired speech data and converts it into corresponding text data.
[0467] "Contextual analysis means" refers to a process or method that analyzes text data generated by speech recognition means to identify the flow of conversation and the main topics.
[0468] "Information acquisition means" refers to technology or equipment for collecting publicly available profile information and related data of a conversation partner via an external communication network.
[0469] "Topic generation means" refers to a process or algorithm that generates new conversation topics based on data obtained by context analysis means and information acquisition means.
[0470] "Display means" refers to a device or interface for presenting a topic constructed by a topic generation means to a user.
[0471] This invention is a dialogue support system for maintaining a smooth flow of conversation, and comprises a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, and a presentation means. This system is implemented as a device worn by the user, and its embodiments are shown below.
[0472] The user uses an ear-worn device to initiate everyday conversations. The device uses a voice acquisition mechanism to capture ambient sounds and collects voice data through a built-in microphone. The acquired voice data is converted into text data in real time via a speech recognition mechanism. This achieves high-precision speech recognition using the Google Cloud Speech-to-Text API.
[0473] Subsequently, the terminal sends the converted text data to the server, which analyzes the text data using contextual analysis tools to identify the flow of the conversation and the main topics. This analysis utilizes natural language processing techniques to extract context and keywords.
[0474] Furthermore, the server uses information acquisition methods to collect public profile information and related data of the person it is interacting with via an external network. This is done by utilizing SNS APIs and public databases.
[0475] Based on the acquired information and contextual analysis results, the server uses a topic generation mechanism and a generative AI model to create new and appropriate conversation topics. This process generates highly adaptable prompt sentences that users can use immediately when they are struggling to find a topic. For example, a prompt sentence such as "Please suggest some new and interesting topics about recent travel destinations or events that have been in the news" might be used.
[0476] Finally, the generated topic is notified to the user verbally or visually through a presentation mechanism. The device uses speech synthesis technology to present the suggested content to the user in a natural way. In this way, the user can prevent silences during the conversation and maintain smooth communication.
[0477] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0478] Step 1:
[0479] Voice acquisition
[0480] When a user starts a conversation, the device activates its voice acquisition system and captures surrounding conversational audio in high quality via the microphone. The input is real-time audio, and the output is digital audio data. Specifically, it uses noise cancellation technology to suppress ambient noise while clearly recording the necessary audio.
[0481] Step 2:
[0482] Speech recognition and text conversion
[0483] The terminal supplies the acquired digital audio data to a speech recognition system, which then converts this data into text data. The input is digital audio data, and the output is real-time text data. This process uses, for example, the Google Cloud Speech-to-Text API to accurately transcribe the audio content.
[0484] Step 3:
[0485] Contextual analysis
[0486] The server receives text data sent from the terminal and analyzes its content using contextual analysis tools. The input is text data obtained through speech recognition, and the output is analytical data including the flow of the conversation and its main themes. This analysis uses natural language processing techniques to identify the central themes and related keywords of the conversation.
[0487] Step 4:
[0488] Information acquisition
[0489] The server collects additional data about the conversation partner from publicly available sources via the external network. Inputs include the conversation partner's identification information and related keywords, while outputs include their public profile and related information. Specifically, it sends queries to SNS APIs and online databases to explore the conversation partner's recent activities and interests.
[0490] Step 5:
[0491] Topic generation
[0492] The server creates a new topic using a topic generation mechanism based on the context analysis results and acquired information. The input is the analyzed context data and collected information, and the output is the proposed conversation topic. A generative AI model is used to generate prompt sentences appropriate for the user. A specific example of a prompt such as "Please suggest some interesting recent travel destinations" is considered.
[0493] Step 6:
[0494] Topic presentation
[0495] The generated topic is sent from the server to the terminal and notified to the user via a presentation mechanism. The input is the generated conversation topic, and the output is the presented suggestion. The terminal uses speech synthesis technology to convey the topic to the user in a natural voice, preventing interruptions in the conversation.
[0496] (Application Example 1)
[0497] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0498] In modern households, conversations among family members often become monotonous or silent, frequently hindering smooth communication. This can weaken family bonds and lead to insufficient mutual understanding in daily life. Furthermore, it's difficult to naturally generate and introduce new topics to revitalize conversation, highlighting the need for technologies to support this process.
[0499] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0500] In this invention, the server includes means for acquiring sound information, means for interpreting sound information, means for analyzing background information, means for acquiring external information, means for generating topics, means for presenting topics, and a topic detection and generation function incorporated into a device for improving in-home communication. This makes it possible to prevent stagnation in conversations among family members and to stimulate conversation in a natural way.
[0501] "Sound information acquisition means" refers to a device worn by the user that has the function of collecting ambient sounds.
[0502] A "sound information interpretation means" is a system that analyzes acoustic data collected by a sound information acquisition means and converts it into linguistic data.
[0503] A "background analysis device" is a device that has the function of understanding the context and important themes of a dialogue using converted language data.
[0504] "External information acquisition means" refers to a system that acquires information related to the interlocutor from external sources via the internet or other communication means.
[0505] A "topic generation device" is a device that has the function of generating new conversation themes based on data obtained from background analysis devices and external information acquisition devices.
[0506] A "topic presentation device" is a device that has the function of presenting or providing users with newly generated conversation topics.
[0507] "Devices for improving domestic communication" are devices designed to facilitate conversations and smooth communication among family members in a specific home environment.
[0508] This invention provides a system to support natural conversations within the home. The user-worn device incorporates an audio information acquisition means that captures ambient sounds. This device also has an audio information interpretation means that converts the acquired audio data into language data in real time. Structurally, technologies such as Google Cloud Speech-to-Text are used as the audio information interpretation means.
[0509] The language data obtained by the sound information interpretation means is sent to the server. The server uses natural language processing libraries (e.g., spaCy or NLTK) to perform background analysis and understand the flow of the conversation and important themes. Furthermore, the server obtains relevant information from publicly available APIs on the internet (e.g., Twitter API or Wikipedia API) through external information acquisition means.
[0510] Based on these analysis results and acquired information, the server generates new conversation topics using topic generation tools. A generation AI model (e.g., OpenAI's GPT-4) is used for this purpose. The generated topics are presented to users within the home through topic presentation tools. Users can then use these topics to make conversations more engaging.
[0511] As a concrete example, when a conversation between a parent and child in the living room pauses, the system suggests, "How about talking about recent trends in home cooking?" providing a natural conversation starter. An example of a prompt used in the generative AI model is, "The conversation between parent and child has paused. Please suggest a conversation topic that will interest the child. Please use recent trends in home cooking."
[0512] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0513] Step 1:
[0514] The terminal collects ambient sound using sound information acquisition means. During this process, audio data is input in real time via a microphone. The collected audio data is then prepared for processing in the next step.
[0515] Step 2:
[0516] The device uses Google Cloud Speech-to-Text as a means of interpreting sound information, converting the input speech data into language data. This process converts the speech data into text data, preparing it for transmission to the server.
[0517] Step 3:
[0518] The server receives text data sent from the terminal and analyzes it using background analysis tools. Specifically, it uses natural language processing libraries (such as spaCy and NLTK) to extract context and important themes from the text data and understand conversation patterns.
[0519] Step 4:
[0520] The server uses publicly available APIs on the internet (e.g., Twitter API and Wikipedia API) as means of acquiring external information to collect external information related to the interlocutor. It uses keywords obtained by the background analysis means as input and receives related information based on these keywords as output.
[0521] Step 5:
[0522] The server generates new conversation topics using a generative AI model (e.g., OpenAI's GPT-4) based on data obtained from background analysis and external information acquisition. It uses analysis results and external information as input and outputs a newly constructed topic.
[0523] Step 6:
[0524] The server sends topics generated through topic suggestion mechanisms to the terminal, which then presents them to the user. Specifically, it uses voice suggestions and on-screen displays to inform the user of new topics in a natural way.
[0525] Step 7:
[0526] The conversation resumes based on the topic suggested by the user. In this step, the user's responses become input for the next analysis, and a mechanism is implemented to allow the system to continuously learn and improve.
[0527] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0528] This invention is a system that uses an ear-worn device worn by the user to compensate for silences and lack of topics during conversations, and in particular provides dialogue support that takes the user's emotional state into consideration. This invention includes a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, a presentation means, and an emotion engine that analyzes the user's emotional state.
[0529] Description of the Embodiment
[0530] Consider a scenario where a user wears a device and engages in a normal conversation. During the conversation, the device records surrounding sounds using a voice acquisition device. The recorded audio data is immediately converted into text data by a speech recognition device, and this text data is sent to a server.
[0531] The server uses contextual analysis tools to analyze the flow of the conversation based on this text data. Furthermore, it uses external communication networks via information acquisition tools to collect publicly available information about the conversation partner. This includes social media posts and publicly available project information.
[0532] A key feature of this invention is that the emotion engine monitors the user's voice and speech patterns to identify their emotional state. This emotional state, combined with the aforementioned analysis results, helps the topic generation means determine the topic best suited to the user's mood. For example, if the user is relaxed, adjustments such as selecting a light topic are made.
[0533] The emotion engine is designed to analyze the user's facial expressions in real time, using this additional information to improve the accuracy of their emotional state. This data is collected on a server and used to support the generation of topics that best reflect the user's current situation.
[0534] The generated topics are delivered to the user via the device. The device naturally suggests topics to the user through speech synthesis and display. This suggests that the user can continue the conversation smoothly with the other person.
[0535] Specific example
[0536] Consider a scenario where a user is talking to a friend in a park. If there is a silence in the conversation, the device starts recording the audio and transcribes it into text using speech recognition. The server extracts specific keywords and learns from social media that the person the user is talking to has recently been planning a camping trip. The emotion engine also detects that the user is in a calm mood. Based on this, the server generates a topic about "the latest camping gear," and the device suggests to the user via voice, "Why don't you ask your friend about the latest camping items?", allowing the user to resume the conversation in a natural way.
[0537] The following describes the processing flow.
[0538] Step 1:
[0539] The device detects conversational audio around the user and starts recording when it detects trigger keywords or silence. Recording continues in real time, even while the conversation is ongoing.
[0540] Step 2:
[0541] The device sends the recorded audio data to a speech recognition engine, which converts the audio into text data. This text data is then used for later analysis and information extraction.
[0542] Step 3:
[0543] The server analyzes the text data received from the terminal through contextual analysis tools to understand the flow of the conversation and its main topics. This allows for an understanding of the direction of the conversation.
[0544] Step 4:
[0545] The server uses information retrieval methods to obtain the conversation partner's social media and public information via an external network. This reveals the conversation partner's recent activities and areas of interest.
[0546] Step 5:
[0547] The device sends the user's voice and voice patterns to the emotion engine in real time to evaluate their emotional state. If necessary, a facial expression sensor is also used for more accurate emotion recognition.
[0548] Step 6:
[0549] The server combines the results of the context analysis means, the data from the information acquisition means, and the emotional state determination by the emotion engine to activate the topic generation means and generate new conversation topics suitable for the user.
[0550] Step 7:
[0551] The device presents generated topics to the user audibly or visually, and suggests new conversation topics. The presentation takes into account the user's emotional state and the flow of the conversation.
[0552] Step 8:
[0553] By accepting the device's suggestions and incorporating new topics into the conversation, users can maintain smooth communication with the other person, improving the continuity and quality of the conversation.
[0554] (Example 2)
[0555] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0556] Silences and a lack of topics during conversations can make it difficult for users to engage in dialogue. Furthermore, offering topics that don't align with the user's emotional state can lead to unnatural conversations. There is a need for a system that addresses these challenges and supports smooth and natural dialogue.
[0557] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0558] In this invention, the server includes a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, an emotion analysis means, a topic generation means, and a presentation means. This reduces silence during the user's conversation and enables natural topic provision according to their emotional state.
[0559] "Voice acquisition means" refers to a means of collecting ambient sounds using a device worn by the user.
[0560] "Speech recognition means" refers to means for converting collected speech data into text data.
[0561] "Contextual analysis methods" refer to methods for analyzing text data to understand the flow and context of a conversation.
[0562] "Information acquisition methods" refer to means of collecting information about the person you are talking to from external sources.
[0563] "Emotional analysis tools" are means of analyzing a user's voice and voice patterns to identify their emotional state.
[0564] A "topic generation method" is a means of generating new topics based on context analysis results, information acquisition results, and sentiment analysis results.
[0565] A "presentation method" is a means of suggesting generated topics to users in a natural way.
[0566] This invention is a system that utilizes an ear-worn device worn by the user to facilitate smooth conversation. In particular, it focuses on providing dialogue support while taking into account the user's emotional state. A detailed embodiment is shown below.
[0567] Hardware and software to be used
[0568] A terminal device is used. This includes a voice acquisition means for collecting and recording ambient sounds. This voice data is converted into text data in real time by a speech recognition means. The converted text data is sent to a server.
[0569] The server performs advanced analysis. Based on the received text data, it analyzes the flow of the conversation using contextual analysis tools. Furthermore, it collects publicly available information about the conversation partner via external communication networks through information acquisition tools.
[0570] On the other hand, emotion analysis tools analyze the user's voice patterns to identify the user's emotional state. Information about the emotional state is an important factor in determining the topic of conversation.
[0571] Based on these results, the server uses a topic generation mechanism to generate topics in the generation AI model that match the user's emotions and the context of the conversation.
[0572] The generated topics are presented to the user through the device's display mechanisms. Because they are displayed or spoken in a natural way using speech synthesis technology, the user can engage in smooth conversations.
[0573] Specific example
[0574] For example, consider a situation where a user is talking with a friend in a park and a silence occurs. At this point, the device immediately starts recording the audio, converts it into text data, and sends it to the server. The server analyzes the context of the conversation and the user's emotional state, and selects a topic, for example, "the latest camping gear." Based on this, the device can suggest to the user, "Why don't you ask your friend about the latest camping items?"
[0575] Example of a prompt
[0576] An example of a prompt to input into the generative AI model is, "Generate an appropriate topic when the user is relaxed and knows that the conversation partner has recently planned a camping trip." Based on this, topics that maintain a natural flow of conversation will be generated.
[0577] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0578] Step 1:
[0579] The terminal collects ambient sound in real time through the attached device. After audio data is obtained as input, the terminal records the data using an audio acquisition means and converts it into text data using a speech recognition means. The converted text data becomes the output. Specifically, this is a process in which the waveform of the sound is captured and converted into text information.
[0580] Step 2:
[0581] The server receives text data sent from the terminal. Using contextual analysis techniques, it analyzes the flow of the conversation and important topics based on this text data as input. The output provides analysis results regarding the themes of the conversation and the user's intentions. Specifically, it utilizes natural language processing techniques to extract keywords and understand the context.
[0582] Step 3:
[0583] Based on the information obtained through contextual analysis, the server uses information acquisition methods to collect publicly available information about the conversation partner from external communication networks. The collected information becomes the input, and information such as SNS posts and public profiles is output. Specifically, this is the operation of collecting digital content from the internet via an API.
[0584] Step 4:
[0585] The device uses emotion analysis techniques to analyze the user's voice and speech patterns, identifying their emotional state in real time. The user's voice data is input, and the analyzed emotional state is output. Specifically, it evaluates the tone and tempo of the voice to identify the emotional state (e.g., joy, calmness).
[0586] Step 5:
[0587] The server uses a topic generation mechanism based on all the information acquired so far to input prompt sentences into the generative AI model. This results in the output of new topics optimized for the user's emotions and the context of the conversation. Specifically, the generative AI model constructs topics that reflect the appropriate context and prepares natural suggestions.
[0588] Step 6:
[0589] The terminal receives new topics sent from the server and presents them to the user. Using the presentation means, it presents topics via speech synthesis as needed. The suggested topics become the output, and the user can resume the conversation based on them. Specifically, speech synthesis technology is used to produce speech output that closely resembles human speech.
[0590] (Application Example 2)
[0591] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0592] In conversations, silences or a lack of topics often lead to a breakdown in communication. Furthermore, finding appropriate topics while considering the emotional state of the person being spoken to is difficult in such situations. Especially in situations with limited interaction, the person being spoken to is more likely to feel anxious or tense, leading to communication breakdowns. To solve this, a comprehensive system is needed that accurately analyzes the user's emotional state and suggests topics appropriate to the situation.
[0593] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0594] In this invention, the server includes an acoustic acquisition means worn by the user to record ambient sounds, a speech recognition means to convert the sounds acquired by the acoustic acquisition means into textual information, and a context analysis means to analyze the textual information to understand the conversation situation. This makes it possible to maintain smooth conversation by accurately selecting and presenting topics to the user based on the user's emotional state.
[0595] "Sound acquisition means" refers to a device worn by the user to record ambient sounds.
[0596] "Speech recognition means" refers to a technology that converts speech acquired by sound acquisition means into text information.
[0597] A "contextual analysis tool" is a mechanism that analyzes textual information to understand the flow and situation of a conversation.
[0598] An "information gathering tool" is a device used to obtain information about the person you are interacting with from external sources.
[0599] "Topic creation means" refers to a technique that generates new topics of conversation based on information collected by context analysis means and information gathering means.
[0600] A "presentation device" is a device for suggesting conversation topics generated by topic creation means to the user.
[0601] An "emotion analysis engine" is an engine that determines the user's emotional state from voice and facial expression data.
[0602] This invention is a system designed for use with consumer robots and is intended to facilitate smooth interaction with users. It begins by recording ambient sounds using an ear-worn device worn by a human and converting them into text data.
[0603] The server uses a microphone attached to the device as a means of acquiring sound. For speech recognition, it utilizes a speech recognition service such as Google Cloud Speech-to-Text to convert the recorded audio into text information in real time. The context analysis means analyzes the obtained text information and plays a role in understanding the context of the conversation.
[0604] Furthermore, the server utilizes information gathering tools to collect publicly available information about the conversation partner via external communication networks. This information is extracted from data such as social media posts and publicly available project information. This makes it possible to enhance the context of the conversation.
[0605] The emotion analysis engine analyzes voice and user facial expressions in real time to determine the user's emotional state. This allows it to suggest topics that are most appropriate to the user's emotions at that moment.
[0606] The presentation device provides a way to naturally suggest topics generated by a topic generation mechanism to the user through speech synthesis and display. For example, if silence occurs during a conversation with family, the system will determine that the user is relaxed and generate an appropriate topic. It will then restart the conversation in a natural way by suggesting, for instance, "How about talking about recent camping gear?" through the speaker.
[0607] One of the main processing technologies used by the server is a generative AI model, which generates and suggests information using a prompt such as, "Please suggest some light, family-oriented topics that will interest the user when they are relaxing."
[0608] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0609] Step 1:
[0610] An acoustic acquisition device attached to the terminal collects ambient sound and transmits it to a server as digital audio data. The input is ambient sound, and the output is digital audio data. Specifically, it continuously records sound using a microphone and stores it in a buffer.
[0611] Step 2:
[0612] The server uses speech recognition to convert digital audio data into text. This process utilizes speech recognition services such as Google Cloud Speech-to-Text. The input is digital audio data, and the output is text. Specifically, the audio data is sent to the cloud service, and the returned text is received.
[0613] Step 3:
[0614] The server uses contextual analysis tools to analyze textual information and understand the flow and situation of the conversation. The input is textual information, and the output is the analyzed conversational context. Specifically, it uses natural language processing techniques to perform keyword extraction and relevance analysis.
[0615] Step 4:
[0616] The server uses information gathering tools to obtain publicly available information about the conversation partner from an external communication network. The input is the conversation partner's identifier, and the output is the obtained publicly available information. Specifically, it uses the SNS API to retrieve the other party's latest SNS posts, project information, etc.
[0617] Step 5:
[0618] The server uses an emotion analysis engine to determine the user's emotional state from their voice and facial expressions. The input is voice data and facial expression data, and the output is the analyzed emotional state. Specifically, it performs voice tone analysis and facial expression recognition from camera footage.
[0619] Step 6:
[0620] The server uses a topic generation mechanism to generate new conversation topics based on the information obtained in the previous steps, utilizing a generative AI model. This process uses prompts. The input consists of conversation context information, public information, and emotional state, and the output is the generated conversation topic. A specific prompt for the generative AI model is, "Suggest a light, family-oriented topic that will interest the user when they are relaxed."
[0621] Step 7:
[0622] The server proposes generated conversation topics to the user via a presentation device. The presentation device uses speech synthesis and a display. The input is the generated conversation topic, and the output is the conversation topic proposal to the user. Specifically, it outputs synthesized speech from a speaker and displays supplementary information on a display.
[0623] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0624] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0625] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0626] [Fourth Embodiment]
[0627] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0628] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0629] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0630] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0631] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0632] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0633] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0634] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0635] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0636] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0637] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0638] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0639] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0640] This invention is a system that uses an ear-worn device worn by the user to prevent silence during conversations and enable smooth communication with the other party. This system consists of a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, and a presentation means.
[0641] Description of the Embodiment
[0642] When a user wears the device and begins a casual conversation, the device first records the conversation using a voice acquisition tool. The recorded audio is then converted into text data in real time by a speech recognition tool. This text data is sent to a server, where a context analysis tool analyzes the flow of the conversation and important themes.
[0643] The server further uses information retrieval methods to obtain public profile information and related data of the conversation partner via external networks. This data can be obtained from public domain sources such as social networking services and business profiles.
[0644] Based on the acquired conversation context and information about the conversation partner, the server uses a topic generation mechanism to generate new conversation topics that can be suggested to the user. For example, new topics may be generated based on places the conversation partner has recently visited or topics they are interested in. This process allows the user to receive appropriate topics naturally when the conversation tends to stall.
[0645] The generated topics are communicated to the user through presentation methods. The device utilizes audio and display devices to present new topics to the user in an easy-to-understand and natural way.
[0646] Specific example
[0647] One evening, a user was talking with a friend at a cafe. When the conversation seemed to falter, the device started recording and transcribed the audio in real time. The server extracted the keywords "coffee" and "travel," and through its information acquisition tools, discovered that the friend had made a social media post about coffee that morning. The server used this information to generate a topic about "new coffee beans." The device then suggested to the user via voice, "Why don't you ask your friend about the coffee they were talking about on social media this morning?" The user was then able to restart the conversation based on that topic and continue communicating in a natural flow.
[0648] The following describes the processing flow.
[0649] Step 1:
[0650] The device monitors the conversation and starts recording when it detects specific keywords or silences. Recording proceeds in real time and continues even in the middle of a conversation.
[0651] Step 2:
[0652] The device sends the recorded audio data to the speech recognition engine. The speech recognition engine converts the audio into text data, thereby digitizing the linguistic information.
[0653] Step 3:
[0654] The server receives text data from the terminal. The server uses context analysis tools and applies natural language processing techniques to understand the topic and flow of the conversation.
[0655] Step 4:
[0656] The server uses information retrieval methods to obtain relevant information from the conversation partner's social media accounts and public profiles via an external network. This allows the server to collect information about the other person's interests and recent activities.
[0657] Step 5:
[0658] The server combines the analysis results of the context analysis means with the data collected by the information acquisition means, and the topic generation means creates a new topic. This involves using a generative AI model to form a topic suitable for conversation.
[0659] Step 6:
[0660] The terminal receives generated topics from the server and presents them to the user using speech synthesis technology and a display. The presentation is designed to be natural.
[0661] Step 7:
[0662] Based on the device's suggestions, the user can resume the conversation with the other party and continue communication smoothly. This prevents interruptions in the conversation and improves the quality of interaction.
[0663] (Example 1)
[0664] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0665] In everyday conversations, there is a challenge in keeping the conversation flowing smoothly without interruptions. In particular, when a conversation stalls, it is not easy to quickly find a topic appropriate for that moment. There is a need for a system that allows users to naturally introduce topics and continue the conversation smoothly in such situations.
[0666] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0667] In this invention, the server includes: a voice acquisition means worn by the user to acquire ambient sounds; a voice recognition means to convert the voice obtained by the voice acquisition means into digital data; a context analysis means to analyze the digital data and identify the flow of conversation and main topics; an information acquisition means to collect publicly available information of the conversation partner from an external information source via a communication means; a topic generation means to create a new topic based on the data obtained by the context analysis means and the information acquisition means; and a display means to present the topic constructed by the topic generation means to the user. This makes it possible to prevent conversational stagnation and smoothly provide the user with appropriate topics.
[0668] "Voice acquisition means" refers to a device or mechanism worn by the user that physically collects ambient sounds.
[0669] "Speech recognition means" refers to a technology or system that analyzes acquired speech data and converts it into corresponding text data.
[0670] "Contextual analysis means" refers to a process or method that analyzes text data generated by speech recognition means to identify the flow of conversation and the main topics.
[0671] "Information acquisition means" refers to technology or equipment for collecting publicly available profile information and related data of a conversation partner via an external communication network.
[0672] "Topic generation means" refers to a process or algorithm that generates new conversation topics based on data obtained by context analysis means and information acquisition means.
[0673] "Display means" refers to a device or interface for presenting a topic constructed by a topic generation means to a user.
[0674] This invention is a dialogue support system for maintaining a smooth flow of conversation, and comprises a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, and a presentation means. This system is implemented as a device worn by the user, and its embodiments are shown below.
[0675] The user uses an ear-worn device to initiate everyday conversations. The device uses a voice acquisition mechanism to capture ambient sounds and collects voice data through a built-in microphone. The acquired voice data is converted into text data in real time via a speech recognition mechanism. This achieves high-precision speech recognition using the Google Cloud Speech-to-Text API.
[0676] Subsequently, the terminal sends the converted text data to the server, which analyzes the text data using contextual analysis tools to identify the flow of the conversation and the main topics. This analysis utilizes natural language processing techniques to extract context and keywords.
[0677] Furthermore, the server uses information acquisition methods to collect public profile information and related data of the person it is interacting with via an external network. This is done by utilizing SNS APIs and public databases.
[0678] Based on the acquired information and contextual analysis results, the server uses a topic generation mechanism and a generative AI model to create new and appropriate conversation topics. This process generates highly adaptable prompt sentences that users can use immediately when they are struggling to find a topic. For example, a prompt sentence such as "Please suggest some new and interesting topics about recent travel destinations or events that have been in the news" might be used.
[0679] Finally, the generated topic is notified to the user verbally or visually through a presentation mechanism. The device uses speech synthesis technology to present the suggested content to the user in a natural way. In this way, the user can prevent silences during the conversation and maintain smooth communication.
[0680] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0681] Step 1:
[0682] Voice acquisition
[0683] When a user starts a conversation, the device activates its voice acquisition system and captures surrounding conversational audio in high quality via the microphone. The input is real-time audio, and the output is digital audio data. Specifically, it uses noise cancellation technology to suppress ambient noise while clearly recording the necessary audio.
[0684] Step 2:
[0685] Speech recognition and text conversion
[0686] The terminal supplies the acquired digital audio data to a speech recognition system, which then converts this data into text data. The input is digital audio data, and the output is real-time text data. This process uses, for example, the Google Cloud Speech-to-Text API to accurately transcribe the audio content.
[0687] Step 3:
[0688] Contextual analysis
[0689] The server receives text data sent from the terminal and analyzes its content using contextual analysis tools. The input is text data obtained through speech recognition, and the output is analytical data including the flow of the conversation and its main themes. This analysis uses natural language processing techniques to identify the central themes and related keywords of the conversation.
[0690] Step 4:
[0691] Information acquisition
[0692] The server collects additional data about the conversation partner from publicly available sources via the external network. Inputs include the conversation partner's identification information and related keywords, while outputs include their public profile and related information. Specifically, it sends queries to SNS APIs and online databases to explore the conversation partner's recent activities and interests.
[0693] Step 5:
[0694] Topic generation
[0695] The server creates a new topic using a topic generation mechanism based on the context analysis results and acquired information. The input is the analyzed context data and collected information, and the output is the proposed conversation topic. A generative AI model is used to generate prompt sentences appropriate for the user. A specific example of a prompt such as "Please suggest some interesting recent travel destinations" is considered.
[0696] Step 6:
[0697] Topic presentation
[0698] The generated topic is sent from the server to the terminal and notified to the user via a presentation mechanism. The input is the generated conversation topic, and the output is the presented suggestion. The terminal uses speech synthesis technology to convey the topic to the user in a natural voice, preventing interruptions in the conversation.
[0699] (Application Example 1)
[0700] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0701] In modern households, conversations among family members often become monotonous or silent, frequently hindering smooth communication. This can weaken family bonds and lead to insufficient mutual understanding in daily life. Furthermore, it's difficult to naturally generate and introduce new topics to revitalize conversation, highlighting the need for technologies to support this process.
[0702] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0703] In this invention, the server includes means for acquiring sound information, means for interpreting sound information, means for analyzing background information, means for acquiring external information, means for generating topics, means for presenting topics, and a topic detection and generation function incorporated into a device for improving in-home communication. This makes it possible to prevent stagnation in conversations among family members and to stimulate conversation in a natural way.
[0704] "Sound information acquisition means" refers to a device worn by the user that has the function of collecting ambient sounds.
[0705] A "sound information interpretation means" is a system that analyzes acoustic data collected by a sound information acquisition means and converts it into linguistic data.
[0706] A "background analysis device" is a device that has the function of understanding the context and important themes of a dialogue using converted language data.
[0707] "External information acquisition means" refers to a system that acquires information related to the interlocutor from external sources via the internet or other communication means.
[0708] A "topic generation device" is a device that has the function of generating new conversation themes based on data obtained from background analysis devices and external information acquisition devices.
[0709] A "topic presentation device" is a device that has the function of presenting or providing users with newly generated conversation topics.
[0710] "Devices for improving domestic communication" are devices designed to facilitate conversations and smooth communication among family members in a specific home environment.
[0711] This invention provides a system to support natural conversations within the home. The user-worn device incorporates an audio information acquisition means that captures ambient sounds. This device also has an audio information interpretation means that converts the acquired audio data into language data in real time. Structurally, technologies such as Google Cloud Speech-to-Text are used as the audio information interpretation means.
[0712] The language data obtained by the sound information interpretation means is sent to the server. The server uses natural language processing libraries (e.g., spaCy or NLTK) to perform background analysis and understand the flow of the conversation and important themes. Furthermore, the server obtains relevant information from publicly available APIs on the internet (e.g., Twitter API or Wikipedia API) through external information acquisition means.
[0713] Based on these analysis results and acquired information, the server generates new conversation topics using topic generation tools. A generation AI model (e.g., OpenAI's GPT-4) is used for this purpose. The generated topics are presented to users within the home through topic presentation tools. Users can then use these topics to make conversations more engaging.
[0714] As a concrete example, when a conversation between a parent and child in the living room pauses, the system suggests, "How about talking about recent trends in home cooking?" providing a natural conversation starter. An example of a prompt used in the generative AI model is, "The conversation between parent and child has paused. Please suggest a conversation topic that will interest the child. Please use recent trends in home cooking."
[0715] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0716] Step 1:
[0717] The terminal collects ambient sound using sound information acquisition means. During this process, audio data is input in real time via a microphone. The collected audio data is then prepared for processing in the next step.
[0718] Step 2:
[0719] The device uses Google Cloud Speech-to-Text as a means of interpreting sound information, converting the input speech data into language data. This process converts the speech data into text data, preparing it for transmission to the server.
[0720] Step 3:
[0721] The server receives text data sent from the terminal and analyzes it using background analysis tools. Specifically, it uses natural language processing libraries (such as spaCy and NLTK) to extract context and important themes from the text data and understand conversation patterns.
[0722] Step 4:
[0723] The server uses publicly available APIs on the internet (e.g., Twitter API and Wikipedia API) as means of acquiring external information to collect external information related to the interlocutor. It uses keywords obtained by the background analysis means as input and receives related information based on these keywords as output.
[0724] Step 5:
[0725] The server generates new conversation topics using a generative AI model (e.g., OpenAI's GPT-4) based on data obtained from background analysis and external information acquisition. It uses analysis results and external information as input and outputs a newly constructed topic.
[0726] Step 6:
[0727] The server sends topics generated through topic suggestion mechanisms to the terminal, which then presents them to the user. Specifically, it uses voice suggestions and on-screen displays to inform the user of new topics in a natural way.
[0728] Step 7:
[0729] The conversation resumes based on the topic suggested by the user. In this step, the user's responses become input for the next analysis, and a mechanism is implemented to allow the system to continuously learn and improve.
[0730] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0731] This invention is a system that uses an ear-worn device worn by the user to compensate for silences and lack of topics during conversations, and in particular provides dialogue support that takes the user's emotional state into consideration. This invention includes a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, a topic generation means, a presentation means, and an emotion engine that analyzes the user's emotional state.
[0732] Description of the Embodiment
[0733] Consider a scenario where a user wears a device and engages in a normal conversation. During the conversation, the device records surrounding sounds using a voice acquisition device. The recorded audio data is immediately converted into text data by a speech recognition device, and this text data is sent to a server.
[0734] The server uses contextual analysis tools to analyze the flow of the conversation based on this text data. Furthermore, it uses external communication networks via information acquisition tools to collect publicly available information about the conversation partner. This includes social media posts and publicly available project information.
[0735] A key feature of this invention is that the emotion engine monitors the user's voice and speech patterns to identify their emotional state. This emotional state, combined with the aforementioned analysis results, helps the topic generation means determine the topic best suited to the user's mood. For example, if the user is relaxed, adjustments such as selecting a light topic are made.
[0736] The emotion engine is designed to analyze the user's facial expressions in real time, using this additional information to improve the accuracy of their emotional state. This data is collected on a server and used to support the generation of topics that best reflect the user's current situation.
[0737] The generated topics are delivered to the user via the device. The device naturally suggests topics to the user through speech synthesis and display. This suggests that the user can continue the conversation smoothly with the other person.
[0738] Specific example
[0739] Consider a scenario where a user is talking to a friend in a park. If there is a silence in the conversation, the device starts recording the audio and transcribes it into text using speech recognition. The server extracts specific keywords and learns from social media that the person the user is talking to has recently been planning a camping trip. The emotion engine also detects that the user is in a calm mood. Based on this, the server generates a topic about "the latest camping gear," and the device suggests to the user via voice, "Why don't you ask your friend about the latest camping items?", allowing the user to resume the conversation in a natural way.
[0740] The following describes the processing flow.
[0741] Step 1:
[0742] The device detects conversational audio around the user and starts recording when it detects trigger keywords or silence. Recording continues in real time, even while the conversation is ongoing.
[0743] Step 2:
[0744] The device sends the recorded audio data to a speech recognition engine, which converts the audio into text data. This text data is then used for later analysis and information extraction.
[0745] Step 3:
[0746] The server analyzes the text data received from the terminal through contextual analysis tools to understand the flow of the conversation and its main topics. This allows for an understanding of the direction of the conversation.
[0747] Step 4:
[0748] The server uses information retrieval methods to obtain the conversation partner's social media and public information via an external network. This reveals the conversation partner's recent activities and areas of interest.
[0749] Step 5:
[0750] The device sends the user's voice and voice patterns to the emotion engine in real time to evaluate their emotional state. If necessary, a facial expression sensor is also used for more accurate emotion recognition.
[0751] Step 6:
[0752] The server combines the results of the context analysis means, the data from the information acquisition means, and the emotional state determination by the emotion engine to activate the topic generation means and generate new conversation topics suitable for the user.
[0753] Step 7:
[0754] The device presents generated topics to the user audibly or visually, and suggests new conversation topics. The presentation takes into account the user's emotional state and the flow of the conversation.
[0755] Step 8:
[0756] By accepting the device's suggestions and incorporating new topics into the conversation, users can maintain smooth communication with the other person, improving the continuity and quality of the conversation.
[0757] (Example 2)
[0758] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0759] Silences and a lack of topics during conversations can make it difficult for users to engage in dialogue. Furthermore, offering topics that don't align with the user's emotional state can lead to unnatural conversations. There is a need for a system that addresses these challenges and supports smooth and natural dialogue.
[0760] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0761] In this invention, the server includes a voice acquisition means, a voice recognition means, a context analysis means, an information acquisition means, an emotion analysis means, a topic generation means, and a presentation means. This reduces silence during the user's conversation and enables natural topic provision according to their emotional state.
[0762] "Voice acquisition means" refers to a means of collecting ambient sounds using a device worn by the user.
[0763] "Speech recognition means" refers to means for converting collected speech data into text data.
[0764] "Contextual analysis methods" refer to methods for analyzing text data to understand the flow and context of a conversation.
[0765] "Information acquisition methods" refer to means of collecting information about the person you are talking to from external sources.
[0766] "Emotional analysis tools" are means of analyzing a user's voice and voice patterns to identify their emotional state.
[0767] A "topic generation method" is a means of generating new topics based on context analysis results, information acquisition results, and sentiment analysis results.
[0768] A "presentation method" is a means of suggesting generated topics to users in a natural way.
[0769] This invention is a system that utilizes an ear-worn device worn by the user to facilitate smooth conversation. In particular, it focuses on providing dialogue support while taking into account the user's emotional state. A detailed embodiment is shown below.
[0770] Hardware and software to be used
[0771] A terminal device is used. This includes a voice acquisition means for collecting and recording ambient sounds. This voice data is converted into text data in real time by a speech recognition means. The converted text data is sent to a server.
[0772] The server performs advanced analysis. Based on the received text data, it analyzes the flow of the conversation using contextual analysis tools. Furthermore, it collects publicly available information about the conversation partner via external communication networks through information acquisition tools.
[0773] On the other hand, emotion analysis tools analyze the user's voice patterns to identify the user's emotional state. Information about the emotional state is an important factor in determining the topic of conversation.
[0774] Based on these results, the server uses a topic generation mechanism to generate topics in the generation AI model that match the user's emotions and the context of the conversation.
[0775] The generated topics are presented to the user through the device's display mechanisms. Because they are displayed or spoken in a natural way using speech synthesis technology, the user can engage in smooth conversations.
[0776] Specific example
[0777] For example, consider a situation where a user is talking with a friend in a park and a silence occurs. At this point, the device immediately starts recording the audio, converts it into text data, and sends it to the server. The server analyzes the context of the conversation and the user's emotional state, and selects a topic, for example, "the latest camping gear." Based on this, the device can suggest to the user, "Why don't you ask your friend about the latest camping items?"
[0778] Example of a prompt
[0779] An example of a prompt to input into the generative AI model is, "Generate an appropriate topic when the user is relaxed and knows that the conversation partner has recently planned a camping trip." Based on this, topics that maintain a natural flow of conversation will be generated.
[0780] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0781] Step 1:
[0782] The terminal collects ambient sound in real time through the attached device. After audio data is obtained as input, the terminal records the data using an audio acquisition means and converts it into text data using a speech recognition means. The converted text data becomes the output. Specifically, this is a process in which the waveform of the sound is captured and converted into text information.
[0783] Step 2:
[0784] The server receives text data sent from the terminal. Using contextual analysis techniques, it analyzes the flow of the conversation and important topics based on this text data as input. The output provides analysis results regarding the themes of the conversation and the user's intentions. Specifically, it utilizes natural language processing techniques to extract keywords and understand the context.
[0785] Step 3:
[0786] Based on the information obtained through contextual analysis, the server uses information acquisition methods to collect publicly available information about the conversation partner from external communication networks. The collected information becomes the input, and information such as SNS posts and public profiles is output. Specifically, this is the operation of collecting digital content from the internet via an API.
[0787] Step 4:
[0788] The device uses emotion analysis techniques to analyze the user's voice and speech patterns, identifying their emotional state in real time. The user's voice data is input, and the analyzed emotional state is output. Specifically, it evaluates the tone and tempo of the voice to identify the emotional state (e.g., joy, calmness).
[0789] Step 5:
[0790] The server uses a topic generation mechanism based on all the information acquired so far to input prompt sentences into the generative AI model. This results in the output of new topics optimized for the user's emotions and the context of the conversation. Specifically, the generative AI model constructs topics that reflect the appropriate context and prepares natural suggestions.
[0791] Step 6:
[0792] The terminal receives new topics sent from the server and presents them to the user. Using the presentation means, it presents topics via speech synthesis as needed. The suggested topics become the output, and the user can resume the conversation based on them. Specifically, speech synthesis technology is used to produce speech output that closely resembles human speech.
[0793] (Application Example 2)
[0794] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0795] In conversations, silences or a lack of topics often lead to a breakdown in communication. Furthermore, finding appropriate topics while considering the emotional state of the person being spoken to is difficult in such situations. Especially in situations with limited interaction, the person being spoken to is more likely to feel anxious or tense, leading to communication breakdowns. To solve this, a comprehensive system is needed that accurately analyzes the user's emotional state and suggests topics appropriate to the situation.
[0796] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0797] In this invention, the server includes an acoustic acquisition means worn by the user to record ambient sounds, a speech recognition means to convert the sounds acquired by the acoustic acquisition means into textual information, and a context analysis means to analyze the textual information to understand the conversation situation. This makes it possible to maintain smooth conversation by accurately selecting and presenting topics to the user based on the user's emotional state.
[0798] "Sound acquisition means" refers to a device worn by the user to record ambient sounds.
[0799] "Speech recognition means" refers to a technology that converts speech acquired by sound acquisition means into text information.
[0800] A "contextual analysis tool" is a mechanism that analyzes textual information to understand the flow and situation of a conversation.
[0801] An "information gathering tool" is a device used to obtain information about the person you are interacting with from external sources.
[0802] "Topic creation means" refers to a technique that generates new topics of conversation based on information collected by context analysis means and information gathering means.
[0803] A "presentation device" is a device for suggesting conversation topics generated by topic creation means to the user.
[0804] An "emotion analysis engine" is an engine that determines the user's emotional state from voice and facial expression data.
[0805] This invention is a system designed for use with consumer robots and is intended to facilitate smooth interaction with users. It begins by recording ambient sounds using an ear-worn device worn by a human and converting them into text data.
[0806] The server uses a microphone attached to the device as a means of acquiring sound. For speech recognition, it utilizes a speech recognition service such as Google Cloud Speech-to-Text to convert the recorded audio into text information in real time. The context analysis means analyzes the obtained text information and plays a role in understanding the context of the conversation.
[0807] Furthermore, the server utilizes information gathering tools to collect publicly available information about the conversation partner via external communication networks. This information is extracted from data such as social media posts and publicly available project information. This makes it possible to enhance the context of the conversation.
[0808] The emotion analysis engine analyzes voice and user facial expressions in real time to determine the user's emotional state. This allows it to suggest topics that are most appropriate to the user's emotions at that moment.
[0809] The presentation device provides a way to naturally suggest topics generated by a topic generation mechanism to the user through speech synthesis and display. For example, if silence occurs during a conversation with family, the system will determine that the user is relaxed and generate an appropriate topic. It will then restart the conversation in a natural way by suggesting, for instance, "How about talking about recent camping gear?" through the speaker.
[0810] One of the main processing technologies used by the server is a generative AI model, which generates and suggests information using a prompt such as, "Please suggest some light, family-oriented topics that will interest the user when they are relaxing."
[0811] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0812] Step 1:
[0813] An acoustic acquisition device attached to the terminal collects ambient sound and transmits it to a server as digital audio data. The input is ambient sound, and the output is digital audio data. Specifically, it continuously records sound using a microphone and stores it in a buffer.
[0814] Step 2:
[0815] The server uses speech recognition to convert digital audio data into text. This process utilizes speech recognition services such as Google Cloud Speech-to-Text. The input is digital audio data, and the output is text. Specifically, the audio data is sent to the cloud service, and the returned text is received.
[0816] Step 3:
[0817] The server uses contextual analysis tools to analyze textual information and understand the flow and situation of the conversation. The input is textual information, and the output is the analyzed conversational context. Specifically, it uses natural language processing techniques to perform keyword extraction and relevance analysis.
[0818] Step 4:
[0819] The server uses information gathering tools to obtain publicly available information about the conversation partner from an external communication network. The input is the conversation partner's identifier, and the output is the obtained publicly available information. Specifically, it uses the SNS API to retrieve the other party's latest SNS posts, project information, etc.
[0820] Step 5:
[0821] The server uses an emotion analysis engine to determine the user's emotional state from their voice and facial expressions. The input is voice data and facial expression data, and the output is the analyzed emotional state. Specifically, it performs voice tone analysis and facial expression recognition from camera footage.
[0822] Step 6:
[0823] The server uses a topic generation mechanism to generate new conversation topics based on the information obtained in the previous steps, utilizing a generative AI model. This process uses prompts. The input consists of conversation context information, public information, and emotional state, and the output is the generated conversation topic. A specific prompt for the generative AI model is, "Suggest a light, family-oriented topic that will interest the user when they are relaxed."
[0824] Step 7:
[0825] The server proposes generated conversation topics to the user via a presentation device. The presentation device uses speech synthesis and a display. The input is the generated conversation topic, and the output is the conversation topic proposal to the user. Specifically, it outputs synthesized speech from a speaker and displays supplementary information on a display.
[0826] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0827] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0828] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0829] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0830] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0831] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0832] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0833] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0834] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0835] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0836] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0837] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0838] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0839] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0840] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0841] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0842] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0843] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0844] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0845] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0846] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0847] The following is further disclosed regarding the embodiments described above.
[0848] (Claim 1)
[0849] A voice acquisition device worn by the user to record ambient sounds,
[0850] A speech recognition means that converts the speech acquired by the aforementioned speech acquisition means into text data,
[0851] A context analysis means for analyzing the aforementioned text data to understand the conversation context,
[0852] Information acquisition means for obtaining information about the conversation partner from external sources,
[0853] A topic generation means that generates a new topic based on the information obtained by the context analysis means and the information acquisition means,
[0854] A presentation means that proposes topics generated by the topic generation means to the user,
[0855] A system that includes this.
[0856] (Claim 2)
[0857] The system according to claim 1, wherein the voice recognition means starts recording upon a predetermined trigger.
[0858] (Claim 3)
[0859] The system according to claim 1, wherein the information acquisition means acquires publicly available information of the conversation partner via an external communication network.
[0860] "Example 1"
[0861] (Claim 1)
[0862] A voice acquisition means that is worn by the user to acquire ambient sounds,
[0863] A speech recognition means that converts the sound obtained by the aforementioned speech acquisition means into digital data,
[0864] A contextual analysis means for analyzing the aforementioned digital data and identifying the flow of the conversation and main topics,
[0865] Information acquisition means for collecting publicly available information of the conversation partner from external sources via communication means,
[0866] A topic generation means that creates a new topic based on the data obtained by the context analysis means and the information acquisition means,
[0867] A display means for presenting a topic constructed by the topic generation means to the user,
[0868] A system that includes this.
[0869] (Claim 2)
[0870] The system according to claim 1, wherein the voice recognition means starts data collection in response to a specific action.
[0871] (Claim 3)
[0872] The system according to claim 1, wherein the information acquisition means acquires public profile data of the interlocutor using a variety of communication networks.
[0873] "Application Example 1"
[0874] (Claim 1)
[0875] A sound information acquisition means that is worn by the user and acquires voice information,
[0876] A sound information interpretation means that converts acoustic data acquired by the sound information acquisition means into language data,
[0877] A background analysis means for analyzing the aforementioned language data to understand the background of the dialogue,
[0878] An external information acquisition means that obtains information about the interlocutor from an external information source,
[0879] A topic creation means that generates new topics based on the data obtained by the background analysis means and the external information acquisition means,
[0880] A topic presentation means that presents topics generated by the topic generation means to the user,
[0881] A system that includes topic detection and creation functions incorporated into devices designed to improve communication within the home.
[0882] (Claim 2)
[0883] The system according to claim 1, wherein the sound information interpretation means starts recording sound by a specific operation.
[0884] (Claim 3)
[0885] The system according to claim 1, wherein the external information acquisition means acquires publicly known information of the interlocutor via an external communication structure.
[0886] "Example 2 of combining an emotion engine"
[0887] (Claim 1)
[0888] A voice acquisition device worn by the user to collect ambient sounds,
[0889] A speech recognition means that converts the speech collected by the aforementioned speech acquisition means into text data,
[0890] A contextual analysis means for analyzing the aforementioned text data to understand the flow of the conversation,
[0891] Information acquisition means for collecting information about the conversation partner from external sources,
[0892] An emotion analysis means for analyzing the user's voice and voice patterns to identify their emotional state,
[0893] A topic generation means that generates a new topic based on the information obtained by the context analysis means, information acquisition means and sentiment analysis means,
[0894] A presentation means that proposes topics generated by the topic generation means to the user,
[0895] An information processing system that includes this.
[0896] (Claim 2)
[0897] The information processing system according to claim 1, wherein the voice recognition means starts recording upon a predetermined trigger and optimizes the topic based on the user's emotional analysis by the emotion analysis means.
[0898] (Claim 3)
[0899] The information processing system according to claim 1, wherein the information acquisition means acquires publicly available digital information of the conversation partner via an external communication network.
[0900] "Application example 2 when combining with an emotional engine"
[0901] (Claim 1)
[0902] An acoustic acquisition device worn by the user to record ambient sounds,
[0903] A speech recognition means that converts sound acquired by the aforementioned sound acquisition means into text information,
[0904] A contextual analysis means for analyzing the aforementioned textual information to understand the conversation situation,
[0905] Information gathering means for obtaining information about the conversation partner from external sources,
[0906] A topic creation means that generates a new topic of conversation based on the information obtained by the context analysis means and the information gathering means,
[0907] A presentation device that proposes conversation topics generated by the topic creation means to the user,
[0908] Includes an emotion analysis engine, the emotion analysis engine having means for determining the user's state from voice and facial expressions,
[0909] The aforementioned presentation device includes means for presenting appropriate conversation topics selected based on an emotion analysis engine,
[0910] A system that includes this.
[0911] (Claim 2)
[0912] The system according to claim 1, wherein the voice recognition means starts recording upon a predetermined trigger and selects a topic based on the emotional state.
[0913] (Claim 3)
[0914] The system according to claim 1, wherein the information gathering means obtains publicly available information of the conversation partner via an external communication network and generates more appropriate conversation topics based on the results of sentiment analysis. [Explanation of symbols]
[0915] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A voice acquisition device worn by the user to record ambient sounds, A speech recognition means that converts the speech acquired by the aforementioned speech acquisition means into text data, A context analysis means for analyzing the aforementioned text data to understand the conversation context, Information acquisition means for obtaining information about the conversation partner from external sources, A topic generation means that generates a new topic based on the information obtained by the context analysis means and the information acquisition means, A presentation means that proposes topics generated by the topic generation means to the user, A system that includes this.
2. The system according to claim 1, wherein the voice recognition means starts recording upon a predetermined trigger.
3. The system according to claim 1, wherein the information acquisition means acquires publicly available information of the conversation partner via an external communication network.