system
A system that converts user voice to text, analyzes emotions, and generates natural responses addresses the challenge of loneliness among the elderly by providing continuous and emotionally supportive conversations.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-09
- Publication Date
- 2026-06-19
AI Technical Summary
Elderly individuals often experience loneliness due to limited communication opportunities, and existing technologies struggle to understand their needs and provide appropriate, human-like conversations.
A system that captures user voice in real-time, converts it to text, analyzes intentions and emotions, generates natural responses using generative AI, and provides voice feedback, while also retrieving information and improving responses through user feedback.
The system provides continuous and natural conversations, reducing feelings of loneliness by offering personalized and emotionally supportive interactions.
Smart Images

Figure 2026100706000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] In modern society, there is a problem that many elderly people suffer from loneliness and do not have sufficient communication opportunities. In particular, elderly people living alone have limited conversations and interactions in their daily lives, so there is a high possibility of mental and health problems. In order to alleviate such loneliness of the elderly, a solution that can provide continuous and natural conversations is required. Existing technologies have a problem that it is difficult to understand various needs of the elderly and provide appropriate and human-like conversations.
Means for Solving the Problems
[0005] This invention provides a terminal device that captures a user's voice in real time and transmits the voice data to a server. The server converts the voice data into text and analyzes the user's intentions and emotions through natural language processing. Based on the analysis results, it uses a generation AI to generate natural conversational responses. Furthermore, it converts the generated responses into voice data and transmits them to the terminal, thereby providing voice responses to elderly people. The server also supports information access by retrieving information from the internet in response to user requests, summarizing it appropriately, and providing it. The terminal has a function to collect user feedback and improve the accuracy of responses at the server. This configuration realizes an environment in which elderly people can enjoy conversations with peace of mind.
[0006] A "user" is an individual, primarily elderly people, who enjoy using this system to communicate.
[0007] "Audio data" refers to audio input from a user, represented as a digital signal, and is data in the stage before it is transmitted to the server.
[0008] "Terminal means" refers to a device that includes hardware and software for capturing the user's voice in real time and transmitting it to a server.
[0009] A "server system" is a central processing unit for processing and analyzing received audio data, and it has the functionality to perform text conversion and natural language processing.
[0010] "Natural language processing" is a technology for analyzing user intent and emotions from speech and text data, and includes the process by which computers understand human language.
[0011] "Generative AI means" refers to a system that utilizes artificial intelligence technology to generate appropriate conversational responses based on analysis results obtained from the user.
[0012] "Speech synthesis means" refers to a technology that converts generated text responses into speech data and outputs it in a natural tone.
[0013] "Output means" refers to a speaker or similar device installed in the terminal to provide the generated audio data to the user.
[0014] "Retrieving information from the internet" refers to the process of obtaining data from external sources based on a user's request and providing it to the user.
[0015] "Feedback" refers to the reactions and opinions collected from users, which are useful for improving the system's response accuracy and functionality. [Brief explanation of the drawing]
[0016] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10]Shows an emotion map to which multiple emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Embodiment 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Embodiment 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.
Mode for Carrying Out the Invention
[0017] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.
[0018] First, the language used in the following description will be explained.
[0019] In the following embodiments, a numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0020] In the following embodiments, a numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0021] In the following embodiments, the signed storage is one or more non-volatile storage devices that store various programs and various parameters. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes.
[0022] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0023] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0024] [First Embodiment]
[0025] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0026] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0027] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0028] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0029] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0030] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0031] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0032] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0033] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0034] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0035] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0036] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0037] This invention provides a specific embodiment for realizing natural dialogue with users in a communication system for the elderly. This system is mainly composed of terminals, a server, and a network connecting them.
[0038] First, when a user begins interacting with this system, they speak into the terminal, which then acquires their voice. The terminal is equipped with a highly sensitive microphone designed to accurately capture the user's speech while filtering out external noise. The acquired voice data is then transmitted to the server in real time using compression technology.
[0039] Next, the server converts the received audio data into text format using speech recognition technology. At this stage, the server uses natural language processing to analyze the context behind the audio, the user's emotions, and even the intent of the conversation. For example, if the user asks, "What's the weather like today?", the server understands that the user is looking for the latest weather information.
[0040] Once the analysis is complete, the server uses generative AI to generate an appropriate response. This AI is trained on a massive dataset and is designed to produce natural, human-like responses. For example, it might generate a response such as, "It's sunny today. Do you have any plans to go out?"
[0041] The generated text responses are converted into natural-sounding voice data by a speech synthesis engine on the server. This generated voice data is sent to the terminal and delivered to the user through the speaker. Since there is no initial text-based interaction and the entire conversation is conducted via voice, it provides the user with a more immersive and personalized dialogue experience.
[0042] Furthermore, as an information provision function, the server retrieves information from external sources in response to user requests and presents it in a format optimized for the user's needs. This information covers a wide range of topics, such as news, health information, and activities related to hobbies.
[0043] Finally, the device stores the history of user interactions and feedback, and sends it to the server. This feedback is processed by the server as part of machine learning and used to improve the system's future response performance.
[0044] The goal of this entire system is to improve the quality of life for users by creating an environment where elderly people can enjoy conversations on a daily basis and by reducing feelings of loneliness.
[0045] The following describes the processing flow.
[0046] Step 1:
[0047] The user initiates communication by speaking into the device. The device's built-in microphone captures the user's voice in real time.
[0048] Step 2:
[0049] The terminal converts the captured audio into a digital signal, compresses it, and prepares it for transmission to the server. To facilitate speech recognition processing, it performs noise reduction and processes the data to ensure the audio remains clear.
[0050] Step 3:
[0051] The server processes the received audio data using a speech recognition engine and converts it into text data. The server then passes this text data to a natural language processing module for analysis, which extracts the user's intent and emotions.
[0052] Step 4:
[0053] The server uses generative AI to generate appropriate responses based on the analyzed intent and emotional information. This process also takes conversation history and context into consideration, resulting in more natural and human-like responses.
[0054] Step 5:
[0055] The server passes the generated text response to a speech synthesis module, which converts it into speech data. The converted speech is synthesized to have a natural intonation and a human-like sound.
[0056] Step 6:
[0057] The server sends the generated audio data to the terminal. The audio data is delivered to the user quickly while maintaining high quality.
[0058] Step 7:
[0059] The device plays the received audio data through its speaker, providing a response to the user. Through this output, the device establishes its presence as a companion to the user.
[0060] Step 8:
[0061] The device tracks user responses and additional interactions, sending this data to the server as feedback. This feedback is used to improve the system's response accuracy.
[0062] (Example 1)
[0063] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0064] For the elderly and users who communicate via voice, it is crucial to create communication systems that provide natural and human-like responses. However, conventional technologies often fall short in terms of convenience and response accuracy, with particular challenges in noise reduction of voice input, emotion analysis, and appropriate response generation. Furthermore, there is currently a lack of effective methods for improving system performance by utilizing user feedback.
[0065] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0066] In this invention, the server includes an analysis means that converts voice data into text data and performs natural language processing to analyze the user's intentions and emotions; a generation model means that generates human-like responses; and a synthesis means that converts the generated responses into acoustic data and outputs it. This enables smooth interaction with the user, provides highly accurate responses, and allows for continuous improvement of system performance through the use of feedback.
[0067] "Acquisition means for acquiring sound, compressing the sound data, and transmitting it to an information processing device" refers to a device or system that has the function of capturing the sound emitted by a user, compressing it for effective data transfer, and then transmitting it to an information processing device such as a server via a network.
[0068] "An analytical means that converts audio data into text data and performs natural language processing to analyze the user's intent and emotions" refers to a technology or system that converts audio data into text format and uses language processing techniques to understand the context, emotions, and desired content from the text.
[0069] A "generative model means for generating human-like responses" is a model or program that uses AI technology to generate natural conversations in response to user input, creating responses that sound as if a human were responding.
[0070] "A synthesis means that converts the generated response into acoustic data and outputs it" refers to a device or system that uses speech synthesis technology to convert text generated in natural language into speech and output it in a format that can be heard by the user.
[0071] "Recording means for recording the history of conversations with users and feedback from users, and transmitting them for use by the analysis means" refers to a device or software that has the function of saving the content of past conversations with users and feedback information such as suggestions for improvement obtained from users, and transmitting it to a server for use in future analysis and system performance improvement.
[0072] This invention is a communication system that enables natural, voice-based dialogue for users, including the elderly. The system primarily consists of terminals, a server, and a network.
[0073] When a user initiates a voice interaction, the terminal device first receives the audio. This terminal is equipped with a high-sensitivity microphone to accurately capture the user's speech. Noise cancellation technology is employed to effectively remove external noise. The acquired audio data is compressed in real time and transferred to a server. AAC and Opus are commonly used for this audio compression.
[0074] The server decodes the received compressed audio data. The decoded audio data is converted into text data using speech recognition software. Speech recognition APIs, which are already common technologies on the market, are used for speech recognition. Subsequently, natural language processing (NLP) libraries are used to analyze the user's intent and emotions. Natural Language Toolkit (NLTK) or spaCy are often suitable for this purpose.
[0075] Based on the analysis results, the server generates a response using a generative AI. This generative AI is a model trained on a large dataset. Examples of such models include widely known natural language generation technologies, such as the GPT series. The generated response is prepared in an optimized text format.
[0076] Text-based responses are converted to speech using a speech synthesis engine. The synthesized speech data is then compressed again and sent to the device. The synthesized speech is delivered to the user through the speaker and is corrected to sound like natural speech with natural intonation and emotion.
[0077] In addition, in response to specific user requests, the server retrieves the latest information from a wide-area communication network and provides summarized information through NLP and AI services. This includes a variety of content, such as news summaries and weather information.
[0078] Furthermore, the terminal records user feedback and sends it to the server. This feedback is analyzed by machine learning algorithms, contributing to improvements in the accuracy and quality of responses. Through this feedback process, the system continuously evolves and can respond quickly and flexibly to user needs.
[0079] As a concrete example, when a user asks, "What's the weather like this week?", the system retrieves weather data and provides a specific response such as, "There will be many sunny days this week. Do you have any plans to go out?". To enable this scenario, an example of a prompt to input into the generative AI model is, "Generate what the user will say next to have a natural conversation with an elderly person: 'What's the weather like this week?'"
[0080] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0081] Step 1:
[0082] Voice input begins when the user speaks into the device. The device's built-in microphone captures this voice. The input voice is processed by a noise-canceling algorithm to reduce background noise. The result is clear voice data. This voice data is compressed in real time in AAC format and immediately sent to the server.
[0083] Step 2:
[0084] The server receives compressed audio data sent from the terminal. At this stage, the audio data is in AAC format, so it is first decompressed. The decompressed audio data is converted into text format using a speech recognition engine, such as a speech recognition API. The input is the decompressed audio data, and the output is the corresponding text data.
[0085] Step 3:
[0086] The server further analyzes the text data. Using a natural language processing toolkit, it interprets the grammatical structure, sentiment, and intent of the text. This analysis yields contextual information and intent as output, based on the input text data. This constitutes the sentiment and intent parameters necessary for subsequent response generation.
[0087] Step 4:
[0088] The server uses a generative AI model to generate specific responses based on the intentions and emotions obtained in the analysis step. Here, the generative AI model responds to prompt sentences and forms a natural dialogue. In this step, the analysis results are taken as input and human-like response text is output.
[0089] Step 5:
[0090] The synthesis engine on the server converts the generated text response into speech data. This process uses high-quality speech synthesis technology to transform the text data into smooth speech. The input is the generated response text, and the output is the corresponding speech data.
[0091] Step 6:
[0092] The server encodes the synthesized speech data and sends it to the terminal in a compressed format. The terminal then decompresses this data and provides it to the user through its speakers. For the user, this results in natural, flowing speech.
[0093] Step 7:
[0094] The terminal records user interaction history and feedback. This feedback is sent to the server to be used for system improvement. The input is user feedback information, and the output is analytical data based on that feedback. This allows the system to be improved to provide even more user-friendly responses.
[0095] (Application Example 1)
[0096] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0097] To alleviate feelings of loneliness and improve the quality of life for certain groups, such as the elderly, natural and personalized information provision and health management support are necessary. However, current voice dialogue systems have challenges in understanding emotions and managing health information, often failing to provide sufficient satisfaction. Furthermore, there is a lack of efficient support methods for regular and individual schedule management and health information recording.
[0098] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0099] In this invention, the server includes means for converting voice information into document format and performing natural language analysis to analyze the user's purpose and emotions; means for generating information and generating natural responses based on the analysis results; and means for recording the user's health information and assisting with daily schedule management. This makes it possible for elderly people to enjoy conversations naturally in their daily lives while receiving personalized information and health management support.
[0100] "User utterance" refers to the act of a user communicating information using their voice. The device acquires this voice information through this action.
[0101] A "central processing unit" refers to a device that converts audio information into document format and further analyzes the user's intent and emotions through natural language processing. This device is connected to terminals via a network.
[0102] "Information generation processing means" refers to technology for generating natural-sounding responses based on the results analyzed by the central processing unit. The generated information is provided to the user as audio.
[0103] "Voice conversion means" refers to a technology that converts generated text information into voice data and transmits it to a terminal. This enables voice interaction.
[0104] "Output means" refers to a device or function incorporated into the terminal to provide audio information to the user. Using this means, the generated audio is played back through the speaker.
[0105] "Support features" refer to functions that record the user's health information and assist with daily schedule management. These features provide users with appropriate support.
[0106] This invention is a natural voice dialogue system for supporting the daily lives of the elderly. The system mainly consists of three elements: a "terminal," a "central processing unit," and a "user."
[0107] The device is equipped with a highly sensitive microphone and speaker, allowing for high-precision collection of user-generated audio. The audio data is converted to a digital format and transmitted to a central processing unit using data compression technology. Hardware used includes smartphones and dedicated electronic terminals.
[0108] The central processing unit first converts speech information into text using a speech recognition API (e.g., open-source speech recognition software or cloud service). Next, it uses a natural language processing framework to analyze the user's intent and emotions. Based on this analysis, a generative AI model generates a response appropriate to the user. This AI model is an AI technology trained on a large amount of dialogue data, enabling it to provide natural and user-friendly responses.
[0109] Furthermore, it includes health management and schedule management functions, recording and analyzing the user's daily health information as needed. This allows for personalized health advice to be provided to each individual user.
[0110] For example, if an elderly person asks, "What's my schedule for today?", the system will respond, "You have an exercise session scheduled for this morning, followed by a relaxation session." This kind of daily voice support allows users to live their daily lives with peace of mind.
[0111] As an example of a prompt, the generation AI model would be instructed as follows: "Generate a response that provides gentle-toned health management advice for elderly people to enjoy everyday conversation."
[0112] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0113] Step 1:
[0114] The user speaks into the device. The device is equipped with a high-sensitivity microphone that accurately captures the user's voice. The input is the user's raw voice data. The device converts this voice data into a digital format, compresses the data, and transmits it to the central processing unit in real time. The output is compressed voice data.
[0115] Step 2:
[0116] The server (central processing unit) converts the received audio data into text format using a speech recognition API. The input is digital audio data. This conversion transforms the audio into text information. The output is text data. Next, a natural language processing algorithm is used to analyze the user's intent and emotions. This process helps understand what the user wants and their emotional state. The output of the analysis is metadata indicating the user's intent.
[0117] Step 3:
[0118] The server uses a generative AI model to generate appropriate responses to the user based on the analyzed data. The input is metadata containing the user's intent. The generative AI model is trained on a large-scale dialogue dataset and has the ability to generate natural and appropriate responses. The output of this process is the text data of the response.
[0119] Step 4:
[0120] The server converts the generated text response into speech data using a speech synthesis engine. The input is text data. Through speech synthesis, the text information is transformed into natural-sounding speech. This speech data is sent to the terminal. The output is the converted speech data.
[0121] Step 5:
[0122] The device plays the received audio data to the user through its speaker. The input is audio data from the server. This allows the user to receive natural-sounding voice responses from the system. The output is physical sound. The user can experience voice interaction and make decisions about their daily actions based on the responses.
[0123] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0124] The present invention aims to provide more natural and psychologically supportive communication in a dialogue system designed to alleviate feelings of loneliness among the elderly by incorporating a function that recognizes the user's emotions and reflects them in the response. Specific embodiments are shown below.
[0125] The system consists of terminals, servers, an emotion engine, and a network connecting them.
[0126] When a user speaks into the device, the device's microphone captures the audio in real time. The audio data is compressed and sent to the server.
[0127] The server converts the received audio data into text data using speech recognition technology. Furthermore, an emotion engine located within the server analyzes the user's emotions using this audio data. This analysis takes into account the tone, speed, pitch, and resonance of the user's voice. For example, if the user speaks slowly in a calm tone, the emotion engine recognizes this as a "stable" state, while if they speak quickly and at a high pitch, it determines that they are "excited."
[0128] Next, the server uses natural language processing to combine the user's statements with emotional information, and the generative AI generates the optimal response. In this process, emotional data influences the tone and content of the response. For example, if the server detects that the user is feeling anxious, it will generate a more supportive and reassuring response such as, "It's okay. Is there anything I can do to help?"
[0129] The generated text response is converted into speech by a speech synthesis engine on the server. The speech data is adjusted for intonation and tone to make it sound as close to human speech as possible. The completed speech response is sent to the terminal and delivered to the user.
[0130] The device plays the received audio response through its speaker and provides it to the user. At this time, the device verifies that the response is appropriate to the user's state and environment, and plays it back at an appropriate volume and speed.
[0131] This system allows users to have a deeper conversational experience, not just exchange information, and functions as a source of daily emotional support. The entire system constantly evolves by incorporating user feedback, improving the accuracy and quality of responses.
[0132] The following describes the processing flow.
[0133] Step 1:
[0134] The user initiates a conversation by speaking into the device. The device's built-in microphone filters out ambient noise and captures the user's voice clearly.
[0135] Step 2:
[0136] The terminal converts the captured audio into digital data and compresses it. The compressed audio data is then ready to be sent to the server.
[0137] Step 3:
[0138] The server converts the audio data received from the terminal into text data using a speech recognition engine. An emotion engine installed within the server then analyzes the user's emotions from this audio data.
[0139] Step 4:
[0140] The server's emotion engine analyzes elements such as tone, speed, and pitch of speech to identify the user's emotions. For example, if speech is hesitant or choppy, it might map this to "anxiety."
[0141] Step 5:
[0142] The server sends the user's text data and sentiment information to a natural language processing module. Here, contextual analysis and sentiment data are integrated to gain a more detailed understanding of the speaker's intent.
[0143] Step 6:
[0144] The server uses generative AI to generate natural-sounding responses based on the analyzed data. The tone and content of the response are adjusted based on information obtained from the emotion engine. For example, if "anxiety" is detected, it will generate a response in a gentle tone such as, "Don't worry, I'm here to support you."
[0145] Step 7:
[0146] The server converts the generated text response into speech data using a speech synthesis engine. During this process, natural intonation is added, and the speech is optimized for easy understanding.
[0147] Step 8:
[0148] The server sends audio data to the terminal. Once the terminal receives the data, it delivers the audio to the user through its speaker.
[0149] Step 9:
[0150] The device monitors user responses in real time and sends the collected feedback to the server. This feedback is used as training data to further improve future responses.
[0151] (Example 2)
[0152] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0153] In modern society, the elderly often experience feelings of loneliness, and there is a need for psychological support. However, conventional dialogue systems have struggled to adequately recognize users' emotions and provide natural and supportive responses. As a result, there is a challenge in that opportunities for the elderly to feel secure are limited.
[0154] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0155] In this invention, the server includes means for converting voice information into text information and performing machine processing to analyze the user's intentions and emotions; means for generating a harmonious response; and means for converting the response into voice information and transmitting it to an information communication device. This makes it possible to appropriately recognize the user's emotions and provide psychological support through natural dialogue based on those emotions.
[0156] An "information and communication device" is a device that has the function of acquiring a user's voice and transmitting that voice information to a processing device.
[0157] A "processing device" is a device that converts audio information into text information and analyzes the user's intentions and emotions through machine processing.
[0158] A "generative modeling tool" is a tool that has the function of generating a harmonious response based on analyzed emotional information and intentions.
[0159] A "voice conversion means" is a means that has the function of converting the generated response into voice information and transmitting it to an information communication device.
[0160] An "output device" is a device that has the function of playing back audio information received by an information and communication device and providing an audio response to the user.
[0161] This invention is a dialogue system designed to alleviate feelings of loneliness among the elderly. It aims to provide natural and psychologically supportive communication by recognizing the user's emotions and reflecting them in its responses.
[0162] When a user speaks into the device to initiate a conversation, the device acquires the voice using its built-in microphone. This voice information is processed using compression technology (e.g., AAC or MP3 format) to enable efficient data transfer. The voice data is then transmitted to the server via the information and communication network.
[0163] The server converts the received audio data into text data using a speech recognition engine (e.g., general speech recognition software). This ensures that the user's speech is treated as textual information. Subsequently, an emotion analysis module within the server analyzes the user's voice characteristics, such as tone, speed, pitch, and resonance, to identify their emotional state. Machine learning techniques are used in this emotion analysis.
[0164] Using the analyzed emotion information, the server generates the optimal response using a generative AI model (for example, a proprietary generative AI). By inputting prompts into the generative AI, an appropriate response tailored to the user's emotions is elicited. A concrete example of a prompt might be, "The user is calm, so respond in a relaxed tone."
[0165] The generated text response is converted into speech data by a speech synthesis engine (e.g., common speech synthesis software). This speech data is then adjusted for intonation and tone and transmitted to the terminal over the network.
[0166] The device provides the user with the received audio data through its speaker. At this time, the device adjusts the volume and playback speed according to the surrounding environment and the user's state. This allows the user to have a natural and supportive conversational experience.
[0167] This system allows users to enjoy not just information exchange, but more personalized psychological support. The terminal sends user feedback collected during the interaction to the server, allowing the system to continuously improve the accuracy and quality of its responses.
[0168] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0169] Step 1:
[0170] The user speaks into the device, inputting voice. The device uses its built-in microphone to capture this voice. The captured voice data is compressed using compression techniques such as AAC format and sent to the server via the information and communication network. The input is raw voice data, and the output is a compressed audio file.
[0171] Step 2:
[0172] The server decompresses the received compressed audio data and converts it into text data using a speech recognition engine. The input is compressed audio data, and the output is text-based character information. Specifically, it performs analysis using widely available speech recognition software.
[0173] Step 3:
[0174] The server processes text data using an emotion analysis module to analyze the user's emotional state. The input is textual information, and the output is the user's emotional state (e.g., "stable," "excited," etc.). Emotion analysis utilizes techniques such as voice analysis, evaluating factors like tone, speed, and pitch.
[0175] Step 4:
[0176] The server uses a generative AI model to generate responses based on emotional states and text data. Here, prompts such as "respond in a relaxed tone that matches the current emotional state" are input, and optimized response text is output. A concrete example would be reassuring words appropriate to the situation.
[0177] Step 5:
[0178] The server converts the generated response text into speech data using a speech synthesis engine. The input is response data in text format, and the output is response data in speech format. In speech synthesis, the intonation and tone are adjusted to be as close to human as possible.
[0179] Step 6:
[0180] The terminal receives audio data transmitted from the server and plays it through the speaker. The input is response data in audio format, and the output is delivered to the user as actual audio. The terminal adjusts the volume according to the user's environment and plays the audio in the optimal way.
[0181] (Application Example 2)
[0182] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0183] The challenge is to provide a dialogue system that allows for natural communication in order to alleviate feelings of loneliness and mental stress among the elderly. Furthermore, this system is required to generate supportive responses that are tailored to the emotional state of the elderly and to provide psychological comfort.
[0184] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0185] In this invention, the server includes a processing device that converts the user's voice into text information and performs natural language processing to analyze the user's intentions and emotions; a generation model that generates a natural response based on the analysis results; and a synthesis device that converts the generated response into voice information and transmits it to an input / output device. This makes it possible to generate responses based on the user's emotions and provide psychological support to the elderly.
[0186] An "input / output device" is a device that acquires the user's voice in real time and plays back the received voice information to the user.
[0187] A "processing device" is a device that converts audio information into text information and analyzes the user's intentions and emotions using natural language processing.
[0188] A "generative model" is an artificial intelligence model designed to generate natural responses based on analyzed user intentions and emotions.
[0189] A "synthesis device" is a device that converts text information generated by a generative model into speech information and transmits it to an input / output device.
[0190] "Psychological support" refers to providing elderly people with a sense of security and peace of mind through responses based on the user's emotions.
[0191] The embodiments for carrying out the invention are as follows:
[0192] This system consists of a terminal, a server, a generative AI model, and a network connecting them. The terminal is responsible for receiving the voice of elderly people in real time and transmitting this voice data to the server.
[0193] The server first uses a speech processing engine to convert speech data into text data. This process utilizes "speech recognition software." Subsequently, a "natural language processing engine" is used to analyze the user's intent and emotions using natural language processing technology. Based on the emotional information analyzed here, a generative AI model generates an appropriate response. For example, a "generative AI engine" is used for this response generation.
[0194] The generated text information is converted into speech data by a speech synthesis engine. At this stage, "speech synthesis software" is used, and the generated speech data is sent back to the terminal.
[0195] The device plays back received audio data and provides it to the elderly. During playback, it is required to deliver the audio at a volume and speed that is optimal for the user's environment and condition.
[0196] For example, if an elderly person using the device says, "I feel lonely today," the system can analyze the results and determine that the emotion is "loneliness," then generate a response such as, "Shall we watch your favorite movie together?" This response aims to provide comfortable and reassuring communication.
[0197] An example of a prompt would be: "Generate a reassuring response when the user is feeling down. The user said, 'I'm feeling kind of lonely today.'"
[0198] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0199] Step 1:
[0200] The device captures the user's voice in real time and takes the voice data as input. The acquired voice data is compressed over the network and output to the server.
[0201] Step 2:
[0202] The server passes the received audio data as input to the speech processing engine. Speech recognition software is used to convert the audio data into text data. The converted text data is then output to the next stage of processing.
[0203] Step 3:
[0204] The server inputs text data into a natural language processing engine to analyze the user's intent and emotions. The analysis results are output as emotion data, which is then passed to a generative AI model.
[0205] Step 4:
[0206] The server inputs the analysis results into the generation AI model and generates a natural response based on the user's emotions. It then outputs the generated response text and sends it to the speech synthesis engine.
[0207] Step 5:
[0208] The server uses speech synthesis software to convert the generated response text into audio data. It then outputs the converted audio data and transmits it to the terminal via the network.
[0209] Step 6:
[0210] The device receives audio data as input and plays it back to the user through the speaker. The volume and speed are optimized according to the user's environment.
[0211] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0212] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0213] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0214] [Second Embodiment]
[0215] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0216] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0217] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0218] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0219] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0220] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0221] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0222] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0223] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0224] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0225] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0226] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0227] This invention provides a specific embodiment for realizing natural dialogue with users in a communication system for the elderly. This system is mainly composed of terminals, a server, and a network connecting them.
[0228] First, when a user begins interacting with this system, they speak into the terminal, which then acquires their voice. The terminal is equipped with a highly sensitive microphone designed to accurately capture the user's speech while filtering out external noise. The acquired voice data is then transmitted to the server in real time using compression technology.
[0229] Next, the server converts the received audio data into text format using speech recognition technology. At this stage, the server uses natural language processing to analyze the context behind the audio, the user's emotions, and even the intent of the conversation. For example, if the user asks, "What's the weather like today?", the server understands that the user is looking for the latest weather information.
[0230] Once the analysis is complete, the server uses generative AI to generate an appropriate response. This AI is trained on a massive dataset and is designed to produce natural, human-like responses. For example, it might generate a response such as, "It's sunny today. Do you have any plans to go out?"
[0231] The generated text responses are converted into natural-sounding voice data by a speech synthesis engine on the server. This generated voice data is sent to the terminal and delivered to the user through the speaker. Since there is no initial text-based interaction and the entire conversation is conducted via voice, it provides the user with a more immersive and personalized dialogue experience.
[0232] Furthermore, as an information provision function, the server retrieves information from external sources in response to user requests and presents it in a format optimized for the user's needs. This information covers a wide range of topics, such as news, health information, and activities related to hobbies.
[0233] Finally, the device stores the history of user interactions and feedback, and sends it to the server. This feedback is processed by the server as part of machine learning and used to improve the system's future response performance.
[0234] The goal of this entire system is to improve the quality of life for users by creating an environment where elderly people can enjoy conversations on a daily basis and by reducing feelings of loneliness.
[0235] The following describes the processing flow.
[0236] Step 1:
[0237] The user initiates communication by speaking into the device. The device's built-in microphone captures the user's voice in real time.
[0238] Step 2:
[0239] The terminal converts the captured audio into a digital signal, compresses it, and prepares it for transmission to the server. To facilitate speech recognition processing, it performs noise reduction and processes the data to ensure the audio remains clear.
[0240] Step 3:
[0241] The server processes the received audio data using a speech recognition engine and converts it into text data. The server then passes this text data to a natural language processing module for analysis, which extracts the user's intent and emotions.
[0242] Step 4:
[0243] The server uses generative AI to generate appropriate responses based on the analyzed intent and emotional information. This process also takes conversation history and context into consideration, resulting in more natural and human-like responses.
[0244] Step 5:
[0245] The server passes the generated text response to a speech synthesis module, which converts it into speech data. The converted speech is synthesized to have a natural intonation and a human-like sound.
[0246] Step 6:
[0247] The server sends the generated audio data to the terminal. The audio data is delivered to the user quickly while maintaining high quality.
[0248] Step 7:
[0249] The device plays the received audio data through its speaker, providing a response to the user. Through this output, the device establishes its presence as a companion to the user.
[0250] Step 8:
[0251] The device tracks user responses and additional interactions, sending this data to the server as feedback. This feedback is used to improve the system's response accuracy.
[0252] (Example 1)
[0253] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0254] For the elderly and users who communicate via voice, it is crucial to create communication systems that provide natural and human-like responses. However, conventional technologies often fall short in terms of convenience and response accuracy, with particular challenges in noise reduction of voice input, emotion analysis, and appropriate response generation. Furthermore, there is currently a lack of effective methods for improving system performance by utilizing user feedback.
[0255] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0256] In this invention, the server includes an analysis means that converts voice data into text data and performs natural language processing to analyze the user's intentions and emotions; a generation model means that generates human-like responses; and a synthesis means that converts the generated responses into acoustic data and outputs it. This enables smooth interaction with the user, provides highly accurate responses, and allows for continuous improvement of system performance through the use of feedback.
[0257] "Acquisition means for acquiring sound, compressing the sound data, and transmitting it to an information processing device" refers to a device or system that has the function of capturing the sound emitted by a user, compressing it for effective data transfer, and then transmitting it to an information processing device such as a server via a network.
[0258] "An analytical means that converts audio data into text data and performs natural language processing to analyze the user's intent and emotions" refers to a technology or system that converts audio data into text format and uses language processing techniques to understand the context, emotions, and desired content from the text.
[0259] A "generative model means for generating human-like responses" is a model or program that uses AI technology to generate natural conversations in response to user input, creating responses that sound as if a human were responding.
[0260] "A synthesis means that converts the generated response into acoustic data and outputs it" refers to a device or system that uses speech synthesis technology to convert text generated in natural language into speech and output it in a format that can be heard by the user.
[0261] "Recording means for recording the history of conversations with users and feedback from users, and transmitting them for use by the analysis means" refers to a device or software that has the function of saving the content of past conversations with users and feedback information such as suggestions for improvement obtained from users, and transmitting it to a server for use in future analysis and system performance improvement.
[0262] This invention is a communication system that enables natural, voice-based dialogue for users, including the elderly. The system primarily consists of terminals, a server, and a network.
[0263] When a user initiates a voice interaction, the terminal device first receives the audio. This terminal is equipped with a high-sensitivity microphone to accurately capture the user's speech. Noise cancellation technology is employed to effectively remove external noise. The acquired audio data is compressed in real time and transferred to a server. AAC and Opus are commonly used for this audio compression.
[0264] The server decodes the received compressed audio data. The decoded audio data is converted into text data using speech recognition software. Speech recognition APIs, which are already common technologies on the market, are used for speech recognition. Subsequently, natural language processing (NLP) libraries are used to analyze the user's intent and emotions. Natural Language Toolkit (NLTK) or spaCy are often suitable for this purpose.
[0265] Based on the analysis results, the server generates a response using a generative AI. This generative AI is a model trained on a large dataset. Examples of such models include widely known natural language generation technologies, such as the GPT series. The generated response is prepared in an optimized text format.
[0266] Text-based responses are converted to speech using a speech synthesis engine. The synthesized speech data is then compressed again and sent to the device. The synthesized speech is delivered to the user through the speaker and is corrected to sound like natural speech with natural intonation and emotion.
[0267] In addition, in response to specific user requests, the server retrieves the latest information from a wide-area communication network and provides summarized information through NLP and AI services. This includes a variety of content, such as news summaries and weather information.
[0268] Furthermore, the terminal records user feedback and sends it to the server. This feedback is analyzed by machine learning algorithms, contributing to improvements in the accuracy and quality of responses. Through this feedback process, the system continuously evolves and can respond quickly and flexibly to user needs.
[0269] As a concrete example, when a user asks, "What's the weather like this week?", the system retrieves weather data and provides a specific response such as, "There will be many sunny days this week. Do you have any plans to go out?". To enable this scenario, an example of a prompt to input into the generative AI model is, "Generate what the user will say next to have a natural conversation with an elderly person: 'What's the weather like this week?'"
[0270] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0271] Step 1:
[0272] Voice input begins when the user speaks into the device. The device's built-in microphone captures this voice. The input voice is processed by a noise-canceling algorithm to reduce background noise. The result is clear voice data. This voice data is compressed in real time in AAC format and immediately sent to the server.
[0273] Step 2:
[0274] The server receives compressed audio data sent from the terminal. At this stage, the audio data is in AAC format, so it is first decompressed. The decompressed audio data is converted into text format using a speech recognition engine, such as a speech recognition API. The input is the decompressed audio data, and the output is the corresponding text data.
[0275] Step 3:
[0276] The server further analyzes the text data. It uses a natural language processing toolkit to interpret the grammar structure, sentiment, and intent of the text. In this analysis, based on the input text data, context information and intent are obtained as output. This constitutes the sentiment and intent parameters necessary for subsequent response generation.
[0277] Step 4:
[0278] The server uses a generative AI model to generate a specific response based on the intent and sentiment obtained in the analysis step. Here, the generative AI model responds to the prompt text to form a natural conversation. In this step, the analysis result is used as the input, and a human-like response text is output.
[0279] Step 5:
[0280] The synthesis engine in the server converts the generated text response into audio data. In this process, high-quality speech synthesis technology is used to turn the text data into smooth audio. The input is the generated response text, and the output is the corresponding audio data.
[0281] Step 6:
[0282] The server encodes the synthesized audio data and transmits it to the terminal in a compressed format. The terminal decompresses this data and provides it to the user through the speaker. For the user, natural-sounding audio is played back.
[0283] Step 7:
[0284] The terminal records the interaction history and feedback with the user. This feedback is sent to the server for utilization in system improvement. The input is the user's feedback information, and the output is the analysis data based on it. Thus, the system is continuously improved to provide responses that are more user-oriented.
[0285] (Application Example 1)
[0286] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as a "server", and the smart glasses 214 are referred to as a "terminal".
[0287] In order for specific people such as the elderly to reduce their daily sense of loneliness and improve the quality of life, natural and personalized information provision and health management support are necessary. However, the current voice dialogue system has problems in understanding emotions and managing health information, and often fails to obtain sufficient satisfaction. And there is a lack of efficient support means for scheduling management and recording health information regularly and individually.
[0288] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0289] In this invention, the server includes means for converting voice information into a document format, performing natural language analysis to analyze the user's purpose and emotion, generation information processing means for generating a natural response based on the analyzed result, and support means having a function of recording the user's health information and assisting in daily schedule management. As a result, an environment is enabled in which the elderly can individually receive necessary information and health management support while naturally enjoying conversations in their daily lives.
[0290] "User's utterance" refers to the act of a user transmitting information using voice. By this act, the terminal acquires voice information.
[0291] "Central processing unit" refers to a device for converting voice information into a document format and further analyzing the user's intention and emotion by natural language processing. This device is connected to the terminal through a network.
[0292] "Generation information processing means" refers to a technology for generating a natural response based on the result analyzed by the central processing unit. The generated information is provided to the user as voice.
[0293] "Voice conversion means" refers to a technology that converts generated text information into voice data and transmits it to a terminal. This enables voice interaction.
[0294] "Output means" refers to a device or function incorporated into the terminal to provide audio information to the user. Using this means, the generated audio is played back through the speaker.
[0295] "Support features" refer to functions that record the user's health information and assist with daily schedule management. These features provide users with appropriate support.
[0296] This invention is a natural voice dialogue system for supporting the daily lives of the elderly. The system mainly consists of three elements: a "terminal," a "central processing unit," and a "user."
[0297] The device is equipped with a highly sensitive microphone and speaker, allowing for high-precision collection of user-generated audio. The audio data is converted to a digital format and transmitted to a central processing unit using data compression technology. Hardware used includes smartphones and dedicated electronic terminals.
[0298] The central processing unit first converts speech information into text using a speech recognition API (e.g., open-source speech recognition software or cloud service). Next, it uses a natural language processing framework to analyze the user's intent and emotions. Based on this analysis, a generative AI model generates a response appropriate to the user. This AI model is an AI technology trained on a large amount of dialogue data, enabling it to provide natural and user-friendly responses.
[0299] Furthermore, it includes health management and schedule management functions, recording and analyzing the user's daily health information as needed. This allows for personalized health advice to be provided to each individual user.
[0300] As a specific example, when an elderly person asks, "Tell me today's schedule," the system responds, "There is a planned exercise in the morning. After that, there is a relaxation session." In this way, daily support in voice is provided, and the user can live their daily life with peace of mind.
[0301] As an example of a prompt sentence, the generative AI model is instructed as follows. "Please generate a response that gives advice on today's health management in a kind tone for the elderly to enjoy daily conversations."
[0302] The flow of the specific process in Application Example 1 will be described using FIG. 12.
[0303] Step 1:
[0304] The user speaks to the terminal. The terminal is equipped with a high-sensitivity microphone to accurately collect the user's voice. The input is the user's raw voice data. The terminal converts this voice data into digital format, performs data compression, and transmits it to the central processing unit in real time. The output is the compressed voice data.
[0305] Step 2:
[0306] The server (central processing unit) converts the received voice data into text format using a voice recognition API. The input is the voice data in digital format. By this conversion, the voice is changed into character information. The output is text data. Next, using a natural language processing algorithm, the user's intention and emotion are analyzed. Through this process, it is understood what the user is asking for and what the emotional state is. The output of the analysis is metadata indicating the user's intention.
[0307] Step 3:
[0308] The server uses a generative AI model to generate appropriate responses to the user based on the analyzed data. The input is metadata containing the user's intent. The generative AI model is trained on a large-scale dialogue dataset and has the ability to generate natural and appropriate responses. The output of this process is the text data of the response.
[0309] Step 4:
[0310] The server converts the generated text response into speech data using a speech synthesis engine. The input is text data. Through speech synthesis, the text information is transformed into natural-sounding speech. This speech data is sent to the terminal. The output is the converted speech data.
[0311] Step 5:
[0312] The device plays the received audio data to the user through its speaker. The input is audio data from the server. This allows the user to receive natural-sounding voice responses from the system. The output is physical sound. The user can experience voice interaction and make decisions about their daily actions based on the responses.
[0313] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0314] The present invention aims to provide more natural and psychologically supportive communication in a dialogue system designed to alleviate feelings of loneliness among the elderly by incorporating a function that recognizes the user's emotions and reflects them in the response. Specific embodiments are shown below.
[0315] The system consists of terminals, servers, an emotion engine, and a network connecting them.
[0316] When a user speaks into the device, the device's microphone captures the audio in real time. The audio data is compressed and sent to the server.
[0317] The server converts the received audio data into text data using speech recognition technology. Furthermore, an emotion engine located within the server analyzes the user's emotions using this audio data. This analysis takes into account the tone, speed, pitch, and resonance of the user's voice. For example, if the user speaks slowly in a calm tone, the emotion engine recognizes this as a "stable" state, while if they speak quickly and at a high pitch, it determines that they are "excited."
[0318] Next, the server uses natural language processing to combine the user's statements with emotional information, and the generative AI generates the optimal response. In this process, emotional data influences the tone and content of the response. For example, if the server detects that the user is feeling anxious, it will generate a more supportive and reassuring response such as, "It's okay. Is there anything I can do to help?"
[0319] The generated text response is converted into speech by a speech synthesis engine on the server. The speech data is adjusted for intonation and tone to make it sound as close to human speech as possible. The completed speech response is sent to the terminal and delivered to the user.
[0320] The device plays the received audio response through its speaker and provides it to the user. At this time, the device verifies that the response is appropriate to the user's state and environment, and plays it back at an appropriate volume and speed.
[0321] This system allows users to have a deeper conversational experience, not just exchange information, and functions as a source of daily emotional support. The entire system constantly evolves by incorporating user feedback, improving the accuracy and quality of responses.
[0322] The following describes the processing flow.
[0323] Step 1:
[0324] The user initiates a conversation by speaking into the device. The device's built-in microphone filters out ambient noise and captures the user's voice clearly.
[0325] Step 2:
[0326] The terminal converts the captured audio into digital data and compresses it. The compressed audio data is then ready to be sent to the server.
[0327] Step 3:
[0328] The server converts the audio data received from the terminal into text data using a speech recognition engine. An emotion engine installed within the server then analyzes the user's emotions from this audio data.
[0329] Step 4:
[0330] The server's emotion engine analyzes elements such as tone, speed, and pitch of speech to identify the user's emotions. For example, if speech is hesitant or choppy, it might map this to "anxiety."
[0331] Step 5:
[0332] The server sends the user's text data and sentiment information to a natural language processing module. Here, contextual analysis and sentiment data are integrated to gain a more detailed understanding of the speaker's intent.
[0333] Step 6:
[0334] The server uses generative AI to generate natural-sounding responses based on the analyzed data. The tone and content of the response are adjusted based on information obtained from the emotion engine. For example, if "anxiety" is detected, it will generate a response in a gentle tone such as, "Don't worry, I'm here to support you."
[0335] Step 7:
[0336] The server converts the generated text response into speech data using a speech synthesis engine. During this process, natural intonation is added, and the speech is optimized for easy understanding.
[0337] Step 8:
[0338] The server sends audio data to the terminal. Once the terminal receives the data, it delivers the audio to the user through its speaker.
[0339] Step 9:
[0340] The device monitors user responses in real time and sends the collected feedback to the server. This feedback is used as training data to further improve future responses.
[0341] (Example 2)
[0342] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0343] In modern society, the elderly often experience feelings of loneliness, and there is a need for psychological support. However, conventional dialogue systems have struggled to adequately recognize users' emotions and provide natural and supportive responses. As a result, there is a challenge in that opportunities for the elderly to feel secure are limited.
[0344] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0345] In this invention, the server includes means for converting voice information into text information and performing machine processing to analyze the user's intentions and emotions; means for generating a harmonious response; and means for converting the response into voice information and transmitting it to an information communication device. This makes it possible to appropriately recognize the user's emotions and provide psychological support through natural dialogue based on those emotions.
[0346] An "information and communication device" is a device that has the function of acquiring a user's voice and transmitting that voice information to a processing device.
[0347] A "processing device" is a device that converts audio information into text information and analyzes the user's intentions and emotions through machine processing.
[0348] A "generative modeling tool" is a tool that has the function of generating a harmonious response based on analyzed emotional information and intentions.
[0349] A "voice conversion means" is a means that has the function of converting the generated response into voice information and transmitting it to an information communication device.
[0350] An "output device" is a device that has the function of playing back audio information received by an information and communication device and providing an audio response to the user.
[0351] This invention is a dialogue system designed to alleviate feelings of loneliness among the elderly. It aims to provide natural and psychologically supportive communication by recognizing the user's emotions and reflecting them in its responses.
[0352] When a user speaks into the device to initiate a conversation, the device acquires the voice using its built-in microphone. This voice information is processed using compression technology (e.g., AAC or MP3 format) to enable efficient data transfer. The voice data is then transmitted to the server via the information and communication network.
[0353] The server converts the received audio data into text data using a speech recognition engine (e.g., general speech recognition software). This ensures that the user's speech is treated as textual information. Subsequently, an emotion analysis module within the server analyzes the user's voice characteristics, such as tone, speed, pitch, and resonance, to identify their emotional state. Machine learning techniques are used in this emotion analysis.
[0354] Using the analyzed emotion information, the server generates the optimal response using a generative AI model (for example, a proprietary generative AI). By inputting prompts into the generative AI, an appropriate response tailored to the user's emotions is elicited. A concrete example of a prompt might be, "The user is calm, so respond in a relaxed tone."
[0355] The generated text response is converted into speech data by a speech synthesis engine (e.g., common speech synthesis software). This speech data is then adjusted for intonation and tone and transmitted to the terminal over the network.
[0356] The device provides the user with the received audio data through its speaker. At this time, the device adjusts the volume and playback speed according to the surrounding environment and the user's state. This allows the user to have a natural and supportive conversational experience.
[0357] This system allows users to enjoy not just information exchange, but more personalized psychological support. The terminal sends user feedback collected during the interaction to the server, allowing the system to continuously improve the accuracy and quality of its responses.
[0358] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0359] Step 1:
[0360] The user speaks into the device, inputting voice. The device uses its built-in microphone to capture this voice. The captured voice data is compressed using compression techniques such as AAC format and sent to the server via the information and communication network. The input is raw voice data, and the output is a compressed audio file.
[0361] Step 2:
[0362] The server decompresses the received compressed audio data and converts it into text data using a speech recognition engine. The input is compressed audio data, and the output is text-based character information. Specifically, it performs analysis using widely available speech recognition software.
[0363] Step 3:
[0364] The server processes text data using an emotion analysis module to analyze the user's emotional state. The input is textual information, and the output is the user's emotional state (e.g., "stable," "excited," etc.). Emotion analysis utilizes techniques such as voice analysis, evaluating factors like tone, speed, and pitch.
[0365] Step 4:
[0366] The server uses a generative AI model to generate responses based on emotional states and text data. Here, prompts such as "respond in a relaxed tone that matches the current emotional state" are input, and optimized response text is output. A concrete example would be reassuring words appropriate to the situation.
[0367] Step 5:
[0368] The server converts the generated response text into speech data using a speech synthesis engine. The input is response data in text format, and the output is response data in speech format. In speech synthesis, the intonation and tone are adjusted to be as close to human as possible.
[0369] Step 6:
[0370] The terminal receives audio data transmitted from the server and plays it through the speaker. The input is response data in audio format, and the output is delivered to the user as actual audio. The terminal adjusts the volume according to the user's environment and plays the audio in the optimal way.
[0371] (Application Example 2)
[0372] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0373] The challenge is to provide a dialogue system that allows for natural communication in order to alleviate feelings of loneliness and mental stress among the elderly. Furthermore, this system is required to generate supportive responses that are tailored to the emotional state of the elderly and to provide psychological comfort.
[0374] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0375] In this invention, the server includes a processing device that converts the user's voice into text information and performs natural language processing to analyze the user's intentions and emotions; a generation model that generates a natural response based on the analysis results; and a synthesis device that converts the generated response into voice information and transmits it to an input / output device. This makes it possible to generate responses based on the user's emotions and provide psychological support to the elderly.
[0376] An "input / output device" is a device that acquires the user's voice in real time and plays back the received voice information to the user.
[0377] A "processing device" is a device that converts audio information into text information and analyzes the user's intentions and emotions using natural language processing.
[0378] A "generative model" is an artificial intelligence model designed to generate natural responses based on analyzed user intentions and emotions.
[0379] A "synthesis device" is a device that converts text information generated by a generative model into speech information and transmits it to an input / output device.
[0380] "Psychological support" refers to providing elderly people with a sense of security and peace of mind through responses based on the user's emotions.
[0381] The embodiments for carrying out the invention are as follows:
[0382] This system consists of a terminal, a server, a generative AI model, and a network connecting them. The terminal is responsible for receiving the voice of elderly people in real time and transmitting this voice data to the server.
[0383] The server first uses a speech processing engine to convert speech data into text data. This process utilizes "speech recognition software." Subsequently, a "natural language processing engine" is used to analyze the user's intent and emotions using natural language processing technology. Based on the emotional information analyzed here, a generative AI model generates an appropriate response. For example, a "generative AI engine" is used for this response generation.
[0384] The generated text information is converted into speech data by a speech synthesis engine. At this stage, "speech synthesis software" is used, and the generated speech data is sent back to the terminal.
[0385] The device plays back received audio data and provides it to the elderly. During playback, it is required to deliver the audio at a volume and speed that is optimal for the user's environment and condition.
[0386] For example, if an elderly person using the device says, "I feel lonely today," the system can analyze the results and determine that the emotion is "loneliness," then generate a response such as, "Shall we watch your favorite movie together?" This response aims to provide comfortable and reassuring communication.
[0387] An example of a prompt would be: "Generate a reassuring response when the user is feeling down. The user said, 'I'm feeling kind of lonely today.'"
[0388] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0389] Step 1:
[0390] The device captures the user's voice in real time and takes the voice data as input. The acquired voice data is compressed over the network and output to the server.
[0391] Step 2:
[0392] The server passes the received audio data as input to the speech processing engine. Speech recognition software is used to convert the audio data into text data. The converted text data is then output to the next stage of processing.
[0393] Step 3:
[0394] The server inputs text data into a natural language processing engine to analyze the user's intent and emotions. The analysis results are output as emotion data, which is then passed to a generative AI model.
[0395] Step 4:
[0396] The server inputs the analysis results into the generation AI model and generates a natural response based on the user's emotions. It then outputs the generated response text and sends it to the speech synthesis engine.
[0397] Step 5:
[0398] The server uses speech synthesis software to convert the generated response text into audio data. It then outputs the converted audio data and transmits it to the terminal via the network.
[0399] Step 6:
[0400] The device receives audio data as input and plays it back to the user through the speaker. The volume and speed are optimized according to the user's environment.
[0401] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0402] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0403] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0404] [Third Embodiment]
[0405] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0406] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0407] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0408] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0409] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0410] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0411] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0412] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0413] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0414] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0415] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0416] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0417] This invention provides a specific embodiment for realizing natural dialogue with users in a communication system for the elderly. This system is mainly composed of terminals, a server, and a network connecting them.
[0418] First, when a user begins interacting with this system, they speak into the terminal, which then acquires their voice. The terminal is equipped with a highly sensitive microphone designed to accurately capture the user's speech while filtering out external noise. The acquired voice data is then transmitted to the server in real time using compression technology.
[0419] Next, the server converts the received audio data into text format using speech recognition technology. At this stage, the server uses natural language processing to analyze the context behind the audio, the user's emotions, and even the intent of the conversation. For example, if the user asks, "What's the weather like today?", the server understands that the user is looking for the latest weather information.
[0420] Once the analysis is complete, the server uses generative AI to generate an appropriate response. This AI is trained on a massive dataset and is designed to produce natural, human-like responses. For example, it might generate a response such as, "It's sunny today. Do you have any plans to go out?"
[0421] The generated text responses are converted into natural-sounding voice data by a speech synthesis engine on the server. This generated voice data is sent to the terminal and delivered to the user through the speaker. Since there is no initial text-based interaction and the entire conversation is conducted via voice, it provides the user with a more immersive and personalized dialogue experience.
[0422] Furthermore, as an information provision function, the server retrieves information from external sources in response to user requests and presents it in a format optimized for the user's needs. This information covers a wide range of topics, such as news, health information, and activities related to hobbies.
[0423] Finally, the device stores the history of user interactions and feedback, and sends it to the server. This feedback is processed by the server as part of machine learning and used to improve the system's future response performance.
[0424] The goal of this entire system is to improve the quality of life for users by creating an environment where elderly people can enjoy conversations on a daily basis and by reducing feelings of loneliness.
[0425] The following describes the processing flow.
[0426] Step 1:
[0427] The user initiates communication by speaking into the device. The device's built-in microphone captures the user's voice in real time.
[0428] Step 2:
[0429] The terminal converts the captured audio into a digital signal, compresses it, and prepares it for transmission to the server. To facilitate speech recognition processing, it performs noise reduction and processes the data to ensure the audio remains clear.
[0430] Step 3:
[0431] The server processes the received audio data using a speech recognition engine and converts it into text data. The server then passes this text data to a natural language processing module for analysis, which extracts the user's intent and emotions.
[0432] Step 4:
[0433] The server uses generative AI to generate appropriate responses based on the analyzed intent and emotional information. This process also takes conversation history and context into consideration, resulting in more natural and human-like responses.
[0434] Step 5:
[0435] The server passes the generated text response to a speech synthesis module, which converts it into speech data. The converted speech is synthesized to have a natural intonation and a human-like sound.
[0436] Step 6:
[0437] The server sends the generated audio data to the terminal. The audio data is delivered to the user quickly while maintaining high quality.
[0438] Step 7:
[0439] The device plays the received audio data through its speaker, providing a response to the user. Through this output, the device establishes its presence as a companion to the user.
[0440] Step 8:
[0441] The device tracks user responses and additional interactions, sending this data to the server as feedback. This feedback is used to improve the system's response accuracy.
[0442] (Example 1)
[0443] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0444] For the elderly and users who communicate via voice, it is crucial to create communication systems that provide natural and human-like responses. However, conventional technologies often fall short in terms of convenience and response accuracy, with particular challenges in noise reduction of voice input, emotion analysis, and appropriate response generation. Furthermore, there is currently a lack of effective methods for improving system performance by utilizing user feedback.
[0445] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0446] In this invention, the server includes an analysis means that converts voice data into text data and performs natural language processing to analyze the user's intentions and emotions; a generation model means that generates human-like responses; and a synthesis means that converts the generated responses into acoustic data and outputs it. This enables smooth interaction with the user, provides highly accurate responses, and allows for continuous improvement of system performance through the use of feedback.
[0447] "Acquisition means for acquiring sound, compressing the sound data, and transmitting it to an information processing device" refers to a device or system that has the function of capturing the sound emitted by a user, compressing it for effective data transfer, and then transmitting it to an information processing device such as a server via a network.
[0448] "An analytical means that converts audio data into text data and performs natural language processing to analyze the user's intent and emotions" refers to a technology or system that converts audio data into text format and uses language processing techniques to understand the context, emotions, and desired content from the text.
[0449] A "generative model means for generating human-like responses" is a model or program that uses AI technology to generate natural conversations in response to user input, creating responses that sound as if a human were responding.
[0450] "A synthesis means that converts the generated response into acoustic data and outputs it" refers to a device or system that uses speech synthesis technology to convert text generated in natural language into speech and output it in a format that can be heard by the user.
[0451] "Recording means for recording the history of conversations with users and feedback from users, and transmitting them for use by the analysis means" refers to a device or software that has the function of saving the content of past conversations with users and feedback information such as suggestions for improvement obtained from users, and transmitting it to a server for use in future analysis and system performance improvement.
[0452] This invention is a communication system that enables natural, voice-based dialogue for users, including the elderly. The system primarily consists of terminals, a server, and a network.
[0453] When a user initiates a voice interaction, the terminal device first receives the audio. This terminal is equipped with a high-sensitivity microphone to accurately capture the user's speech. Noise cancellation technology is employed to effectively remove external noise. The acquired audio data is compressed in real time and transferred to a server. AAC and Opus are commonly used for this audio compression.
[0454] The server decodes the received compressed audio data. The decoded audio data is converted into text data using speech recognition software. Speech recognition APIs, which are already common technologies on the market, are used for speech recognition. Subsequently, natural language processing (NLP) libraries are used to analyze the user's intent and emotions. Natural Language Toolkit (NLTK) or spaCy are often suitable for this purpose.
[0455] Based on the analysis results, the server generates a response using a generative AI. This generative AI is a model trained on a large dataset. Examples of such models include widely known natural language generation technologies, such as the GPT series. The generated response is prepared in an optimized text format.
[0456] Text-based responses are converted to speech using a speech synthesis engine. The synthesized speech data is then compressed again and sent to the device. The synthesized speech is delivered to the user through the speaker and is corrected to sound like natural speech with natural intonation and emotion.
[0457] In addition, in response to specific user requests, the server retrieves the latest information from a wide-area communication network and provides summarized information through NLP and AI services. This includes a variety of content, such as news summaries and weather information.
[0458] Furthermore, the terminal records user feedback and sends it to the server. This feedback is analyzed by machine learning algorithms, contributing to improvements in the accuracy and quality of responses. Through this feedback process, the system continuously evolves and can respond quickly and flexibly to user needs.
[0459] As a concrete example, when a user asks, "What's the weather like this week?", the system retrieves weather data and provides a specific response such as, "There will be many sunny days this week. Do you have any plans to go out?". To enable this scenario, an example of a prompt to input into the generative AI model is, "Generate what the user will say next to have a natural conversation with an elderly person: 'What's the weather like this week?'"
[0460] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0461] Step 1:
[0462] Voice input begins when the user speaks into the device. The device's built-in microphone captures this voice. The input voice is processed by a noise-canceling algorithm to reduce background noise. The result is clear voice data. This voice data is compressed in real time in AAC format and immediately sent to the server.
[0463] Step 2:
[0464] The server receives compressed audio data sent from the terminal. At this stage, the audio data is in AAC format, so it is first decompressed. The decompressed audio data is converted into text format using a speech recognition engine, such as a speech recognition API. The input is the decompressed audio data, and the output is the corresponding text data.
[0465] Step 3:
[0466] The server further analyzes the text data. Using a natural language processing toolkit, it interprets the grammatical structure, sentiment, and intent of the text. This analysis yields contextual information and intent as output, based on the input text data. This constitutes the sentiment and intent parameters necessary for subsequent response generation.
[0467] Step 4:
[0468] The server uses a generative AI model to generate specific responses based on the intentions and emotions obtained in the analysis step. Here, the generative AI model responds to prompt sentences and forms a natural dialogue. In this step, the analysis results are taken as input and human-like response text is output.
[0469] Step 5:
[0470] The synthesis engine on the server converts the generated text response into speech data. This process uses high-quality speech synthesis technology to transform the text data into smooth speech. The input is the generated response text, and the output is the corresponding speech data.
[0471] Step 6:
[0472] The server encodes the synthesized speech data and sends it to the terminal in a compressed format. The terminal then decompresses this data and provides it to the user through its speakers. For the user, this results in natural, flowing speech.
[0473] Step 7:
[0474] The terminal records user interaction history and feedback. This feedback is sent to the server to be used for system improvement. The input is user feedback information, and the output is analytical data based on that feedback. This allows the system to be improved to provide even more user-friendly responses.
[0475] (Application Example 1)
[0476] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0477] To alleviate feelings of loneliness and improve the quality of life for certain groups, such as the elderly, natural and personalized information provision and health management support are necessary. However, current voice dialogue systems have challenges in understanding emotions and managing health information, often failing to provide sufficient satisfaction. Furthermore, there is a lack of efficient support methods for regular and individual schedule management and health information recording.
[0478] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0479] In this invention, the server includes means for converting voice information into document format and performing natural language analysis to analyze the user's purpose and emotions; means for generating information and generating natural responses based on the analysis results; and means for recording the user's health information and assisting with daily schedule management. This makes it possible for elderly people to enjoy conversations naturally in their daily lives while receiving personalized information and health management support.
[0480] "User utterance" refers to the act of a user communicating information using their voice. The device acquires this voice information through this action.
[0481] A "central processing unit" refers to a device that converts audio information into document format and further analyzes the user's intent and emotions through natural language processing. This device is connected to terminals via a network.
[0482] "Information generation processing means" refers to technology for generating natural-sounding responses based on the results analyzed by the central processing unit. The generated information is provided to the user as audio.
[0483] "Voice conversion means" refers to a technology that converts generated text information into voice data and transmits it to a terminal. This enables voice interaction.
[0484] "Output means" refers to a device or function incorporated into the terminal to provide audio information to the user. Using this means, the generated audio is played back through the speaker.
[0485] "Support features" refer to functions that record the user's health information and assist with daily schedule management. These features provide users with appropriate support.
[0486] This invention is a natural voice dialogue system for supporting the daily lives of the elderly. The system mainly consists of three elements: a "terminal," a "central processing unit," and a "user."
[0487] The device is equipped with a highly sensitive microphone and speaker, allowing for high-precision collection of user-generated audio. The audio data is converted to a digital format and transmitted to a central processing unit using data compression technology. Hardware used includes smartphones and dedicated electronic terminals.
[0488] The central processing unit first converts speech information into text using a speech recognition API (e.g., open-source speech recognition software or cloud service). Next, it uses a natural language processing framework to analyze the user's intent and emotions. Based on this analysis, a generative AI model generates a response appropriate to the user. This AI model is an AI technology trained on a large amount of dialogue data, enabling it to provide natural and user-friendly responses.
[0489] Furthermore, it includes health management and schedule management functions, recording and analyzing the user's daily health information as needed. This allows for personalized health advice to be provided to each individual user.
[0490] For example, if an elderly person asks, "What's my schedule for today?", the system will respond, "You have an exercise session scheduled for this morning, followed by a relaxation session." This kind of daily voice support allows users to live their daily lives with peace of mind.
[0491] As an example of a prompt, the generation AI model would be instructed as follows: "Generate a response that provides gentle-toned health management advice for elderly people to enjoy everyday conversation."
[0492] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0493] Step 1:
[0494] The user speaks into the device. The device is equipped with a high-sensitivity microphone that accurately captures the user's voice. The input is the user's raw voice data. The device converts this voice data into a digital format, compresses the data, and transmits it to the central processing unit in real time. The output is compressed voice data.
[0495] Step 2:
[0496] The server (central processing unit) converts the received audio data into text format using a speech recognition API. The input is digital audio data. This conversion transforms the audio into text information. The output is text data. Next, a natural language processing algorithm is used to analyze the user's intent and emotions. This process helps understand what the user wants and their emotional state. The output of the analysis is metadata indicating the user's intent.
[0497] Step 3:
[0498] The server uses a generative AI model to generate appropriate responses to the user based on the analyzed data. The input is metadata containing the user's intent. The generative AI model is trained on a large-scale dialogue dataset and has the ability to generate natural and appropriate responses. The output of this process is the text data of the response.
[0499] Step 4:
[0500] The server converts the generated text response into speech data using a speech synthesis engine. The input is text data. Through speech synthesis, the text information is transformed into natural-sounding speech. This speech data is sent to the terminal. The output is the converted speech data.
[0501] Step 5:
[0502] The device plays the received audio data to the user through its speaker. The input is audio data from the server. This allows the user to receive natural-sounding voice responses from the system. The output is physical sound. The user can experience voice interaction and make decisions about their daily actions based on the responses.
[0503] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0504] The present invention aims to provide more natural and psychologically supportive communication in a dialogue system designed to alleviate feelings of loneliness among the elderly by incorporating a function that recognizes the user's emotions and reflects them in the response. Specific embodiments are shown below.
[0505] The system consists of terminals, servers, an emotion engine, and a network connecting them.
[0506] When a user speaks into the device, the device's microphone captures the audio in real time. The audio data is compressed and sent to the server.
[0507] The server converts the received audio data into text data using speech recognition technology. Furthermore, an emotion engine located within the server analyzes the user's emotions using this audio data. This analysis takes into account the tone, speed, pitch, and resonance of the user's voice. For example, if the user speaks slowly in a calm tone, the emotion engine recognizes this as a "stable" state, while if they speak quickly and at a high pitch, it determines that they are "excited."
[0508] Next, the server uses natural language processing to combine the user's statements with emotional information, and the generative AI generates the optimal response. In this process, emotional data influences the tone and content of the response. For example, if the server detects that the user is feeling anxious, it will generate a more supportive and reassuring response such as, "It's okay. Is there anything I can do to help?"
[0509] The generated text response is converted into speech by a speech synthesis engine on the server. The speech data is adjusted for intonation and tone to make it sound as close to human speech as possible. The completed speech response is sent to the terminal and delivered to the user.
[0510] The device plays the received audio response through its speaker and provides it to the user. At this time, the device verifies that the response is appropriate to the user's state and environment, and plays it back at an appropriate volume and speed.
[0511] This system allows users to have a deeper conversational experience, not just exchange information, and functions as a source of daily emotional support. The entire system constantly evolves by incorporating user feedback, improving the accuracy and quality of responses.
[0512] The following describes the processing flow.
[0513] Step 1:
[0514] The user initiates a conversation by speaking into the device. The device's built-in microphone filters out ambient noise and captures the user's voice clearly.
[0515] Step 2:
[0516] The terminal converts the captured audio into digital data and compresses it. The compressed audio data is then ready to be sent to the server.
[0517] Step 3:
[0518] The server converts the audio data received from the terminal into text data using a speech recognition engine. An emotion engine installed within the server then analyzes the user's emotions from this audio data.
[0519] Step 4:
[0520] The server's emotion engine analyzes elements such as tone, speed, and pitch of speech to identify the user's emotions. For example, if speech is hesitant or choppy, it might map this to "anxiety."
[0521] Step 5:
[0522] The server sends the user's text data and sentiment information to a natural language processing module. Here, contextual analysis and sentiment data are integrated to gain a more detailed understanding of the speaker's intent.
[0523] Step 6:
[0524] The server uses generative AI to generate natural-sounding responses based on the analyzed data. The tone and content of the response are adjusted based on information obtained from the emotion engine. For example, if "anxiety" is detected, it will generate a response in a gentle tone such as, "Don't worry, I'm here to support you."
[0525] Step 7:
[0526] The server converts the generated text response into speech data using a speech synthesis engine. During this process, natural intonation is added, and the speech is optimized for easy understanding.
[0527] Step 8:
[0528] The server sends audio data to the terminal. Once the terminal receives the data, it delivers the audio to the user through its speaker.
[0529] Step 9:
[0530] The device monitors user responses in real time and sends the collected feedback to the server. This feedback is used as training data to further improve future responses.
[0531] (Example 2)
[0532] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0533] In modern society, the elderly often experience feelings of loneliness, and there is a need for psychological support. However, conventional dialogue systems have struggled to adequately recognize users' emotions and provide natural and supportive responses. As a result, there is a challenge in that opportunities for the elderly to feel secure are limited.
[0534] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0535] In this invention, the server includes means for converting voice information into text information and performing machine processing to analyze the user's intentions and emotions; means for generating a harmonious response; and means for converting the response into voice information and transmitting it to an information communication device. This makes it possible to appropriately recognize the user's emotions and provide psychological support through natural dialogue based on those emotions.
[0536] An "information and communication device" is a device that has the function of acquiring a user's voice and transmitting that voice information to a processing device.
[0537] A "processing device" is a device that converts audio information into text information and analyzes the user's intentions and emotions through machine processing.
[0538] A "generative modeling tool" is a tool that has the function of generating a harmonious response based on analyzed emotional information and intentions.
[0539] A "voice conversion means" is a means that has the function of converting the generated response into voice information and transmitting it to an information communication device.
[0540] An "output device" is a device that has the function of playing back audio information received by an information and communication device and providing an audio response to the user.
[0541] This invention is a dialogue system designed to alleviate feelings of loneliness among the elderly. It aims to provide natural and psychologically supportive communication by recognizing the user's emotions and reflecting them in its responses.
[0542] When a user speaks into the device to initiate a conversation, the device acquires the voice using its built-in microphone. This voice information is processed using compression technology (e.g., AAC or MP3 format) to enable efficient data transfer. The voice data is then transmitted to the server via the information and communication network.
[0543] The server converts the received audio data into text data using a speech recognition engine (e.g., general speech recognition software). This ensures that the user's speech is treated as textual information. Subsequently, an emotion analysis module within the server analyzes the user's voice characteristics, such as tone, speed, pitch, and resonance, to identify their emotional state. Machine learning techniques are used in this emotion analysis.
[0544] Using the analyzed emotion information, the server generates the optimal response using a generative AI model (for example, a proprietary generative AI). By inputting prompts into the generative AI, an appropriate response tailored to the user's emotions is elicited. A concrete example of a prompt might be, "The user is calm, so respond in a relaxed tone."
[0545] The generated text response is converted into speech data by a speech synthesis engine (e.g., common speech synthesis software). This speech data is then adjusted for intonation and tone and transmitted to the terminal over the network.
[0546] The device provides the user with the received audio data through its speaker. At this time, the device adjusts the volume and playback speed according to the surrounding environment and the user's state. This allows the user to have a natural and supportive conversational experience.
[0547] This system allows users to enjoy not just information exchange, but more personalized psychological support. The terminal sends user feedback collected during the interaction to the server, allowing the system to continuously improve the accuracy and quality of its responses.
[0548] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0549] Step 1:
[0550] The user speaks into the device, inputting voice. The device uses its built-in microphone to capture this voice. The captured voice data is compressed using compression techniques such as AAC format and sent to the server via the information and communication network. The input is raw voice data, and the output is a compressed audio file.
[0551] Step 2:
[0552] The server decompresses the received compressed audio data and converts it into text data using a speech recognition engine. The input is compressed audio data, and the output is text-based character information. Specifically, it performs analysis using widely available speech recognition software.
[0553] Step 3:
[0554] The server processes text data using an emotion analysis module to analyze the user's emotional state. The input is textual information, and the output is the user's emotional state (e.g., "stable," "excited," etc.). Emotion analysis utilizes techniques such as voice analysis, evaluating factors like tone, speed, and pitch.
[0555] Step 4:
[0556] The server uses a generative AI model to generate responses based on emotional states and text data. Here, prompts such as "respond in a relaxed tone that matches the current emotional state" are input, and optimized response text is output. A concrete example would be reassuring words appropriate to the situation.
[0557] Step 5:
[0558] The server converts the generated response text into speech data using a speech synthesis engine. The input is response data in text format, and the output is response data in speech format. In speech synthesis, the intonation and tone are adjusted to be as close to human as possible.
[0559] Step 6:
[0560] The terminal receives audio data transmitted from the server and plays it through the speaker. The input is response data in audio format, and the output is delivered to the user as actual audio. The terminal adjusts the volume according to the user's environment and plays the audio in the optimal way.
[0561] (Application Example 2)
[0562] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0563] The challenge is to provide a dialogue system that allows for natural communication in order to alleviate feelings of loneliness and mental stress among the elderly. Furthermore, this system is required to generate supportive responses that are tailored to the emotional state of the elderly and to provide psychological comfort.
[0564] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0565] In this invention, the server includes a processing device that converts the user's voice into text information and performs natural language processing to analyze the user's intentions and emotions; a generation model that generates a natural response based on the analysis results; and a synthesis device that converts the generated response into voice information and transmits it to an input / output device. This makes it possible to generate responses based on the user's emotions and provide psychological support to the elderly.
[0566] An "input / output device" is a device that acquires the user's voice in real time and plays back the received voice information to the user.
[0567] A "processing device" is a device that converts audio information into text information and analyzes the user's intentions and emotions using natural language processing.
[0568] A "generative model" is an artificial intelligence model designed to generate natural responses based on analyzed user intentions and emotions.
[0569] A "synthesis device" is a device that converts text information generated by a generative model into speech information and transmits it to an input / output device.
[0570] "Psychological support" refers to providing elderly people with a sense of security and peace of mind through responses based on the user's emotions.
[0571] The embodiments for carrying out the invention are as follows:
[0572] This system consists of a terminal, a server, a generative AI model, and a network connecting them. The terminal is responsible for receiving the voice of elderly people in real time and transmitting this voice data to the server.
[0573] The server first uses a speech processing engine to convert speech data into text data. This process utilizes "speech recognition software." Subsequently, a "natural language processing engine" is used to analyze the user's intent and emotions using natural language processing technology. Based on the emotional information analyzed here, a generative AI model generates an appropriate response. For example, a "generative AI engine" is used for this response generation.
[0574] The generated text information is converted into speech data by a speech synthesis engine. At this stage, "speech synthesis software" is used, and the generated speech data is sent back to the terminal.
[0575] The device plays back received audio data and provides it to the elderly. During playback, it is required to deliver the audio at a volume and speed that is optimal for the user's environment and condition.
[0576] For example, if an elderly person using the device says, "I feel lonely today," the system can analyze the results and determine that the emotion is "loneliness," then generate a response such as, "Shall we watch your favorite movie together?" This response aims to provide comfortable and reassuring communication.
[0577] An example of a prompt would be: "Generate a reassuring response when the user is feeling down. The user said, 'I'm feeling kind of lonely today.'"
[0578] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0579] Step 1:
[0580] The device captures the user's voice in real time and takes the voice data as input. The acquired voice data is compressed over the network and output to the server.
[0581] Step 2:
[0582] The server passes the received audio data as input to the speech processing engine. Speech recognition software is used to convert the audio data into text data. The converted text data is then output to the next stage of processing.
[0583] Step 3:
[0584] The server inputs text data into a natural language processing engine to analyze the user's intent and emotions. The analysis results are output as emotion data, which is then passed to a generative AI model.
[0585] Step 4:
[0586] The server inputs the analysis results into the generation AI model and generates a natural response based on the user's emotions. It then outputs the generated response text and sends it to the speech synthesis engine.
[0587] Step 5:
[0588] The server uses speech synthesis software to convert the generated response text into audio data. It then outputs the converted audio data and transmits it to the terminal via the network.
[0589] Step 6:
[0590] The device receives audio data as input and plays it back to the user through the speaker. The volume and speed are optimized according to the user's environment.
[0591] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0592] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0593] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0594] [Fourth Embodiment]
[0595] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0596] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0597] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0598] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0599] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0600] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0601] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0602] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0603] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0604] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0605] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0606] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0607] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0608] This invention provides a specific embodiment for realizing natural dialogue with users in a communication system for the elderly. This system is mainly composed of terminals, a server, and a network connecting them.
[0609] First, when a user begins interacting with this system, they speak into the terminal, which then acquires their voice. The terminal is equipped with a highly sensitive microphone designed to accurately capture the user's speech while filtering out external noise. The acquired voice data is then transmitted to the server in real time using compression technology.
[0610] Next, the server converts the received audio data into text format using speech recognition technology. At this stage, the server uses natural language processing to analyze the context behind the audio, the user's emotions, and even the intent of the conversation. For example, if the user asks, "What's the weather like today?", the server understands that the user is looking for the latest weather information.
[0611] Once the analysis is complete, the server uses generative AI to generate an appropriate response. This AI is trained on a massive dataset and is designed to produce natural, human-like responses. For example, it might generate a response such as, "It's sunny today. Do you have any plans to go out?"
[0612] The generated text responses are converted into natural-sounding voice data by a speech synthesis engine on the server. This generated voice data is sent to the terminal and delivered to the user through the speaker. Since there is no initial text-based interaction and the entire conversation is conducted via voice, it provides the user with a more immersive and personalized dialogue experience.
[0613] Furthermore, as an information provision function, the server retrieves information from external sources in response to user requests and presents it in a format optimized for the user's needs. This information covers a wide range of topics, such as news, health information, and activities related to hobbies.
[0614] Finally, the device stores the history of user interactions and feedback, and sends it to the server. This feedback is processed by the server as part of machine learning and used to improve the system's future response performance.
[0615] The goal of this entire system is to improve the quality of life for users by creating an environment where elderly people can enjoy conversations on a daily basis and by reducing feelings of loneliness.
[0616] The following describes the processing flow.
[0617] Step 1:
[0618] The user initiates communication by speaking into the device. The device's built-in microphone captures the user's voice in real time.
[0619] Step 2:
[0620] The terminal converts the captured audio into a digital signal, compresses it, and prepares it for transmission to the server. To facilitate speech recognition processing, it performs noise reduction and processes the data to ensure the audio remains clear.
[0621] Step 3:
[0622] The server processes the received audio data using a speech recognition engine and converts it into text data. The server then passes this text data to a natural language processing module for analysis, which extracts the user's intent and emotions.
[0623] Step 4:
[0624] The server uses generative AI to generate appropriate responses based on the analyzed intent and emotional information. This process also takes conversation history and context into consideration, resulting in more natural and human-like responses.
[0625] Step 5:
[0626] The server passes the generated text response to a speech synthesis module, which converts it into speech data. The converted speech is synthesized to have a natural intonation and a human-like sound.
[0627] Step 6:
[0628] The server sends the generated audio data to the terminal. The audio data is delivered to the user quickly while maintaining high quality.
[0629] Step 7:
[0630] The device plays the received audio data through its speaker, providing a response to the user. Through this output, the device establishes its presence as a companion to the user.
[0631] Step 8:
[0632] The device tracks user responses and additional interactions, sending this data to the server as feedback. This feedback is used to improve the system's response accuracy.
[0633] (Example 1)
[0634] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0635] For the elderly and users who communicate via voice, it is crucial to create communication systems that provide natural and human-like responses. However, conventional technologies often fall short in terms of convenience and response accuracy, with particular challenges in noise reduction of voice input, emotion analysis, and appropriate response generation. Furthermore, there is currently a lack of effective methods for improving system performance by utilizing user feedback.
[0636] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0637] In this invention, the server includes an analysis means that converts voice data into text data and performs natural language processing to analyze the user's intentions and emotions; a generation model means that generates human-like responses; and a synthesis means that converts the generated responses into acoustic data and outputs it. This enables smooth interaction with the user, provides highly accurate responses, and allows for continuous improvement of system performance through the use of feedback.
[0638] "Acquisition means for acquiring sound, compressing the sound data, and transmitting it to an information processing device" refers to a device or system that has the function of capturing the sound emitted by a user, compressing it for effective data transfer, and then transmitting it to an information processing device such as a server via a network.
[0639] "An analytical means that converts audio data into text data and performs natural language processing to analyze the user's intent and emotions" refers to a technology or system that converts audio data into text format and uses language processing techniques to understand the context, emotions, and desired content from the text.
[0640] A "generative model means for generating human-like responses" is a model or program that uses AI technology to generate natural conversations in response to user input, creating responses that sound as if a human were responding.
[0641] "A synthesis means that converts the generated response into acoustic data and outputs it" refers to a device or system that uses speech synthesis technology to convert text generated in natural language into speech and output it in a format that can be heard by the user.
[0642] "Recording means for recording the history of conversations with users and feedback from users, and transmitting them for use by the analysis means" refers to a device or software that has the function of saving the content of past conversations with users and feedback information such as suggestions for improvement obtained from users, and transmitting it to a server for use in future analysis and system performance improvement.
[0643] This invention is a communication system that enables natural, voice-based dialogue for users, including the elderly. The system primarily consists of terminals, a server, and a network.
[0644] When a user initiates a voice interaction, the terminal device first receives the audio. This terminal is equipped with a high-sensitivity microphone to accurately capture the user's speech. Noise cancellation technology is employed to effectively remove external noise. The acquired audio data is compressed in real time and transferred to a server. AAC and Opus are commonly used for this audio compression.
[0645] The server decodes the received compressed audio data. The decoded audio data is converted into text data using speech recognition software. Speech recognition APIs, which are already common technologies on the market, are used for speech recognition. Subsequently, natural language processing (NLP) libraries are used to analyze the user's intent and emotions. Natural Language Toolkit (NLTK) or spaCy are often suitable for this purpose.
[0646] Based on the analysis results, the server generates a response using a generative AI. This generative AI is a model trained on a large dataset. Examples of such models include widely known natural language generation technologies, such as the GPT series. The generated response is prepared in an optimized text format.
[0647] Text-based responses are converted to speech using a speech synthesis engine. The synthesized speech data is then compressed again and sent to the device. The synthesized speech is delivered to the user through the speaker and is corrected to sound like natural speech with natural intonation and emotion.
[0648] In addition, in response to specific user requests, the server retrieves the latest information from a wide-area communication network and provides summarized information through NLP and AI services. This includes a variety of content, such as news summaries and weather information.
[0649] Furthermore, the terminal records user feedback and sends it to the server. This feedback is analyzed by machine learning algorithms, contributing to improvements in the accuracy and quality of responses. Through this feedback process, the system continuously evolves and can respond quickly and flexibly to user needs.
[0650] As a concrete example, when a user asks, "What's the weather like this week?", the system retrieves weather data and provides a specific response such as, "There will be many sunny days this week. Do you have any plans to go out?". To enable this scenario, an example of a prompt to input into the generative AI model is, "Generate what the user will say next to have a natural conversation with an elderly person: 'What's the weather like this week?'"
[0651] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0652] Step 1:
[0653] Voice input begins when the user speaks into the device. The device's built-in microphone captures this voice. The input voice is processed by a noise-canceling algorithm to reduce background noise. The result is clear voice data. This voice data is compressed in real time in AAC format and immediately sent to the server.
[0654] Step 2:
[0655] The server receives compressed audio data sent from the terminal. At this stage, the audio data is in AAC format, so it is first decompressed. The decompressed audio data is converted into text format using a speech recognition engine, such as a speech recognition API. The input is the decompressed audio data, and the output is the corresponding text data.
[0656] Step 3:
[0657] The server further analyzes the text data. Using a natural language processing toolkit, it interprets the grammatical structure, sentiment, and intent of the text. This analysis yields contextual information and intent as output, based on the input text data. This constitutes the sentiment and intent parameters necessary for subsequent response generation.
[0658] Step 4:
[0659] The server uses a generative AI model to generate specific responses based on the intentions and emotions obtained in the analysis step. Here, the generative AI model responds to prompt sentences and forms a natural dialogue. In this step, the analysis results are taken as input and human-like response text is output.
[0660] Step 5:
[0661] The synthesis engine on the server converts the generated text response into speech data. This process uses high-quality speech synthesis technology to transform the text data into smooth speech. The input is the generated response text, and the output is the corresponding speech data.
[0662] Step 6:
[0663] The server encodes the synthesized speech data and sends it to the terminal in a compressed format. The terminal then decompresses this data and provides it to the user through its speakers. For the user, this results in natural, flowing speech.
[0664] Step 7:
[0665] The terminal records user interaction history and feedback. This feedback is sent to the server to be used for system improvement. The input is user feedback information, and the output is analytical data based on that feedback. This allows the system to be improved to provide even more user-friendly responses.
[0666] (Application Example 1)
[0667] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0668] To alleviate feelings of loneliness and improve the quality of life for certain groups, such as the elderly, natural and personalized information provision and health management support are necessary. However, current voice dialogue systems have challenges in understanding emotions and managing health information, often failing to provide sufficient satisfaction. Furthermore, there is a lack of efficient support methods for regular and individual schedule management and health information recording.
[0669] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0670] In this invention, the server includes means for converting voice information into document format and performing natural language analysis to analyze the user's purpose and emotions; means for generating information and generating natural responses based on the analysis results; and means for recording the user's health information and assisting with daily schedule management. This makes it possible for elderly people to enjoy conversations naturally in their daily lives while receiving personalized information and health management support.
[0671] "User utterance" refers to the act of a user communicating information using their voice. The device acquires this voice information through this action.
[0672] A "central processing unit" refers to a device that converts audio information into document format and further analyzes the user's intent and emotions through natural language processing. This device is connected to terminals via a network.
[0673] "Information generation processing means" refers to technology for generating natural-sounding responses based on the results analyzed by the central processing unit. The generated information is provided to the user as audio.
[0674] "Voice conversion means" refers to a technology that converts generated text information into voice data and transmits it to a terminal. This enables voice interaction.
[0675] "Output means" refers to a device or function incorporated into the terminal to provide audio information to the user. Using this means, the generated audio is played back through the speaker.
[0676] "Support features" refer to functions that record the user's health information and assist with daily schedule management. These features provide users with appropriate support.
[0677] This invention is a natural voice dialogue system for supporting the daily lives of the elderly. The system mainly consists of three elements: a "terminal," a "central processing unit," and a "user."
[0678] The device is equipped with a highly sensitive microphone and speaker, allowing for high-precision collection of user-generated audio. The audio data is converted to a digital format and transmitted to a central processing unit using data compression technology. Hardware used includes smartphones and dedicated electronic terminals.
[0679] The central processing unit first converts speech information into text using a speech recognition API (e.g., open-source speech recognition software or cloud service). Next, it uses a natural language processing framework to analyze the user's intent and emotions. Based on this analysis, a generative AI model generates a response appropriate to the user. This AI model is an AI technology trained on a large amount of dialogue data, enabling it to provide natural and user-friendly responses.
[0680] Furthermore, it includes health management and schedule management functions, recording and analyzing the user's daily health information as needed. This allows for personalized health advice to be provided to each individual user.
[0681] For example, if an elderly person asks, "What's my schedule for today?", the system will respond, "You have an exercise session scheduled for this morning, followed by a relaxation session." This kind of daily voice support allows users to live their daily lives with peace of mind.
[0682] As an example of a prompt, the generation AI model would be instructed as follows: "Generate a response that provides gentle-toned health management advice for elderly people to enjoy everyday conversation."
[0683] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0684] Step 1:
[0685] The user speaks into the device. The device is equipped with a high-sensitivity microphone that accurately captures the user's voice. The input is the user's raw voice data. The device converts this voice data into a digital format, compresses the data, and transmits it to the central processing unit in real time. The output is compressed voice data.
[0686] Step 2:
[0687] The server (central processing unit) converts the received audio data into text format using a speech recognition API. The input is digital audio data. This conversion transforms the audio into text information. The output is text data. Next, a natural language processing algorithm is used to analyze the user's intent and emotions. This process helps understand what the user wants and their emotional state. The output of the analysis is metadata indicating the user's intent.
[0688] Step 3:
[0689] The server uses a generative AI model to generate appropriate responses to the user based on the analyzed data. The input is metadata containing the user's intent. The generative AI model is trained on a large-scale dialogue dataset and has the ability to generate natural and appropriate responses. The output of this process is the text data of the response.
[0690] Step 4:
[0691] The server converts the generated text response into speech data using a speech synthesis engine. The input is text data. Through speech synthesis, the text information is transformed into natural-sounding speech. This speech data is sent to the terminal. The output is the converted speech data.
[0692] Step 5:
[0693] The device plays the received audio data to the user through its speaker. The input is audio data from the server. This allows the user to receive natural-sounding voice responses from the system. The output is physical sound. The user can experience voice interaction and make decisions about their daily actions based on the responses.
[0694] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0695] The present invention aims to provide more natural and psychologically supportive communication in a dialogue system designed to alleviate feelings of loneliness among the elderly by incorporating a function that recognizes the user's emotions and reflects them in the response. Specific embodiments are shown below.
[0696] The system consists of terminals, servers, an emotion engine, and a network connecting them.
[0697] When a user speaks into the device, the device's microphone captures the audio in real time. The audio data is compressed and sent to the server.
[0698] The server converts the received audio data into text data using speech recognition technology. Furthermore, an emotion engine located within the server analyzes the user's emotions using this audio data. This analysis takes into account the tone, speed, pitch, and resonance of the user's voice. For example, if the user speaks slowly in a calm tone, the emotion engine recognizes this as a "stable" state, while if they speak quickly and at a high pitch, it determines that they are "excited."
[0699] Next, the server uses natural language processing to combine the user's statements with emotional information, and the generative AI generates the optimal response. In this process, emotional data influences the tone and content of the response. For example, if the server detects that the user is feeling anxious, it will generate a more supportive and reassuring response such as, "It's okay. Is there anything I can do to help?"
[0700] The generated text response is converted into speech by a speech synthesis engine on the server. The speech data is adjusted for intonation and tone to make it sound as close to human speech as possible. The completed speech response is sent to the terminal and delivered to the user.
[0701] The device plays the received audio response through its speaker and provides it to the user. At this time, the device verifies that the response is appropriate to the user's state and environment, and plays it back at an appropriate volume and speed.
[0702] This system allows users to have a deeper conversational experience, not just exchange information, and functions as a source of daily emotional support. The entire system constantly evolves by incorporating user feedback, improving the accuracy and quality of responses.
[0703] The following describes the processing flow.
[0704] Step 1:
[0705] The user initiates a conversation by speaking into the device. The device's built-in microphone filters out ambient noise and captures the user's voice clearly.
[0706] Step 2:
[0707] The terminal converts the captured audio into digital data and compresses it. The compressed audio data is then ready to be sent to the server.
[0708] Step 3:
[0709] The server converts the audio data received from the terminal into text data using a speech recognition engine. An emotion engine installed within the server then analyzes the user's emotions from this audio data.
[0710] Step 4:
[0711] The server's emotion engine analyzes elements such as tone, speed, and pitch of speech to identify the user's emotions. For example, if speech is hesitant or choppy, it might map this to "anxiety."
[0712] Step 5:
[0713] The server sends the user's text data and sentiment information to a natural language processing module. Here, contextual analysis and sentiment data are integrated to gain a more detailed understanding of the speaker's intent.
[0714] Step 6:
[0715] The server uses generative AI to generate natural-sounding responses based on the analyzed data. The tone and content of the response are adjusted based on information obtained from the emotion engine. For example, if "anxiety" is detected, it will generate a response in a gentle tone such as, "Don't worry, I'm here to support you."
[0716] Step 7:
[0717] The server converts the generated text response into speech data using a speech synthesis engine. During this process, natural intonation is added, and the speech is optimized for easy understanding.
[0718] Step 8:
[0719] The server sends audio data to the terminal. Once the terminal receives the data, it delivers the audio to the user through its speaker.
[0720] Step 9:
[0721] The device monitors user responses in real time and sends the collected feedback to the server. This feedback is used as training data to further improve future responses.
[0722] (Example 2)
[0723] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0724] In modern society, the elderly often experience feelings of loneliness, and there is a need for psychological support. However, conventional dialogue systems have struggled to adequately recognize users' emotions and provide natural and supportive responses. As a result, there is a challenge in that opportunities for the elderly to feel secure are limited.
[0725] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0726] In this invention, the server includes means for converting voice information into text information and performing machine processing to analyze the user's intentions and emotions; means for generating a harmonious response; and means for converting the response into voice information and transmitting it to an information communication device. This makes it possible to appropriately recognize the user's emotions and provide psychological support through natural dialogue based on those emotions.
[0727] An "information and communication device" is a device that has the function of acquiring a user's voice and transmitting that voice information to a processing device.
[0728] A "processing device" is a device that converts audio information into text information and analyzes the user's intentions and emotions through machine processing.
[0729] A "generative modeling tool" is a tool that has the function of generating a harmonious response based on analyzed emotional information and intentions.
[0730] A "voice conversion means" is a means that has the function of converting the generated response into voice information and transmitting it to an information communication device.
[0731] An "output device" is a device that has the function of playing back audio information received by an information and communication device and providing an audio response to the user.
[0732] This invention is a dialogue system designed to alleviate feelings of loneliness among the elderly. It aims to provide natural and psychologically supportive communication by recognizing the user's emotions and reflecting them in its responses.
[0733] When a user speaks into the device to initiate a conversation, the device acquires the voice using its built-in microphone. This voice information is processed using compression technology (e.g., AAC or MP3 format) to enable efficient data transfer. The voice data is then transmitted to the server via the information and communication network.
[0734] The server converts the received audio data into text data using a speech recognition engine (e.g., general speech recognition software). This ensures that the user's speech is treated as textual information. Subsequently, an emotion analysis module within the server analyzes the user's voice characteristics, such as tone, speed, pitch, and resonance, to identify their emotional state. Machine learning techniques are used in this emotion analysis.
[0735] Using the analyzed emotion information, the server generates the optimal response using a generative AI model (for example, a proprietary generative AI). By inputting prompts into the generative AI, an appropriate response tailored to the user's emotions is elicited. A concrete example of a prompt might be, "The user is calm, so respond in a relaxed tone."
[0736] The generated text response is converted into speech data by a speech synthesis engine (e.g., common speech synthesis software). This speech data is then adjusted for intonation and tone and transmitted to the terminal over the network.
[0737] The device provides the user with the received audio data through its speaker. At this time, the device adjusts the volume and playback speed according to the surrounding environment and the user's state. This allows the user to have a natural and supportive conversational experience.
[0738] This system allows users to enjoy not just information exchange, but more personalized psychological support. The terminal sends user feedback collected during the interaction to the server, allowing the system to continuously improve the accuracy and quality of its responses.
[0739] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0740] Step 1:
[0741] The user speaks into the device, inputting voice. The device uses its built-in microphone to capture this voice. The captured voice data is compressed using compression techniques such as AAC format and sent to the server via the information and communication network. The input is raw voice data, and the output is a compressed audio file.
[0742] Step 2:
[0743] The server decompresses the received compressed audio data and converts it into text data using a speech recognition engine. The input is compressed audio data, and the output is text-based character information. Specifically, it performs analysis using widely available speech recognition software.
[0744] Step 3:
[0745] The server processes text data using an emotion analysis module to analyze the user's emotional state. The input is textual information, and the output is the user's emotional state (e.g., "stable," "excited," etc.). Emotion analysis utilizes techniques such as voice analysis, evaluating factors like tone, speed, and pitch.
[0746] Step 4:
[0747] The server uses a generative AI model to generate responses based on emotional states and text data. Here, prompts such as "respond in a relaxed tone that matches the current emotional state" are input, and optimized response text is output. A concrete example would be reassuring words appropriate to the situation.
[0748] Step 5:
[0749] The server converts the generated response text into speech data using a speech synthesis engine. The input is response data in text format, and the output is response data in speech format. In speech synthesis, the intonation and tone are adjusted to be as close to human as possible.
[0750] Step 6:
[0751] The terminal receives audio data transmitted from the server and plays it through the speaker. The input is response data in audio format, and the output is delivered to the user as actual audio. The terminal adjusts the volume according to the user's environment and plays the audio in the optimal way.
[0752] (Application Example 2)
[0753] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0754] The challenge is to provide a dialogue system that allows for natural communication in order to alleviate feelings of loneliness and mental stress among the elderly. Furthermore, this system is required to generate supportive responses that are tailored to the emotional state of the elderly and to provide psychological comfort.
[0755] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0756] In this invention, the server includes a processing device that converts the user's voice into text information and performs natural language processing to analyze the user's intentions and emotions; a generation model that generates a natural response based on the analysis results; and a synthesis device that converts the generated response into voice information and transmits it to an input / output device. This makes it possible to generate responses based on the user's emotions and provide psychological support to the elderly.
[0757] An "input / output device" is a device that acquires the user's voice in real time and plays back the received voice information to the user.
[0758] A "processing device" is a device that converts audio information into text information and analyzes the user's intentions and emotions using natural language processing.
[0759] A "generative model" is an artificial intelligence model designed to generate natural responses based on analyzed user intentions and emotions.
[0760] A "synthesis device" is a device that converts text information generated by a generative model into speech information and transmits it to an input / output device.
[0761] "Psychological support" refers to providing elderly people with a sense of security and peace of mind through responses based on the user's emotions.
[0762] The embodiments for carrying out the invention are as follows:
[0763] This system consists of a terminal, a server, a generative AI model, and a network connecting them. The terminal is responsible for receiving the voice of elderly people in real time and transmitting this voice data to the server.
[0764] The server first uses a speech processing engine to convert speech data into text data. This process utilizes "speech recognition software." Subsequently, a "natural language processing engine" is used to analyze the user's intent and emotions using natural language processing technology. Based on the emotional information analyzed here, a generative AI model generates an appropriate response. For example, a "generative AI engine" is used for this response generation.
[0765] The generated text information is converted into speech data by a speech synthesis engine. At this stage, "speech synthesis software" is used, and the generated speech data is sent back to the terminal.
[0766] The device plays back received audio data and provides it to the elderly. During playback, it is required to deliver the audio at a volume and speed that is optimal for the user's environment and condition.
[0767] For example, if an elderly person using the device says, "I feel lonely today," the system can analyze the results and determine that the emotion is "loneliness," then generate a response such as, "Shall we watch your favorite movie together?" This response aims to provide comfortable and reassuring communication.
[0768] An example of a prompt would be: "Generate a reassuring response when the user is feeling down. The user said, 'I'm feeling kind of lonely today.'"
[0769] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0770] Step 1:
[0771] The device captures the user's voice in real time and takes the voice data as input. The acquired voice data is compressed over the network and output to the server.
[0772] Step 2:
[0773] The server passes the received audio data as input to the speech processing engine. Speech recognition software is used to convert the audio data into text data. The converted text data is then output to the next stage of processing.
[0774] Step 3:
[0775] The server inputs text data into a natural language processing engine to analyze the user's intent and emotions. The analysis results are output as emotion data, which is then passed to a generative AI model.
[0776] Step 4:
[0777] The server inputs the analysis results into the generation AI model and generates a natural response based on the user's emotions. It then outputs the generated response text and sends it to the speech synthesis engine.
[0778] Step 5:
[0779] The server uses speech synthesis software to convert the generated response text into audio data. It then outputs the converted audio data and transmits it to the terminal via the network.
[0780] Step 6:
[0781] The device receives audio data as input and plays it back to the user through the speaker. The volume and speed are optimized according to the user's environment.
[0782] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0783] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0784] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0785] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0786] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0787] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0788] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0789] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0790] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0791] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0792] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0793] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0794] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0795] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0796] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0797] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0798] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0799] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0800] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0801] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0802] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0803] The following is further disclosed regarding the embodiments described above.
[0804] (Claim 1)
[0805] A terminal device that captures the user's voice in real time and transmits the voice data to a server,
[0806] A server means that converts the aforementioned audio data into text and performs natural language processing to analyze the user's intent and emotions,
[0807] A generation AI means that generates a natural response based on the aforementioned analysis results,
[0808] A speech synthesis means that converts the generated response into audio data and transmits it to the terminal,
[0809] The aforementioned terminal includes an output means that plays back the received audio data and provides an audio response to the user,
[0810] A system that includes this.
[0811] (Claim 2)
[0812] The system according to claim 1, wherein the server means obtains information from the internet in response to a user's request and provides the information summarized by the generating AI means.
[0813] (Claim 3)
[0814] The system according to claim 1, wherein the terminal means has a function to collect user feedback and transmit it to the server means to improve the accuracy of future responses.
[0815] "Example 1"
[0816] (Claim 1)
[0817] An acquisition means for acquiring audio, compressing the audio data, and transmitting it to an information processing device,
[0818] The aforementioned audio data is converted into text data, and an analysis means is used to analyze the user's intentions and emotions through natural language processing.
[0819] A generative model means that generates a human-like response based on the aforementioned analysis results,
[0820] A synthesis means that converts the generated response into acoustic data and transmits it to the acquisition means,
[0821] The acquisition means includes an output means that plays back the received audio data and provides an audio response to the user,
[0822] A recording means that records the history of conversations with the user and feedback from the user, and transmits it for use by the analysis means,
[0823] A system that includes this.
[0824] (Claim 2)
[0825] The system according to claim 1, wherein the analysis means acquires information from a wide-area information network in response to a user's request and provides the information summarized by the generation model means.
[0826] (Claim 3)
[0827] The system according to claim 1, wherein the acquisition means has a function of collecting user-generated feedback that contributes to improving quality and transmitting it for processing by the analysis means.
[0828] "Application Example 1"
[0829] (Claim 1)
[0830] A terminal means that collects user speech in real time and transmits the voice information to a central processing unit,
[0831] A central processing unit that converts the aforementioned audio information into a document format and performs natural language analysis to analyze the user's purpose and emotions,
[0832] A generation information processing means that generates a natural response based on the analysis results,
[0833] A voice conversion means that converts the generated response into voice information and transmits it to the terminal,
[0834] The aforementioned terminal includes an output means that plays back the received audio information and provides an audio response to the user,
[0835] A support system equipped with functions to record the user's health information and assist in managing their daily schedule,
[0836] A system that includes this.
[0837] (Claim 2)
[0838] The system according to claim 1, wherein the central processing unit acquires knowledge from an information network in response to a user's request and provides the summarized knowledge by the generated information processing unit.
[0839] (Claim 3)
[0840] The system according to claim 1, wherein the terminal means has a function of collecting user evaluations and transmitting them to the central processing unit to improve future response accuracy.
[0841] "Example 2 of combining an emotion engine"
[0842] (Claim 1)
[0843] An information communication device that acquires the user's voice using an acquisition means and transmits the voice information to a processing device,
[0844] A processing device that converts the aforementioned audio information into text information and performs machine processing to analyze the user's intentions and emotions,
[0845] A generation model means for generating a harmonic response based on the analysis results,
[0846] A voice conversion means that converts the generated response into voice information and transmits it to the information communication device,
[0847] The aforementioned information communication device includes an output device that plays back received audio information and provides an audio response to the user,
[0848] A system that includes this.
[0849] (Claim 2)
[0850] The system according to claim 1, wherein the processing device acquires information from a wide-area information network in response to a user's request and provides information simplified by the generation model means.
[0851] (Claim 3)
[0852] The system according to claim 1, wherein the information communication device has a function of collecting user evaluation information and transmitting it to the processing device to improve future response accuracy.
[0853] "Application example 2 when combining with an emotional engine"
[0854] (Claim 1)
[0855] An input / output device that acquires the user's voice in real time and transmits that voice information to a processing unit,
[0856] A processing device that converts the aforementioned audio information into text information and performs natural language processing to analyze the user's intent and emotions,
[0857] A generative model that generates a natural response based on the aforementioned analysis results,
[0858] A synthesis device that converts the generated response into audio information and transmits it to the input / output device,
[0859] The input / output device includes an output device that plays back the received audio information and provides an audio response to the user,
[0860] A means for generating user-based responses to provide psychological support to the elderly in the field of elderly care,
[0861] A system that includes this.
[0862] (Claim 2)
[0863] The system according to claim 1, wherein the processing device acquires information from a communication network in response to a user inquiry and provides information summarized by the generation model.
[0864] (Claim 3)
[0865] The system according to claim 1, wherein the input / output device has a function of recording the user's response and transmitting it to the processing unit to improve the accuracy of future responses. [Explanation of Symbols]
[0866] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A terminal device that captures the user's voice in real time and transmits the voice data to a server, A server means that converts the aforementioned audio data into text and performs natural language processing to analyze the user's intent and emotions, A generation AI means that generates a natural response based on the aforementioned analysis results, A speech synthesis means that converts the generated response into audio data and transmits it to the terminal, The aforementioned terminal includes an output means that plays back the received audio data and provides an audio response to the user, A system that includes this.
2. The system according to claim 1, wherein the server means obtains information from the internet in response to a user's request and provides the information summarized by the generating AI means.
3. The system according to claim 1, wherein the terminal means has a function to collect user feedback and transmit it to the server means to improve future response accuracy.