system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The system addresses isolation and cognitive decline in the elderly by providing personalized interactions and real-time support through natural language processing, speech synthesis, and data management, enhancing family connections and emotional security.

JP2026100655APending Publication Date: 2026-06-19SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-09
Publication Date: 2026-06-19

Application Information

Patent Timeline

09 Dec 2024

Application

19 Jun 2026

Publication

JP2026100655A

IPC: G10L13/00; G10L21/028; G10L13/08; G10L15/00; G06F3/16; G10L13/02; G10L15/22; G10L13/06; G10L19/02; G10L19/00; G10L25/48; G10L19/16; G10L13/10; G10L13/04; G10L15/30; G10L25/00; G10L15/10

AI Tagging

Application Domain

Sound input/output Speech recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure 2026100655000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] Natural language processing means, A memory device that records and uses past conversation content, A speech synthesis means that outputs the generated response as speech, A speech recognition means for converting the obtained audio into text, A data management system for managing user schedules and health information, A system that includes a means of communication to periodically report information to the user's family.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, the method including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a character of the chatbot, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] The sense of isolation experienced by the elderly, the decline in cognitive function due to lack of conversation, and the resulting increase in stress have become social problems. There is a need for means to help the elderly maintain connections with their families and society and enrich their daily lives. However, with current technologies, it is difficult to provide individualized responses based on individual needs and past conversation histories, and there are limited efficient ways for families to understand the living conditions of the elderly. There is a need for technology that comprehensively addresses these issues.

Means for Solving the Problems

[0005] This invention provides a system comprising natural language processing means, storage means for recording and utilizing past conversation content, speech synthesis means for outputting generated responses as speech, speech recognition means for converting the obtained speech into text, data management means for managing the user's schedule and health information, and communication means for periodically reporting information to the user's family. This enables personalized responses through dialogue with elderly individuals and realizes natural interactions that reflect past conversation history. Furthermore, by reporting the user's living situation to family members, it strengthens connections with family members living far away and provides a sense of security. This system is an integrated solution for preventing isolation and improving the quality of life for the elderly.

[0006] "Natural language processing" refers to technologies that analyze text data to understand the user's intent and context.

[0007] "Memory function" refers to a database function that records past conversations with the user and allows them to refer to this data in subsequent interactions.

[0008] "Speech synthesis means" refers to technology that converts generated text data into speech and outputs it to the user.

[0009] "Speech recognition means" refers to technology that converts speech input from a user into text data.

[0010] "Data management means" refers to functions for organizing and managing users' schedules and health information, and updating the information as needed.

[0011] "Communication methods" refer to communication technologies used to periodically report information to external parties, such as the user's family. [Brief explanation of the drawing]

[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2]This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0013] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.

[0014] First, the terms used in the following description will be explained.

[0015] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0016] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0017] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0018] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), and the like.

[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0020] [First Embodiment]

[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0033] The AI support system for the elderly, as described in this invention, is implemented through the cooperation of a server, a terminal, and a user. A specific embodiment is shown below.

[0034] The server first uses natural language processing to convert the user's voice input into text data and analyzes its intent. The analyzed data is then compared with past conversation history and recorded in memory. Based on this information, the server uses a generative model to create an appropriate response. This response is then sent back to the terminal in text format.

[0035] The terminal converts text data sent from the server into speech data using speech synthesis and outputs it audibly to the user. In this process, the terminal adjusts the tone and speed of the speech to provide it in a natural and easy-to-understand manner for the user. The terminal also converts newly inputted speech from the user into text using speech recognition and sends that information to the server.

[0036] Users engage in everyday conversations through this system. For example, if a user asks, "What did I eat for dinner yesterday?", the server retrieves past data from its memory and generates a specific response such as, "Yesterday's meal was fish." In this way, users can receive support that contributes to reducing feelings of loneliness and maintaining cognitive function through continuous conversation.

[0037] Furthermore, the data management system manages the user's schedule and health information in real time and sets reminders as needed. For example, if a user says, "I want to check my doctor's appointment for next week," the device checks the schedule information and responds, "It's next Monday at 10:00 AM."

[0038] Reports to family members are made via communication methods. The server periodically analyzes the user's lifestyle and health status and sends the results to the family as a report. This makes it easier for family members to understand the elderly person's condition, even when they are in a remote location, and provides them with peace of mind.

[0039] This invention provides an effective means for elderly people to live fulfilling lives with peace of mind while strengthening their ties with their families.

[0040] The following describes the processing flow.

[0041] Step 1:

[0042] The terminal receives voice input from the user. A speech recognition system is used to convert the voice data into text data. This text data is then sent to a server for analysis of the user's intent.

[0043] Step 2:

[0044] The server analyzes the received text data using natural language processing techniques. The analysis helps understand the user's intent and the context of the conversation, and retrieves relevant information from past conversation history.

[0045] Step 3:

[0046] Based on the information acquired by the server, a generative model is used to generate an appropriate response to the user's utterance. In this process, past history is also considered to enable context-specific responses.

[0047] Step 4:

[0048] The server sends the generated text response to the terminal. The terminal then uses a speech synthesis system to convert this text data into speech and outputs it to the user.

[0049] Step 5:

[0050] The user listens to the device's audio output and, if they wish to continue the conversation, provides new audio input. This iterative process enables continuous conversation.

[0051] Step 6:

[0052] The data management system updates schedules and health information based on the user's new utterances and sets reminders as needed.

[0053] Step 7:

[0054] The server periodically reports the user's living situation and health information to the family via a communication method. This allows the family to properly understand the user's condition.

[0055] (Example 1)

[0056] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0057] As society ages, many elderly people experience loneliness in their daily lives and face challenges in managing their health. To address these challenges, there is a need for support systems that enable the elderly to live their daily lives with peace of mind and communicate effectively with their families. In addition, there is a need for technology that reduces noise and unnatural speech when using voice interfaces, providing a more comfortable and natural conversational environment.

[0058] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0059] In this invention, the server includes a natural language processing unit, a storage device for recording and referencing past conversation information, a speech synthesis device for outputting generated responses as speech, means for filtering noise when converting speech data to text data, and means for adjusting the tone and speed of the generated responses. This makes it possible for elderly people to receive daily support while reducing loneliness, and to provide a sense of security to their families. Furthermore, even when using a voice interface, noise reduction and natural speech output enable smoother use.

[0060] A "natural language processing device" is a device that utilizes technology to understand and analyze human language in order to generate appropriate responses.

[0061] A "memory device" is a device that stores past conversation information and uses it for future matching and response generation.

[0062] A "speech synthesis device" is a device that converts generated text data into speech and provides it to the user audibly.

[0063] A "speech recognition device" is a device that converts a user's speech into digital data, making it possible to process it as text.

[0064] An "information management device" is a device that systematically manages users' schedules and health-related information, and provides appropriate instructions and notifications.

[0065] A "communication device" is a device used to report the user's status and necessary information to family members in a remote location according to certain standards.

[0066] "Methods for filtering noise" refer to methods for removing unwanted acoustic noise from received audio data, making it easier to analyze the pure speech content.

[0067] "Means of adjusting tone and speed" refer to methods for adjusting generated audio data to provide users with natural and easily understandable audio.

[0068] This invention, an AI support system for the elderly, is built on the interaction between a server, a terminal, and the user.

[0069] The server first receives voice data input from the user and converts it into text data using a speech recognition device. This process utilizes a natural language processing unit to understand human speech. Specifically, common cloud-based speech recognition technology is used to convert speech to text. The converted text is stored in memory and used to refer to past conversation information.

[0070] The terminal receives text data sent from the server and converts it into speech data using a speech synthesis device. During this process, the tone and speed of the synthesized speech are adjusted to enhance the naturalness of the generated response. For example, the terminal can use speech synthesis technology to convey specific details such as "Yesterday we had fish" in a soft, natural way.

[0071] Users can use this system to engage in everyday conversations and alleviate feelings of loneliness. For example, if a user says, "I want to check my doctor's appointment for next week," the server will use its information management device to check the schedule data and generate an appropriate response. Specifically, it can provide a response such as, "It's next Monday at 10:00 AM."

[0072] Furthermore, an example of a prompt message sent to the AI generation model is, "Considering that the user is elderly, please gently describe what they had for lunch yesterday." This model then generates a more appropriate response, which is delivered to the user.

[0073] This invention allows users to receive daily support while their families gain peace of mind through regular reports on the user's condition. The system maintains the quality of voice data through noise filtering, creating a natural and easy-to-understand conversational environment for the user.

[0074] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0075] Step 1:

[0076] The user inputs voice into the device's microphone. The device has the capability to process this voice data based on cloud-based speech recognition technology. The input is the user's spoken voice, and the output is digitized voice data. The device sends this voice data to a server for further processing.

[0077] Step 2:

[0078] The server uses the received audio data to perform analysis with a natural language processing unit. This process uses speech recognition technology to convert the audio data into text data, thereby understanding the user's intent. Here, the input is digital audio data, and the output is the analyzed text data. Furthermore, the server retrieves past conversation information from its storage device and compares it with historical information related to the intent.

[0079] Step 3:

[0080] The server references text data and conversation history and uses a generative AI model to generate appropriate prompt sentences. Based on these prompt sentences, it generates responses that match the user's requests. The input is history information matched with text data, and the output is a natural language response text based on the prompt sentences.

[0081] Step 4:

[0082] The generated response text is sent from the server to the terminal, which uses speech synthesis technology to convert this text into speech data. The input here is a response text in natural language, and the output is speech data with adjusted tone and speed. The terminal then provides this speech data to the user audibly through its speaker.

[0083] Step 5:

[0084] The user accepts the audio output from the device and decides whether to continue the conversation based on it. In this step, the input is the audio response, and the output is the user's experience and information. If additional information is needed, the user restarts the conversation from step 1.

[0085] (Application Example 1)

[0086] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0087] There is a need to provide effective support to alleviate the loneliness and anxiety that elderly people face in their daily lives and to maintain their cognitive function. Furthermore, a system is needed that allows family members living remotely to easily monitor the health and living conditions of elderly individuals. Current technology lacks consistency in audio and visual information delivery, and real-time activity recording and reporting automation is insufficient; therefore, improved solutions are desired.

[0088] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0089] In this invention, the server includes natural language processing means, a storage device for recording and utilizing past conversation information, and a speech synthesis device for outputting generated responses as speech. This enables elderly people to naturally give instructions by voice and receive appropriate responses. The invention also includes a data management device for managing the user's schedule and health information, a communication device for periodically reporting information to the user's family, a display device for providing information audibly and visually, and means for recording the user's activities and reporting the situation to the family. This enables real-time notifications of schedules and reports on health status.

[0090] "Natural language processing means" refers to technologies that convert voice input from users into text data and analyze its intent.

[0091] A "memory device" is a device that records past conversation information and uses it to generate appropriate responses.

[0092] A "speech synthesis device" is a device that converts generated responses into speech data and outputs it to the user audibly.

[0093] A "speech recognition device" is a device that converts obtained speech into text data.

[0094] A "data management device" is a device that manages a user's schedule and health information and sends reminders as needed.

[0095] A "communication device" is a device used to periodically report information to the user's family.

[0096] A "display device" is a device that provides information in both audio and visual form.

[0097] "Means of reporting" refers to methods for recording the user's activities and notifying family members in remote locations of the situation.

[0098] The system for realizing this invention mainly consists of three elements: a server, a terminal, and a user.

[0099] The server uses natural language processing technology to convert user voice input into text data. For example, when a user asks an everyday question, the voice is quickly converted into text, and the intent is analyzed using a generative AI model. The analyzed text data is stored in a memory device that also contains past conversation information, and an appropriate response is generated based on this. This response is sent to the terminal in text format. The hardware and software used include speech recognition APIs (e.g., Google® Cloud Speech-to-Text) and natural language processing libraries (e.g., NLTK, spaCy).

[0100] The terminal converts text responses sent from the server into speech using a speech synthesis device and provides it to the user. The tone and speed of the speech are adjusted to ensure the user can easily understand it. The terminal can also manage the user's schedule and health information using a data management device and send reminders as needed. A speech synthesis API (e.g., Amazon Polly) is used in this process.

[0101] Users interact with the system through a device that provides voice and visual information. For example, they can ask questions like, "What time should I take my medicine?" to their smart device to check their schedule and receive instructions. The user's activities are regularly recorded and reported to their family via a communication device. Family members, even those living far away, can gain peace of mind through this information.

[0102] For example, prompt statements include the following:

[0103] "Tell me when to take my medicine."

[0104] "What's next?"

[0105] "What kind of exercise did you do today?"

[0106] By using these prompts, users can receive support to make their daily lives more comfortable.

[0107] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0108] Step 1:

[0109] The server receives audio input from the user. This audio data is given as input and converted into text data by a speech recognition API (e.g., Google Cloud Speech-to-Text). The converted text data is then passed to the next parsing step.

[0110] Step 2:

[0111] The server analyzes the converted text data using natural language processing libraries (e.g., NLTK, spaCy). It extracts the user's intent from the input text data and searches for relevant information from a memory storage based on past conversation information. In this process, a generative AI model derives an appropriate response to the generated text data. If any other information is needed, it searches again, organizes the information, and passes it on to the next step.

[0112] Step 3:

[0113] The server sends the generated response to the speech synthesis device. Using a speech synthesis API (e.g., Amazon Polly), the text data is converted into speech data. The converted speech data is then sent to the terminal as output.

[0114] Step 4:

[0115] The terminal receives audio data transmitted from the server and outputs it to the user. By having the terminal play the audio for the user, the user receives responses and advice from the system in audio format. The tone and speed of the audio are appropriately adjusted to provide natural and easy-to-listen-to audio.

[0116] Step 5:

[0117] The user receives voice input from the device and, if necessary, asks additional questions via voice. This new voice input returns to step 1, and the process continues in the same manner. Furthermore, if the user's status or schedule is updated, the device uses a data management device to record the information in real time and periodically reports it to the family via a communication device. This activity log enables continuous data management and provides peace of mind.

[0118] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0119] The AI support system for the elderly, which incorporates the emotion engine of this invention, is realized through the interaction of a server, a terminal, and a user. A specific embodiment is shown below.

[0120] The server first receives the user's voice transmitted from the terminal and converts it into text using natural language processing. This text data is then input into the emotion engine, which analyzes the user's emotional state. This emotional information is reflected in the content and tone of the response generated by the server.

[0121] The emotion engine has the function of analyzing the user's emotions based on the content of their speech, tone of voice, tempo, etc. Specifically, if it determines that the user is feeling down, the server generates words of comfort and encouragement and provides a response in a corresponding tone of voice. In addition, by comparing this with past conversation history, it takes into account the user's past emotional state, enabling it to respond more appropriately.

[0122] The device interacts with the user using text data obtained through speech recognition and responses from the server. The text responses sent from the server are converted into speech by a speech synthesis system and output to the user. This speech is given a tone influenced by an emotion engine, enabling natural and emotionally resonant expressions.

[0123] Users communicate with the AI through everyday conversations. For example, if a user says, "I'm feeling a little down today," the server analyzes this based on its emotion engine and generates a response such as, "Is something wrong? Shall we talk?" By providing responses that match the user's emotions in this way, it fosters a sense of security and trust.

[0124] This invention goes beyond simply recording and managing information; it provides a new form of support that is empathetic to the feelings of the elderly through emotion recognition and responses based on those emotions. The system aims to enhance emotional support for users and provide them with a richer life experience through intimate dialogue.

[0125] The following describes the processing flow.

[0126] Step 1:

[0127] The terminal receives voice input from the user. This data is converted into text data using speech recognition technology and sent to the server.

[0128] Step 2:

[0129] The server analyzes the received text data using natural language processing techniques to understand the user's utterances. This analysis reveals the user's requests and intentions.

[0130] Step 3:

[0131] The server inputs text data into the emotion engine, which then analyzes the user's emotional state. Emotions are inferred from factors such as tone of voice, speed, and the words used.

[0132] Step 4:

[0133] The emotion engine returns the analysis results to the server, providing data based on the user's emotions. This data is then used in the subsequent response generation process.

[0134] Step 5:

[0135] The server references the results of the emotion engine and past conversation history to generate an appropriate response. It adjusts the content and tone of voice of the response according to the user's emotional state.

[0136] Step 6:

[0137] The server sends the generated text response to the terminal. During this process, a speech synthesis system is used to convert the text into speech data.

[0138] Step 7:

[0139] The device uses the converted voice data to output a response to the user. Emotion-based tone adjustments are applied, resulting in a natural and friendly voice for the user.

[0140] Step 8:

[0141] The user hears this response and, if they wish to continue the conversation, provides further voice input. This cycle is repeated, enabling continuous conversation.

[0142] (Example 2)

[0143] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0144] There is a need for technologies that alleviate the emotional and informational anxieties and feelings of loneliness that older adults face in their daily lives, and that provide smoother and more approachable communication. Furthermore, there is a need to strengthen emotional support for older adults through appropriate voice responses that take their emotions into consideration.

[0145] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0146] In this invention, the server includes natural language processing means, memory means for recording and utilizing past utterances, and speech synthesis means for outputting the generated response as sound. This enables the detection of the user's emotional state and the generation of natural speech responses accordingly, thereby fostering a sense of security and trust in the elderly and providing emotional support.

[0147] "Natural language processing means" refers to technologies that analyze input character data, understand its content, and generate appropriate responses.

[0148] A "memory device" is a function that stores past speech and allows for referencing and using it as needed.

[0149] "Speech synthesis means" refers to a technology for converting a generated response into sound and outputting it.

[0150] "Speech recognition means" refers to technology that converts input audio data into text data.

[0151] "Information management means" refers to functions for organizing and managing users' schedule information and health information.

[0152] "Emotion analysis methods" refer to technologies used to determine a user's emotional state from the content of their speech and the intonation of their voice.

[0153] "Communication means" refers to a function that periodically reports information to the user's relatives and related parties.

[0154] The AI support system for the elderly according to the present invention consists of a server, a terminal, and user interaction. Its main objective is to provide elderly individuals with natural, empathetic dialogue and support their emotional stability.

[0155] The server first receives the audio data sent from the terminal and converts it into text data using speech recognition software. A commonly used speech recognition technology for this process is a "speech recognition API." Subsequently, natural language processing technology (e.g., generative AI models) is used to analyze the text data and identify the emotions contained in the user's utterances. The emotion analysis refers to the user's past conversation history and reflects this in the content of the generated response.

[0156] Specifically, if a user says something like, "I'm feeling a little lonely today," the server analyzes that statement using sentiment analysis tools and generates a response that empathizes with the user's feelings, such as, "How was your day?" This process utilizes a generative AI model, and an example of a prompt would be, "Generate a kind response for when the user is feeling lonely."

[0157] The terminal receives text data from the server, converts it into speech using speech synthesis technology, and outputs it to the user. This conversion enables a natural conversational format with an acoustic tone that reflects emotions.

[0158] Users can communicate with the system through everyday conversations and receive emotional support based on the responses. This allows users to gain a sense of security and trust, and receive support to enjoy a better life experience.

[0159] Overall, this system aims to improve the quality of life for the elderly, providing a new form of support through technological elements and user interaction.

[0160] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0161] Step 1:

[0162] The server receives audio data transmitted from the terminal. The input is the user's speech, which is captured as audio data through the terminal's microphone. The server records this audio data as digital data and prepares it for the next processing step.

[0163] Step 2:

[0164] The server converts the received audio data into text data using speech recognition. For example, this process involves converting audio data into a string using a speech recognition API. The input here is audio data, and the output is text data that can be processed by a machine.

[0165] Step 3:

[0166] The server uses a generative AI model to analyze text data and determine the user's emotional state. In this step, the input is text data obtained through speech recognition. The generative AI model uses prompt sentences to analyze emotions and outputs the user's emotional state (e.g., joy, sadness, etc.).

[0167] Step 4:

[0168] The server generates an appropriate response using natural language processing based on the results of sentiment analysis. The input consists of the user's emotional state and past conversation history, and based on this, it outputs a response text that is emotionally empathetic to the user. Specifically, it creates a response using the prompt example "Generate a gentle response when the user is feeling lonely."

[0169] Step 5:

[0170] The terminal converts the response text sent from the server into speech using speech synthesis technology and plays it back to the user. The input is the generated response text, which is delivered to the user as a human-like voice output using speech synthesis technology. In this step, the user can continue the conversation with the system by listening to the voice response from the terminal.

[0171] Step 6:

[0172] After receiving a response from the system, the user provides further voice input. This interaction is continuous, and the server repeats the process from step 1 as it receives new voice data. User feedback and new utterances help improve sentiment analysis and the quality of responses.

[0173] (Application Example 2)

[0174] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0175] There is a problem of insufficient emotional support in the daily lives of the elderly. Conventional information provision systems have difficulty generating responses that are sensitive to the user's feelings, and thus have the challenge of not being able to adequately provide the elderly with a sense of psychological security and trust.

[0176] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0177] In this invention, the server includes a natural language processing means, an emotion analysis device means, and a generation means for generating responses that correspond to the user's emotional state. This makes it possible to accurately grasp the emotions of elderly people through their daily conversations and provide emotionally expressive responses based on that understanding.

[0178] "Natural language processing" refers to technologies that convert information input via speech or text into a format that a computer can understand, and analyze the user's intent and meaning.

[0179] A "memory device" is a device that has the function of storing past dialogue information and data, and retrieving and using it as needed.

[0180] "Speech synthesis device means" refers to a technology for converting generated text data into speech and outputting it in a natural form.

[0181] "Speech recognition device means" refers to a technology that analyzes speech collected from users and converts it into text data.

[0182] A "data management device means" is a technology that provides convenience by organizing and managing information related to users' schedules and health.

[0183] A "communication device" is a device that has the function of periodically reporting and sharing information with the user's family and related parties.

[0184] An "emotion analysis device means" is a technology that analyzes a user's emotional state from their voice or text information and identifies that emotion.

[0185] "Generation means" refers to techniques for generating appropriate responses based on the analyzed results.

[0186] The system implementing this invention mainly consists of a server, a terminal, and a user. First, the server uses a speech recognition device to convert the audio data transmitted from the terminal into text data. Next, this text data is processed by a natural language analysis device, and the user's emotional state is identified by an emotion analysis device. Based on this, a generation device generates a response, which is then output again as audio by a speech synthesis device.

[0187] The terminal receives speech synthesis results from the server and provides responses to the user through voice output. This allows the user to communicate with the system through dialogue. In particular, emotion analysis makes it possible to naturally express words of comfort and encouragement to a depressed user.

[0188] The hardware used includes smart speakers and smartphones, while the software includes natural language processing libraries, speech recognition engines, and speech synthesis engines. Specifically, the "SpeechRecognition" library can be used as the speech recognition engine, and generative AI models for natural language processing can be used.

[0189] For example, if a user says, "Yesterday was a good day," the server will generate a response such as, "That's great. What are your plans for today?" A specific prompt might be, "Generate words of encouragement based on the user's mood today." This system allows elderly people to receive emotional support in their daily lives and gain a sense of security and trust.

[0190] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0191] Step 1:

[0192] The user provides voice input to the device. The device receives this voice data through its microphone. The voice data is input and sent to the speech recognition engine within the device.

[0193] Step 2:

[0194] The device uses a speech recognition engine to convert speech data into text data. This process captures the content of the speech as text information. In this case, the input is speech data, and the output is the text data of that speech.

[0195] Step 3:

[0196] The server receives text data from the terminal and analyzes the text using natural language processing (NLP) tools. Through this analysis, the user's utterances and intentions are understood. The input for the analysis is text data, and the output is intent data derived from the analysis of that text.

[0197] Step 4:

[0198] The server uses an emotion analysis device to identify the user's emotional state from intent data. This allows the server to determine the user's current emotions. The input is the analyzed intent data, and the output is the estimated emotional state.

[0199] Step 5:

[0200] The server uses a generation mechanism to generate a response based on the identified emotional state. This process utilizes a generative AI model. The input is the emotional state, and the output is an appropriate response sentence corresponding to that emotion.

[0201] Step 6:

[0202] The server uses a speech synthesis device to convert the generated response text into audio data. This prepares an audio response that the user can hear. The input is a response text in text format, and the output is the audio data of that text.

[0203] Step 7:

[0204] The terminal receives audio data from the server and plays the response to the user via the speaker. The user receives the response as audio. The output from the speaker is audio data generated by the server.

[0205] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0206] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0207] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0208] [Second Embodiment]

[0209] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0210] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0211] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0212] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0213] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0214] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0215] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0216] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0217] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0218] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0219] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0220] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0221] The AI support system for the elderly, as described in this invention, is implemented through the cooperation of a server, a terminal, and a user. A specific embodiment is shown below.

[0222] The server first uses natural language processing to convert the user's voice input into text data and analyzes its intent. The analyzed data is then compared with past conversation history and recorded in memory. Based on this information, the server uses a generative model to create an appropriate response. This response is then sent back to the terminal in text format.

[0223] The terminal converts text data sent from the server into speech data using speech synthesis and outputs it audibly to the user. In this process, the terminal adjusts the tone and speed of the speech to provide it in a natural and easy-to-understand manner for the user. The terminal also converts newly inputted speech from the user into text using speech recognition and sends that information to the server.

[0224] Users engage in everyday conversations through this system. For example, if a user asks, "What did I eat for dinner yesterday?", the server retrieves past data from its memory and generates a specific response such as, "Yesterday's meal was fish." In this way, users can receive support that contributes to reducing feelings of loneliness and maintaining cognitive function through continuous conversation.

[0225] Furthermore, the data management system manages the user's schedule and health information in real time and sets reminders as needed. For example, if a user says, "I want to check my doctor's appointment for next week," the device checks the schedule information and responds, "It's next Monday at 10:00 AM."

[0226] Reports to family members are made via communication methods. The server periodically analyzes the user's lifestyle and health status and sends the results to the family as a report. This makes it easier for family members to understand the elderly person's condition, even when they are in a remote location, and provides them with peace of mind.

[0227] This invention provides an effective means for elderly people to live fulfilling lives with peace of mind while strengthening their ties with their families.

[0228] The following describes the processing flow.

[0229] Step 1:

[0230] The terminal receives voice input from the user. A speech recognition system is used to convert the voice data into text data. This text data is then sent to a server for analysis of the user's intent.

[0231] Step 2:

[0232] The server analyzes the received text data using natural language processing techniques. The analysis helps understand the user's intent and the context of the conversation, and retrieves relevant information from past conversation history.

[0233] Step 3:

[0234] Based on the information acquired by the server, a generative model is used to generate an appropriate response to the user's utterance. In this process, past history is also considered to enable context-specific responses.

[0235] Step 4:

[0236] The server sends the generated text response to the terminal. The terminal then uses a speech synthesis system to convert this text data into speech and outputs it to the user.

[0237] Step 5:

[0238] The user listens to the device's audio output and, if they wish to continue the conversation, provides new audio input. This iterative process enables continuous conversation.

[0239] Step 6:

[0240] The data management system updates schedules and health information based on the user's new utterances and sets reminders as needed.

[0241] Step 7:

[0242] The server periodically reports the user's living situation and health information to the family via a communication method. This allows the family to properly understand the user's condition.

[0243] (Example 1)

[0244] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0245] As society ages, many elderly people experience loneliness in their daily lives and face challenges in managing their health. To address these challenges, there is a need for support systems that enable the elderly to live their daily lives with peace of mind and communicate effectively with their families. In addition, there is a need for technology that reduces noise and unnatural speech when using voice interfaces, providing a more comfortable and natural conversational environment.

[0246] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0247] In this invention, the server includes a natural language processing unit, a storage device for recording and referencing past conversation information, a speech synthesis device for outputting generated responses as speech, means for filtering noise when converting speech data to text data, and means for adjusting the tone and speed of the generated responses. This makes it possible for elderly people to receive daily support while reducing loneliness, and to provide a sense of security to their families. Furthermore, even when using a voice interface, noise reduction and natural speech output enable smoother use.

[0248] A "natural language processing device" is a device that utilizes technology to understand and analyze human language in order to generate appropriate responses.

[0249] A "memory device" is a device that stores past conversation information and uses it for future matching and response generation.

[0250] A "speech synthesis device" is a device that converts generated text data into speech and provides it to the user audibly.

[0251] A "speech recognition device" is a device that converts a user's speech into digital data, making it possible to process it as text.

[0252] An "information management device" is a device that systematically manages users' schedules and health-related information, and provides appropriate instructions and notifications.

[0253] A "communication device" is a device used to report the user's status and necessary information to family members in a remote location according to certain standards.

[0254] "Methods for filtering noise" refer to methods for removing unwanted acoustic noise from received audio data, making it easier to analyze the pure speech content.

[0255] "Means of adjusting tone and speed" refer to methods for adjusting generated audio data to provide users with natural and easily understandable audio.

[0256] This invention, an AI support system for the elderly, is built on the interaction between a server, a terminal, and the user.

[0257] The server first receives voice data input from the user and converts it into text data using a speech recognition device. This process utilizes a natural language processing unit to understand human speech. Specifically, common cloud-based speech recognition technology is used to convert speech to text. The converted text is stored in memory and used to refer to past conversation information.

[0258] The terminal receives text data sent from the server and converts it into speech data using a speech synthesis device. During this process, the tone and speed of the synthesized speech are adjusted to enhance the naturalness of the generated response. For example, the terminal can use speech synthesis technology to convey specific details such as "Yesterday we had fish" in a soft, natural way.

[0259] Users can use this system to engage in everyday conversations and alleviate feelings of loneliness. For example, if a user says, "I want to check my doctor's appointment for next week," the server will use its information management device to check the schedule data and generate an appropriate response. Specifically, it can provide a response such as, "It's next Monday at 10:00 AM."

[0260] Furthermore, an example of a prompt message sent to the AI generation model is, "Considering that the user is elderly, please gently describe what they had for lunch yesterday." This model then generates a more appropriate response, which is delivered to the user.

[0261] This invention allows users to receive daily support while their families gain peace of mind through regular reports on the user's condition. The system maintains the quality of voice data through noise filtering, creating a natural and easy-to-understand conversational environment for the user.

[0262] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0263] Step 1:

[0264] The user inputs voice into the device's microphone. The device has the capability to process this voice data based on cloud-based speech recognition technology. The input is the user's spoken voice, and the output is digitized voice data. The device sends this voice data to a server for further processing.

[0265] Step 2:

[0266] The server uses the received audio data to perform analysis with a natural language processing unit. This process uses speech recognition technology to convert the audio data into text data, thereby understanding the user's intent. Here, the input is digital audio data, and the output is the analyzed text data. Furthermore, the server retrieves past conversation information from its storage device and compares it with historical information related to the intent.

[0267] Step 3:

[0268] The server references text data and conversation history and uses a generative AI model to generate appropriate prompt sentences. Based on these prompt sentences, it generates responses that match the user's requests. The input is history information matched with text data, and the output is a natural language response text based on the prompt sentences.

[0269] Step 4:

[0270] The generated response text is sent from the server to the terminal, which uses speech synthesis technology to convert this text into speech data. The input here is a response text in natural language, and the output is speech data with adjusted tone and speed. The terminal then provides this speech data to the user audibly through its speaker.

[0271] Step 5:

[0272] The user accepts the audio output from the device and decides whether to continue the conversation based on it. In this step, the input is the audio response, and the output is the user's experience and information. If additional information is needed, the user restarts the conversation from step 1.

[0273] (Application Example 1)

[0274] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0275] There is a need to provide effective support to alleviate the loneliness and anxiety that elderly people face in their daily lives and to maintain their cognitive function. Furthermore, a system is needed that allows family members living remotely to easily monitor the health and living conditions of elderly individuals. Current technology lacks consistency in audio and visual information delivery, and real-time activity recording and reporting automation is insufficient; therefore, improved solutions are desired.

[0276] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0277] In this invention, the server includes natural language processing means, a storage device for recording and utilizing past conversation information, and a speech synthesis device for outputting generated responses as speech. This enables elderly people to naturally give instructions by voice and receive appropriate responses. The invention also includes a data management device for managing the user's schedule and health information, a communication device for periodically reporting information to the user's family, a display device for providing information audibly and visually, and means for recording the user's activities and reporting the situation to the family. This enables real-time notifications of schedules and reports on health status.

[0278] "Natural language processing means" refers to technologies that convert voice input from users into text data and analyze its intent.

[0279] A "memory device" is a device that records past conversation information and uses it to generate appropriate responses.

[0280] A "speech synthesis device" is a device that converts generated responses into speech data and outputs it to the user audibly.

[0281] A "speech recognition device" is a device that converts obtained speech into text data.

[0282] The "data management device" is a device for managing users' schedules and health information and notifying reminders as needed.

[0283] The "communication device" is a device for regularly reporting information to the user's family members.

[0284] The "display device" is a device for providing information both audibly and visually.

[0285] The "means for reporting" is a means for recording the user's activities and notifying the situation to family members in a remote location.

[0286] The system for realizing this invention mainly consists of three elements: a server, a terminal, and a user.

[0287] The server uses natural language processing technology to convert the user's voice input into text data. For example, when the user makes an everyday query, the voice is quickly texturized, and the intention is analyzed using a generated AI model. The analyzed text data is stored in a storage device containing past conversation information, and an appropriate response is generated based on this. This response is sent to the terminal in text format. The hardware and software used include a speech recognition API (e.g., Google Cloud Speech-to-Text) and a natural language processing library (e.g., NLTK, spaCy).

[0288] The terminal converts the text response sent from the server into voice using a speech synthesis device and provides it to the user. The tone and speed of the voice are adjusted so that the user can easily understand the voice. Also, the terminal can manage the user's schedule and health information using the data management device and notify reminders as needed. In this process, a speech synthesis API (e.g., Amazon Polly) is used.

[0289] Users interact with the system through a device that provides voice and visual information. For example, they can ask questions like, "What time should I take my medicine?" to their smart device to check their schedule and receive instructions. The user's activities are regularly recorded and reported to their family via a communication device. Family members, even those living far away, can gain peace of mind through this information.

[0290] For example, prompt statements include the following:

[0291] "Tell me when to take my medicine."

[0292] "What's next?"

[0293] "What kind of exercise did you do today?"

[0294] By using these prompts, users can receive support to make their daily lives more comfortable.

[0295] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0296] Step 1:

[0297] The server receives audio input from the user. This audio data is given as input and converted into text data by a speech recognition API (e.g., Google Cloud Speech-to-Text). The converted text data is then passed to the next parsing step.

[0298] Step 2:

[0299] The server analyzes the converted text data using natural language processing libraries (e.g., NLTK, spaCy). It extracts the user's intent from the input text data and searches for relevant information from a memory storage based on past conversation information. In this process, a generative AI model derives an appropriate response to the generated text data. If any other information is needed, it searches again, organizes the information, and passes it on to the next step.

[0300] Step 3:

[0301] The server sends the generated response to the speech synthesis device. Using a speech synthesis API (e.g., Amazon Polly), the text data is converted into speech data. The converted speech data is then sent to the terminal as output.

[0302] Step 4:

[0303] The terminal receives audio data transmitted from the server and outputs it to the user. By having the terminal play the audio for the user, the user receives responses and advice from the system in audio format. The tone and speed of the audio are appropriately adjusted to provide natural and easy-to-listen-to audio.

[0304] Step 5:

[0305] The user receives voice input from the device and, if necessary, asks additional questions via voice. This new voice input returns to step 1, and the process continues in the same manner. Furthermore, if the user's status or schedule is updated, the device uses a data management device to record the information in real time and periodically reports it to the family via a communication device. This activity log enables continuous data management and provides peace of mind.

[0306] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0307] The AI support system for the elderly combined with the emotional engine of the present invention is realized by the interaction of the server, the terminal, and the user. The following shows its specific embodiments.

[0308] The server first receives the user's voice transmitted from the terminal and converts it into text using natural language processing means. This text data is input into the emotional engine to analyze the user's emotional state. The emotional information is reflected in the content and tone of the response generated by the server.

[0309] The emotional engine has a function of analyzing emotions based on the user's utterance content, voice tone, tempo, etc. Specifically, when it is determined that the user is depressed, the server generates words of comfort and encouragement and provides a response in a voice tone along with them. Also, by comparing with the past conversation history, it can consider what kind of emotional state the user was in the past and be able to make a more appropriate response.

[0310] The terminal uses the text data obtained by voice recognition and the response from the server to interact with the user. The text response sent from the server is converted into voice by voice synthesis means and output to the user. The voice tone influenced by the emotional engine is applied to this voice, enabling a natural and empathetic expression.

[0311] The user communicates with the AI through daily conversations. For example, when the user says "I'm a little lacking in energy today", the server analyzes this based on the emotional engine and generates a response such as "Did something happen? Let's talk about it?". By returning a response according to the user's emotions like this, it leads to a sense of security and trust.

[0312] This invention goes beyond simply recording and managing information; it provides a new form of support that is empathetic to the feelings of the elderly through emotion recognition and responses based on those emotions. The system aims to enhance emotional support for users and provide them with a richer life experience through intimate dialogue.

[0313] The following describes the processing flow.

[0314] Step 1:

[0315] The terminal receives voice input from the user. This data is converted into text data using speech recognition technology and sent to the server.

[0316] Step 2:

[0317] The server analyzes the received text data using natural language processing techniques to understand the user's utterances. This analysis reveals the user's requests and intentions.

[0318] Step 3:

[0319] The server inputs text data into the emotion engine, which then analyzes the user's emotional state. Emotions are inferred from factors such as tone of voice, speed, and the words used.

[0320] Step 4:

[0321] The emotion engine returns the analysis results to the server, providing data based on the user's emotions. This data is then used in the subsequent response generation process.

[0322] Step 5:

[0323] The server references the results of the emotion engine and past conversation history to generate an appropriate response. It adjusts the content and tone of voice of the response according to the user's emotional state.

[0324] Step 6:

[0325] The server sends the generated text response to the terminal. During this process, a speech synthesis system is used to convert the text into speech data.

[0326] Step 7:

[0327] The device uses the converted voice data to output a response to the user. Emotion-based tone adjustments are applied, resulting in a natural and friendly voice for the user.

[0328] Step 8:

[0329] The user hears this response and, if they wish to continue the conversation, provides further voice input. This cycle is repeated, enabling continuous conversation.

[0330] (Example 2)

[0331] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0332] There is a need for technologies that alleviate the emotional and informational anxieties and feelings of loneliness that older adults face in their daily lives, and that provide smoother and more approachable communication. Furthermore, there is a need to strengthen emotional support for older adults through appropriate voice responses that take their emotions into consideration.

[0333] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0334] In this invention, the server includes natural language processing means, memory means for recording and utilizing past utterances, and speech synthesis means for outputting the generated response as sound. This enables the detection of the user's emotional state and the generation of natural speech responses accordingly, thereby fostering a sense of security and trust in the elderly and providing emotional support.

[0335] "Natural language processing means" refers to technologies that analyze input character data, understand its content, and generate appropriate responses.

[0336] A "memory device" is a function that stores past speech and allows for referencing and using it as needed.

[0337] "Speech synthesis means" refers to a technology for converting a generated response into sound and outputting it.

[0338] "Speech recognition means" refers to technology that converts input audio data into text data.

[0339] "Information management means" refers to functions for organizing and managing users' schedule information and health information.

[0340] "Emotion analysis methods" refer to technologies used to determine a user's emotional state from the content of their speech and the intonation of their voice.

[0341] "Communication means" refers to a function that periodically reports information to the user's relatives and related parties.

[0342] The AI support system for the elderly according to the present invention consists of a server, a terminal, and user interaction. Its main objective is to provide elderly individuals with natural, empathetic dialogue and support their emotional stability.

[0343] The server first receives the audio data sent from the terminal and converts it into text data using speech recognition software. A commonly used speech recognition technology for this process is a "speech recognition API." Subsequently, natural language processing technology (e.g., generative AI models) is used to analyze the text data and identify the emotions contained in the user's utterances. The emotion analysis refers to the user's past conversation history and reflects this in the content of the generated response.

[0344] Specifically, if a user says something like, "I'm feeling a little lonely today," the server analyzes that statement using sentiment analysis tools and generates a response that empathizes with the user's feelings, such as, "How was your day?" This process utilizes a generative AI model, and an example of a prompt would be, "Generate a kind response for when the user is feeling lonely."

[0345] The terminal receives text data from the server, converts it into speech using speech synthesis technology, and outputs it to the user. This conversion enables a natural conversational format with an acoustic tone that reflects emotions.

[0346] Users can communicate with the system through everyday conversations and receive emotional support based on the responses. This allows users to gain a sense of security and trust, and receive support to enjoy a better life experience.

[0347] Overall, this system aims to improve the quality of life for the elderly, providing a new form of support through technological elements and user interaction.

[0348] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0349] Step 1:

[0350] The server receives audio data transmitted from the terminal. The input is the user's speech, which is captured as audio data through the terminal's microphone. The server records this audio data as digital data and prepares it for the next processing step.

[0351] Step 2:

[0352] The server converts the received audio data into text data using speech recognition. For example, this process involves converting audio data into a string using a speech recognition API. The input here is audio data, and the output is text data that can be processed by a machine.

[0353] Step 3:

[0354] The server uses a generative AI model to analyze text data and determine the user's emotional state. In this step, the input is text data obtained through speech recognition. The generative AI model uses prompt sentences to analyze emotions and outputs the user's emotional state (e.g., joy, sadness, etc.).

[0355] Step 4:

[0356] The server generates an appropriate response using natural language processing based on the results of sentiment analysis. The input consists of the user's emotional state and past conversation history, and based on this, it outputs a response text that is emotionally empathetic to the user. Specifically, it creates a response using the prompt example "Generate a gentle response when the user is feeling lonely."

[0357] Step 5:

[0358] The terminal converts the response text sent from the server into speech using speech synthesis technology and plays it back to the user. The input is the generated response text, which is delivered to the user as a human-like voice output using speech synthesis technology. In this step, the user can continue the conversation with the system by listening to the voice response from the terminal.

[0359] Step 6:

[0360] After receiving a response from the system, the user provides further voice input. This interaction is continuous, and the server repeats the process from step 1 as it receives new voice data. User feedback and new utterances help improve sentiment analysis and the quality of responses.

[0361] (Application Example 2)

[0362] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0363] There is a problem of insufficient emotional support in the daily lives of the elderly. Conventional information provision systems have difficulty generating responses that are sensitive to the user's feelings, and thus have the challenge of not being able to adequately provide the elderly with a sense of psychological security and trust.

[0364] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0365] In this invention, the server includes a natural language processing means, an emotion analysis device means, and a generation means for generating responses that correspond to the user's emotional state. This makes it possible to accurately grasp the emotions of elderly people through their daily conversations and provide emotionally expressive responses based on that understanding.

[0366] "Natural language processing" refers to technologies that convert information input via speech or text into a format that a computer can understand, and analyze the user's intent and meaning.

[0367] A "memory device" is a device that has the function of storing past dialogue information and data, and retrieving and using it as needed.

[0368] "Speech synthesis device means" refers to a technology for converting generated text data into speech and outputting it in a natural form.

[0369] "Speech recognition device means" refers to a technology that analyzes speech collected from users and converts it into text data.

[0370] A "data management device means" is a technology that provides convenience by organizing and managing information related to users' schedules and health.

[0371] A "communication device" is a device that has the function of periodically reporting and sharing information with the user's family and related parties.

[0372] An "emotion analysis device means" is a technology that analyzes a user's emotional state from their voice or text information and identifies that emotion.

[0373] "Generation means" refers to techniques for generating appropriate responses based on the analyzed results.

[0374] The system implementing this invention mainly consists of a server, a terminal, and a user. First, the server uses a speech recognition device to convert the audio data transmitted from the terminal into text data. Next, this text data is processed by a natural language analysis device, and the user's emotional state is identified by an emotion analysis device. Based on this, a generation device generates a response, which is then output again as audio by a speech synthesis device.

[0375] The terminal receives speech synthesis results from the server and provides responses to the user through voice output. This allows the user to communicate with the system through dialogue. In particular, emotion analysis makes it possible to naturally express words of comfort and encouragement to a depressed user.

[0376] The hardware used includes smart speakers and smartphones, while the software includes natural language processing libraries, speech recognition engines, and speech synthesis engines. Specifically, the "SpeechRecognition" library can be used as the speech recognition engine, and generative AI models for natural language processing can be used.

[0377] For example, if a user says, "Yesterday was a good day," the server will generate a response such as, "That's great. What are your plans for today?" A specific prompt might be, "Generate words of encouragement based on the user's mood today." This system allows elderly people to receive emotional support in their daily lives and gain a sense of security and trust.

[0378] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0379] Step 1:

[0380] The user provides voice input to the device. The device receives this voice data through its microphone. The voice data is input and sent to the speech recognition engine within the device.

[0381] Step 2:

[0382] The device uses a speech recognition engine to convert speech data into text data. This process captures the content of the speech as text information. In this case, the input is speech data, and the output is the text data of that speech.

[0383] Step 3:

[0384] The server receives text data from the terminal and analyzes the text using natural language processing (NLP) tools. Through this analysis, the user's utterances and intentions are understood. The input for the analysis is text data, and the output is intent data derived from the analysis of that text.

[0385] Step 4:

[0386] The server uses an emotion analysis device to identify the user's emotional state from intent data. This allows the server to determine the user's current emotions. The input is the analyzed intent data, and the output is the estimated emotional state.

[0387] Step 5:

[0388] The server uses a generation mechanism to generate a response based on the identified emotional state. This process utilizes a generative AI model. The input is the emotional state, and the output is an appropriate response sentence corresponding to that emotion.

[0389] Step 6:

[0390] The server uses a speech synthesis device to convert the generated response text into audio data. This prepares an audio response that the user can hear. The input is a response text in text format, and the output is the audio data of that text.

[0391] Step 7:

[0392] The terminal receives audio data from the server and plays the response to the user via the speaker. The user receives the response as audio. The output from the speaker is audio data generated by the server.

[0393] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0394] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0395] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0396] [Third Embodiment]

[0397] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0398] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0399] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0400] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0401] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0402] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0403] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0404] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0405] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0406] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0407] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0408] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0409] The AI support system for the elderly, as described in this invention, is implemented through the cooperation of a server, a terminal, and a user. A specific embodiment is shown below.

[0410] The server first uses natural language processing to convert the user's voice input into text data and analyzes its intent. The analyzed data is then compared with past conversation history and recorded in memory. Based on this information, the server uses a generative model to create an appropriate response. This response is then sent back to the terminal in text format.

[0411] The terminal converts text data sent from the server into speech data using speech synthesis and outputs it audibly to the user. In this process, the terminal adjusts the tone and speed of the speech to provide it in a natural and easy-to-understand manner for the user. The terminal also converts newly inputted speech from the user into text using speech recognition and sends that information to the server.

[0412] Users engage in everyday conversations through this system. For example, if a user asks, "What did I eat for dinner yesterday?", the server retrieves past data from its memory and generates a specific response such as, "Yesterday's meal was fish." In this way, users can receive support that contributes to reducing feelings of loneliness and maintaining cognitive function through continuous conversation.

[0413] Furthermore, the data management system manages the user's schedule and health information in real time and sets reminders as needed. For example, if a user says, "I want to check my doctor's appointment for next week," the device checks the schedule information and responds, "It's next Monday at 10:00 AM."

[0414] Reports to family members are made via communication methods. The server periodically analyzes the user's lifestyle and health status and sends the results to the family as a report. This makes it easier for family members to understand the elderly person's condition, even when they are in a remote location, and provides them with peace of mind.

[0415] This invention provides an effective means for elderly people to live fulfilling lives with peace of mind while strengthening their ties with their families.

[0416] The following describes the processing flow.

[0417] Step 1:

[0418] The terminal receives voice input from the user. A speech recognition system is used to convert the voice data into text data. This text data is then sent to a server for analysis of the user's intent.

[0419] Step 2:

[0420] The server analyzes the received text data using natural language processing techniques. The analysis helps understand the user's intent and the context of the conversation, and retrieves relevant information from past conversation history.

[0421] Step 3:

[0422] Based on the information acquired by the server, a generative model is used to generate an appropriate response to the user's utterance. In this process, past history is also considered to enable context-specific responses.

[0423] Step 4:

[0424] The server sends the generated text response to the terminal. The terminal then uses a speech synthesis system to convert this text data into speech and outputs it to the user.

[0425] Step 5:

[0426] The user listens to the device's audio output and, if they wish to continue the conversation, provides new audio input. This iterative process enables continuous conversation.

[0427] Step 6:

[0428] The data management system updates schedules and health information based on the user's new utterances and sets reminders as needed.

[0429] Step 7:

[0430] The server periodically reports the user's living situation and health information to the family via a communication method. This allows the family to properly understand the user's condition.

[0431] (Example 1)

[0432] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0433] As society ages, many elderly people experience loneliness in their daily lives and face challenges in managing their health. To address these challenges, there is a need for support systems that enable the elderly to live their daily lives with peace of mind and communicate effectively with their families. In addition, there is a need for technology that reduces noise and unnatural speech when using voice interfaces, providing a more comfortable and natural conversational environment.

[0434] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0435] In this invention, the server includes a natural language processing unit, a storage device for recording and referencing past conversation information, a speech synthesis device for outputting generated responses as speech, means for filtering noise when converting speech data to text data, and means for adjusting the tone and speed of the generated responses. This makes it possible for elderly people to receive daily support while reducing loneliness, and to provide a sense of security to their families. Furthermore, even when using a voice interface, noise reduction and natural speech output enable smoother use.

[0436] A "natural language processing device" is a device that utilizes technology to understand and analyze human language in order to generate appropriate responses.

[0437] A "memory device" is a device that stores past conversation information and uses it for future matching and response generation.

[0438] A "speech synthesis device" is a device that converts generated text data into speech and provides it to the user audibly.

[0439] A "speech recognition device" is a device that converts a user's speech into digital data, making it possible to process it as text.

[0440] An "information management device" is a device that systematically manages users' schedules and health-related information, and provides appropriate instructions and notifications.

[0441] A "communication device" is a device used to report the user's status and necessary information to family members in a remote location according to certain standards.

[0442] "Methods for filtering noise" refer to methods for removing unwanted acoustic noise from received audio data, making it easier to analyze the pure speech content.

[0443] "Means of adjusting tone and speed" refer to methods for adjusting generated audio data to provide users with natural and easily understandable audio.

[0444] This invention, an AI support system for the elderly, is built on the interaction between a server, a terminal, and the user.

[0445] The server first receives voice data input from the user and converts it into text data using a speech recognition device. This process utilizes a natural language processing unit to understand human speech. Specifically, common cloud-based speech recognition technology is used to convert speech to text. The converted text is stored in memory and used to refer to past conversation information.

[0446] The terminal receives text data sent from the server and converts it into speech data using a speech synthesis device. During this process, the tone and speed of the synthesized speech are adjusted to enhance the naturalness of the generated response. For example, the terminal can use speech synthesis technology to convey specific details such as "Yesterday we had fish" in a soft, natural way.

[0447] Users can use this system to engage in everyday conversations and alleviate feelings of loneliness. For example, if a user says, "I want to check my doctor's appointment for next week," the server will use its information management device to check the schedule data and generate an appropriate response. Specifically, it can provide a response such as, "It's next Monday at 10:00 AM."

[0448] Furthermore, an example of a prompt message sent to the AI generation model is, "Considering that the user is elderly, please gently describe what they had for lunch yesterday." This model then generates a more appropriate response, which is delivered to the user.

[0449] This invention allows users to receive daily support while their families gain peace of mind through regular reports on the user's condition. The system maintains the quality of voice data through noise filtering, creating a natural and easy-to-understand conversational environment for the user.

[0450] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0451] Step 1:

[0452] The user inputs voice into the device's microphone. The device has the capability to process this voice data based on cloud-based speech recognition technology. The input is the user's spoken voice, and the output is digitized voice data. The device sends this voice data to a server for further processing.

[0453] Step 2:

[0454] The server uses the received audio data to perform analysis with a natural language processing unit. This process uses speech recognition technology to convert the audio data into text data, thereby understanding the user's intent. Here, the input is digital audio data, and the output is the analyzed text data. Furthermore, the server retrieves past conversation information from its storage device and compares it with historical information related to the intent.

[0455] Step 3:

[0456] The server references text data and conversation history and uses a generative AI model to generate appropriate prompt sentences. Based on these prompt sentences, it generates responses that match the user's requests. The input is history information matched with text data, and the output is a natural language response text based on the prompt sentences.

[0457] Step 4:

[0458] The generated response text is sent from the server to the terminal, which uses speech synthesis technology to convert this text into speech data. The input here is a response text in natural language, and the output is speech data with adjusted tone and speed. The terminal then provides this speech data to the user audibly through its speaker.

[0459] Step 5:

[0460] The user accepts the audio output from the device and decides whether to continue the conversation based on it. In this step, the input is the audio response, and the output is the user's experience and information. If additional information is needed, the user restarts the conversation from step 1.

[0461] (Application Example 1)

[0462] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0463] There is a need to provide effective support to alleviate the loneliness and anxiety that elderly people face in their daily lives and to maintain their cognitive function. Furthermore, a system is needed that allows family members living remotely to easily monitor the health and living conditions of elderly individuals. Current technology lacks consistency in audio and visual information delivery, and real-time activity recording and reporting automation is insufficient; therefore, improved solutions are desired.

[0464] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0465] In this invention, the server includes natural language processing means, a storage device for recording and utilizing past conversation information, and a speech synthesis device for outputting generated responses as speech. This enables elderly people to naturally give instructions by voice and receive appropriate responses. The invention also includes a data management device for managing the user's schedule and health information, a communication device for periodically reporting information to the user's family, a display device for providing information audibly and visually, and means for recording the user's activities and reporting the situation to the family. This enables real-time notifications of schedules and reports on health status.

[0466] "Natural language processing means" refers to technologies that convert voice input from users into text data and analyze its intent.

[0467] A "memory device" is a device that records past conversation information and uses it to generate appropriate responses.

[0468] A "speech synthesis device" is a device that converts generated responses into speech data and outputs it to the user audibly.

[0469] A "speech recognition device" is a device that converts obtained speech into text data.

[0470] A "data management device" is a device that manages a user's schedule and health information and sends reminders as needed.

[0471] A "communication device" is a device used to periodically report information to the user's family.

[0472] A "display device" is a device that provides information in both audio and visual form.

[0473] "Means of reporting" refers to methods for recording the user's activities and notifying family members in remote locations of the situation.

[0474] The system for realizing this invention mainly consists of three elements: a server, a terminal, and a user.

[0475] The server uses natural language processing technology to convert user voice input into text data. For example, when a user asks an everyday question, the voice is quickly converted into text, and the intent is analyzed using a generative AI model. The analyzed text data is stored in a memory device that also contains past conversation information, and an appropriate response is generated based on this. This response is sent to the terminal in text format. The hardware and software used include speech recognition APIs (e.g., Google Cloud Speech-to-Text) and natural language processing libraries (e.g., NLTK, spaCy).

[0476] The terminal converts text responses sent from the server into speech using a speech synthesis device and provides it to the user. The tone and speed of the speech are adjusted to ensure the user can easily understand it. The terminal can also manage the user's schedule and health information using a data management device and send reminders as needed. A speech synthesis API (e.g., Amazon Polly) is used in this process.

[0477] Users interact with the system through a device that provides voice and visual information. For example, they can ask questions like, "What time should I take my medicine?" to their smart device to check their schedule and receive instructions. The user's activities are regularly recorded and reported to their family via a communication device. Family members, even those living far away, can gain peace of mind through this information.

[0478] For example, prompt statements include the following:

[0479] "Tell me when to take my medicine."

[0480] "What's next?"

[0481] "What kind of exercise did you do today?"

[0482] By using these prompts, users can receive support to make their daily lives more comfortable.

[0483] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0484] Step 1:

[0485] The server receives audio input from the user. This audio data is given as input and converted into text data by a speech recognition API (e.g., Google Cloud Speech-to-Text). The converted text data is then passed to the next parsing step.

[0486] Step 2:

[0487] The server analyzes the converted text data using natural language processing libraries (e.g., NLTK, spaCy). It extracts the user's intent from the input text data and searches for relevant information from a memory storage based on past conversation information. In this process, a generative AI model derives an appropriate response to the generated text data. If any other information is needed, it searches again, organizes the information, and passes it on to the next step.

[0488] Step 3:

[0489] The server sends the generated response to the speech synthesis device. Using a speech synthesis API (e.g., Amazon Polly), the text data is converted into speech data. The converted speech data is then sent to the terminal as output.

[0490] Step 4:

[0491] The terminal receives audio data transmitted from the server and outputs it to the user. By having the terminal play the audio for the user, the user receives responses and advice from the system in audio format. The tone and speed of the audio are appropriately adjusted to provide natural and easy-to-listen-to audio.

[0492] Step 5:

[0493] The user receives voice input from the device and, if necessary, asks additional questions via voice. This new voice input returns to step 1, and the process continues in the same manner. Furthermore, if the user's status or schedule is updated, the device uses a data management device to record the information in real time and periodically reports it to the family via a communication device. This activity log enables continuous data management and provides peace of mind.

[0494] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0495] The AI support system for the elderly, which incorporates the emotion engine of this invention, is realized through the interaction of a server, a terminal, and a user. A specific embodiment is shown below.

[0496] The server first receives the user's voice transmitted from the terminal and converts it into text using natural language processing. This text data is then input into the emotion engine, which analyzes the user's emotional state. This emotional information is reflected in the content and tone of the response generated by the server.

[0497] The emotion engine has the function of analyzing the user's emotions based on the content of their speech, tone of voice, tempo, etc. Specifically, if it determines that the user is feeling down, the server generates words of comfort and encouragement and provides a response in a corresponding tone of voice. In addition, by comparing this with past conversation history, it takes into account the user's past emotional state, enabling it to respond more appropriately.

[0498] The device interacts with the user using text data obtained through speech recognition and responses from the server. The text responses sent from the server are converted into speech by a speech synthesis system and output to the user. This speech is given a tone influenced by an emotion engine, enabling natural and emotionally resonant expressions.

[0499] Users communicate with the AI through everyday conversations. For example, if a user says, "I'm feeling a little down today," the server analyzes this based on its emotion engine and generates a response such as, "Is something wrong? Shall we talk?" By providing responses that match the user's emotions in this way, it fosters a sense of security and trust.

[0500] This invention goes beyond simply recording and managing information; it provides a new form of support that is empathetic to the feelings of the elderly through emotion recognition and responses based on those emotions. The system aims to enhance emotional support for users and provide them with a richer life experience through intimate dialogue.

[0501] The following describes the processing flow.

[0502] Step 1:

[0503] The terminal receives voice input from the user. This data is converted into text data using speech recognition technology and sent to the server.

[0504] Step 2:

[0505] The server analyzes the received text data using natural language processing techniques to understand the user's utterances. This analysis reveals the user's requests and intentions.

[0506] Step 3:

[0507] The server inputs text data into the emotion engine, which then analyzes the user's emotional state. Emotions are inferred from factors such as tone of voice, speed, and the words used.

[0508] Step 4:

[0509] The emotion engine returns the analysis results to the server, providing data based on the user's emotions. This data is then used in the subsequent response generation process.

[0510] Step 5:

[0511] The server references the results of the emotion engine and past conversation history to generate an appropriate response. It adjusts the content and tone of voice of the response according to the user's emotional state.

[0512] Step 6:

[0513] The server sends the generated text response to the terminal. During this process, a speech synthesis system is used to convert the text into speech data.

[0514] Step 7:

[0515] The device uses the converted voice data to output a response to the user. Emotion-based tone adjustments are applied, resulting in a natural and friendly voice for the user.

[0516] Step 8:

[0517] The user hears this response and, if they wish to continue the conversation, provides further voice input. This cycle is repeated, enabling continuous conversation.

[0518] (Example 2)

[0519] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0520] There is a need for technologies that alleviate the emotional and informational anxieties and feelings of loneliness that older adults face in their daily lives, and that provide smoother and more approachable communication. Furthermore, there is a need to strengthen emotional support for older adults through appropriate voice responses that take their emotions into consideration.

[0521] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0522] In this invention, the server includes natural language processing means, memory means for recording and utilizing past utterances, and speech synthesis means for outputting the generated response as sound. This enables the detection of the user's emotional state and the generation of natural speech responses accordingly, thereby fostering a sense of security and trust in the elderly and providing emotional support.

[0523] "Natural language processing means" refers to technologies that analyze input character data, understand its content, and generate appropriate responses.

[0524] A "memory device" is a function that stores past speech and allows for referencing and using it as needed.

[0525] "Speech synthesis means" refers to a technology for converting a generated response into sound and outputting it.

[0526] "Speech recognition means" refers to technology that converts input audio data into text data.

[0527] "Information management means" refers to functions for organizing and managing users' schedule information and health information.

[0528] "Emotion analysis methods" refer to technologies used to determine a user's emotional state from the content of their speech and the intonation of their voice.

[0529] "Communication means" refers to a function that periodically reports information to the user's relatives and related parties.

[0530] The AI support system for the elderly according to the present invention consists of a server, a terminal, and user interaction. Its main objective is to provide elderly individuals with natural, empathetic dialogue and support their emotional stability.

[0531] The server first receives the audio data sent from the terminal and converts it into text data using speech recognition software. A commonly used speech recognition technology for this process is a "speech recognition API." Subsequently, natural language processing technology (e.g., generative AI models) is used to analyze the text data and identify the emotions contained in the user's utterances. The emotion analysis refers to the user's past conversation history and reflects this in the content of the generated response.

[0532] Specifically, if a user says something like, "I'm feeling a little lonely today," the server analyzes that statement using sentiment analysis tools and generates a response that empathizes with the user's feelings, such as, "How was your day?" This process utilizes a generative AI model, and an example of a prompt would be, "Generate a kind response for when the user is feeling lonely."

[0533] The terminal receives text data from the server, converts it into speech using speech synthesis technology, and outputs it to the user. This conversion enables a natural conversational format with an acoustic tone that reflects emotions.

[0534] Users can communicate with the system through everyday conversations and receive emotional support based on the responses. This allows users to gain a sense of security and trust, and receive support to enjoy a better life experience.

[0535] Overall, this system aims to improve the quality of life for the elderly, providing a new form of support through technological elements and user interaction.

[0536] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0537] Step 1:

[0538] The server receives audio data transmitted from the terminal. The input is the user's speech, which is captured as audio data through the terminal's microphone. The server records this audio data as digital data and prepares it for the next processing step.

[0539] Step 2:

[0540] The server converts the received audio data into text data using speech recognition. For example, this process involves converting audio data into a string using a speech recognition API. The input here is audio data, and the output is text data that can be processed by a machine.

[0541] Step 3:

[0542] The server uses a generative AI model to analyze text data and determine the user's emotional state. In this step, the input is text data obtained through speech recognition. The generative AI model uses prompt sentences to analyze emotions and outputs the user's emotional state (e.g., joy, sadness, etc.).

[0543] Step 4:

[0544] The server generates an appropriate response using natural language processing based on the results of sentiment analysis. The input consists of the user's emotional state and past conversation history, and based on this, it outputs a response text that is emotionally empathetic to the user. Specifically, it creates a response using the prompt example "Generate a gentle response when the user is feeling lonely."

[0545] Step 5:

[0546] The terminal converts the response text sent from the server into speech using speech synthesis technology and plays it back to the user. The input is the generated response text, which is delivered to the user as a human-like voice output using speech synthesis technology. In this step, the user can continue the conversation with the system by listening to the voice response from the terminal.

[0547] Step 6:

[0548] After receiving a response from the system, the user provides further voice input. This interaction is continuous, and the server repeats the process from step 1 as it receives new voice data. User feedback and new utterances help improve sentiment analysis and the quality of responses.

[0549] (Application Example 2)

[0550] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0551] There is a problem of insufficient emotional support in the daily lives of the elderly. Conventional information provision systems have difficulty generating responses that are sensitive to the user's feelings, and thus have the challenge of not being able to adequately provide the elderly with a sense of psychological security and trust.

[0552] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0553] In this invention, the server includes a natural language processing means, an emotion analysis device means, and a generation means for generating responses that correspond to the user's emotional state. This makes it possible to accurately grasp the emotions of elderly people through their daily conversations and provide emotionally expressive responses based on that understanding.

[0554] "Natural language processing" refers to technologies that convert information input via speech or text into a format that a computer can understand, and analyze the user's intent and meaning.

[0555] A "memory device" is a device that has the function of storing past dialogue information and data, and retrieving and using it as needed.

[0556] "Speech synthesis device means" refers to a technology for converting generated text data into speech and outputting it in a natural form.

[0557] "Speech recognition device means" refers to a technology that analyzes speech collected from users and converts it into text data.

[0558] A "data management device means" is a technology that provides convenience by organizing and managing information related to users' schedules and health.

[0559] A "communication device" is a device that has the function of periodically reporting and sharing information with the user's family and related parties.

[0560] An "emotion analysis device means" is a technology that analyzes a user's emotional state from their voice or text information and identifies that emotion.

[0561] "Generation means" refers to techniques for generating appropriate responses based on the analyzed results.

[0562] The system implementing this invention mainly consists of a server, a terminal, and a user. First, the server uses a speech recognition device to convert the audio data transmitted from the terminal into text data. Next, this text data is processed by a natural language analysis device, and the user's emotional state is identified by an emotion analysis device. Based on this, a generation device generates a response, which is then output again as audio by a speech synthesis device.

[0563] The terminal receives speech synthesis results from the server and provides responses to the user through voice output. This allows the user to communicate with the system through dialogue. In particular, emotion analysis makes it possible to naturally express words of comfort and encouragement to a depressed user.

[0564] The hardware used includes smart speakers and smartphones, while the software includes natural language processing libraries, speech recognition engines, and speech synthesis engines. Specifically, the "SpeechRecognition" library can be used as the speech recognition engine, and generative AI models for natural language processing can be used.

[0565] For example, if a user says, "Yesterday was a good day," the server will generate a response such as, "That's great. What are your plans for today?" A specific prompt might be, "Generate words of encouragement based on the user's mood today." This system allows elderly people to receive emotional support in their daily lives and gain a sense of security and trust.

[0566] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0567] Step 1:

[0568] The user provides voice input to the device. The device receives this voice data through its microphone. The voice data is input and sent to the speech recognition engine within the device.

[0569] Step 2:

[0570] The device uses a speech recognition engine to convert speech data into text data. This process captures the content of the speech as text information. In this case, the input is speech data, and the output is the text data of that speech.

[0571] Step 3:

[0572] The server receives text data from the terminal and analyzes the text using natural language processing (NLP) tools. Through this analysis, the user's utterances and intentions are understood. The input for the analysis is text data, and the output is intent data derived from the analysis of that text.

[0573] Step 4:

[0574] The server uses an emotion analysis device to identify the user's emotional state from intent data. This allows the server to determine the user's current emotions. The input is the analyzed intent data, and the output is the estimated emotional state.

[0575] Step 5:

[0576] The server uses a generation mechanism to generate a response based on the identified emotional state. This process utilizes a generative AI model. The input is the emotional state, and the output is an appropriate response sentence corresponding to that emotion.

[0577] Step 6:

[0578] The server uses a speech synthesis device to convert the generated response text into audio data. This prepares an audio response that the user can hear. The input is a response text in text format, and the output is the audio data of that text.

[0579] Step 7:

[0580] The terminal receives audio data from the server and plays the response to the user via the speaker. The user receives the response as audio. The output from the speaker is audio data generated by the server.

[0581] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0582] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0583] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0584] [Fourth Embodiment]

[0585] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0586] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0587] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0588] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0589] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0590] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0591] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0592] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0593] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0594] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0595] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0596] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0597] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0598] The AI support system for the elderly, as described in this invention, is implemented through the cooperation of a server, a terminal, and a user. A specific embodiment is shown below.

[0599] The server first uses natural language processing to convert the user's voice input into text data and analyzes its intent. The analyzed data is then compared with past conversation history and recorded in memory. Based on this information, the server uses a generative model to create an appropriate response. This response is then sent back to the terminal in text format.

[0600] The terminal converts text data sent from the server into speech data using speech synthesis and outputs it audibly to the user. In this process, the terminal adjusts the tone and speed of the speech to provide it in a natural and easy-to-understand manner for the user. The terminal also converts newly inputted speech from the user into text using speech recognition and sends that information to the server.

[0601] Users engage in everyday conversations through this system. For example, if a user asks, "What did I eat for dinner yesterday?", the server retrieves past data from its memory and generates a specific response such as, "Yesterday's meal was fish." In this way, users can receive support that contributes to reducing feelings of loneliness and maintaining cognitive function through continuous conversation.

[0602] Furthermore, the data management system manages the user's schedule and health information in real time and sets reminders as needed. For example, if a user says, "I want to check my doctor's appointment for next week," the device checks the schedule information and responds, "It's next Monday at 10:00 AM."

[0603] Reports to family members are made via communication methods. The server periodically analyzes the user's lifestyle and health status and sends the results to the family as a report. This makes it easier for family members to understand the elderly person's condition, even when they are in a remote location, and provides them with peace of mind.

[0604] This invention provides an effective means for elderly people to live fulfilling lives with peace of mind while strengthening their ties with their families.

[0605] The following describes the processing flow.

[0606] Step 1:

[0607] The terminal receives voice input from the user. A speech recognition system is used to convert the voice data into text data. This text data is then sent to a server for analysis of the user's intent.

[0608] Step 2:

[0609] The server analyzes the received text data using natural language processing techniques. The analysis helps understand the user's intent and the context of the conversation, and retrieves relevant information from past conversation history.

[0610] Step 3:

[0611] Based on the information acquired by the server, a generative model is used to generate an appropriate response to the user's utterance. In this process, past history is also considered to enable context-specific responses.

[0612] Step 4:

[0613] The server sends the generated text response to the terminal. The terminal then uses a speech synthesis system to convert this text data into speech and outputs it to the user.

[0614] Step 5:

[0615] The user listens to the device's audio output and, if they wish to continue the conversation, provides new audio input. This iterative process enables continuous conversation.

[0616] Step 6:

[0617] The data management system updates schedules and health information based on the user's new utterances and sets reminders as needed.

[0618] Step 7:

[0619] The server periodically reports the user's living situation and health information to the family via a communication method. This allows the family to properly understand the user's condition.

[0620] (Example 1)

[0621] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0622] As society ages, many elderly people experience loneliness in their daily lives and face challenges in managing their health. To address these challenges, there is a need for support systems that enable the elderly to live their daily lives with peace of mind and communicate effectively with their families. In addition, there is a need for technology that reduces noise and unnatural speech when using voice interfaces, providing a more comfortable and natural conversational environment.

[0623] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0624] In this invention, the server includes a natural language processing unit, a storage device for recording and referencing past conversation information, a speech synthesis device for outputting generated responses as speech, means for filtering noise when converting speech data to text data, and means for adjusting the tone and speed of the generated responses. This makes it possible for elderly people to receive daily support while reducing loneliness, and to provide a sense of security to their families. Furthermore, even when using a voice interface, noise reduction and natural speech output enable smoother use.

[0625] A "natural language processing device" is a device that utilizes technology to understand and analyze human language in order to generate appropriate responses.

[0626] A "memory device" is a device that stores past conversation information and uses it for future matching and response generation.

[0627] A "speech synthesis device" is a device that converts generated text data into speech and provides it to the user audibly.

[0628] A "speech recognition device" is a device that converts a user's speech into digital data, making it possible to process it as text.

[0629] An "information management device" is a device that systematically manages users' schedules and health-related information, and provides appropriate instructions and notifications.

[0630] A "communication device" is a device used to report the user's status and necessary information to family members in a remote location according to certain standards.

[0631] "Methods for filtering noise" refer to methods for removing unwanted acoustic noise from received audio data, making it easier to analyze the pure speech content.

[0632] "Means of adjusting tone and speed" refer to methods for adjusting generated audio data to provide users with natural and easily understandable audio.

[0633] This invention, an AI support system for the elderly, is built on the interaction between a server, a terminal, and the user.

[0634] The server first receives voice data input from the user and converts it into text data using a speech recognition device. This process utilizes a natural language processing unit to understand human speech. Specifically, common cloud-based speech recognition technology is used to convert speech to text. The converted text is stored in memory and used to refer to past conversation information.

[0635] The terminal receives text data sent from the server and converts it into speech data using a speech synthesis device. During this process, the tone and speed of the synthesized speech are adjusted to enhance the naturalness of the generated response. For example, the terminal can use speech synthesis technology to convey specific details such as "Yesterday we had fish" in a soft, natural way.

[0636] Users can use this system to engage in everyday conversations and alleviate feelings of loneliness. For example, if a user says, "I want to check my doctor's appointment for next week," the server will use its information management device to check the schedule data and generate an appropriate response. Specifically, it can provide a response such as, "It's next Monday at 10:00 AM."

[0637] Furthermore, an example of a prompt message sent to the AI generation model is, "Considering that the user is elderly, please gently describe what they had for lunch yesterday." This model then generates a more appropriate response, which is delivered to the user.

[0638] This invention allows users to receive daily support while their families gain peace of mind through regular reports on the user's condition. The system maintains the quality of voice data through noise filtering, creating a natural and easy-to-understand conversational environment for the user.

[0639] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0640] Step 1:

[0641] The user inputs voice into the device's microphone. The device has the capability to process this voice data based on cloud-based speech recognition technology. The input is the user's spoken voice, and the output is digitized voice data. The device sends this voice data to a server for further processing.

[0642] Step 2:

[0643] The server uses the received audio data to perform analysis with a natural language processing unit. This process uses speech recognition technology to convert the audio data into text data, thereby understanding the user's intent. Here, the input is digital audio data, and the output is the analyzed text data. Furthermore, the server retrieves past conversation information from its storage device and compares it with historical information related to the intent.

[0644] Step 3:

[0645] The server references text data and conversation history and uses a generative AI model to generate appropriate prompt sentences. Based on these prompt sentences, it generates responses that match the user's requests. The input is history information matched with text data, and the output is a natural language response text based on the prompt sentences.

[0646] Step 4:

[0647] The generated response text is sent from the server to the terminal, which uses speech synthesis technology to convert this text into speech data. The input here is a response text in natural language, and the output is speech data with adjusted tone and speed. The terminal then provides this speech data to the user audibly through its speaker.

[0648] Step 5:

[0649] The user accepts the audio output from the device and decides whether to continue the conversation based on it. In this step, the input is the audio response, and the output is the user's experience and information. If additional information is needed, the user restarts the conversation from step 1.

[0650] (Application Example 1)

[0651] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0652] There is a need to provide effective support to alleviate the loneliness and anxiety that elderly people face in their daily lives and to maintain their cognitive function. Furthermore, a system is needed that allows family members living remotely to easily monitor the health and living conditions of elderly individuals. Current technology lacks consistency in audio and visual information delivery, and real-time activity recording and reporting automation is insufficient; therefore, improved solutions are desired.

[0653] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0654] In this invention, the server includes natural language processing means, a storage device for recording and utilizing past conversation information, and a speech synthesis device for outputting generated responses as speech. This enables elderly people to naturally give instructions by voice and receive appropriate responses. The invention also includes a data management device for managing the user's schedule and health information, a communication device for periodically reporting information to the user's family, a display device for providing information audibly and visually, and means for recording the user's activities and reporting the situation to the family. This enables real-time notifications of schedules and reports on health status.

[0655] "Natural language processing means" refers to technologies that convert voice input from users into text data and analyze its intent.

[0656] A "memory device" is a device that records past conversation information and uses it to generate appropriate responses.

[0657] A "speech synthesis device" is a device that converts generated responses into speech data and outputs it to the user audibly.

[0658] A "speech recognition device" is a device that converts obtained speech into text data.

[0659] A "data management device" is a device that manages a user's schedule and health information and sends reminders as needed.

[0660] A "communication device" is a device used to periodically report information to the user's family.

[0661] A "display device" is a device that provides information in both audio and visual form.

[0662] "Means of reporting" refers to methods for recording the user's activities and notifying family members in remote locations of the situation.

[0663] The system for realizing this invention mainly consists of three elements: a server, a terminal, and a user.

[0664] The server uses natural language processing technology to convert user voice input into text data. For example, when a user asks an everyday question, the voice is quickly converted into text, and the intent is analyzed using a generative AI model. The analyzed text data is stored in a memory device that also contains past conversation information, and an appropriate response is generated based on this. This response is sent to the terminal in text format. The hardware and software used include speech recognition APIs (e.g., Google Cloud Speech-to-Text) and natural language processing libraries (e.g., NLTK, spaCy).

[0665] The terminal converts text responses sent from the server into speech using a speech synthesis device and provides it to the user. The tone and speed of the speech are adjusted to ensure the user can easily understand it. The terminal can also manage the user's schedule and health information using a data management device and send reminders as needed. A speech synthesis API (e.g., Amazon Polly) is used in this process.

[0666] Users interact with the system through a device that provides voice and visual information. For example, they can ask questions like, "What time should I take my medicine?" to their smart device to check their schedule and receive instructions. The user's activities are regularly recorded and reported to their family via a communication device. Family members, even those living far away, can gain peace of mind through this information.

[0667] For example, prompt statements include the following:

[0668] "Tell me when to take my medicine."

[0669] "What's next?"

[0670] "What kind of exercise did you do today?"

[0671] By using these prompts, users can receive support to make their daily lives more comfortable.

[0672] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0673] Step 1:

[0674] The server receives audio input from the user. This audio data is given as input and converted into text data by a speech recognition API (e.g., Google Cloud Speech-to-Text). The converted text data is then passed to the next parsing step.

[0675] Step 2:

[0676] The server analyzes the converted text data using natural language processing libraries (e.g., NLTK, spaCy). It extracts the user's intent from the input text data and searches for relevant information from a memory storage based on past conversation information. In this process, a generative AI model derives an appropriate response to the generated text data. If any other information is needed, it searches again, organizes the information, and passes it on to the next step.

[0677] Step 3:

[0678] The server sends the generated response to the speech synthesis device. Using a speech synthesis API (e.g., Amazon Polly), the text data is converted into speech data. The converted speech data is then sent to the terminal as output.

[0679] Step 4:

[0680] The terminal receives audio data transmitted from the server and outputs it to the user. By having the terminal play the audio for the user, the user receives responses and advice from the system in audio format. The tone and speed of the audio are appropriately adjusted to provide natural and easy-to-listen-to audio.

[0681] Step 5:

[0682] The user receives voice input from the device and, if necessary, asks additional questions via voice. This new voice input returns to step 1, and the process continues in the same manner. Furthermore, if the user's status or schedule is updated, the device uses a data management device to record the information in real time and periodically reports it to the family via a communication device. This activity log enables continuous data management and provides peace of mind.

[0683] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0684] The AI support system for the elderly, which incorporates the emotion engine of this invention, is realized through the interaction of a server, a terminal, and a user. A specific embodiment is shown below.

[0685] The server first receives the user's voice transmitted from the terminal and converts it into text using natural language processing. This text data is then input into the emotion engine, which analyzes the user's emotional state. This emotional information is reflected in the content and tone of the response generated by the server.

[0686] The emotion engine has the function of analyzing the user's emotions based on the content of their speech, tone of voice, tempo, etc. Specifically, if it determines that the user is feeling down, the server generates words of comfort and encouragement and provides a response in a corresponding tone of voice. In addition, by comparing this with past conversation history, it takes into account the user's past emotional state, enabling it to respond more appropriately.

[0687] The device interacts with the user using text data obtained through speech recognition and responses from the server. The text responses sent from the server are converted into speech by a speech synthesis system and output to the user. This speech is given a tone influenced by an emotion engine, enabling natural and emotionally resonant expressions.

[0688] Users communicate with the AI through everyday conversations. For example, if a user says, "I'm feeling a little down today," the server analyzes this based on its emotion engine and generates a response such as, "Is something wrong? Shall we talk?" By providing responses that match the user's emotions in this way, it fosters a sense of security and trust.

[0689] This invention goes beyond simply recording and managing information; it provides a new form of support that is empathetic to the feelings of the elderly through emotion recognition and responses based on those emotions. The system aims to enhance emotional support for users and provide them with a richer life experience through intimate dialogue.

[0690] The following describes the processing flow.

[0691] Step 1:

[0692] The terminal receives voice input from the user. This data is converted into text data using speech recognition technology and sent to the server.

[0693] Step 2:

[0694] The server analyzes the received text data using natural language processing techniques to understand the user's utterances. This analysis reveals the user's requests and intentions.

[0695] Step 3:

[0696] The server inputs text data into the emotion engine, which then analyzes the user's emotional state. Emotions are inferred from factors such as tone of voice, speed, and the words used.

[0697] Step 4:

[0698] The emotion engine returns the analysis results to the server, providing data based on the user's emotions. This data is then used in the subsequent response generation process.

[0699] Step 5:

[0700] The server references the results of the emotion engine and past conversation history to generate an appropriate response. It adjusts the content and tone of voice of the response according to the user's emotional state.

[0701] Step 6:

[0702] The server sends the generated text response to the terminal. During this process, a speech synthesis system is used to convert the text into speech data.

[0703] Step 7:

[0704] The device uses the converted voice data to output a response to the user. Emotion-based tone adjustments are applied, resulting in a natural and friendly voice for the user.

[0705] Step 8:

[0706] The user hears this response and, if they wish to continue the conversation, provides further voice input. This cycle is repeated, enabling continuous conversation.

[0707] (Example 2)

[0708] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0709] There is a need for technologies that alleviate the emotional and informational anxieties and feelings of loneliness that older adults face in their daily lives, and that provide smoother and more approachable communication. Furthermore, there is a need to strengthen emotional support for older adults through appropriate voice responses that take their emotions into consideration.

[0710] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0711] In this invention, the server includes natural language processing means, memory means for recording and utilizing past utterances, and speech synthesis means for outputting the generated response as sound. This enables the detection of the user's emotional state and the generation of natural speech responses accordingly, thereby fostering a sense of security and trust in the elderly and providing emotional support.

[0712] "Natural language processing means" refers to technologies that analyze input character data, understand its content, and generate appropriate responses.

[0713] A "memory device" is a function that stores past speech and allows for referencing and using it as needed.

[0714] "Speech synthesis means" refers to a technology for converting a generated response into sound and outputting it.

[0715] "Speech recognition means" refers to technology that converts input audio data into text data.

[0716] "Information management means" refers to functions for organizing and managing users' schedule information and health information.

[0717] "Emotion analysis methods" refer to technologies used to determine a user's emotional state from the content of their speech and the intonation of their voice.

[0718] "Communication means" refers to a function that periodically reports information to the user's relatives and related parties.

[0719] The AI support system for the elderly according to the present invention consists of a server, a terminal, and user interaction. Its main objective is to provide elderly individuals with natural, empathetic dialogue and support their emotional stability.

[0720] The server first receives the audio data sent from the terminal and converts it into text data using speech recognition software. A commonly used speech recognition technology for this process is a "speech recognition API." Subsequently, natural language processing technology (e.g., generative AI models) is used to analyze the text data and identify the emotions contained in the user's utterances. The emotion analysis refers to the user's past conversation history and reflects this in the content of the generated response.

[0721] Specifically, if a user says something like, "I'm feeling a little lonely today," the server analyzes that statement using sentiment analysis tools and generates a response that empathizes with the user's feelings, such as, "How was your day?" This process utilizes a generative AI model, and an example of a prompt would be, "Generate a kind response for when the user is feeling lonely."

[0722] The terminal receives text data from the server, converts it into speech using speech synthesis technology, and outputs it to the user. This conversion enables a natural conversational format with an acoustic tone that reflects emotions.

[0723] Users can communicate with the system through everyday conversations and receive emotional support based on the responses. This allows users to gain a sense of security and trust, and receive support to enjoy a better life experience.

[0724] Overall, this system aims to improve the quality of life for the elderly, providing a new form of support through technological elements and user interaction.

[0725] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0726] Step 1:

[0727] The server receives audio data transmitted from the terminal. The input is the user's speech, which is captured as audio data through the terminal's microphone. The server records this audio data as digital data and prepares it for the next processing step.

[0728] Step 2:

[0729] The server converts the received audio data into text data using speech recognition. For example, this process involves converting audio data into a string using a speech recognition API. The input here is audio data, and the output is text data that can be processed by a machine.

[0730] Step 3:

[0731] The server uses a generative AI model to analyze text data and determine the user's emotional state. In this step, the input is text data obtained through speech recognition. The generative AI model uses prompt sentences to analyze emotions and outputs the user's emotional state (e.g., joy, sadness, etc.).

[0732] Step 4:

[0733] The server generates an appropriate response using natural language processing based on the results of sentiment analysis. The input consists of the user's emotional state and past conversation history, and based on this, it outputs a response text that is emotionally empathetic to the user. Specifically, it creates a response using the prompt example "Generate a gentle response when the user is feeling lonely."

[0734] Step 5:

[0735] The terminal converts the response text sent from the server into speech using speech synthesis technology and plays it back to the user. The input is the generated response text, which is delivered to the user as a human-like voice output using speech synthesis technology. In this step, the user can continue the conversation with the system by listening to the voice response from the terminal.

[0736] Step 6:

[0737] After receiving a response from the system, the user provides further voice input. This interaction is continuous, and the server repeats the process from step 1 as it receives new voice data. User feedback and new utterances help improve sentiment analysis and the quality of responses.

[0738] (Application Example 2)

[0739] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0740] There is a problem of insufficient emotional support in the daily lives of the elderly. Conventional information provision systems have difficulty generating responses that are sensitive to the user's feelings, and thus have the challenge of not being able to adequately provide the elderly with a sense of psychological security and trust.

[0741] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0742] In this invention, the server includes a natural language processing means, an emotion analysis device means, and a generation means for generating responses that correspond to the user's emotional state. This makes it possible to accurately grasp the emotions of elderly people through their daily conversations and provide emotionally expressive responses based on that understanding.

[0743] "Natural language processing" refers to technologies that convert information input via speech or text into a format that a computer can understand, and analyze the user's intent and meaning.

[0744] A "memory device" is a device that has the function of storing past dialogue information and data, and retrieving and using it as needed.

[0745] "Speech synthesis device means" refers to a technology for converting generated text data into speech and outputting it in a natural form.

[0746] "Speech recognition device means" refers to a technology that analyzes speech collected from users and converts it into text data.

[0747] A "data management device means" is a technology that provides convenience by organizing and managing information related to users' schedules and health.

[0748] A "communication device" is a device that has the function of periodically reporting and sharing information with the user's family and related parties.

[0749] An "emotion analysis device means" is a technology that analyzes a user's emotional state from their voice or text information and identifies that emotion.

[0750] "Generation means" refers to techniques for generating appropriate responses based on the analyzed results.

[0751] The system implementing this invention mainly consists of a server, a terminal, and a user. First, the server uses a speech recognition device to convert the audio data transmitted from the terminal into text data. Next, this text data is processed by a natural language analysis device, and the user's emotional state is identified by an emotion analysis device. Based on this, a generation device generates a response, which is then output again as audio by a speech synthesis device.

[0752] The terminal receives speech synthesis results from the server and provides responses to the user through voice output. This allows the user to communicate with the system through dialogue. In particular, emotion analysis makes it possible to naturally express words of comfort and encouragement to a depressed user.

[0753] The hardware used includes smart speakers and smartphones, while the software includes natural language processing libraries, speech recognition engines, and speech synthesis engines. Specifically, the "SpeechRecognition" library can be used as the speech recognition engine, and generative AI models for natural language processing can be used.

[0754] For example, if a user says, "Yesterday was a good day," the server will generate a response such as, "That's great. What are your plans for today?" A specific prompt might be, "Generate words of encouragement based on the user's mood today." This system allows elderly people to receive emotional support in their daily lives and gain a sense of security and trust.

[0755] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0756] Step 1:

[0757] The user provides voice input to the device. The device receives this voice data through its microphone. The voice data is input and sent to the speech recognition engine within the device.

[0758] Step 2:

[0759] The device uses a speech recognition engine to convert speech data into text data. This process captures the content of the speech as text information. In this case, the input is speech data, and the output is the text data of that speech.

[0760] Step 3:

[0761] The server receives text data from the terminal and analyzes the text using natural language processing (NLP) tools. Through this analysis, the user's utterances and intentions are understood. The input for the analysis is text data, and the output is intent data derived from the analysis of that text.

[0762] Step 4:

[0763] The server uses an emotion analysis device to identify the user's emotional state from intent data. This allows the server to determine the user's current emotions. The input is the analyzed intent data, and the output is the estimated emotional state.

[0764] Step 5:

[0765] The server uses a generation mechanism to generate a response based on the identified emotional state. This process utilizes a generative AI model. The input is the emotional state, and the output is an appropriate response sentence corresponding to that emotion.

[0766] Step 6:

[0767] The server uses a speech synthesis device to convert the generated response text into audio data. This prepares an audio response that the user can hear. The input is a response text in text format, and the output is the audio data of that text.

[0768] Step 7:

[0769] The terminal receives audio data from the server and plays the response to the user via the speaker. The user receives the response as audio. The output from the speaker is audio data generated by the server.

[0770] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0771] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0772] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0773] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0774] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0775] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0776] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0777] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0778] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0779] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0780] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0781] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0782] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0783] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0784] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0785] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0786] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0787] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0788] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0789] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0790] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted as being incorporated by reference.

[0791] The following is further disclosed regarding the embodiments described above.

[0792] (Claim 1)

[0793] Natural language processing means,

[0794] A memory device that records and uses past conversation content,

[0795] A speech synthesis means that outputs the generated response as speech,

[0796] A speech recognition means for converting the obtained audio into text,

[0797] A data management system for managing user schedules and health information,

[0798] A system that includes a means of communication to periodically report information to the user's family.

[0799] (Claim 2)

[0800] The system according to claim 1, wherein the storage means includes a model that adaptively generates responses by referring to conversation history.

[0801] (Claim 3)

[0802] The system according to claim 1, wherein the data management means has a function to notify the user of a reminder based on their schedule.

[0803] "Example 1"

[0804] (Claim 1)

[0805] Natural language processing device,

[0806] A memory device that records and references past conversation information,

[0807] A speech synthesizer that outputs the generated response as speech,

[0808] A speech recognition device that converts received audio into text,

[0809] An information management device that manages users' schedules and health information,

[0810] A communication device that periodically reports information to the user's family,

[0811] A method for filtering noise when converting audio data to text data,

[0812] A system that includes means for adjusting the tone and speed of the generated response.

[0813] (Claim 2)

[0814] The system according to claim 1, wherein the storage device has the function of referencing conversation history and creating an adaptive response using a generative AI model.

[0815] (Claim 3)

[0816] The system according to claim 1, wherein the information management device has a function to create a prompt message and notify a reminder based on the user's schedule.

[0817] "Application Example 1"

[0818] (Claim 1)

[0819] Natural language processing means,

[0820] A memory device that records and uses past conversation information,

[0821] A speech synthesizer that outputs the generated response as speech,

[0822] A speech recognition device that converts the obtained audio into text,

[0823] A data management device that manages users' schedules and health information,

[0824] A communication device that periodically reports information to the user's family,

[0825] A display device for providing information audibly and visually,

[0826] A system that includes means for recording the user's activities and reporting the situation to their family.

[0827] (Claim 2)

[0828] The system according to claim 1, further comprising a model in which the storage device adaptively generates a response by referring to the conversation history.

[0829] (Claim 3)

[0830] The system according to claim 1, wherein the data management device has a function to notify users of reminders based on their schedules.

[0831] "Example 2 of combining an emotion engine"

[0832] (Claim 1)

[0833] Natural language processing means,

[0834] A memory device that records and uses past speech content,

[0835] A speech synthesis means that outputs the generated response as sound,

[0836] A speech recognition means that converts the obtained sound into text,

[0837] Information management means for managing user schedule information and health information,

[0838] A means of analyzing the emotional state of a user,

[0839] A system that includes a means of communication to periodically report information to the user's relatives.

[0840] (Claim 2)

[0841] The system according to claim 1, wherein the storage means includes a model that adaptively generates responses by referring to the speech history.

[0842] (Claim 3)

[0843] The system according to claim 1, wherein the information management means has a function to notify users of alerts based on their schedule information.

[0844] "Application example 2 when combining with an emotional engine"

[0845] (Claim 1)

[0846] Natural language processing tools,

[0847] A storage device for recording and utilizing past dialogue information,

[0848] A speech synthesis device means that outputs the generated response as speech,

[0849] A speech recognition device means for converting collected audio into text,

[0850] A data management device means for managing the user's schedule and health information,

[0851] A communication device that periodically reports information to the user's family,

[0852] Emotion analysis device means,

[0853] A generation means for generating responses that correspond to the user's emotional state,

[0854] A system that includes this.

[0855] (Claim 2)

[0856] The system according to claim 1, wherein the storage device includes a model that adaptively generates a response by referring to the dialogue history.

[0857] (Claim 3)

[0858] The system according to claim 1, wherein the data management device means has a function of notifying the user of a reminder based on their schedule. [Explanation of Symbols]

[0859] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. Natural language processing means, A memory device that records and uses past conversation content, A speech synthesis means that outputs the generated response as speech, A speech recognition means for converting the obtained audio into text, A data management system for managing user schedules and health information, A system that includes a means of communication to periodically report information to the user's family.

2. The system according to claim 1, wherein the storage means includes a model that adaptively generates responses by referring to conversation history.

3. The system according to claim 1, wherein the data management means has a function to notify the user of a reminder based on the user's schedule.