system

A voice-operable system for the elderly converts audio signals into digital data, analyzes user intent, and provides personalized responses, addressing the lack of user-friendly interfaces in conventional support systems.

JP2026100712APending Publication Date: 2026-06-19SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Conventional support systems for the elderly lack personalization and user-friendly interfaces, making it difficult for them to manage daily tasks and receive appropriate responses based on their individual needs and preferences.

Method used

A system that converts audio signals into digital data, analyzes user intent using speech recognition and natural language processing, and provides tailored responses through voice feedback, enabling memo creation and reminder settings.

Benefits of technology

Enables elderly individuals to manage daily tasks efficiently and comfortably using voice commands, providing personalized and intuitive support.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026100712000001_ABST
    Figure 2026100712000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A terminal that converts audio signals into digital data, and a means for receiving audio input from the user, The means of transmitting the aforementioned digital data to a server, and the server converting the audio data into text data, A means for analyzing the aforementioned text data to determine the user's intent, A means for generating appropriate output based on the user's intent and transmitting it to the terminal as audio data, A means for playing the aforementioned audio data to the user and providing an audio response, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In an aging society, in order for the elderly to live a secure and independent life, appropriate support according to individual needs and preferences is necessary. However, conventional support systems are generalized and often difficult to provide individual responses. Also, there is a lack of an interface that is easy for the elderly to operate in terms of usability. Therefore, there is a need to provide a personalized voice-operable support system that can be comfortably used by the elderly.

Means for Solving the Problems

[0005] This invention includes a terminal that converts audio signals into digital data and then converts the audio data into text data via a server. The server analyzes the text data using speech recognition software and a natural language processing engine to determine the user's intent. Subsequently, it generates appropriate output based on the user's intent, transmits it to the terminal as audio data, and plays it back. This provides a system that allows elderly people to easily create memos and set reminders using voice commands, thereby addressing individual needs.

[0006] An "audio signal" is an electrical signal obtained by converting sound into an electrical signal, and it is information in a stage before it is processed as digital data.

[0007] "Digital data" refers to data that represents analog information in a numerical format and can be processed by a computer.

[0008] A "terminal" is a device that a user directly operates and that has the function of receiving voice input and sending it to a server.

[0009] A "server" is a central computer system that processes information and provides necessary services through a network.

[0010] "Speech recognition software" is a program that analyzes speech signals and converts them into text data.

[0011] A "natural language processing engine" is a system that analyzes text data to understand human language and automatically performs responses and information processing.

[0012] "User intent" refers to the specific wishes and objectives that the user requests from the system through voice commands.

[0013] "Memo creation" is a function that records important information based on voice input and saves it in a format that can be referenced later.

[0014] "Reminder settings" are scheduling functions that allow users to receive notifications based on specific times or conditions. [Brief explanation of the drawing]

[0015] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, when an emotion engine is combined. [Figure 14]It is a sequence diagram showing the processing flow of a data processing system in Application Example 2 when a sentiment engine is combined.

Embodiments for Carrying Out the Invention

[0016] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0017] First, the terms used in the following description will be explained.

[0018] In the following embodiments, a numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0019] In the following embodiments, a numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0020] In the following embodiments, a numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0021] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0022] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0023] [First Embodiment]

[0024] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0025] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0026] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0027] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0028] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0029] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0030] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0031] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0032] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0033] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0034] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0035] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0036] The system according to the present invention provides personalized services for the elderly based on voice input. This system mainly consists of a terminal, a server, and data communication between them.

[0037] First, the device uses a high-sensitivity microphone to acquire the user's voice input. This voice is converted into digital data in real time. The converted voice data is then transmitted to a server via the internet.

[0038] The server is equipped with speech recognition software, which converts received speech data into text data. This text data is then analyzed by a natural language processing engine to identify the user's intent. The analysis determines what the user wants and what services should be provided.

[0039] For example, if a user uses voice input to say, "Set a reminder to take my medicine tomorrow at 3 PM," the server parses this request and generates data corresponding to the reminder setting. This information is then sent back to the terminal as needed, and the terminal's scheduling service sets the reminder and manages it across systems.

[0040] Users can confirm the results of their requests through voice feedback from their device. The device plays back the voice data received from the server as synthesized speech, providing feedback to the user, such as, "I have set a reminder for tomorrow at 3 PM."

[0041] In this way, this system uses voice to determine the user's intentions and provides services tailored to individual needs in a timely manner. It enables elderly people to manage their daily lives more comfortably through a user-friendly interface.

[0042] The following describes the processing flow.

[0043] Step 1:

[0044] The device receives the user's voice input using a high-sensitivity microphone and converts the voice signal into digital data in real time.

[0045] Step 2:

[0046] The device transmits this digital audio data to the server via the internet.

[0047] Step 3:

[0048] The server passes the received audio data to speech recognition software, which converts the audio signal into text data.

[0049] Step 4:

[0050] The server inputs text data into a natural language processing engine and analyzes the user's intent.

[0051] Step 5:

[0052] The server determines the user's request based on the analysis results and selects the appropriate service or task.

[0053] Step 6:

[0054] The server creates a response based on the selected service, converts it into audio data, and then sends it to the terminal.

[0055] Step 7:

[0056] The terminal plays the audio data received from the server through its speaker and notifies the user of the response result via voice.

[0057] Step 8:

[0058] The user acts according to voice feedback from the device and issues additional voice commands as needed.

[0059] (Example 1)

[0060] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0061] For the elderly and those unfamiliar with technology, efficiently managing everyday tasks is often a challenging task. In particular, conventional systems capable of smoothly processing voice commands are limited, highlighting the need for improved usability. It is essential to provide intuitively usable systems that leverage voice recognition and natural language processing technologies.

[0062] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0063] In this invention, the server includes means for converting speech into text data using a speech recognition application, means for analyzing the user's intent using a natural language processing engine, and means for providing a time management function based on voice commands. This makes it possible to appropriately analyze the user's intent simultaneously with voice input and manage daily tasks efficiently and intuitively.

[0064] An "audio signal" is an electronic representation of sound waves traveling through the air.

[0065] "Digital information" refers to a form of information that is recorded or processed by converting analog data into a binary format.

[0066] An "information processing device" is an electronic device, such as a computer, that has the function of receiving information, processing it, and generating output.

[0067] "User" refers to a person who uses this system to input voice commands and receive services.

[0068] A "speech recognition application" is a software program that analyzes speech signals and converts them into text data.

[0069] A "natural language processing engine" is a system that uses technology to analyze text data and understand the meaning of language and the intent of the user.

[0070] "Audio data" refers to data obtained by digitizing audio signals, in a format that can be processed by computers and other electronic devices.

[0071] "Character data" refers to data represented using characters such as letters of the alphabet and numbers.

[0072] "Time management features" refer to functions such as calendars and reminders that allow you to set and manage tasks based on specific times.

[0073] The system for implementing the present invention utilizes technology to provide a variety of services based on voice input. Specifically, it consists mainly of a terminal, a server, and a data communication infrastructure connecting them.

[0074] The terminal uses a highly sensitive voice input device to acquire voice signals from the user. These voice signals are converted from analog to digital information and transmitted to the server in digital format. To perform this operation, the terminal is equipped with general-purpose voice recording hardware and software for processing the voice data.

[0075] The server converts the received digital information into text data using a speech recognition application. High-precision speech recognition software is used for this process, such as a common speech recognition API. The generated text data is then analyzed using a natural language processing engine to identify the user's intent. Cloud-based services may be used for the natural language processing engine.

[0076] For example, if a user says, "Set a reminder to take my medicine tomorrow at 3pm," the system recognizes the instruction and sets an alarm for the appropriate time. The server processes the specified task, generates feedback, and sends it back to the device. The device then uses speech synthesis technology to provide the user with voice feedback such as, "I have set a reminder for tomorrow at 3pm."

[0077] To make this system function even more efficiently, prompts powered by generative AI models can be used. For example, a prompt such as, "When the user says, 'Set a reminder to take my medicine tomorrow at 3 pm,' show the results of speech recognition and natural language processing," can be used to enable quick and accurate responses. This allows users, such as the elderly, to manage various daily tasks with simple voice commands.

[0078] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0079] Step 1:

[0080] The terminal acquires the user's voice using a highly sensitive voice input device. The input is an analog audio signal. This audio signal is converted into digital information through a high-precision digital conversion process. This conversion allows the terminal to output a digital audio file. The audio data converted to digital format is then ready to be sent to subsequent processing steps.

[0081] Step 2:

[0082] The terminal transmits digital information to the server using a secure protocol. The input is digital information generated by the terminal. For security reasons, the terminal encrypts the data and transmits it to the server over the internet. During this process, the terminal performs a transmission confirmation to verify that the data was successfully received by the server. The output is the digital information transmitted to the server.

[0083] Step 3:

[0084] The server converts received digital information into text data using a speech recognition application. The input is digital information sent from the terminal. Speech recognition software analyzes the speech pattern and converts it into text format. Using a specific algorithm, the server generates text data with high accuracy. The output is text data.

[0085] Step 4:

[0086] The server inputs the generated text data into a natural language processing engine to analyze the user's intent. The input is text data generated by a speech recognition application. The server uses natural language processing techniques to evaluate keywords and context in the text and determine the user's intent. The output is the analyzed data regarding the user's intent.

[0087] Step 5:

[0088] The server determines the necessary actions based on the user's intent and generates appropriate feedback data. The input is the analyzed user intent data. Based on the analyzed data, the server assembles the necessary information, for example, for setting reminders. The generated feedback data is prepared to be sent back to the terminal. The output is the feedback data.

[0089] Step 6:

[0090] The terminal receives feedback data sent from the server and provides feedback to the user using speech synthesis technology. The input is the feedback data sent from the server. The terminal uses speech synthesis software to convert the text into speech and plays it back to the user through the speaker. This audio feedback allows the user to confirm that their instructions have been processed correctly. The output is the audio feedback to the user.

[0091] (Application Example 1)

[0092] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0093] There is a problem in reducing the burden on users, including the elderly, when it comes to easily and intuitively managing schedules and setting reminders in their daily lives, and in efficiently executing tasks through voice control. In particular, there is a need to provide convenience and intuitive operation for users who are unfamiliar with technology.

[0094] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0095] In this invention, the server includes a device for converting voice signals into digital data, means for receiving voice input from a user, means for transmitting the digital data to an information processing device, where the information processing device converts the voice data into text data, and means for analyzing the text data to determine the user's intent. This makes it possible to accurately analyze the user's intent via voice and to easily perform daily management functions such as appropriate schedule management and reminder setting through voice operation.

[0096] An "audio signal" is an electrical signal that represents audio information, and it is fundamental data used when processing audio physically and electronically.

[0097] "Digital data" refers to audio signals that have been converted into numerical data and formatted in a way that allows them to be processed by computers and other devices.

[0098] "Device" is a general term for machines or equipment that have a specific function or role.

[0099] "User" refers to any person who operates or uses this system.

[0100] "Voice input" refers to the operation or process of capturing a user's voice into a system.

[0101] An "information processing device" is a computer system that receives digital data and performs various calculations and processes.

[0102] A "speech recognition program" is software that analyzes speech signals and converts their content into text data.

[0103] A "natural language processing module" is a software component that analyzes text data to infer the user's intent and derive appropriate responses or actions.

[0104] "Schedule management" is the process of organizing, recording, and notifying users of specific appointments and tasks according to the dates and times they specify.

[0105] "Setting a reminder" is the process of instructing the system to send a notification at a time specified by the user.

[0106] "Voice response" refers to the process by which a system communicates its processing results to the user via voice.

[0107] This invention provides a system for assisting the elderly using a voice interface. The system includes a device for user voice input, an information processing device for processing digital data, and a device for providing voice responses.

[0108] The device uses a high-sensitivity microphone to capture the user's voice and converts it into digital data. The converted digital data is transferred to an information processing device via the internet. The information processing device uses a speech recognition program (e.g., a speech recognition API) to convert the voice into text data and then uses a natural language processing module (e.g., a natural language API) to analyze the user's intent.

[0109] The analysis identifies the necessary actions. For example, if a user enters the voice command "Set a reminder to take my medicine at 3 PM," the information processing device generates a task as a reminder setting, which is then transmitted to the device, and the user receives voice feedback confirming the task's success. A synthesized speech engine (e.g., a speech synthesis API) is used for the voice feedback.

[0110] For example, a user might voice-input, "Can you set a reminder to take my vitamin supplements at 8 PM?" Based on this request, the device will set a reminder at the specified time and notify the user in synthesized speech, "It's time to take your vitamin supplements at 8 PM." An example of a prompt might be, "Please add an event: I'm going for a walk tomorrow at 10 AM."

[0111] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0112] Step 1:

[0113] The device uses a high-sensitivity microphone to capture voice input from the user. The voice signal is input and converted into digital data. This converted digital data becomes the output and is passed on to subsequent processing.

[0114] Step 2:

[0115] The terminal sends the converted digital data to the server via the internet. The input is digital audio data, which the server receives and uses as input for the next processing step.

[0116] Step 3:

[0117] The server uses a speech recognition program to convert digital audio data into text data. After the data conversion, the audio content is output as text, which becomes input data for analysis by a natural language processing engine.

[0118] Step 4:

[0119] The server analyzes text data using a natural language processing module to understand the user's intent. The input is converted text data, which is then analyzed to identify the task the user wishes to perform. The analysis results are then used as output for specific processing and responses.

[0120] Step 5:

[0121] The server generates an appropriate task (e.g., setting a reminder) based on the analysis results. The input is the result of analysis representing the user's intent, and the server generates and outputs a task based on appropriate program logic. This task information is then sent to the terminal.

[0122] Step 6:

[0123] The terminal receives task information from the server and manages the schedule based on the configured tasks. It performs data calculations internally and sets reminders according to the execution timing.

[0124] Step 7:

[0125] At the time requested by the user, the device uses a synthesized speech engine to provide voice feedback. The input is feedback information from the server, and the synthesized speech outputs the feedback content as voice, notifying the user.

[0126] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0127] This invention is a system that provides customized services based on voice input for the elderly, and by recognizing the user's emotions, it enables more appropriate and user-friendly responses. This system mainly consists of terminals, servers, and a communication network connecting them.

[0128] First, the terminal uses a high-sensitivity microphone to acquire the user's voice input. The acquired voice is converted into digital data in real time and sent to the server. This transmission takes place over the internet.

[0129] The server first converts the received audio data into text data using speech recognition software, then passes the text data to a natural language processing engine to analyze the user's intent. In addition, an emotion engine recognizes the user's emotions from their tone of voice and speaking style. This emotion information is reflected in response generation, creating an output appropriate to the user's state.

[0130] For example, if a user uses voice input to say, "I feel a little tired, I'd like some relaxing music," the server recognizes not only the intention but also the emotion, such as "tired," using its emotion engine. As a result, it selects relaxing music, sends a calming instruction to the device, and prompts it to play the music.

[0131] The device operates based on instructions received from the server and plays music through its speaker. This allows users to receive services that are sensitive to their emotions, improving their quality of life.

[0132] Thus, the present invention aims to provide a voice interface that reflects the user's emotions, enabling elderly people to enjoy digital services with peace of mind in an environment that suits them better.

[0133] The following describes the processing flow.

[0134] Step 1:

[0135] The device receives the voice spoken by the user using a high-sensitivity microphone and converts the voice signal into digital data in real time.

[0136] Step 2:

[0137] The terminal transmits the converted digital audio data to the server via the internet.

[0138] Step 3:

[0139] The server passes the received audio data to speech recognition software, which converts the audio signal into text data.

[0140] Step 4:

[0141] The server inputs the converted text data into a natural language processing engine to analyze the user's intent.

[0142] Step 5:

[0143] The server inputs the analyzed text data into an emotion engine, which then recognizes the user's emotions based on their tone of voice and speaking style.

[0144] Step 6:

[0145] The server generates an appropriate response based on the user's intent and perceived emotions, and sends it back to the terminal as audio data.

[0146] Step 7:

[0147] The device plays audio data received from the server through its speaker, providing the user with emotionally sensitive audio feedback.

[0148] Step 8:

[0149] The user can receive voice feedback from the device and issue additional voice commands as needed.

[0150] (Example 2)

[0151] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0152] When providing personalized voice-input services to a diverse range of users, including the elderly, a challenge is that information is not adequately provided while considering the user's emotions. There is a need to generate responses that incorporate the user's emotions to achieve more user-friendly and effective interactions.

[0153] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0154] In this invention, the server includes means for converting voice data into text information, means for analyzing the text information to determine the user's intent, and means for analyzing voice tone and speaking style to recognize the user's emotions. This makes it possible to generate personalized responses based on the user's intent and emotions.

[0155] "Audio signals" are data obtained by converting human voices into electrical signals.

[0156] "Digital data" refers to data that has been digitized from an analog signal and converted into a format that can be processed by a computer.

[0157] An "information terminal" is a device that accepts input such as voice and has the function of communicating digital data via the internet.

[0158] A "user" is an individual who uses this system to perform voice input.

[0159] A "processing device" is a computing device that analyzes and processes received digital data and generates appropriate output.

[0160] "Speech recognition software" is a program that converts speech signals into text information.

[0161] A "natural language processing engine" is a mechanism that implements algorithms for analyzing the user's intent from textual information.

[0162] An "emotion recognition engine" is a technology that analyzes voice tone and word choice to evaluate the user's emotions.

[0163] The "Record Creation" and "Notification Settings" functions automatically create and manage memos and reminders based on the user's voice commands.

[0164] To implement this invention, an information terminal, a processing unit, and a communication network connecting them are primarily used. The terminal uses a high-sensitivity microphone to capture the user's voice signal. This voice signal is converted into digital data within the terminal and transmitted to a server via the internet.

[0165] The server converts speech data into text using speech recognition software. This process utilizes common speech recognition algorithms. The server then leverages a natural language processing engine to analyze the user's intent from the converted text. This engine interprets user commands and inquiries using techniques such as keyword extraction and syntactic analysis.

[0166] Furthermore, the server uses an emotion recognition engine to analyze the user's voice tone and speaking style, and evaluates the user's emotions. Emotion recognition includes analyzing speech waveforms and analyzing text using an emotion dictionary. Based on this evaluation, the server generates the most appropriate response and sends it to the information terminal as audio data or other information format.

[0167] The terminal follows instructions from the server and uses its voice output device to play the voice response requested by the user. It also has functions for creating records and setting notifications based on voice commands, supporting the user's daily activities.

[0168] For example, if a user gives a voice command such as, "I feel like relaxing today, so play some relaxing music," the server analyzes the voice to understand the user's intention and emotion, and selects music related to "relaxation." Based on the instruction, the device plays relaxing music through its speakers.

[0169] Examples of prompts include, "I want to relax, so please play some gentle music," or "I'm in a good mood today, so I'd like to listen to some upbeat music." By using such prompts, the system can provide services that reflect the user's emotions and intentions, enabling elderly users and other users to receive information and responses that are tailored to them.

[0170] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0171] Step 1:

[0172] The terminal captures the user's voice signal using a high-sensitivity microphone. At this time, the voice signal is input. The input voice signal is converted into digital data by the voice processing unit installed in the terminal. The converted digital data is transmitted to the server via the internet.

[0173] Step 2:

[0174] The server inputs the received digital audio data into speech recognition software. This software uses a speech recognition algorithm to convert the audio data into text information. The output text information is then used for subsequent analysis.

[0175] Step 3:

[0176] The server passes the text information to the natural language processing engine. The engine analyzes the input text information and processes the data to understand the user's intent. For example, grammatical analysis and keyword extraction are performed, and the user's specific commands or questions are output.

[0177] Step 4:

[0178] The server transmits intent information obtained from the natural language processing engine to the emotion recognition engine. The emotion recognition engine analyzes the content of the input text information and the paralinguistic elements of the speech (e.g., intonation, stress) to recognize the user's emotions. This analysis outputs an evaluation of the user's emotions, such as whether they are relaxed or stressed.

[0179] Step 5:

[0180] The server integrates intent and sentiment information and inputs it into the response generation engine to generate the optimal response. This data processing uses a generative AI model to produce the most appropriate response for the user's situation. For example, it might output instructions for playlist creation or information provision.

[0181] Step 6:

[0182] The terminal executes specific services based on response instructions received from the server. For example, it might launch a music playback application and play music through the speaker according to the outputted instructions. As a result, the user can receive services that match their intentions and emotions.

[0183] (Application Example 2)

[0184] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0185] When elderly people use voice-based digital services in their daily lives, accurate understanding of their intentions and responses that take their emotions into consideration are required. However, existing voice response systems are insufficient in providing appropriate services that take emotions into account, which can cause inconvenience for users. This problem needs to be solved.

[0186] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0187] In this invention, the server includes means for converting speech data into text data using speech recognition software, means for analyzing the user's intent from the text data using a natural language processing engine, and means for determining the user's emotions from their tone of voice and manner of speaking using an emotion recognition engine. This enables an appropriate response based on the user's intent and emotions.

[0188] An "audio signal" is waveform information represented by sound, including human speech, that is transmitted through a medium such as air.

[0189] "Digital data" refers to data that represents analog information, such as audio signals, in binary format.

[0190] A "communication device" is a device that has the function of converting voice signals into digital data and transmitting and receiving data.

[0191] A "user" is an individual who operates the system and receives services through voice input.

[0192] An "information processing device" is a computer system that processes audio signals and text data to analyze the user's intentions and emotions.

[0193] "Speech recognition software" is a program or application that converts speech signals into text data.

[0194] A "natural language processing engine" is a computer program used to analyze human language and understand its meaning and intent.

[0195] An "emotion recognition engine" is a software system that determines a user's emotions from their voice tone and speaking style.

[0196] "Audio data" refers to data recorded in digital format using audio signals.

[0197] The system for implementing this invention mainly consists of a communication device, an information processing device, and a communication network connecting them. A portable terminal equipped with a high-sensitivity microphone is used as the communication device. This terminal has the function of acquiring voice input from the user and converting the voice signal into digital data. This converted digital data is transmitted to the information processing device via the internet.

[0198] The information processing device is equipped with speech recognition software, a natural language processing engine, and an emotion recognition engine. The speech recognition software converts speech data into text data. Subsequently, the natural language processing engine analyzes the text data and interprets the user's intent. Furthermore, the emotion recognition engine recognizes the user's emotions from their tone of voice and manner of speaking. This combination makes it possible to generate an appropriate response based on the user's intent and emotions. This generated response is then transmitted back to the communication device as audio data and presented to the user verbally.

[0199] As a concrete example, consider a scenario where a user requests via voice, "I'm feeling down today, so I'd like some advice to cheer me up." The information processing device recognizes the emotion of "feeling down" and generates appropriate advice for that emotion. Through the communication device, it can provide the user with encouraging words in a calm tone.

[0200] An example of a prompt message is, "How are you feeling today? Let me know if you need any music or advice." This is used as suitable input for a generative AI model. In this way, the present invention provides a user-friendly voice response system that is sensitive to the user's emotions.

[0201] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0202] Step 1:

[0203] The user provides voice input to the communication device. This voice signal is captured using a high-sensitivity microphone. The voice signal is converted into digital data through digital signal processing, and this digital data is passed to voice recognition software.

[0204] Step 2:

[0205] The terminal transmits the converted digital data to the information processing device. The transmitted digital data becomes input, and the speech recognition software on the information processing device starts operating. The speech recognition software converts the speech data into text data and passes this text data to the next processing step.

[0206] Step 3:

[0207] The server analyzes the text data via a natural language processing engine. Based on the input text data, it extracts the user's intent. For example, if the user says "I'm feeling down," the appropriate intent is analyzed based on that. This analysis result is then passed on to the next step.

[0208] Step 4:

[0209] The server uses an emotion recognition engine to determine the user's emotions from their voice tone and manner of speaking. In this step, the analyzed text data is used as input, and data calculations are performed to determine the emotion of "depressed."

[0210] Step 5:

[0211] The server generates appropriate responses based on the user's intentions and emotions. Using a generative AI model, it generates output data in the form of speech based on the analysis results and emotional information. For example, it can generate encouraging words or advice.

[0212] Step 6:

[0213] The server sends the generated audio data to the terminal. The terminal plays this audio data, allowing the user to receive an audio response. This enables the user to experience a response that resonates with their emotions.

[0214] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0215] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0216] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0217] [Second Embodiment]

[0218] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0219] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0220] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0221] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0222] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0223] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0224] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0225] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0226] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0227] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0228] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0229] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0230] The system according to the present invention provides personalized services for the elderly based on voice input. This system mainly consists of a terminal, a server, and data communication between them.

[0231] First, the device uses a high-sensitivity microphone to acquire the user's voice input. This voice is converted into digital data in real time. The converted voice data is then transmitted to a server via the internet.

[0232] The server is equipped with speech recognition software, which converts received speech data into text data. This text data is then analyzed by a natural language processing engine to identify the user's intent. The analysis determines what the user wants and what services should be provided.

[0233] For example, if a user uses voice input to say, "Set a reminder to take my medicine tomorrow at 3 PM," the server parses this request and generates data corresponding to the reminder setting. This information is then sent back to the terminal as needed, and the terminal's scheduling service sets the reminder and manages it across systems.

[0234] Users can confirm the results of their requests through voice feedback from their device. The device plays back the voice data received from the server as synthesized speech, providing feedback to the user, such as, "I have set a reminder for tomorrow at 3 PM."

[0235] In this way, this system uses voice to determine the user's intentions and provides services tailored to individual needs in a timely manner. It enables elderly people to manage their daily lives more comfortably through a user-friendly interface.

[0236] The following describes the processing flow.

[0237] Step 1:

[0238] The device receives the user's voice input using a high-sensitivity microphone and converts the voice signal into digital data in real time.

[0239] Step 2:

[0240] The device transmits this digital audio data to the server via the internet.

[0241] Step 3:

[0242] The server passes the received audio data to speech recognition software, which converts the audio signal into text data.

[0243] Step 4:

[0244] The server inputs text data into a natural language processing engine and analyzes the user's intent.

[0245] Step 5:

[0246] The server determines the user's request based on the analysis results and selects the appropriate service or task.

[0247] Step 6:

[0248] The server creates a response based on the selected service, converts it into audio data, and then sends it to the terminal.

[0249] Step 7:

[0250] The terminal plays the audio data received from the server through its speaker and notifies the user of the response result via voice.

[0251] Step 8:

[0252] The user acts according to voice feedback from the device and issues additional voice commands as needed.

[0253] (Example 1)

[0254] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0255] For the elderly and those unfamiliar with technology, efficiently managing everyday tasks is often a challenging task. In particular, conventional systems capable of smoothly processing voice commands are limited, highlighting the need for improved usability. It is essential to provide intuitively usable systems that leverage voice recognition and natural language processing technologies.

[0256] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0257] In this invention, the server includes means for converting speech into text data using a speech recognition application, means for analyzing the user's intent using a natural language processing engine, and means for providing a time management function based on voice commands. This makes it possible to appropriately analyze the user's intent simultaneously with voice input and manage daily tasks efficiently and intuitively.

[0258] An "audio signal" is an electronic representation of sound waves traveling through the air.

[0259] "Digital information" refers to a form of information that is recorded or processed by converting analog data into a binary format.

[0260] An "information processing device" is an electronic device, such as a computer, that has the function of receiving information, processing it, and generating output.

[0261] "User" refers to a person who uses this system to input voice commands and receive services.

[0262] A "speech recognition application" is a software program that analyzes speech signals and converts them into text data.

[0263] A "natural language processing engine" is a system that uses technology to analyze text data and understand the meaning of language and the intent of the user.

[0264] "Audio data" refers to data obtained by digitizing audio signals, in a format that can be processed by computers and other electronic devices.

[0265] "Character data" refers to data represented using characters such as letters of the alphabet and numbers.

[0266] "Time management features" refer to functions such as calendars and reminders that allow you to set and manage tasks based on specific times.

[0267] The system for implementing the present invention utilizes technology to provide a variety of services based on voice input. Specifically, it consists mainly of a terminal, a server, and a data communication infrastructure connecting them.

[0268] The terminal uses a highly sensitive voice input device to acquire voice signals from the user. These voice signals are converted from analog to digital information and transmitted to the server in digital format. To perform this operation, the terminal is equipped with general-purpose voice recording hardware and software for processing the voice data.

[0269] The server converts the received digital information into text data using a speech recognition application. High-precision speech recognition software is used for this process, such as a common speech recognition API. The generated text data is then analyzed using a natural language processing engine to identify the user's intent. Cloud-based services may be used for the natural language processing engine.

[0270] For example, if a user says, "Set a reminder to take my medicine tomorrow at 3pm," the system recognizes the instruction and sets an alarm for the appropriate time. The server processes the specified task, generates feedback, and sends it back to the device. The device then uses speech synthesis technology to provide the user with voice feedback such as, "I have set a reminder for tomorrow at 3pm."

[0271] To make this system function even more efficiently, prompts powered by generative AI models can be used. For example, a prompt such as, "When the user says, 'Set a reminder to take my medicine tomorrow at 3 pm,' show the results of speech recognition and natural language processing," can be used to enable quick and accurate responses. This allows users, such as the elderly, to manage various daily tasks with simple voice commands.

[0272] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0273] Step 1:

[0274] The terminal acquires the user's voice using a highly sensitive voice input device. The input is an analog audio signal. This audio signal is converted into digital information through a high-precision digital conversion process. This conversion allows the terminal to output a digital audio file. The audio data converted to digital format is then ready to be sent to subsequent processing steps.

[0275] Step 2:

[0276] The terminal transmits digital information to the server using a secure protocol. The input is digital information generated by the terminal. For security reasons, the terminal encrypts the data and transmits it to the server over the internet. During this process, the terminal performs a transmission confirmation to verify that the data was successfully received by the server. The output is the digital information transmitted to the server.

[0277] Step 3:

[0278] The server converts the received digital information into character data using a speech recognition application. The input is the digital information transmitted from the terminal. The speech recognition software analyzes the speech pattern and converts it into text format. Using a specific algorithm, the server generates character data with high accuracy. The output is character data.

[0279] Step 4:

[0280] The server inputs the generated character data into a natural language processing engine to analyze the user's intention. The input is the character data generated by the speech recognition application. The server uses natural language processing technology to evaluate the keywords and context in the text and determine the user's intention. The output is analysis data regarding the user's intention.

[0281] Step 5:

[0282] Based on the user's intention, the server determines the necessary actions and generates appropriate feedback data. The input is the analyzed user intention data. The server assembles the necessary information for, for example, reminder settings based on the analysis data. The generated feedback data is prepared to be sent back to the terminal. The output is feedback data.

[0283] Step 6:

[0284] The terminal receives the feedback data transmitted from the server and provides feedback to the user using speech synthesis technology. The input is the feedback data transmitted from the server. The terminal uses speech synthesis software to convert text into speech and play it back to the user through a speaker. With this voice feedback, the user can confirm that their instructions have been processed properly. The output is voice feedback to the user.

[0285] (Application Example 1)

[0286] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as a "server", and the smart glasses 214 are referred to as a "terminal".

[0287] There is a problem that when users including the elderly perform schedule management and reminder setting in their daily lives in a simple and intuitive manner, the burden is reduced, and it is difficult to efficiently execute tasks through voice operations. In particular, it is required to provide convenience and intuitive operations for users who are not familiar with technology.

[0288] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0289] In this invention, the server includes a device that converts an audio signal into digital data, means for receiving an audio input from a user, means for transmitting the digital data to an information processing device, and means for the information processing device to convert the audio data into character data, and means for analyzing the character data to determine the intention of the user. Thereby, it becomes possible to accurately analyze the intention of the user by voice and easily execute daily management functions such as appropriate schedule management and reminder setting by voice operation.

[0290] An "audio signal" is what represents audio information as an electrical signal and is the basic data used when physically and electronically processing audio.

[0291] "Digital data" is what digitizes an audio signal into a form that can be processed by a computer or the like.

[0292] A "device" is a general term for machines and equipment with specific functions and roles.

[0293] A "user" refers to a person who operates or uses this system.

[0294] "Audio input" is an operation or process of capturing a user's voice into the system.

[0295] An "information processing device" is a computer system that receives digital data and performs various calculations and processes.

[0296] A "speech recognition program" is software that analyzes speech signals and converts their content into text data.

[0297] A "natural language processing module" is a software component that analyzes text data to infer the user's intent and derive appropriate responses or actions.

[0298] "Schedule management" is the process of organizing, recording, and notifying users of specific appointments and tasks according to the dates and times they specify.

[0299] "Setting a reminder" is the process of instructing the system to send a notification at a time specified by the user.

[0300] "Voice response" refers to the process by which a system communicates its processing results to the user via voice.

[0301] This invention provides a system for assisting the elderly using a voice interface. The system includes a device for user voice input, an information processing device for processing digital data, and a device for providing voice responses.

[0302] The device uses a high-sensitivity microphone to capture the user's voice and converts it into digital data. The converted digital data is transferred to an information processing device via the internet. The information processing device uses a speech recognition program (e.g., a speech recognition API) to convert the voice into text data and then uses a natural language processing module (e.g., a natural language API) to analyze the user's intent.

[0303] As a result of the analysis, the necessary actions are identified. For example, when a user inputs a voice command such as "Set a reminder to take medicine at 3 PM", the information processing device generates a task as the setting of the reminder, which is transferred to the device and feeds back the success of the task to the user by voice. A synthetic voice engine (e.g., text-to-speech API) is used for the voice feedback.

[0304] As a specific example, there may be a case where a user inputs voice such as "Can you set a reminder to take vitamin supplements at 8 PM?". Based on this request, the device sets a reminder at the specified time and notifies the user in synthetic voice that "It's time to take vitamin supplements at 8 PM". Examples of prompt sentences include "Please add a schedule: Go for a walk at 10 AM tomorrow".

[0305] The flow of the specific process in Application Example 1 will be described using FIG. 12.

[0306] Step 1:

[0307] [[ID=*15]] The terminal uses a high-sensitivity microphone to capture voice input from the user. The voice signal is input and converted into digital data. This converted digital data becomes the output and is passed to subsequent processing.

[0308] Step 2:

[0309] The terminal transmits the converted digital data to the server via the Internet. The input is digital voice data, and when the server receives this, it becomes the input for the next processing step.

[0310] Step 3:

[0311] The server uses a speech recognition program to convert the digital voice data into character data. The data conversion is performed, and the content of the voice is output as characters, which becomes the input data for analysis by the natural language processing engine.

[0312] Step 4:

[0313] The server analyzes text data using a natural language processing module to understand the user's intent. The input is converted text data, which is then analyzed to identify the task the user wishes to perform. The analysis results are then used as output for specific processing and responses.

[0314] Step 5:

[0315] The server generates an appropriate task (e.g., setting a reminder) based on the analysis results. The input is the result of analysis representing the user's intent, and the server generates and outputs a task based on appropriate program logic. This task information is then sent to the terminal.

[0316] Step 6:

[0317] The terminal receives task information from the server and manages the schedule based on the configured tasks. It performs data calculations internally and sets reminders according to the execution timing.

[0318] Step 7:

[0319] At the time requested by the user, the device uses a synthesized speech engine to provide voice feedback. The input is feedback information from the server, and the synthesized speech outputs the feedback content as voice, notifying the user.

[0320] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0321] This invention is a system that provides customized services based on voice input for the elderly, and by recognizing the user's emotions, it enables more appropriate and user-friendly responses. This system mainly consists of terminals, servers, and a communication network connecting them.

[0322] First, the terminal uses a high-sensitivity microphone to acquire the user's voice input. The acquired voice is converted into digital data in real time and sent to the server. This transmission takes place over the internet.

[0323] The server first converts the received audio data into text data using speech recognition software, then passes the text data to a natural language processing engine to analyze the user's intent. In addition, an emotion engine recognizes the user's emotions from their tone of voice and speaking style. This emotion information is reflected in response generation, creating an output appropriate to the user's state.

[0324] For example, if a user uses voice input to say, "I feel a little tired, I'd like some relaxing music," the server recognizes not only the intention but also the emotion, such as "tired," using its emotion engine. As a result, it selects relaxing music, sends a calming instruction to the device, and prompts it to play the music.

[0325] The device operates based on instructions received from the server and plays music through its speaker. This allows users to receive services that are sensitive to their emotions, improving their quality of life.

[0326] Thus, the present invention aims to provide a voice interface that reflects the user's emotions, enabling elderly people to enjoy digital services with peace of mind in an environment that suits them better.

[0327] The following describes the processing flow.

[0328] Step 1:

[0329] The device receives the voice spoken by the user using a high-sensitivity microphone and converts the voice signal into digital data in real time.

[0330] Step 2:

[0331] The terminal transmits the converted digital audio data to the server via the internet.

[0332] Step 3:

[0333] The server passes the received audio data to speech recognition software, which converts the audio signal into text data.

[0334] Step 4:

[0335] The server inputs the converted text data into a natural language processing engine to analyze the user's intent.

[0336] Step 5:

[0337] The server inputs the analyzed text data into an emotion engine, which then recognizes the user's emotions based on their tone of voice and speaking style.

[0338] Step 6:

[0339] The server generates an appropriate response based on the user's intent and perceived emotions, and sends it back to the terminal as audio data.

[0340] Step 7:

[0341] The device plays audio data received from the server through its speaker, providing the user with emotionally sensitive audio feedback.

[0342] Step 8:

[0343] The user can receive voice feedback from the device and issue additional voice commands as needed.

[0344] (Example 2)

[0345] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0346] When providing personalized voice-input services to a diverse range of users, including the elderly, a challenge is that information is not adequately provided while considering the user's emotions. There is a need to generate responses that incorporate the user's emotions to achieve more user-friendly and effective interactions.

[0347] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0348] In this invention, the server includes means for converting voice data into text information, means for analyzing the text information to determine the user's intent, and means for analyzing voice tone and speaking style to recognize the user's emotions. This makes it possible to generate personalized responses based on the user's intent and emotions.

[0349] "Audio signals" are data obtained by converting human voices into electrical signals.

[0350] "Digital data" refers to data that has been digitized from an analog signal and converted into a format that can be processed by a computer.

[0351] An "information terminal" is a device that accepts input such as voice and has the function of communicating digital data via the internet.

[0352] A "user" is an individual who uses this system to perform voice input.

[0353] A "processing device" is a computing device that analyzes and processes received digital data and generates appropriate output.

[0354] "Speech recognition software" is a program that converts speech signals into text information.

[0355] A "natural language processing engine" is a mechanism that implements algorithms for analyzing the user's intent from textual information.

[0356] An "emotion recognition engine" is a technology that analyzes voice tone and word choice to evaluate the user's emotions.

[0357] The "Record Creation" and "Notification Settings" functions automatically create and manage memos and reminders based on the user's voice commands.

[0358] To implement this invention, an information terminal, a processing unit, and a communication network connecting them are primarily used. The terminal uses a high-sensitivity microphone to capture the user's voice signal. This voice signal is converted into digital data within the terminal and transmitted to a server via the internet.

[0359] The server converts speech data into text using speech recognition software. This process utilizes common speech recognition algorithms. The server then leverages a natural language processing engine to analyze the user's intent from the converted text. This engine interprets user commands and inquiries using techniques such as keyword extraction and syntactic analysis.

[0360] Furthermore, the server uses an emotion recognition engine to analyze the user's voice tone and speaking style, and evaluates the user's emotions. Emotion recognition includes analyzing speech waveforms and analyzing text using an emotion dictionary. Based on this evaluation, the server generates the most appropriate response and sends it to the information terminal as audio data or other information format.

[0361] The terminal follows instructions from the server and uses its voice output device to play the voice response requested by the user. It also has functions for creating records and setting notifications based on voice commands, supporting the user's daily activities.

[0362] For example, if a user gives a voice command such as, "I feel like relaxing today, so play some relaxing music," the server analyzes the voice to understand the user's intention and emotion, and selects music related to "relaxation." Based on the instruction, the device plays relaxing music through its speakers.

[0363] Examples of prompts include, "I want to relax, so please play some gentle music," or "I'm in a good mood today, so I'd like to listen to some upbeat music." By using such prompts, the system can provide services that reflect the user's emotions and intentions, enabling elderly users and other users to receive information and responses that are tailored to them.

[0364] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0365] Step 1:

[0366] The terminal captures the user's voice signal using a high-sensitivity microphone. At this time, the voice signal is input. The input voice signal is converted into digital data by the voice processing unit installed in the terminal. The converted digital data is transmitted to the server via the internet.

[0367] Step 2:

[0368] The server inputs the received digital audio data into speech recognition software. This software uses a speech recognition algorithm to convert the audio data into text information. The output text information is then used for subsequent analysis.

[0369] Step 3:

[0370] The server passes the text information to the natural language processing engine. The engine analyzes the input text information and processes the data to understand the user's intent. For example, grammatical analysis and keyword extraction are performed, and the user's specific commands or questions are output.

[0371] Step 4:

[0372] The server transmits intent information obtained from the natural language processing engine to the emotion recognition engine. The emotion recognition engine analyzes the content of the input text information and the paralinguistic elements of the speech (e.g., intonation, stress) to recognize the user's emotions. This analysis outputs an evaluation of the user's emotions, such as whether they are relaxed or stressed.

[0373] Step 5:

[0374] The server integrates intent and sentiment information and inputs it into the response generation engine to generate the optimal response. This data processing uses a generative AI model to produce the most appropriate response for the user's situation. For example, it might output instructions for playlist creation or information provision.

[0375] Step 6:

[0376] The terminal executes specific services based on response instructions received from the server. For example, it might launch a music playback application and play music through the speaker according to the outputted instructions. As a result, the user can receive services that match their intentions and emotions.

[0377] (Application Example 2)

[0378] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0379] When elderly people use voice-based digital services in their daily lives, accurate understanding of their intentions and responses that take their emotions into consideration are required. However, existing voice response systems are insufficient in providing appropriate services that take emotions into account, which can cause inconvenience for users. This problem needs to be solved.

[0380] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0381] In this invention, the server includes means for converting speech data into text data using speech recognition software, means for analyzing the user's intent from the text data using a natural language processing engine, and means for determining the user's emotions from their tone of voice and manner of speaking using an emotion recognition engine. This enables an appropriate response based on the user's intent and emotions.

[0382] An "audio signal" is waveform information represented by sound, including human speech, that is transmitted through a medium such as air.

[0383] "Digital data" refers to data that represents analog information, such as audio signals, in binary format.

[0384] A "communication device" is a device that has the function of converting voice signals into digital data and transmitting and receiving data.

[0385] A "user" is an individual who operates the system and receives services through voice input.

[0386] An "information processing device" is a computer system that processes audio signals and text data to analyze the user's intentions and emotions.

[0387] "Speech recognition software" is a program or application that converts speech signals into text data.

[0388] A "natural language processing engine" is a computer program used to analyze human language and understand its meaning and intent.

[0389] An "emotion recognition engine" is a software system that determines a user's emotions from their voice tone and speaking style.

[0390] "Audio data" refers to data recorded in digital format using audio signals.

[0391] The system for implementing this invention mainly consists of a communication device, an information processing device, and a communication network connecting them. A portable terminal equipped with a high-sensitivity microphone is used as the communication device. This terminal has the function of acquiring voice input from the user and converting the voice signal into digital data. This converted digital data is transmitted to the information processing device via the internet.

[0392] The information processing device is equipped with speech recognition software, a natural language processing engine, and an emotion recognition engine. The speech recognition software converts speech data into text data. Subsequently, the natural language processing engine analyzes the text data and interprets the user's intent. Furthermore, the emotion recognition engine recognizes the user's emotions from their tone of voice and manner of speaking. This combination makes it possible to generate an appropriate response based on the user's intent and emotions. This generated response is then transmitted back to the communication device as audio data and presented to the user verbally.

[0393] As a concrete example, consider a scenario where a user requests via voice, "I'm feeling down today, so I'd like some advice to cheer me up." The information processing device recognizes the emotion of "feeling down" and generates appropriate advice for that emotion. Through the communication device, it can provide the user with encouraging words in a calm tone.

[0394] An example of a prompt message is, "How are you feeling today? Let me know if you need any music or advice." This is used as suitable input for a generative AI model. In this way, the present invention provides a user-friendly voice response system that is sensitive to the user's emotions.

[0395] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0396] Step 1:

[0397] The user provides voice input to the communication device. This voice signal is captured using a high-sensitivity microphone. The voice signal is converted into digital data through digital signal processing, and this digital data is passed to voice recognition software.

[0398] Step 2:

[0399] The terminal transmits the converted digital data to the information processing device. The transmitted digital data becomes input, and the speech recognition software on the information processing device starts operating. The speech recognition software converts the speech data into text data and passes this text data to the next processing step.

[0400] Step 3:

[0401] The server analyzes the text data via a natural language processing engine. Based on the input text data, it extracts the user's intent. For example, if the user says "I'm feeling down," the appropriate intent is analyzed based on that. This analysis result is then passed on to the next step.

[0402] Step 4:

[0403] The server uses an emotion recognition engine to determine the user's emotions from their voice tone and manner of speaking. In this step, the analyzed text data is used as input, and data calculations are performed to determine the emotion of "depressed."

[0404] Step 5:

[0405] The server generates appropriate responses based on the user's intentions and emotions. Using a generative AI model, it generates output data in the form of speech based on the analysis results and emotional information. For example, it can generate encouraging words or advice.

[0406] Step 6:

[0407] The server sends the generated audio data to the terminal. The terminal plays this audio data, allowing the user to receive an audio response. This enables the user to experience a response that resonates with their emotions.

[0408] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0409] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0410] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0411] [Third Embodiment]

[0412] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0413] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0414] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0415] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0416] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0417] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0418] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0419] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0420] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0421] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0422] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0423] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0424] The system according to the present invention provides personalized services for the elderly based on voice input. This system mainly consists of a terminal, a server, and data communication between them.

[0425] First, the device uses a high-sensitivity microphone to acquire the user's voice input. This voice is converted into digital data in real time. The converted voice data is then transmitted to a server via the internet.

[0426] The server is equipped with speech recognition software, which converts received speech data into text data. This text data is then analyzed by a natural language processing engine to identify the user's intent. The analysis determines what the user wants and what services should be provided.

[0427] For example, if a user uses voice input to say, "Set a reminder to take my medicine tomorrow at 3 PM," the server parses this request and generates data corresponding to the reminder setting. This information is then sent back to the terminal as needed, and the terminal's scheduling service sets the reminder and manages it across systems.

[0428] Users can confirm the results of their requests through voice feedback from their device. The device plays back the voice data received from the server as synthesized speech, providing feedback to the user, such as, "I have set a reminder for tomorrow at 3 PM."

[0429] In this way, this system uses voice to determine the user's intentions and provides services tailored to individual needs in a timely manner. It enables elderly people to manage their daily lives more comfortably through a user-friendly interface.

[0430] The following describes the processing flow.

[0431] Step 1:

[0432] The device receives the user's voice input using a high-sensitivity microphone and converts the voice signal into digital data in real time.

[0433] Step 2:

[0434] The device transmits this digital audio data to the server via the internet.

[0435] Step 3:

[0436] The server passes the received audio data to speech recognition software, which converts the audio signal into text data.

[0437] Step 4:

[0438] The server inputs text data into a natural language processing engine and analyzes the user's intent.

[0439] Step 5:

[0440] The server determines the user's request based on the analysis results and selects the appropriate service or task.

[0441] Step 6:

[0442] The server creates a response based on the selected service, converts it into audio data, and then sends it to the terminal.

[0443] Step 7:

[0444] The terminal plays the audio data received from the server through its speaker and notifies the user of the response result via voice.

[0445] Step 8:

[0446] The user acts according to voice feedback from the device and issues additional voice commands as needed.

[0447] (Example 1)

[0448] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0449] For the elderly and those unfamiliar with technology, efficiently managing everyday tasks is often a challenging task. In particular, conventional systems capable of smoothly processing voice commands are limited, highlighting the need for improved usability. It is essential to provide intuitively usable systems that leverage voice recognition and natural language processing technologies.

[0450] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0451] In this invention, the server includes means for converting speech into text data using a speech recognition application, means for analyzing the user's intent using a natural language processing engine, and means for providing a time management function based on voice commands. This makes it possible to appropriately analyze the user's intent simultaneously with voice input and manage daily tasks efficiently and intuitively.

[0452] An "audio signal" is an electronic representation of sound waves traveling through the air.

[0453] "Digital information" refers to a form of information that is recorded or processed by converting analog data into a binary format.

[0454] An "information processing device" is an electronic device, such as a computer, that has the function of receiving information, processing it, and generating output.

[0455] "User" refers to a person who uses this system to input voice commands and receive services.

[0456] A "speech recognition application" is a software program that analyzes speech signals and converts them into text data.

[0457] A "natural language processing engine" is a system that uses technology to analyze text data and understand the meaning of language and the intent of the user.

[0458] "Audio data" refers to data obtained by digitizing audio signals, in a format that can be processed by computers and other electronic devices.

[0459] "Character data" refers to data represented using characters such as letters of the alphabet and numbers.

[0460] "Time management features" refer to functions such as calendars and reminders that allow you to set and manage tasks based on specific times.

[0461] The system for implementing the present invention utilizes technology to provide a variety of services based on voice input. Specifically, it consists mainly of a terminal, a server, and a data communication infrastructure connecting them.

[0462] The terminal uses a highly sensitive voice input device to acquire voice signals from the user. These voice signals are converted from analog to digital information and transmitted to the server in digital format. To perform this operation, the terminal is equipped with general-purpose voice recording hardware and software for processing the voice data.

[0463] The server converts the received digital information into text data using a speech recognition application. High-precision speech recognition software is used for this process, such as a common speech recognition API. The generated text data is then analyzed using a natural language processing engine to identify the user's intent. Cloud-based services may be used for the natural language processing engine.

[0464] For example, if a user says, "Set a reminder to take my medicine tomorrow at 3pm," the system recognizes the instruction and sets an alarm for the appropriate time. The server processes the specified task, generates feedback, and sends it back to the device. The device then uses speech synthesis technology to provide the user with voice feedback such as, "I have set a reminder for tomorrow at 3pm."

[0465] To make this system function even more efficiently, prompts powered by generative AI models can be used. For example, a prompt such as, "When the user says, 'Set a reminder to take my medicine tomorrow at 3 pm,' show the results of speech recognition and natural language processing," can be used to enable quick and accurate responses. This allows users, such as the elderly, to manage various daily tasks with simple voice commands.

[0466] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0467] Step 1:

[0468] The terminal acquires the user's voice using a highly sensitive voice input device. The input is an analog audio signal. This audio signal is converted into digital information through a high-precision digital conversion process. This conversion allows the terminal to output a digital audio file. The audio data converted to digital format is then ready to be sent to subsequent processing steps.

[0469] Step 2:

[0470] The terminal transmits digital information to the server using a secure protocol. The input is digital information generated by the terminal. For security reasons, the terminal encrypts the data and transmits it to the server over the internet. During this process, the terminal performs a transmission confirmation to verify that the data was successfully received by the server. The output is the digital information transmitted to the server.

[0471] Step 3:

[0472] The server converts received digital information into text data using a speech recognition application. The input is digital information sent from the terminal. Speech recognition software analyzes the speech pattern and converts it into text format. Using a specific algorithm, the server generates text data with high accuracy. The output is text data.

[0473] Step 4:

[0474] The server inputs the generated text data into a natural language processing engine to analyze the user's intent. The input is text data generated by a speech recognition application. The server uses natural language processing techniques to evaluate keywords and context in the text and determine the user's intent. The output is the analyzed data regarding the user's intent.

[0475] Step 5:

[0476] The server determines the necessary actions based on the user's intent and generates appropriate feedback data. The input is the analyzed user intent data. Based on the analyzed data, the server assembles the necessary information, for example, for setting reminders. The generated feedback data is prepared to be sent back to the terminal. The output is the feedback data.

[0477] Step 6:

[0478] The terminal receives feedback data sent from the server and provides feedback to the user using speech synthesis technology. The input is the feedback data sent from the server. The terminal uses speech synthesis software to convert the text into speech and plays it back to the user through the speaker. This audio feedback allows the user to confirm that their instructions have been processed correctly. The output is the audio feedback to the user.

[0479] (Application Example 1)

[0480] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0481] There is a problem in reducing the burden on users, including the elderly, when it comes to easily and intuitively managing schedules and setting reminders in their daily lives, and in efficiently executing tasks through voice control. In particular, there is a need to provide convenience and intuitive operation for users who are unfamiliar with technology.

[0482] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0483] In this invention, the server includes a device for converting voice signals into digital data, means for receiving voice input from a user, means for transmitting the digital data to an information processing device, where the information processing device converts the voice data into text data, and means for analyzing the text data to determine the user's intent. This makes it possible to accurately analyze the user's intent via voice and to easily perform daily management functions such as appropriate schedule management and reminder setting through voice operation.

[0484] An "audio signal" is an electrical signal that represents audio information, and it is fundamental data used when processing audio physically and electronically.

[0485] "Digital data" refers to audio signals that have been converted into numerical data and formatted in a way that allows them to be processed by computers and other devices.

[0486] "Device" is a general term for machines or equipment that have a specific function or role.

[0487] "User" refers to any person who operates or uses this system.

[0488] "Voice input" refers to the operation or process of capturing a user's voice into a system.

[0489] An "information processing device" is a computer system that receives digital data and performs various calculations and processes.

[0490] A "speech recognition program" is software that analyzes speech signals and converts their content into text data.

[0491] A "natural language processing module" is a software component that analyzes text data to infer the user's intent and derive appropriate responses or actions.

[0492] "Schedule management" is the process of organizing, recording, and notifying users of specific appointments and tasks according to the dates and times they specify.

[0493] "Setting a reminder" is the process of instructing the system to send a notification at a time specified by the user.

[0494] "Voice response" refers to the process by which a system communicates its processing results to the user via voice.

[0495] This invention provides a system for assisting the elderly using a voice interface. The system includes a device for user voice input, an information processing device for processing digital data, and a device for providing voice responses.

[0496] The device uses a high-sensitivity microphone to capture the user's voice and converts it into digital data. The converted digital data is transferred to an information processing device via the internet. The information processing device uses a speech recognition program (e.g., a speech recognition API) to convert the voice into text data and then uses a natural language processing module (e.g., a natural language API) to analyze the user's intent.

[0497] The analysis identifies the necessary actions. For example, if a user enters the voice command "Set a reminder to take my medicine at 3 PM," the information processing device generates a task as a reminder setting, which is then transmitted to the device, and the user receives voice feedback confirming the task's success. A synthesized speech engine (e.g., a speech synthesis API) is used for the voice feedback.

[0498] For example, a user might voice-input, "Can you set a reminder to take my vitamin supplements at 8 PM?" Based on this request, the device will set a reminder at the specified time and notify the user in synthesized speech, "It's time to take your vitamin supplements at 8 PM." An example of a prompt might be, "Please add an event: I'm going for a walk tomorrow at 10 AM."

[0499] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0500] Step 1:

[0501] The device uses a high-sensitivity microphone to capture voice input from the user. The voice signal is input and converted into digital data. This converted digital data becomes the output and is passed on to subsequent processing.

[0502] Step 2:

[0503] The terminal sends the converted digital data to the server via the internet. The input is digital audio data, which the server receives and uses as input for the next processing step.

[0504] Step 3:

[0505] The server uses a speech recognition program to convert digital audio data into text data. After the data conversion, the audio content is output as text, which becomes input data for analysis by a natural language processing engine.

[0506] Step 4:

[0507] The server analyzes text data using a natural language processing module to understand the user's intent. The input is converted text data, which is then analyzed to identify the task the user wishes to perform. The analysis results are then used as output for specific processing and responses.

[0508] Step 5:

[0509] The server generates an appropriate task (e.g., setting a reminder) based on the analysis results. The input is the result of analysis representing the user's intent, and the server generates and outputs a task based on appropriate program logic. This task information is then sent to the terminal.

[0510] Step 6:

[0511] The terminal receives task information from the server and manages the schedule based on the configured tasks. It performs data calculations internally and sets reminders according to the execution timing.

[0512] Step 7:

[0513] At the time requested by the user, the device uses a synthesized speech engine to provide voice feedback. The input is feedback information from the server, and the synthesized speech outputs the feedback content as voice, notifying the user.

[0514] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0515] This invention is a system that provides customized services based on voice input for the elderly, and by recognizing the user's emotions, it enables more appropriate and user-friendly responses. This system mainly consists of terminals, servers, and a communication network connecting them.

[0516] First, the terminal uses a high-sensitivity microphone to acquire the user's voice input. The acquired voice is converted into digital data in real time and sent to the server. This transmission takes place over the internet.

[0517] The server first converts the received audio data into text data using speech recognition software, then passes the text data to a natural language processing engine to analyze the user's intent. In addition, an emotion engine recognizes the user's emotions from their tone of voice and speaking style. This emotion information is reflected in response generation, creating an output appropriate to the user's state.

[0518] For example, if a user uses voice input to say, "I feel a little tired, I'd like some relaxing music," the server recognizes not only the intention but also the emotion, such as "tired," using its emotion engine. As a result, it selects relaxing music, sends a calming instruction to the device, and prompts it to play the music.

[0519] The device operates based on instructions received from the server and plays music through its speaker. This allows users to receive services that are sensitive to their emotions, improving their quality of life.

[0520] Thus, the present invention aims to provide a voice interface that reflects the user's emotions, enabling elderly people to enjoy digital services with peace of mind in an environment that suits them better.

[0521] The following describes the processing flow.

[0522] Step 1:

[0523] The device receives the voice spoken by the user using a high-sensitivity microphone and converts the voice signal into digital data in real time.

[0524] Step 2:

[0525] The terminal transmits the converted digital audio data to the server via the internet.

[0526] Step 3:

[0527] The server passes the received audio data to speech recognition software, which converts the audio signal into text data.

[0528] Step 4:

[0529] The server inputs the converted text data into a natural language processing engine to analyze the user's intent.

[0530] Step 5:

[0531] The server inputs the analyzed text data into an emotion engine, which then recognizes the user's emotions based on their tone of voice and speaking style.

[0532] Step 6:

[0533] The server generates an appropriate response based on the user's intent and perceived emotions, and sends it back to the terminal as audio data.

[0534] Step 7:

[0535] The device plays audio data received from the server through its speaker, providing the user with emotionally sensitive audio feedback.

[0536] Step 8:

[0537] The user can receive voice feedback from the device and issue additional voice commands as needed.

[0538] (Example 2)

[0539] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0540] When providing personalized voice-input services to a diverse range of users, including the elderly, a challenge is that information is not adequately provided while considering the user's emotions. There is a need to generate responses that incorporate the user's emotions to achieve more user-friendly and effective interactions.

[0541] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0542] In this invention, the server includes means for converting voice data into text information, means for analyzing the text information to determine the user's intent, and means for analyzing voice tone and speaking style to recognize the user's emotions. This makes it possible to generate personalized responses based on the user's intent and emotions.

[0543] "Audio signals" are data obtained by converting human voices into electrical signals.

[0544] "Digital data" refers to data that has been digitized from an analog signal and converted into a format that can be processed by a computer.

[0545] An "information terminal" is a device that accepts input such as voice and has the function of communicating digital data via the internet.

[0546] A "user" is an individual who uses this system to perform voice input.

[0547] A "processing device" is a computing device that analyzes and processes received digital data and generates appropriate output.

[0548] "Speech recognition software" is a program that converts speech signals into text information.

[0549] A "natural language processing engine" is a mechanism that implements algorithms for analyzing the user's intent from textual information.

[0550] An "emotion recognition engine" is a technology that analyzes voice tone and word choice to evaluate the user's emotions.

[0551] The "Record Creation" and "Notification Settings" functions automatically create and manage memos and reminders based on the user's voice commands.

[0552] To implement this invention, an information terminal, a processing unit, and a communication network connecting them are primarily used. The terminal uses a high-sensitivity microphone to capture the user's voice signal. This voice signal is converted into digital data within the terminal and transmitted to a server via the internet.

[0553] The server converts speech data into text using speech recognition software. This process utilizes common speech recognition algorithms. The server then leverages a natural language processing engine to analyze the user's intent from the converted text. This engine interprets user commands and inquiries using techniques such as keyword extraction and syntactic analysis.

[0554] Furthermore, the server uses an emotion recognition engine to analyze the user's voice tone and speaking style, and evaluates the user's emotions. Emotion recognition includes analyzing speech waveforms and analyzing text using an emotion dictionary. Based on this evaluation, the server generates the most appropriate response and sends it to the information terminal as audio data or other information format.

[0555] The terminal follows instructions from the server and uses its voice output device to play the voice response requested by the user. It also has functions for creating records and setting notifications based on voice commands, supporting the user's daily activities.

[0556] For example, if a user gives a voice command such as, "I feel like relaxing today, so play some relaxing music," the server analyzes the voice to understand the user's intention and emotion, and selects music related to "relaxation." Based on the instruction, the device plays relaxing music through its speakers.

[0557] Examples of prompts include, "I want to relax, so please play some gentle music," or "I'm in a good mood today, so I'd like to listen to some upbeat music." By using such prompts, the system can provide services that reflect the user's emotions and intentions, enabling elderly users and other users to receive information and responses that are tailored to them.

[0558] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0559] Step 1:

[0560] The terminal captures the user's voice signal using a high-sensitivity microphone. At this time, the voice signal is input. The input voice signal is converted into digital data by the voice processing unit installed in the terminal. The converted digital data is transmitted to the server via the internet.

[0561] Step 2:

[0562] The server inputs the received digital audio data into speech recognition software. This software uses a speech recognition algorithm to convert the audio data into text information. The output text information is then used for subsequent analysis.

[0563] Step 3:

[0564] The server passes the text information to the natural language processing engine. The engine analyzes the input text information and processes the data to understand the user's intent. For example, grammatical analysis and keyword extraction are performed, and the user's specific commands or questions are output.

[0565] Step 4:

[0566] The server transmits intent information obtained from the natural language processing engine to the emotion recognition engine. The emotion recognition engine analyzes the content of the input text information and the paralinguistic elements of the speech (e.g., intonation, stress) to recognize the user's emotions. This analysis outputs an evaluation of the user's emotions, such as whether they are relaxed or stressed.

[0567] Step 5:

[0568] The server integrates intent and sentiment information and inputs it into the response generation engine to generate the optimal response. This data processing uses a generative AI model to produce the most appropriate response for the user's situation. For example, it might output instructions for playlist creation or information provision.

[0569] Step 6:

[0570] The terminal executes specific services based on response instructions received from the server. For example, it might launch a music playback application and play music through the speaker according to the outputted instructions. As a result, the user can receive services that match their intentions and emotions.

[0571] (Application Example 2)

[0572] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0573] When elderly people use voice-based digital services in their daily lives, accurate understanding of their intentions and responses that take their emotions into consideration are required. However, existing voice response systems are insufficient in providing appropriate services that take emotions into account, which can cause inconvenience for users. This problem needs to be solved.

[0574] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0575] In this invention, the server includes means for converting speech data into text data using speech recognition software, means for analyzing the user's intent from the text data using a natural language processing engine, and means for determining the user's emotions from their tone of voice and manner of speaking using an emotion recognition engine. This enables an appropriate response based on the user's intent and emotions.

[0576] An "audio signal" is waveform information represented by sound, including human speech, that is transmitted through a medium such as air.

[0577] "Digital data" refers to data that represents analog information, such as audio signals, in binary format.

[0578] A "communication device" is a device that has the function of converting voice signals into digital data and transmitting and receiving data.

[0579] A "user" is an individual who operates the system and receives services through voice input.

[0580] An "information processing device" is a computer system that processes audio signals and text data to analyze the user's intentions and emotions.

[0581] "Speech recognition software" is a program or application that converts speech signals into text data.

[0582] A "natural language processing engine" is a computer program used to analyze human language and understand its meaning and intent.

[0583] An "emotion recognition engine" is a software system that determines a user's emotions from their voice tone and speaking style.

[0584] "Audio data" refers to data recorded in digital format using audio signals.

[0585] The system for implementing this invention mainly consists of a communication device, an information processing device, and a communication network connecting them. A portable terminal equipped with a high-sensitivity microphone is used as the communication device. This terminal has the function of acquiring voice input from the user and converting the voice signal into digital data. This converted digital data is transmitted to the information processing device via the internet.

[0586] The information processing device is equipped with speech recognition software, a natural language processing engine, and an emotion recognition engine. The speech recognition software converts speech data into text data. Subsequently, the natural language processing engine analyzes the text data and interprets the user's intent. Furthermore, the emotion recognition engine recognizes the user's emotions from their tone of voice and manner of speaking. This combination makes it possible to generate an appropriate response based on the user's intent and emotions. This generated response is then transmitted back to the communication device as audio data and presented to the user verbally.

[0587] As a concrete example, consider a scenario where a user requests via voice, "I'm feeling down today, so I'd like some advice to cheer me up." The information processing device recognizes the emotion of "feeling down" and generates appropriate advice for that emotion. Through the communication device, it can provide the user with encouraging words in a calm tone.

[0588] An example of a prompt message is, "How are you feeling today? Let me know if you need any music or advice." This is used as suitable input for a generative AI model. In this way, the present invention provides a user-friendly voice response system that is sensitive to the user's emotions.

[0589] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0590] Step 1:

[0591] The user provides voice input to the communication device. This voice signal is captured using a high-sensitivity microphone. The voice signal is converted into digital data through digital signal processing, and this digital data is passed to voice recognition software.

[0592] Step 2:

[0593] The terminal transmits the converted digital data to the information processing device. The transmitted digital data becomes input, and the speech recognition software on the information processing device starts operating. The speech recognition software converts the speech data into text data and passes this text data to the next processing step.

[0594] Step 3:

[0595] The server analyzes the text data via a natural language processing engine. Based on the input text data, it extracts the user's intent. For example, if the user says "I'm feeling down," the appropriate intent is analyzed based on that. This analysis result is then passed on to the next step.

[0596] Step 4:

[0597] The server uses an emotion recognition engine to determine the user's emotions from their voice tone and manner of speaking. In this step, the analyzed text data is used as input, and data calculations are performed to determine the emotion of "depressed."

[0598] Step 5:

[0599] The server generates appropriate responses based on the user's intentions and emotions. Using a generative AI model, it generates output data in the form of speech based on the analysis results and emotional information. For example, it can generate encouraging words or advice.

[0600] Step 6:

[0601] The server sends the generated audio data to the terminal. The terminal plays this audio data, allowing the user to receive an audio response. This enables the user to experience a response that resonates with their emotions.

[0602] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0603] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0604] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0605] [Fourth Embodiment]

[0606] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0607] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0608] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0609] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0610] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0611] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0612] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0613] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0614] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0615] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0616] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0617] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0618] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0619] The system according to the present invention provides personalized services for the elderly based on voice input. This system mainly consists of a terminal, a server, and data communication between them.

[0620] First, the device uses a high-sensitivity microphone to acquire the user's voice input. This voice is converted into digital data in real time. The converted voice data is then transmitted to a server via the internet.

[0621] The server is equipped with speech recognition software, which converts received speech data into text data. This text data is then analyzed by a natural language processing engine to identify the user's intent. The analysis determines what the user wants and what services should be provided.

[0622] For example, if a user uses voice input to say, "Set a reminder to take my medicine tomorrow at 3 PM," the server parses this request and generates data corresponding to the reminder setting. This information is then sent back to the terminal as needed, and the terminal's scheduling service sets the reminder and manages it across systems.

[0623] Users can confirm the results of their requests through voice feedback from their device. The device plays back the voice data received from the server as synthesized speech, providing feedback to the user, such as, "I have set a reminder for tomorrow at 3 PM."

[0624] In this way, this system uses voice to determine the user's intentions and provides services tailored to individual needs in a timely manner. It enables elderly people to manage their daily lives more comfortably through a user-friendly interface.

[0625] The following describes the processing flow.

[0626] Step 1:

[0627] The device receives the user's voice input using a high-sensitivity microphone and converts the voice signal into digital data in real time.

[0628] Step 2:

[0629] The device transmits this digital audio data to the server via the internet.

[0630] Step 3:

[0631] The server passes the received audio data to speech recognition software, which converts the audio signal into text data.

[0632] Step 4:

[0633] The server inputs text data into a natural language processing engine and analyzes the user's intent.

[0634] Step 5:

[0635] The server determines the user's request based on the analysis results and selects the appropriate service or task.

[0636] Step 6:

[0637] The server creates a response based on the selected service, converts it into audio data, and then sends it to the terminal.

[0638] Step 7:

[0639] The terminal plays the audio data received from the server through its speaker and notifies the user of the response result via voice.

[0640] Step 8:

[0641] The user acts according to voice feedback from the device and issues additional voice commands as needed.

[0642] (Example 1)

[0643] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0644] For the elderly and those unfamiliar with technology, efficiently managing everyday tasks is often a challenging task. In particular, conventional systems capable of smoothly processing voice commands are limited, highlighting the need for improved usability. It is essential to provide intuitively usable systems that leverage voice recognition and natural language processing technologies.

[0645] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0646] In this invention, the server includes means for converting speech into text data using a speech recognition application, means for analyzing the user's intent using a natural language processing engine, and means for providing a time management function based on voice commands. This makes it possible to appropriately analyze the user's intent simultaneously with voice input and manage daily tasks efficiently and intuitively.

[0647] An "audio signal" is an electronic representation of sound waves traveling through the air.

[0648] "Digital information" refers to a form of information that is recorded or processed by converting analog data into a binary format.

[0649] An "information processing device" is an electronic device, such as a computer, that has the function of receiving information, processing it, and generating output.

[0650] "User" refers to a person who uses this system to input voice commands and receive services.

[0651] A "speech recognition application" is a software program that analyzes speech signals and converts them into text data.

[0652] A "natural language processing engine" is a system that uses technology to analyze text data and understand the meaning of language and the intent of the user.

[0653] "Audio data" refers to data obtained by digitizing audio signals, in a format that can be processed by computers and other electronic devices.

[0654] "Character data" refers to data represented using characters such as letters of the alphabet and numbers.

[0655] "Time management features" refer to functions such as calendars and reminders that allow you to set and manage tasks based on specific times.

[0656] The system for implementing the present invention utilizes technology to provide a variety of services based on voice input. Specifically, it consists mainly of a terminal, a server, and a data communication infrastructure connecting them.

[0657] The terminal uses a highly sensitive voice input device to acquire voice signals from the user. These voice signals are converted from analog to digital information and transmitted to the server in digital format. To perform this operation, the terminal is equipped with general-purpose voice recording hardware and software for processing the voice data.

[0658] The server converts the received digital information into text data using a speech recognition application. High-precision speech recognition software is used for this process, such as a common speech recognition API. The generated text data is then analyzed using a natural language processing engine to identify the user's intent. Cloud-based services may be used for the natural language processing engine.

[0659] For example, if a user says, "Set a reminder to take my medicine tomorrow at 3pm," the system recognizes the instruction and sets an alarm for the appropriate time. The server processes the specified task, generates feedback, and sends it back to the device. The device then uses speech synthesis technology to provide the user with voice feedback such as, "I have set a reminder for tomorrow at 3pm."

[0660] To make this system function even more efficiently, prompts powered by generative AI models can be used. For example, a prompt such as, "When the user says, 'Set a reminder to take my medicine tomorrow at 3 pm,' show the results of speech recognition and natural language processing," can be used to enable quick and accurate responses. This allows users, such as the elderly, to manage various daily tasks with simple voice commands.

[0661] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0662] Step 1:

[0663] The terminal acquires the user's voice using a highly sensitive voice input device. The input is an analog audio signal. This audio signal is converted into digital information through a high-precision digital conversion process. This conversion allows the terminal to output a digital audio file. The audio data converted to digital format is then ready to be sent to subsequent processing steps.

[0664] Step 2:

[0665] The terminal transmits digital information to the server using a secure protocol. The input is digital information generated by the terminal. For security reasons, the terminal encrypts the data and transmits it to the server over the internet. During this process, the terminal performs a transmission confirmation to verify that the data was successfully received by the server. The output is the digital information transmitted to the server.

[0666] Step 3:

[0667] The server converts received digital information into text data using a speech recognition application. The input is digital information sent from the terminal. Speech recognition software analyzes the speech pattern and converts it into text format. Using a specific algorithm, the server generates text data with high accuracy. The output is text data.

[0668] Step 4:

[0669] The server inputs the generated text data into a natural language processing engine to analyze the user's intent. The input is text data generated by a speech recognition application. The server uses natural language processing techniques to evaluate keywords and context in the text and determine the user's intent. The output is the analyzed data regarding the user's intent.

[0670] Step 5:

[0671] The server determines the necessary actions based on the user's intent and generates appropriate feedback data. The input is the analyzed user intent data. Based on the analyzed data, the server assembles the necessary information, for example, for setting reminders. The generated feedback data is prepared to be sent back to the terminal. The output is the feedback data.

[0672] Step 6:

[0673] The terminal receives feedback data sent from the server and provides feedback to the user using speech synthesis technology. The input is the feedback data sent from the server. The terminal uses speech synthesis software to convert the text into speech and plays it back to the user through the speaker. This audio feedback allows the user to confirm that their instructions have been processed correctly. The output is the audio feedback to the user.

[0674] (Application Example 1)

[0675] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0676] There is a problem in reducing the burden on users, including the elderly, when it comes to easily and intuitively managing schedules and setting reminders in their daily lives, and in efficiently executing tasks through voice control. In particular, there is a need to provide convenience and intuitive operation for users who are unfamiliar with technology.

[0677] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0678] In this invention, the server includes a device for converting voice signals into digital data, means for receiving voice input from a user, means for transmitting the digital data to an information processing device, where the information processing device converts the voice data into text data, and means for analyzing the text data to determine the user's intent. This makes it possible to accurately analyze the user's intent via voice and to easily perform daily management functions such as appropriate schedule management and reminder setting through voice operation.

[0679] An "audio signal" is an electrical signal that represents audio information, and it is fundamental data used when processing audio physically and electronically.

[0680] "Digital data" refers to audio signals that have been converted into numerical data and formatted in a way that allows them to be processed by computers and other devices.

[0681] "Device" is a general term for machines or equipment that have a specific function or role.

[0682] "User" refers to any person who operates or uses this system.

[0683] "Voice input" refers to the operation or process of capturing a user's voice into a system.

[0684] An "information processing device" is a computer system that receives digital data and performs various calculations and processes.

[0685] A "speech recognition program" is software that analyzes speech signals and converts their content into text data.

[0686] A "natural language processing module" is a software component that analyzes text data to infer the user's intent and derive appropriate responses or actions.

[0687] "Schedule management" is the process of organizing, recording, and notifying users of specific appointments and tasks according to the dates and times they specify.

[0688] "Setting a reminder" is the process of instructing the system to send a notification at a time specified by the user.

[0689] "Voice response" refers to the process by which a system communicates its processing results to the user via voice.

[0690] This invention provides a system for assisting the elderly using a voice interface. The system includes a device for user voice input, an information processing device for processing digital data, and a device for providing voice responses.

[0691] The device uses a high-sensitivity microphone to capture the user's voice and converts it into digital data. The converted digital data is transferred to an information processing device via the internet. The information processing device uses a speech recognition program (e.g., a speech recognition API) to convert the voice into text data and then uses a natural language processing module (e.g., a natural language API) to analyze the user's intent.

[0692] The analysis identifies the necessary actions. For example, if a user enters the voice command "Set a reminder to take my medicine at 3 PM," the information processing device generates a task as a reminder setting, which is then transmitted to the device, and the user receives voice feedback confirming the task's success. A synthesized speech engine (e.g., a speech synthesis API) is used for the voice feedback.

[0693] For example, a user might voice-input, "Can you set a reminder to take my vitamin supplements at 8 PM?" Based on this request, the device will set a reminder at the specified time and notify the user in synthesized speech, "It's time to take your vitamin supplements at 8 PM." An example of a prompt might be, "Please add an event: I'm going for a walk tomorrow at 10 AM."

[0694] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0695] Step 1:

[0696] The device uses a high-sensitivity microphone to capture voice input from the user. The voice signal is input and converted into digital data. This converted digital data becomes the output and is passed on to subsequent processing.

[0697] Step 2:

[0698] The terminal sends the converted digital data to the server via the internet. The input is digital audio data, which the server receives and uses as input for the next processing step.

[0699] Step 3:

[0700] The server uses a speech recognition program to convert digital audio data into text data. After the data conversion, the audio content is output as text, which becomes input data for analysis by a natural language processing engine.

[0701] Step 4:

[0702] The server analyzes text data using a natural language processing module to understand the user's intent. The input is converted text data, which is then analyzed to identify the task the user wishes to perform. The analysis results are then used as output for specific processing and responses.

[0703] Step 5:

[0704] The server generates an appropriate task (e.g., setting a reminder) based on the analysis results. The input is the result of analysis representing the user's intent, and the server generates and outputs a task based on appropriate program logic. This task information is then sent to the terminal.

[0705] Step 6:

[0706] The terminal receives task information from the server and manages the schedule based on the configured tasks. It performs data calculations internally and sets reminders according to the execution timing.

[0707] Step 7:

[0708] At the time requested by the user, the device uses a synthesized speech engine to provide voice feedback. The input is feedback information from the server, and the synthesized speech outputs the feedback content as voice, notifying the user.

[0709] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0710] This invention is a system that provides customized services based on voice input for the elderly, and by recognizing the user's emotions, it enables more appropriate and user-friendly responses. This system mainly consists of terminals, servers, and a communication network connecting them.

[0711] First, the terminal uses a high-sensitivity microphone to acquire the user's voice input. The acquired voice is converted into digital data in real time and sent to the server. This transmission takes place over the internet.

[0712] The server first converts the received audio data into text data using speech recognition software, then passes the text data to a natural language processing engine to analyze the user's intent. In addition, an emotion engine recognizes the user's emotions from their tone of voice and speaking style. This emotion information is reflected in response generation, creating an output appropriate to the user's state.

[0713] For example, if a user uses voice input to say, "I feel a little tired, I'd like some relaxing music," the server recognizes not only the intention but also the emotion, such as "tired," using its emotion engine. As a result, it selects relaxing music, sends a calming instruction to the device, and prompts it to play the music.

[0714] The device operates based on instructions received from the server and plays music through its speaker. This allows users to receive services that are sensitive to their emotions, improving their quality of life.

[0715] Thus, the present invention aims to provide a voice interface that reflects the user's emotions, enabling elderly people to enjoy digital services with peace of mind in an environment that suits them better.

[0716] The following describes the processing flow.

[0717] Step 1:

[0718] The device receives the voice spoken by the user using a high-sensitivity microphone and converts the voice signal into digital data in real time.

[0719] Step 2:

[0720] The terminal transmits the converted digital audio data to the server via the internet.

[0721] Step 3:

[0722] The server passes the received audio data to speech recognition software, which converts the audio signal into text data.

[0723] Step 4:

[0724] The server inputs the converted text data into a natural language processing engine to analyze the user's intent.

[0725] Step 5:

[0726] The server inputs the analyzed text data into an emotion engine, which then recognizes the user's emotions based on their tone of voice and speaking style.

[0727] Step 6:

[0728] The server generates an appropriate response based on the user's intent and perceived emotions, and sends it back to the terminal as audio data.

[0729] Step 7:

[0730] The device plays audio data received from the server through its speaker, providing the user with emotionally sensitive audio feedback.

[0731] Step 8:

[0732] The user can receive voice feedback from the device and issue additional voice commands as needed.

[0733] (Example 2)

[0734] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0735] When providing personalized voice-input services to a diverse range of users, including the elderly, a challenge is that information is not adequately provided while considering the user's emotions. There is a need to generate responses that incorporate the user's emotions to achieve more user-friendly and effective interactions.

[0736] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0737] In this invention, the server includes means for converting voice data into text information, means for analyzing the text information to determine the user's intent, and means for analyzing voice tone and speaking style to recognize the user's emotions. This makes it possible to generate personalized responses based on the user's intent and emotions.

[0738] "Audio signals" are data obtained by converting human voices into electrical signals.

[0739] "Digital data" refers to data that has been digitized from an analog signal and converted into a format that can be processed by a computer.

[0740] An "information terminal" is a device that accepts input such as voice and has the function of communicating digital data via the internet.

[0741] A "user" is an individual who uses this system to perform voice input.

[0742] A "processing device" is a computing device that analyzes and processes received digital data and generates appropriate output.

[0743] "Speech recognition software" is a program that converts speech signals into text information.

[0744] A "natural language processing engine" is a mechanism that implements algorithms for analyzing the user's intent from textual information.

[0745] An "emotion recognition engine" is a technology that analyzes voice tone and word choice to evaluate the user's emotions.

[0746] The "Record Creation" and "Notification Settings" functions automatically create and manage memos and reminders based on the user's voice commands.

[0747] To implement this invention, an information terminal, a processing unit, and a communication network connecting them are primarily used. The terminal uses a high-sensitivity microphone to capture the user's voice signal. This voice signal is converted into digital data within the terminal and transmitted to a server via the internet.

[0748] The server converts speech data into text using speech recognition software. This process utilizes common speech recognition algorithms. The server then leverages a natural language processing engine to analyze the user's intent from the converted text. This engine interprets user commands and inquiries using techniques such as keyword extraction and syntactic analysis.

[0749] Furthermore, the server uses an emotion recognition engine to analyze the user's voice tone and speaking style, and evaluates the user's emotions. Emotion recognition includes analyzing speech waveforms and analyzing text using an emotion dictionary. Based on this evaluation, the server generates the most appropriate response and sends it to the information terminal as audio data or other information format.

[0750] The terminal follows instructions from the server and uses its voice output device to play the voice response requested by the user. It also has functions for creating records and setting notifications based on voice commands, supporting the user's daily activities.

[0751] For example, if a user gives a voice command such as, "I feel like relaxing today, so play some relaxing music," the server analyzes the voice to understand the user's intention and emotion, and selects music related to "relaxation." Based on the instruction, the device plays relaxing music through its speakers.

[0752] Examples of prompts include, "I want to relax, so please play some gentle music," or "I'm in a good mood today, so I'd like to listen to some upbeat music." By using such prompts, the system can provide services that reflect the user's emotions and intentions, enabling elderly users and other users to receive information and responses that are tailored to them.

[0753] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0754] Step 1:

[0755] The terminal captures the user's voice signal using a high-sensitivity microphone. At this time, the voice signal is input. The input voice signal is converted into digital data by the voice processing unit installed in the terminal. The converted digital data is transmitted to the server via the internet.

[0756] Step 2:

[0757] The server inputs the received digital audio data into speech recognition software. This software uses a speech recognition algorithm to convert the audio data into text information. The output text information is then used for subsequent analysis.

[0758] Step 3:

[0759] The server passes the text information to the natural language processing engine. The engine analyzes the input text information and processes the data to understand the user's intent. For example, grammatical analysis and keyword extraction are performed, and the user's specific commands or questions are output.

[0760] Step 4:

[0761] The server transmits intent information obtained from the natural language processing engine to the emotion recognition engine. The emotion recognition engine analyzes the content of the input text information and the paralinguistic elements of the speech (e.g., intonation, stress) to recognize the user's emotions. This analysis outputs an evaluation of the user's emotions, such as whether they are relaxed or stressed.

[0762] Step 5:

[0763] The server integrates intent and sentiment information and inputs it into the response generation engine to generate the optimal response. This data processing uses a generative AI model to produce the most appropriate response for the user's situation. For example, it might output instructions for playlist creation or information provision.

[0764] Step 6:

[0765] The terminal executes specific services based on response instructions received from the server. For example, it might launch a music playback application and play music through the speaker according to the outputted instructions. As a result, the user can receive services that match their intentions and emotions.

[0766] (Application Example 2)

[0767] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0768] When elderly people use voice-based digital services in their daily lives, accurate understanding of their intentions and responses that take their emotions into consideration are required. However, existing voice response systems are insufficient in providing appropriate services that take emotions into account, which can cause inconvenience for users. This problem needs to be solved.

[0769] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0770] In this invention, the server includes means for converting speech data into text data using speech recognition software, means for analyzing the user's intent from the text data using a natural language processing engine, and means for determining the user's emotions from their tone of voice and manner of speaking using an emotion recognition engine. This enables an appropriate response based on the user's intent and emotions.

[0771] An "audio signal" is waveform information represented by sound, including human speech, that is transmitted through a medium such as air.

[0772] "Digital data" refers to data that represents analog information, such as audio signals, in binary format.

[0773] A "communication device" is a device that has the function of converting voice signals into digital data and transmitting and receiving data.

[0774] A "user" is an individual who operates the system and receives services through voice input.

[0775] An "information processing device" is a computer system that processes audio signals and text data to analyze the user's intentions and emotions.

[0776] "Speech recognition software" is a program or application that converts speech signals into text data.

[0777] A "natural language processing engine" is a computer program used to analyze human language and understand its meaning and intent.

[0778] An "emotion recognition engine" is a software system that determines a user's emotions from their voice tone and speaking style.

[0779] "Audio data" refers to data recorded in digital format using audio signals.

[0780] The system for implementing this invention mainly consists of a communication device, an information processing device, and a communication network connecting them. A portable terminal equipped with a high-sensitivity microphone is used as the communication device. This terminal has the function of acquiring voice input from the user and converting the voice signal into digital data. This converted digital data is transmitted to the information processing device via the internet.

[0781] The information processing device is equipped with speech recognition software, a natural language processing engine, and an emotion recognition engine. The speech recognition software converts speech data into text data. Subsequently, the natural language processing engine analyzes the text data and interprets the user's intent. Furthermore, the emotion recognition engine recognizes the user's emotions from their tone of voice and manner of speaking. This combination makes it possible to generate an appropriate response based on the user's intent and emotions. This generated response is then transmitted back to the communication device as audio data and presented to the user verbally.

[0782] As a concrete example, consider a scenario where a user requests via voice, "I'm feeling down today, so I'd like some advice to cheer me up." The information processing device recognizes the emotion of "feeling down" and generates appropriate advice for that emotion. Through the communication device, it can provide the user with encouraging words in a calm tone.

[0783] An example of a prompt message is, "How are you feeling today? Let me know if you need any music or advice." This is used as suitable input for a generative AI model. In this way, the present invention provides a user-friendly voice response system that is sensitive to the user's emotions.

[0784] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0785] Step 1:

[0786] The user provides voice input to the communication device. This voice signal is captured using a high-sensitivity microphone. The voice signal is converted into digital data through digital signal processing, and this digital data is passed to voice recognition software.

[0787] Step 2:

[0788] The terminal transmits the converted digital data to the information processing device. The transmitted digital data becomes input, and the speech recognition software on the information processing device starts operating. The speech recognition software converts the speech data into text data and passes this text data to the next processing step.

[0789] Step 3:

[0790] The server analyzes the text data via a natural language processing engine. Based on the input text data, it extracts the user's intent. For example, if the user says "I'm feeling down," the appropriate intent is analyzed based on that. This analysis result is then passed on to the next step.

[0791] Step 4:

[0792] The server uses an emotion recognition engine to determine the user's emotions from their voice tone and manner of speaking. In this step, the analyzed text data is used as input, and data calculations are performed to determine the emotion of "depressed."

[0793] Step 5:

[0794] The server generates appropriate responses based on the user's intentions and emotions. Using a generative AI model, it generates output data in the form of speech based on the analysis results and emotional information. For example, it can generate encouraging words or advice.

[0795] Step 6:

[0796] The server sends the generated audio data to the terminal. The terminal plays this audio data, allowing the user to receive an audio response. This enables the user to experience a response that resonates with their emotions.

[0797] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0798] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0799] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0800] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0801] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0802] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0803] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0804] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0805] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0806] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0807] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0808] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0809] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0810] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0811] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0812] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0813] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0814] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0815] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0816] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0817] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0818] The following is further disclosed regarding the embodiments described above.

[0819] (Claim 1)

[0820] A terminal that converts audio signals into digital data, and a means for receiving audio input from the user,

[0821] The means of transmitting the aforementioned digital data to a server, and the server converting the audio data into text data,

[0822] A means for analyzing the aforementioned text data to determine the user's intent,

[0823] A means for generating appropriate output based on the user's intent and transmitting it to the terminal as audio data,

[0824] A means for playing the aforementioned audio data to the user and providing an audio response,

[0825] A system that includes this.

[0826] (Claim 2)

[0827] The system according to claim 1, wherein the server includes speech recognition software and a natural language processing engine.

[0828] (Claim 3)

[0829] The system according to claim 1, wherein the terminal has a function for creating memos and setting reminders based on voice commands.

[0830] "Example 1"

[0831] (Claim 1)

[0832] It includes an information processing device that converts audio signals into digital information, and means for receiving audio input from the user,

[0833] A means for transmitting the aforementioned digital information to an information processing device, which then converts the audio data into text data,

[0834] A means for analyzing the aforementioned text data to determine the user's intent,

[0835] Means for generating appropriate output based on the user's intent and transmitting it as audio data to the information processing device,

[0836] A means for playing the aforementioned audio data to the user and providing an audio response,

[0837] A means that secondarily has a time management function for carrying out instructions based on the user's intent,

[0838] A system that includes this.

[0839] (Claim 2)

[0840] The system according to claim 1, wherein the information processing device includes a speech recognition application and a natural language processing engine.

[0841] (Claim 3)

[0842] The system according to claim 1, wherein the information processing device has a function for creating records and setting notifications based on voice commands.

[0843] "Application Example 1"

[0844] (Claim 1)

[0845] A device that converts audio signals into digital data, and a means for receiving audio input from users,

[0846] A means for transmitting the aforementioned digital data to an information processing device, and for the information processing device to convert the audio data into text data,

[0847] A means for analyzing the aforementioned text data to determine the user's intent,

[0848] Means for generating appropriate information based on the user's intent and transmitting it to the device as audio data,

[0849] A means for playing the aforementioned audio data to the user and providing an audio response,

[0850] A means of using smart devices to set reminders and perform other daily management functions via voice,

[0851] A system that includes this.

[0852] (Claim 2)

[0853] The system according to claim 1, wherein the information processing device includes a speech recognition program and a natural language processing module.

[0854] (Claim 3)

[0855] The system according to claim 1, wherein the device has a schedule management and notification setting function based on voice commands.

[0856] "Example 2 of combining an emotion engine"

[0857] (Claim 1)

[0858] It includes an information terminal that converts audio signals into digital data, and a means for receiving audio input from the user,

[0859] A means for transmitting the aforementioned digital data to a processing device, and for the processing device to convert the audio data into text information,

[0860] A means for analyzing the aforementioned textual information to determine the user's intent,

[0861] The aforementioned processing device includes means for analyzing voice and speaking style to recognize the user's emotions,

[0862] Means for generating appropriate output based on the user's intentions and emotions and transmitting it to the information terminal as audio data or other information format,

[0863] A means for reproducing the transmitted data for the user and responding to it,

[0864] A system that includes this.

[0865] (Claim 2)

[0866] The system according to claim 1, wherein the processing unit includes speech recognition software, a natural language processing engine, and an emotion recognition engine.

[0867] (Claim 3)

[0868] The system according to claim 1, wherein the information terminal has a function for creating records and setting notifications based on voice commands, and provides information services that respond to the user's emotions.

[0869] "Application example 2 when combining with an emotional engine"

[0870] (Claim 1)

[0871] A communication device that converts audio signals into digital data, and a means for receiving audio input from users,

[0872] A means for transmitting the aforementioned digital data to an information processing device, and for the information processing device to convert the audio data into text data,

[0873] A means for analyzing the aforementioned text data to determine the user's intent,

[0874] A means for recognizing the user's emotions from their tone of voice and manner of speaking,

[0875] A means for generating an appropriate output based on the user's intentions and emotions and transmitting it to the communication device as voice data,

[0876] A means for playing the aforementioned audio data to the user and providing an audio response,

[0877] A system that includes this.

[0878] (Claim 2)

[0879] The system according to claim 1, wherein the information processing device includes speech recognition software, a natural language processing engine, and an emotion recognition engine.

[0880] (Claim 3)

[0881] The system according to claim 1, wherein the communication device has a memo creation and time management function based on voice instructions. [Explanation of symbols]

[0882] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A terminal that converts audio signals into digital data, and a means for receiving audio input from the user, The means of transmitting the aforementioned digital data to a server, and the server converting the audio data into text data, A means for analyzing the aforementioned text data to determine the user's intent, A means for generating appropriate output based on the user's intent and transmitting it to the terminal as audio data, A means for playing the aforementioned audio data to the user and providing an audio response, A system that includes this.

2. The system according to claim 1, wherein the server includes speech recognition software and a natural language processing engine.

3. The system according to claim 1, wherein the terminal has a function for creating memos and setting reminders based on voice commands.