system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The system addresses the challenge of insufficient personalized responses in mobile devices by analyzing voice inputs and updating user profiles, enabling efficient and personalized information delivery.

JP2026096536APending Publication Date: 2026-06-15SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-03
Publication Date: 2026-06-15

Smart Images

Figure 2026096536000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A means of acquiring voice input from a user and converting it into text data, A means for analyzing the text data using natural language processing and identifying the user's request, Based on the user's request, a means of obtaining necessary information by referring to past data and utilizing external information sources, A means of generating a response to the user based on the acquired information, converting it into voice, and notifying the user, A system including means for recording and updating the aforementioned responses and user interaction history as a profile.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] Modern mobile devices have a lot of information processing capabilities, but users are still in a situation where they have to manage multiple tasks individually in daily life. In particular, voice-based navigation and assistant functions may have difficulty providing appropriate information according to user requests. In addition, due to insufficient personalized responses according to individual preferences, users may not receive satisfactory support from smart devices. Therefore, it is necessary to provide a mechanism that efficiently acquires and provides information in real time based on user voice instructions and makes users' lives more convenient.

Means for Solving the Problems

[0005] This invention provides a system that accurately analyzes user requests by acquiring voice input from users and converting it into text. It uses natural language processing technology for analysis and effectively references past data and external information sources based on the results. This allows for the rapid acquisition of information tailored to the user's requests. Furthermore, by generating responses and notifying the user via voice, users can intuitively receive information. By recording and updating the user's interaction history as a profile, personalized information delivery is achieved for each user. This improves the efficiency of information acquisition and delivery, dramatically enhancing the user experience.

[0006] "Voice input" is the process of capturing the voice spoken by the user as a digital signal in a device.

[0007] "Text data" refers to character information expressed in digital format, including converted audio and handwritten information.

[0008] "Natural language processing" is a technology that allows computers to understand and analyze human language, and is used to extract information from speech and text.

[0009] "User requests" refer to the specific instructions or questions that users give to the device, and represent the tasks that should be performed.

[0010] "External information sources" refer to information services and databases that a device can access via the internet or other networks.

[0011] A "response" is information generated by a device based on a user's request, and is provided in various forms such as voice or text.

[0012] "Interaction history" refers to a record of past interactions and operations between the user and the device, and is data used for later analysis and service improvement.

[0013] A "profile" is a collection of information including an individual user's preferences, tendencies, and past interaction history, and is used to provide personalized services. [Brief explanation of the drawing]

[0014] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] It is a sequence diagram showing the processing flow of a data processing system in Application Example 2 when combined with an emotion engine.

Embodiments for Carrying Out the Invention

[0015] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0016] First, the terms used in the following description will be explained.

[0017] In the following embodiments, a numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0018] In the following embodiments, a numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0019] In the following embodiments, a numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0020] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0021] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0022] [First Embodiment]

[0023] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0024] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0025] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0026] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0027] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0028] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0029] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0030] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0031] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0032] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0033] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0034] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0035] This invention provides a system for coordinating diverse information provision and services based on voice input in mobile information terminals such as smartphones that users use on a daily basis. The aim of this system is to improve the user experience by smoothly coordinating a series of processes including speech recognition, natural language processing, data retrieval, information generation, and response.

[0036] User interface and speech recognition

[0037] The device is equipped with a microphone to receive voice commands from the user, and the acquired voice is converted into a digital signal. This signal is processed by a voice recognition module and converted into text data. Voice recognition is performed in real time, allowing the user to give commands in a natural way.

[0038] Natural Language Processing and Request Analysis

[0039] The device passes text data to a natural language processing engine to analyze the user's request. This identifies the information and tasks the user is seeking, allowing for efficient progress to the next processing step. This analysis is performed using advanced algorithms, enabling accurate semantic understanding.

[0040] Data retrieval and information generation

[0041] After a user's request is identified, the server collects data appropriate to that request from internal and external sources. This involves referencing the internet and specific databases as needed to obtain real-time and accurate information. The server can also refer to the user's past interaction history to generate individually customized information.

[0042] Response generation and user notifications

[0043] The collected and analyzed information is assembled into a user-specific response. The server formats this response, and the terminal uses a speech synthesis module to notify the user verbally. This allows users to obtain appropriate information without relying on visual cues.

[0044] Profile update and learning

[0045] After each interaction, the server records new information in the profile, learning the user's preferences and behavioral patterns. This profile update plays a crucial role in providing more personalized services in future interactions.

[0046] Specific example

[0047] For example, if a user says, "Tell me the weather for tomorrow," the device recognizes the voice, the server retrieves the weather information, and responds with a voice message saying, "It will be sunny tomorrow." Also, if a user asks, "Tell me my insurance card number," the server retrieves the relevant information from the user profile and notifies the user through the device.

[0048] In this way, this system realizes a consistent information delivery flow starting with the user's voice input, providing convenient and efficient support to the user.

[0049] The following describes the processing flow.

[0050] Step 1:

[0051] The device detects the user's voice input and captures it through the built-in microphone. The voice data is converted into a digital format and sent to the voice recognition system.

[0052] Step 2:

[0053] The device uses a speech recognition system to convert speech data into text data. This makes the user's spoken content ready for processing as text information.

[0054] Step 3:

[0055] The device passes text data to a natural language processing engine, which analyzes the user's request. At this stage, keywords and intent are extracted, and the content of the request is identified.

[0056] Step 4:

[0057] Based on the specified request, the server queries internal databases and external sources to find the necessary information. For example, weather information and personal settings information may be referenced here.

[0058] Step 5:

[0059] The server generates a response based on the acquired data, tailored to the user's request. This response may be customized to take the user's profile information into account.

[0060] Step 6:

[0061] The server sends the generated response to the terminal. The terminal uses a speech synthesis system to convert the text response into speech and delivers the information to the user through its speaker.

[0062] Step 7:

[0063] The server records user interaction details in a profile and updates the information. This improves the accuracy and personalization of future responses.

[0064] (Example 1)

[0065] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0066] In daily life, there is a need for systems that allow users to quickly and accurately obtain diverse information through voice input. However, conventional technologies have shortcomings in the process from voice recognition to information provision, resulting in delayed or inaccurate responses to user requests. Furthermore, it has been difficult to provide personalized information that reflects individual user preferences and past interactions.

[0067] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0068] In this invention, the server includes means for collecting auditory input from the user and converting it into text data, means for analyzing the text data using natural language processing technology and identifying the user's needs, and means for referring to information history and utilizing external information sources to obtain necessary information. This enables the user to quickly and accurately obtain personalized information and services starting from voice input.

[0069] A "user" refers to an individual or organization that uses a system to obtain information.

[0070] "Auditory input" refers to the process of supplying information or instructions using sound.

[0071] "Text data" refers to information in text format that has been converted based on speech or auditory input.

[0072] "Natural language processing technology" refers to the technology that enables computers to understand, analyze, and process human language.

[0073] "Needs" refer to the requests or information that users want to resolve through the system.

[0074] "Information history" refers to records of a user's past interactions and requests.

[0075] "External information sources" refer to information obtained from databases or the internet that exist outside the system.

[0076] "Reply" refers to the response generated by the system in response to a user's request.

[0077] "Notification" refers to the act of delivering a generated reply to the user via audio or other means.

[0078] A "profile" refers to a collection of information that records a user's preferences and past usage history.

[0079] This invention relates to an information provision system that receives everyday voice commands from a user, processes information based on those commands, and returns a voice response. The system functions through the cooperation of a mobile information terminal and a server.

[0080] The terminal first has an audio input device (e.g., a microphone) to acquire voice input from the user. This voice signal is converted into text data using internal speech recognition software (e.g., a speech recognition API).

[0081] Next, the device uses a natural language processing engine (e.g., a natural language processing library) to analyze the text data and identify the user's needs. This process allows the system to accurately understand the information and tasks the user is seeking.

[0082] Once user needs are identified, the server collects the necessary information. The server retrieves appropriate data by referencing the information history and utilizing external sources (e.g., information provision APIs). It can also generate personalized information based on the user's past interactions.

[0083] Based on the acquired information, the server generates a reply to the user and converts this response into audio data. The audio generated using speech synthesis software (e.g., a speech synthesis engine) is then notified to the user via the terminal.

[0084] This process allows users to quickly and accurately obtain information without relying on their vision. For example, if a user says, "Tell me about Italian restaurants nearby," the system analyzes the voice, collects appropriate restaurant information, and provides a voice response such as, "The nearest Italian restaurant is in front of the train station."

[0085] A concrete example of a prompt using a generative AI model would be: "Generate a response to the user's question, 'When will it rain next?'" This allows the system to simulate natural and contextually appropriate responses.

[0086] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0087] Step 1:

[0088] The user speaks into the device's microphone and inputs their request by voice. This input is in the format of, for example, "Tell me today's news." The input data is a raw voice signal.

[0089] Step 2:

[0090] The terminal digitizes the audio signal acquired from the user. Speech recognition software is used to generate text data from this digital audio signal. In this step, features are extracted from the audio waveform and matched with known audio to convert them into accurate text. The input is an audio signal, and the output is text data.

[0091] Step 3:

[0092] The device passes the generated text data to a natural language processing engine. The engine analyzes this data to determine the user's intent. For example, if the word "news" is included, it determines that the user is seeking the latest news information. The input is text data, and the output is data that indicates the user's needs.

[0093] Step 4:

[0094] The server searches for relevant information based on the user's needs. Here, it retrieves the most suitable data from existing information history and external sources. For example, it might retrieve the latest news articles through a news API. The input is the user's needs, and the output is the retrieved information data.

[0095] Step 5:

[0096] The server generates a response to the user based on the acquired information data. Using a generative AI model, it creates a response in natural, human-like language. For example, "Today's top news is an article about economic growth." The input is information data, and the output is the response text.

[0097] Step 6:

[0098] The terminal processes the response text received from the server using a speech synthesis module and converts it into audio data. This audio data is played back to the user in real time. The input is the response text, and the output is the audio data.

[0099] (Application Example 1)

[0100] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0101] The problem that this invention aims to solve is to enable users in modern society to quickly and accurately access diverse information and services through voice input. In particular, it aims to improve convenience and efficiency in urban life by utilizing users' real-time information about their living environment.

[0102] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0103] In this invention, the server includes means for acquiring voice input from the user and converting it into text data, means for analyzing the text data using natural language processing to identify the user's requests, and means for acquiring information about the user's living environment in real time and providing optimal services. This enables users to quickly access the information they need and provide appropriate services based on that information.

[0104] "Voice input" is the process of acquiring information using the user's voice and converting it into a digital format.

[0105] "Natural language processing" is a technology that analyzes text data to understand the meaning of language and identify user requests.

[0106] "User requests" refer to the requests for information and services that users express through voice input.

[0107] "Living environment information" refers to real-time information related to the environment in which a user lives, including urban traffic conditions and the operational status of public services.

[0108] "Real-time" refers to a function that processes and responds to the current state of affairs on the spot.

[0109] "Response generation" is the process of assembling appropriate information to provide to the user based on the information acquired.

[0110] A "profile" is a data structure used to record a user's interaction history and preferences, and to utilize this information for future service provision.

[0111] This invention relates to a system in which a user provides voice input via a mobile device such as a smartphone, and a service is provided based on that input. This system is implemented in the following form.

[0112] First, the device is equipped with a microphone to acquire voice input from the user. This voice input is converted into text data using speech recognition software such as Google® Speech-to-Text API. The device then analyzes the converted text data using a natural language processing library such as NLTK or spaCy to identify the user's request.

[0113] Based on user requests, the server utilizes various open APIs to acquire information about the living environment, such as traffic information and the operational status of public services, while referencing historical data. Furthermore, it generates the optimal response for the user based on this information, and converts this response into speech using the Google Text-to-Speech engine for notification. This allows users to obtain the necessary information without relying on visual information.

[0114] For example, if a user asks by voice, "Tell me the shortest route to my workplace," the terminal recognizes and analyzes the voice, and the server obtains real-time traffic information. The server calculates the most efficient route and notifies the user of the result by voice.

[0115] An example of a prompt for a generative AI model is: "Based on the user's voice instructions, design an application that contributes to improving life in a smart city by considering a process to provide optimal information and services."

[0116] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0117] Step 1:

[0118] The device acquires the user's voice input via the microphone. This voice data is sent to the Google Speech-to-Text API and converted into text data. The input is voice data, and the output is text data, which is the transcription of that voice.

[0119] Step 2:

[0120] The terminal analyzes the acquired text data using natural language processing libraries such as NLTK and spaCy. This process identifies the user's request. The input is the text data obtained in step 1, and the output is structured data containing the user's request.

[0121] Step 3:

[0122] The server receives structured data and, based on user requests, calls APIs to retrieve lifestyle information such as traffic information and public service information. The input is the user's request, and the output is a dataset containing the necessary information related to that request.

[0123] Step 4:

[0124] The server generates a response to the user based on the information it has obtained. This response is designed to include the most relevant information to the user's question. The input is the dataset obtained in step 3, and the output is a text message to notify the user.

[0125] Step 5:

[0126] The server converts the generated text message into speech using the Google Text-to-Speech engine and sends it to the device. The input is a text message, and the output is audio data designed to convey information to the user in a visually independent manner.

[0127] Step 6:

[0128] The device notifies the user of the received audio data through its speaker. The input is the audio data obtained in step 5, and the output is the audio information that the user can hear.

[0129] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0130] This invention relates to an advanced AI system that operates within a mobile device and has the ability to recognize user emotions and adjust interactions accordingly. By acquiring user voice input and incorporating an emotion engine, it is possible to detect emotions from voice data and use the results to generate personalized responses.

[0131] User interface and speech recognition

[0132] The device receives the user's voice through the microphone and converts the audio signal into a digital format. The speech recognition function receives this and processes it as text data, making it possible to analyze the content of the user's speech.

[0133] Emotion recognition and natural language processing

[0134] The device inputs voice data into an emotion engine, which analyzes features such as tone, speed, and pauses to recognize the user's emotional state. Simultaneously, a natural language processing engine analyzes text data to identify the content of the request.

[0135] Appropriate information acquisition and response generation

[0136] Based on the results of emotion recognition and information obtained from request analysis, the server retrieves appropriate data from internal databases and external sources to address the user's request. In this process, the communication style is adjusted according to the emotion information, and the tone and content of the response are customized.

[0137] Providing responses and adjusting to emotions

[0138] The server uses the acquired and generated information to create a response to the user and sends it to the terminal. The terminal uses speech synthesis to convert this response into an audio format and transmits it to the user. Furthermore, adjustments are made to the response to make it more acceptable, depending on the user's emotional state.

[0139] Personalized responses and profile updates

[0140] The server records and updates a profile of the user's interaction and emotional history. This update allows for more personalized responses in the future, improving the user experience. In particular, it has the ability to learn recurring request patterns and predict user needs.

[0141] Specific example

[0142] For example, if a user says "I'm tired today" in a slightly downcast tone, the device recognizes that emotion as "fatigue" or "stress." The server takes this information into account and generates a gentle response such as, "Why don't you take a short break?" In contrast, if a user says "What should I do tomorrow?" in a very cheerful voice, the server responds in an energetic tone, "The weather looks nice, so I recommend going out."

[0143] In this way, this system combines speech recognition, natural language processing, and emotion recognition to achieve flexible, context-aware interactions, providing a more user-friendly and effective user experience.

[0144] The following describes the processing flow.

[0145] Step 1:

[0146] The device acquires voice from the user through the microphone and converts the voice signal into digital data. This prepares it for the voice recognition process.

[0147] Step 2:

[0148] The device uses a speech recognition engine to convert digital speech data into text data, making the user's speech content analyzable.

[0149] Step 3:

[0150] The device inputs voice data into its emotion engine, which analyzes the tone, speed, and volume of the voice to recognize the user's emotions. This result is output as an emotion label.

[0151] Step 4:

[0152] The device uses a natural language processing engine to analyze text data and identify the intent behind the user's request. Key keywords and important information are then extracted.

[0153] Step 5:

[0154] The server uses the user's profile, based on the sentiment label and request content, to search for appropriate information from internal databases and external sources. It collects relevant information and generates a response tailored to the user.

[0155] Step 6:

[0156] The server adjusts the generated response content, taking emotional information into account, and refines the response to suit the user's tone and voice.

[0157] Step 7:

[0158] The server sends the final response to the terminal, which then uses speech synthesis to convert the response into speech format. The terminal then provides the user with appropriate audio information through its speaker.

[0159] Step 8:

[0160] The server records the interaction results and user sentiment data in a profile, which is then updated to improve the quality of future interactions. This results in more personalized responses in subsequent interactions.

[0161] (Example 2)

[0162] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0163] Conventional interactive systems have struggled to generate appropriate responses that reflect the user's emotional state, making it difficult to optimize the user experience. Therefore, there is a need for technology that accurately recognizes user emotions and customizes responses accordingly.

[0164] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0165] In this invention, the server includes means for extracting features from audio signals and recognizing emotional states, means for converting audio signals into text data, and means for analyzing text data using natural language processing techniques and identifying requests. This enables the generation of dynamic and personalized responses that correspond to the user's emotional state.

[0166] An "audio signal" is input information used to represent sound as digital data.

[0167] A "digital signal" is a data format that quantifies an analog signal, making it possible to process and analyze it using a computer.

[0168] "Emotional state" refers to a psychological state judged based on factors such as the user's way of speaking and tone of voice.

[0169] "Speech synthesis technology" is a technology that outputs text data as speech.

[0170] "Natural language processing technology" is a technology that enables computers to understand and analyze human language.

[0171] A "profile" is a database that records a user's past interactions and preferences, and is used to generate personalized responses.

[0172] A "source of information" is an internal or external database or API that is accessed to provide the requested information.

[0173] This invention is an interactive system that recognizes user emotions and optimizes responses based on those emotions. To achieve this, the user's device is equipped with hardware that acquires audio signals through a microphone and converts them into digital signals. Digital signal processing technology is used for this conversion.

[0174] The user's device uses the converted digital signal to convert speech into text data via a speech recognition module. A common speech recognition API is used as the software for this process. This text data is then analyzed using natural language processing techniques to identify the user's request. BERT models or equivalent techniques are used for natural language processing.

[0175] In parallel, the device extracts features such as voice tone and speed from the audio signal and inputs them into the emotion recognition engine. Here, an emotion analysis API is used as the algorithm for determining emotion.

[0176] The server retrieves appropriate information from internal and external sources based on the analyzed user requests and emotions. This information retrieval involves querying databases and making API calls to external services. Based on the retrieved information, a generative AI model is used to generate personalized responses for the user.

[0177] The response generated by the server is sent back to the user's terminal and output as speech using speech synthesis software. This allows the user to receive the audio output from their terminal. This entire process is also personalized, with adjustments made based on the user's emotional state.

[0178] For example, if a user says "I'm tired today," the device recognizes this as an emotion, and the server can generate a gentle response such as "Why don't you take a short break?". If the user cheerfully says "What should I do tomorrow?", the server can generate a response along the lines of "The weather looks nice, so I recommend going out."

[0179] An example of a prompt might be: "Create a prompt for a system that takes user emotions into consideration and generates an appropriate response. Offer comfort to tired users and active suggestions to energetic users."

[0180] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0181] Step 1:

[0182] The device acquires the user's voice through a microphone. This voice signal is input as an analog signal. The device uses digital signal processing technology to convert this analog signal into a digital format. Specifically, an analog-to-digital converter (ADC) samples and quantizes the signal to output a digital signal.

[0183] Step 2:

[0184] The digitized audio signal is sent to the terminal's speech recognition module, where it is converted into text data. The input is a digital signal, and the output is text data. Here, a speech recognition API is used, making the content of the speech parseable as a string of characters. Specifically, feature extraction and phoneme analysis are performed on the audio waveform.

[0185] Step 3:

[0186] The terminal passes text data to a natural language processing engine to identify the user's request. The input for this step is text data, and the output is semantic data representing the user's request. The BERT model is used for natural language processing, analyzing the request based on keywords and context contained in the text. Sentence structure analysis and intent interpretation are performed here.

[0187] Step 4:

[0188] Simultaneously, the device analyzes the tone and speed of the voice from the audio signal and inputs this information into the emotion recognition engine. The input to this process is the original audio signal, and the output is the emotional state. Acoustic features are extracted and classified into emotion categories. Specific operations include tone analysis and speed evaluation.

[0189] Step 5:

[0190] The server receives the user's request and emotional state, and retrieves appropriate information from an internal database or external source based on that information. In this step, the request and emotional state are input, and the retrieved information is output. Queries are executed against the database or external API calls are made.

[0191] Step 6:

[0192] Based on the acquired information, the server uses a generative AI model to generate a response that matches the user's emotions. In this step, the acquired information is the input, and the response text data is the output. The generative AI model generates the response as text data and adjusts the content and language style.

[0193] Step 7:

[0194] The response generated by the server is sent to the terminal and output as speech using speech synthesis technology. The input for this step is the text data of the response, and the output is the generated speech data. The speech synthesis software generates a speech waveform, which is transmitted to the user through the speaker.

[0195] Step 8:

[0196] The server records responses and user interaction history and updates the profile. Here, the user's request history and sentiment state are inputs, and the updated profile data is stored as output. Database operations update the profile, which is used to improve the accuracy of future responses.

[0197] (Application Example 2)

[0198] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0199] In care settings, there is a need for a system that can accurately grasp the emotional state of users and provide information to care staff in real time, thereby enabling care tailored to individual needs. However, conventional methods have the challenge of not being able to quickly capture changes in emotions and effectively utilize each user's emotional history.

[0200] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0201] In this invention, the server includes means for recognizing the user's emotional state and adjusting responses based on emotional information, means for referring to past information and utilizing external information sources to obtain necessary data, and means for recording and updating the user's interaction history as a profile. This makes it possible to grasp the user's emotions in real time and to propose specific care to care staff that is appropriate to the user's condition.

[0202] An "audio signal" is an electrical representation of sound, converting human voices and other sounds into data that can be processed in analog or digital format.

[0203] "Digital data" refers to audio signals and other information structured in binary representation, converted into a format that can be processed by a computer.

[0204] "Natural language processing" is a field of technology that enables computers to analyze, understand, and generate language that humans use on a daily basis.

[0205] "User needs" refer to the demands and desires that users have for the system, and these are the things the system should address.

[0206] "Emotional state" refers to the psychological and emotional state of a user, inferred from factors such as tone of voice, speed, and pauses.

[0207] An "external information source" is an information source used to obtain necessary information from databases, online resources, and other sources that exist outside the system.

[0208] A "profile" is a collection of user-specific information created based on a user's past interaction history and emotional state.

[0209] The system based on this invention recognizes the user's emotional state in real time and provides that information to care staff, thereby enabling care tailored to the individual needs of each user in the care setting.

[0210] Hardware and software configuration

[0211] The device uses a smart device equipped with a high-performance microphone. It converts the voice signal into digital data and then converts the voice data into text using the Google Speech-to-Text API. The device uses the Emotion AI SDK to recognize emotional states and performs natural language processing using the Google NLP API.

[0212] Data processing and calculations

[0213] The server analyzes the acquired voice data to assess the user's emotional state. Based on the emotional information, it adjusts its response and provides information to care staff. It also creates a profile based on past data to predict the user's needs. It retrieves necessary data by referring to external sources and generates the most appropriate response for the user.

[0214] Specific example

[0215] For example, if a user says in a weak voice, "I'm not feeling well today," the device recognizes this emotion as "unpleasant," and the server instructs the care staff to "prepare for a medical check." Furthermore, based on the user's emotional history, the system can suggest a meal "suitable for their physical condition" for their next visit.

[0216] Example of a prompt

[0217] "Analyze the user's emotions and propose the most suitable care method."

[0218] "Track comments about health conditions and guide actions to encourage medical attention."

[0219] The aim of these technologies is to enable more flexible and effective care delivery in nursing care settings.

[0220] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0221] Step 1:

[0222] The terminal acquires the user's speech as an audio signal through a high-performance microphone. The input is the user's speech, and the output is audio data in digital format. The audio signal is acquired as digital data and prepared as input data for the next step.

[0223] Step 2:

[0224] The device uses the Google Speech-to-Text API to convert the audio signal into text data. In this step, the input is the digital audio data obtained in the previous step, and the output is text data. The converted text data recognizes the user's speech and prepares it for sentiment analysis.

[0225] Step 3:

[0226] The device utilizes the Emotion AI SDK to input text data and voice tone information, and analyzes the user's emotional state. The output is emotional information, which is labeled as "positive," "negative," "neutral," etc. This richly expressive emotional information is then used in the next step.

[0227] Step 4:

[0228] The server generates an appropriate response based on the sentiment information obtained in the previous step. In this step, sentiment information and the user's profile are taken as input, and the output is text data of the appropriately adjusted response. By utilizing the profile data, a customized response is prepared that takes past history into account.

[0229] Step 5:

[0230] The server performs speech synthesis to provide the generated response to the care staff, converting it into an audio format and sending it to the terminal. The input is the text response generated in the previous step, and the output is the response in audio format. This makes it possible to provide care staff with quick and accurate instructions in real time.

[0231] Step 6:

[0232] The server records the user's interaction history as a profile and updates it for use in generating responses in the future. The input is the sentiment information and response history up to the previous step, and the output is the updated user profile. This improves the accuracy of continuous care for the user.

[0233] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0234] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0235] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0236] [Second Embodiment]

[0237] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0238] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0239] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0240] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0241] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0242] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0243] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0244] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0245] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0246] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0247] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0248] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0249] This invention provides a system for coordinating diverse information provision and services based on voice input in mobile information terminals such as smartphones that users use on a daily basis. The aim of this system is to improve the user experience by smoothly coordinating a series of processes including speech recognition, natural language processing, data retrieval, information generation, and response.

[0250] User interface and speech recognition

[0251] The device is equipped with a microphone to receive voice commands from the user, and the acquired voice is converted into a digital signal. This signal is processed by a voice recognition module and converted into text data. Voice recognition is performed in real time, allowing the user to give commands in a natural way.

[0252] Natural Language Processing and Request Analysis

[0253] The device passes text data to a natural language processing engine to analyze the user's request. This identifies the information and tasks the user is seeking, allowing for efficient progress to the next processing step. This analysis is performed using advanced algorithms, enabling accurate semantic understanding.

[0254] Data retrieval and information generation

[0255] After a user's request is identified, the server collects data appropriate to that request from internal and external sources. This involves referencing the internet and specific databases as needed to obtain real-time and accurate information. The server can also refer to the user's past interaction history to generate individually customized information.

[0256] Response generation and user notifications

[0257] The collected and analyzed information is assembled into a user-specific response. The server formats this response, and the terminal uses a speech synthesis module to notify the user verbally. This allows users to obtain appropriate information without relying on visual cues.

[0258] Profile update and learning

[0259] After each interaction, the server records new information in the profile, learning the user's preferences and behavioral patterns. This profile update plays a crucial role in providing more personalized services in future interactions.

[0260] Specific example

[0261] For example, if a user says, "Tell me the weather for tomorrow," the device recognizes the voice, the server retrieves the weather information, and responds with a voice message saying, "It will be sunny tomorrow." Also, if a user asks, "Tell me my insurance card number," the server retrieves the relevant information from the user profile and notifies the user through the device.

[0262] In this way, this system realizes a consistent information delivery flow starting with the user's voice input, providing convenient and efficient support to the user.

[0263] The following describes the processing flow.

[0264] Step 1:

[0265] The device detects the user's voice input and captures it through the built-in microphone. The voice data is converted into a digital format and sent to the voice recognition system.

[0266] Step 2:

[0267] The device uses a speech recognition system to convert speech data into text data. This makes the user's spoken content ready for processing as text information.

[0268] Step 3:

[0269] The device passes text data to a natural language processing engine, which analyzes the user's request. At this stage, keywords and intent are extracted, and the content of the request is identified.

[0270] Step 4:

[0271] Based on the specified request, the server queries internal databases and external sources to find the necessary information. For example, weather information and personal settings information may be referenced here.

[0272] Step 5:

[0273] The server generates a response based on the acquired data, tailored to the user's request. This response may be customized to take the user's profile information into account.

[0274] Step 6:

[0275] The server sends the generated response to the terminal. The terminal uses a speech synthesis system to convert the text response into speech and delivers the information to the user through its speaker.

[0276] Step 7:

[0277] The server records user interaction details in a profile and updates the information. This improves the accuracy and personalization of future responses.

[0278] (Example 1)

[0279] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0280] In daily life, there is a need for systems that allow users to quickly and accurately obtain diverse information through voice input. However, conventional technologies have shortcomings in the process from voice recognition to information provision, resulting in delayed or inaccurate responses to user requests. Furthermore, it has been difficult to provide personalized information that reflects individual user preferences and past interactions.

[0281] The specific processing by the specific processing unit 290 of the data processing apparatus 12 in Example 1 is realized by the following means.

[0282] In this invention, the server includes means for collecting auditory input from a user and converting it into character data, means for analyzing the character data using natural language processing technology to identify the user's needs, and means for referring to the information history and obtaining necessary information by utilizing external information sources. As a result, it becomes possible for the user to quickly and accurately obtain personalized information and services starting from voice input.

[0283] "User" refers to an individual or group that acquires information using the system.

[0284] "Auditory input" refers to a procedure for supplying information or instructions using voice.

[0285] "Character data" refers to information in text format converted based on voice or auditory input.

[0286] "Natural language processing technology" refers to technology for a computer to understand, analyze, and process human language.

[0287] "Needs" refers to requests or information that a user wants to solve or requires through the system.

[0288] "Information history" refers to a record regarding a user's past interactions and requests.

[0289] "External information source" refers to information obtained from a database existing outside the system or the Internet.

[0290] "Reply" refers to the response content generated by the system for a user's request.

[0291] "Notification" refers to the act of delivering the generated reply to the user by voice or the like.

[0292] A "profile" refers to a collection of information that records a user's preferences and past usage history.

[0293] This invention relates to an information provision system that receives everyday voice commands from a user, processes information based on those commands, and returns a voice response. The system functions through the cooperation of a mobile information terminal and a server.

[0294] The terminal first has an audio input device (e.g., a microphone) to acquire voice input from the user. This voice signal is converted into text data using internal speech recognition software (e.g., a speech recognition API).

[0295] Next, the device uses a natural language processing engine (e.g., a natural language processing library) to analyze the text data and identify the user's needs. This process allows the system to accurately understand the information and tasks the user is seeking.

[0296] Once user needs are identified, the server collects the necessary information. The server retrieves appropriate data by referencing the information history and utilizing external sources (e.g., information provision APIs). It can also generate personalized information based on the user's past interactions.

[0297] Based on the acquired information, the server generates a reply to the user and converts this response into audio data. The audio generated using speech synthesis software (e.g., a speech synthesis engine) is then notified to the user via the terminal.

[0298] This process allows users to quickly and accurately obtain information without relying on their vision. For example, if a user says, "Tell me about Italian restaurants nearby," the system analyzes the voice, collects appropriate restaurant information, and provides a voice response such as, "The nearest Italian restaurant is in front of the train station."

[0299] As an example of a specific prompt sentence utilizing a generative AI model, something like "Please generate a response when the user asks 'When will it rain next?' " can be considered. By doing so, the system can simulate a natural and context-appropriate response.

[0300] The flow of the specific process in Example 1 will be described using FIG. 11.

[0301] Step 1:

[0302] The user faces the microphone of the terminal and inputs their request by voice. This input is made, for example, in the form of "Tell me today's news". The input data is a raw voice signal.

[0303] Step 2:

[0304] The terminal digitizes the voice signal acquired from the user. Using speech recognition software, character data is generated from this digital voice signal. In this step, feature quantities are extracted from the voice waveform and compared with known voices to convert them into accurate text. The input is a voice signal, and the output is character data.

[0305] Step 3:

[0306] The terminal passes the generated character data to the natural language processing engine. The engine analyzes this data to identify the user's intention. For example, if the word "news" is included, it is determined that the latest news information is being sought. The input is character data, and the output is data indicating the user's needs.

[0307] Step 4:

[0308] The server searches for relevant information based on the user's needs. Here, it retrieves the most suitable data from existing information history and external sources. For example, it might retrieve the latest news articles through a news API. The input is the user's needs, and the output is the retrieved information data.

[0309] Step 5:

[0310] The server generates a response to the user based on the acquired information data. Using a generative AI model, it creates a response in natural, human-like language. For example, "Today's top news is an article about economic growth." The input is information data, and the output is the response text.

[0311] Step 6:

[0312] The terminal processes the response text received from the server using a speech synthesis module and converts it into audio data. This audio data is played back to the user in real time. The input is the response text, and the output is the audio data.

[0313] (Application Example 1)

[0314] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0315] The problem that this invention aims to solve is to enable users in modern society to quickly and accurately access diverse information and services through voice input. In particular, it aims to improve convenience and efficiency in urban life by utilizing users' real-time information about their living environment.

[0316] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0317] In this invention, the server includes means for acquiring voice input from the user and converting it into text data, means for analyzing the text data using natural language processing to identify the user's requests, and means for acquiring information about the user's living environment in real time and providing optimal services. This enables users to quickly access the information they need and provide appropriate services based on that information.

[0318] "Voice input" is the process of acquiring information using the user's voice and converting it into a digital format.

[0319] "Natural language processing" is a technology that analyzes text data to understand the meaning of language and identify user requests.

[0320] "User requests" refer to the requests for information and services that users express through voice input.

[0321] "Living environment information" refers to real-time information related to the environment in which a user lives, including urban traffic conditions and the operational status of public services.

[0322] "Real-time" refers to a function that processes and responds to the current state of affairs on the spot.

[0323] "Response generation" is the process of assembling appropriate information to provide to the user based on the information acquired.

[0324] A "profile" is a data structure used to record a user's interaction history and preferences, and to utilize this information for future service provision.

[0325] This invention relates to a system in which a user provides voice input via a mobile device such as a smartphone, and a service is provided based on that input. This system is implemented in the following form.

[0326] First, the device is equipped with a microphone to acquire voice input from the user. This voice input is converted into text data using speech recognition software such as the Google Speech-to-Text API. The device then parses the converted text data using a natural language processing library such as NLTK or spaCy to identify the user's request.

[0327] Based on user requests, the server utilizes various open APIs to acquire information about the living environment, such as traffic information and the operational status of public services, while referencing historical data. Furthermore, it generates the optimal response for the user based on this information, and converts this response into speech using the Google Text-to-Speech engine for notification. This allows users to obtain the necessary information without relying on visual information.

[0328] For example, if a user asks by voice, "Tell me the shortest route to my workplace," the terminal recognizes and analyzes the voice, and the server obtains real-time traffic information. The server calculates the most efficient route and notifies the user of the result by voice.

[0329] An example of a prompt for a generative AI model is: "Based on the user's voice instructions, design an application that contributes to improving life in a smart city by considering a process to provide optimal information and services."

[0330] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0331] Step 1:

[0332] The device acquires the user's voice input via the microphone. This voice data is sent to the Google Speech-to-Text API and converted into text data. The input is voice data, and the output is text data, which is the transcription of that voice.

[0333] Step 2:

[0334] The terminal analyzes the acquired text data using natural language processing libraries such as NLTK and spaCy. This process identifies the user's request. The input is the text data obtained in step 1, and the output is structured data containing the user's request.

[0335] Step 3:

[0336] The server receives structured data and, based on user requests, calls APIs to retrieve lifestyle information such as traffic information and public service information. The input is the user's request, and the output is a dataset containing the necessary information related to that request.

[0337] Step 4:

[0338] The server generates a response to the user based on the information it has obtained. This response is designed to include the most relevant information to the user's question. The input is the dataset obtained in step 3, and the output is a text message to notify the user.

[0339] Step 5:

[0340] The server converts the generated text message into speech using the Google Text-to-Speech engine and sends it to the device. The input is a text message, and the output is audio data designed to convey information to the user in a visually independent manner.

[0341] Step 6:

[0342] The device notifies the user of the received audio data through its speaker. The input is the audio data obtained in step 5, and the output is the audio information that the user can hear.

[0343] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0344] This invention relates to an advanced AI system that operates within a mobile device and has the ability to recognize user emotions and adjust interactions accordingly. By acquiring user voice input and incorporating an emotion engine, it is possible to detect emotions from voice data and use the results to generate personalized responses.

[0345] User interface and speech recognition

[0346] The device receives the user's voice through the microphone and converts the audio signal into a digital format. The speech recognition function receives this and processes it as text data, making it possible to analyze the content of the user's speech.

[0347] Emotion recognition and natural language processing

[0348] The device inputs voice data into an emotion engine, which analyzes features such as tone, speed, and pauses to recognize the user's emotional state. Simultaneously, a natural language processing engine analyzes text data to identify the content of the request.

[0349] Appropriate information acquisition and response generation

[0350] Based on the results of emotion recognition and information obtained from request analysis, the server retrieves appropriate data from internal databases and external sources to address the user's request. In this process, the communication style is adjusted according to the emotion information, and the tone and content of the response are customized.

[0351] Providing responses and adjusting to emotions

[0352] The server uses the acquired and generated information to create a response to the user and sends it to the terminal. The terminal uses speech synthesis to convert this response into an audio format and transmits it to the user. Furthermore, adjustments are made to the response to make it more acceptable, depending on the user's emotional state.

[0353] Personalized responses and profile updates

[0354] The server records and updates a profile of the user's interaction and emotional history. This update allows for more personalized responses in the future, improving the user experience. In particular, it has the ability to learn recurring request patterns and predict user needs.

[0355] Specific example

[0356] For example, if a user says "I'm tired today" in a slightly downcast tone, the device recognizes that emotion as "fatigue" or "stress." The server takes this information into account and generates a gentle response such as, "Why don't you take a short break?" In contrast, if a user says "What should I do tomorrow?" in a very cheerful voice, the server responds in an energetic tone, "The weather looks nice, so I recommend going out."

[0357] In this way, this system combines speech recognition, natural language processing, and emotion recognition to achieve flexible, context-aware interactions, providing a more user-friendly and effective user experience.

[0358] The following describes the processing flow.

[0359] Step 1:

[0360] The device acquires voice from the user through the microphone and converts the voice signal into digital data. This prepares it for the voice recognition process.

[0361] Step 2:

[0362] The device uses a speech recognition engine to convert digital speech data into text data, making the user's speech content analyzable.

[0363] Step 3:

[0364] The device inputs voice data into its emotion engine, which analyzes the tone, speed, and volume of the voice to recognize the user's emotions. This result is output as an emotion label.

[0365] Step 4:

[0366] The device uses a natural language processing engine to analyze text data and identify the intent behind the user's request. Key keywords and important information are then extracted.

[0367] Step 5:

[0368] The server uses the user's profile, based on the sentiment label and request content, to search for appropriate information from internal databases and external sources. It collects relevant information and generates a response tailored to the user.

[0369] Step 6:

[0370] The server adjusts the generated response content, taking emotional information into account, and refines the response to suit the user's tone and voice.

[0371] Step 7:

[0372] The server sends the final response to the terminal, which then uses speech synthesis to convert the response into speech format. The terminal then provides the user with appropriate audio information through its speaker.

[0373] Step 8:

[0374] The server records the interaction results and user sentiment data in a profile, which is then updated to improve the quality of future interactions. This results in more personalized responses in subsequent interactions.

[0375] (Example 2)

[0376] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0377] Conventional interactive systems have struggled to generate appropriate responses that reflect the user's emotional state, making it difficult to optimize the user experience. Therefore, there is a need for technology that accurately recognizes user emotions and customizes responses accordingly.

[0378] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0379] In this invention, the server includes means for extracting features from audio signals and recognizing emotional states, means for converting audio signals into text data, and means for analyzing text data using natural language processing techniques and identifying requests. This enables the generation of dynamic and personalized responses that correspond to the user's emotional state.

[0380] An "audio signal" is input information used to represent sound as digital data.

[0381] A "digital signal" is a data format that quantifies an analog signal, making it possible to process and analyze it using a computer.

[0382] "Emotional state" refers to a psychological state judged based on factors such as the user's way of speaking and tone of voice.

[0383] "Speech synthesis technology" is a technology that outputs text data as speech.

[0384] "Natural language processing technology" is a technology that enables computers to understand and analyze human language.

[0385] A "profile" is a database that records a user's past interactions and preferences, and is used to generate personalized responses.

[0386] A "source of information" is an internal or external database or API that is accessed to provide the requested information.

[0387] This invention is an interactive system that recognizes user emotions and optimizes responses based on those emotions. To achieve this, the user's device is equipped with hardware that acquires audio signals through a microphone and converts them into digital signals. Digital signal processing technology is used for this conversion.

[0388] The user's device uses the converted digital signal to convert speech into text data via a speech recognition module. A common speech recognition API is used as the software for this process. This text data is then analyzed using natural language processing techniques to identify the user's request. BERT models or equivalent techniques are used for natural language processing.

[0389] In parallel, the device extracts features such as voice tone and speed from the audio signal and inputs them into the emotion recognition engine. Here, an emotion analysis API is used as the algorithm for determining emotion.

[0390] The server retrieves appropriate information from internal and external sources based on the analyzed user requests and emotions. This information retrieval involves querying databases and making API calls to external services. Based on the retrieved information, a generative AI model is used to generate personalized responses for the user.

[0391] The response generated by the server is sent back to the user's terminal and output as speech using speech synthesis software. This allows the user to receive the audio output from their terminal. This entire process is also personalized, with adjustments made based on the user's emotional state.

[0392] For example, if a user says "I'm tired today," the device recognizes this as an emotion, and the server can generate a gentle response such as "Why don't you take a short break?". If the user cheerfully says "What should I do tomorrow?", the server can generate a response along the lines of "The weather looks nice, so I recommend going out."

[0393] An example of a prompt might be: "Create a prompt for a system that takes user emotions into consideration and generates an appropriate response. Offer comfort to tired users and active suggestions to energetic users."

[0394] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0395] Step 1:

[0396] The device acquires the user's voice through a microphone. This voice signal is input as an analog signal. The device uses digital signal processing technology to convert this analog signal into a digital format. Specifically, an analog-to-digital converter (ADC) samples and quantizes the signal to output a digital signal.

[0397] Step 2:

[0398] The digitized audio signal is sent to the terminal's speech recognition module, where it is converted into text data. The input is a digital signal, and the output is text data. Here, a speech recognition API is used, making the content of the speech parseable as a string of characters. Specifically, feature extraction and phoneme analysis are performed on the audio waveform.

[0399] Step 3:

[0400] The terminal passes text data to a natural language processing engine to identify the user's request. The input for this step is text data, and the output is semantic data representing the user's request. The BERT model is used for natural language processing, analyzing the request based on keywords and context contained in the text. Sentence structure analysis and intent interpretation are performed here.

[0401] Step 4:

[0402] Simultaneously, the device analyzes the tone and speed of the voice from the audio signal and inputs this information into the emotion recognition engine. The input to this process is the original audio signal, and the output is the emotional state. Acoustic features are extracted and classified into emotion categories. Specific operations include tone analysis and speed evaluation.

[0403] Step 5:

[0404] The server receives the user's request and emotional state, and retrieves appropriate information from an internal database or external source based on that information. In this step, the request and emotional state are input, and the retrieved information is output. Queries are executed against the database or external API calls are made.

[0405] Step 6:

[0406] Based on the acquired information, the server uses a generative AI model to generate a response that matches the user's emotions. In this step, the acquired information is the input, and the response text data is the output. The generative AI model generates the response as text data and adjusts the content and language style.

[0407] Step 7:

[0408] The response generated by the server is sent to the terminal and output as speech using speech synthesis technology. The input for this step is the text data of the response, and the output is the generated speech data. The speech synthesis software generates a speech waveform, which is transmitted to the user through the speaker.

[0409] Step 8:

[0410] The server records responses and user interaction history and updates the profile. Here, the user's request history and sentiment state are inputs, and the updated profile data is stored as output. Database operations update the profile, which is used to improve the accuracy of future responses.

[0411] (Application Example 2)

[0412] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0413] In care settings, there is a need for a system that can accurately grasp the emotional state of users and provide information to care staff in real time, thereby enabling care tailored to individual needs. However, conventional methods have the challenge of not being able to quickly capture changes in emotions and effectively utilize each user's emotional history.

[0414] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0415] In this invention, the server includes means for recognizing the user's emotional state and adjusting responses based on emotional information, means for referring to past information and utilizing external information sources to obtain necessary data, and means for recording and updating the user's interaction history as a profile. This makes it possible to grasp the user's emotions in real time and to propose specific care to care staff that is appropriate to the user's condition.

[0416] An "audio signal" is an electrical representation of sound, converting human voices and other sounds into data that can be processed in analog or digital format.

[0417] "Digital data" refers to audio signals and other information structured in binary representation, converted into a format that can be processed by a computer.

[0418] "Natural language processing" is a field of technology that enables computers to analyze, understand, and generate language that humans use on a daily basis.

[0419] "User needs" refer to the demands and desires that users have for the system, and these are the things the system should address.

[0420] "Emotional state" refers to the psychological and emotional state of a user, inferred from factors such as tone of voice, speed, and pauses.

[0421] An "external information source" is an information source used to obtain necessary information from databases, online resources, and other sources that exist outside the system.

[0422] A "profile" is a collection of user-specific information created based on a user's past interaction history and emotional state.

[0423] The system based on this invention recognizes the user's emotional state in real time and provides that information to care staff, thereby enabling care tailored to the individual needs of each user in the care setting.

[0424] Hardware and software configuration

[0425] The device uses a smart device equipped with a high-performance microphone. It converts the voice signal into digital data and then converts the voice data into text using the Google Speech-to-Text API. The device uses the Emotion AI SDK to recognize emotional states and performs natural language processing using the Google NLP API.

[0426] Data processing and calculations

[0427] The server analyzes the acquired voice data to assess the user's emotional state. Based on the emotional information, it adjusts its response and provides information to care staff. It also creates a profile based on past data to predict the user's needs. It retrieves necessary data by referring to external sources and generates the most appropriate response for the user.

[0428] Specific example

[0429] For example, if a user says in a weak voice, "I'm not feeling well today," the device recognizes this emotion as "unpleasant," and the server instructs the care staff to "prepare for a medical check." Furthermore, based on the user's emotional history, the system can suggest a meal "suitable for their physical condition" for their next visit.

[0430] Example of a prompt

[0431] "Analyze the user's emotions and propose the most suitable care method."

[0432] "Track comments about health conditions and guide actions to encourage medical attention."

[0433] The aim of these technologies is to enable more flexible and effective care delivery in nursing care settings.

[0434] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0435] Step 1:

[0436] The terminal acquires the user's speech as an audio signal through a high-performance microphone. The input is the user's speech, and the output is audio data in digital format. The audio signal is acquired as digital data and prepared as input data for the next step.

[0437] Step 2:

[0438] The device uses the Google Speech-to-Text API to convert the audio signal into text data. In this step, the input is the digital audio data obtained in the previous step, and the output is text data. The converted text data recognizes the user's speech and prepares it for sentiment analysis.

[0439] Step 3:

[0440] The device utilizes the Emotion AI SDK to input text data and voice tone information, and analyzes the user's emotional state. The output is emotional information, which is labeled as "positive," "negative," "neutral," etc. This richly expressive emotional information is then used in the next step.

[0441] Step 4:

[0442] The server generates an appropriate response based on the sentiment information obtained in the previous step. In this step, sentiment information and the user's profile are taken as input, and the output is text data of the appropriately adjusted response. By utilizing the profile data, a customized response is prepared that takes past history into account.

[0443] Step 5:

[0444] The server performs speech synthesis to provide the generated response to the care staff, converting it into an audio format and sending it to the terminal. The input is the text response generated in the previous step, and the output is the response in audio format. This makes it possible to provide care staff with quick and accurate instructions in real time.

[0445] Step 6:

[0446] The server records the user's interaction history as a profile and updates it for use in generating responses in the future. The input is the sentiment information and response history up to the previous step, and the output is the updated user profile. This improves the accuracy of continuous care for the user.

[0447] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0448] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0449] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0450] [Third Embodiment]

[0451] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0452] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0453] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0454] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0455] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0456] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0457] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0458] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0459] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0460] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0461] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0462] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0463] This invention provides a system for coordinating diverse information provision and services based on voice input in mobile information terminals such as smartphones that users use on a daily basis. The aim of this system is to improve the user experience by smoothly coordinating a series of processes including speech recognition, natural language processing, data retrieval, information generation, and response.

[0464] User interface and speech recognition

[0465] The device is equipped with a microphone to receive voice commands from the user, and the acquired voice is converted into a digital signal. This signal is processed by a voice recognition module and converted into text data. Voice recognition is performed in real time, allowing the user to give commands in a natural way.

[0466] Natural Language Processing and Request Analysis

[0467] The device passes text data to a natural language processing engine to analyze the user's request. This identifies the information and tasks the user is seeking, allowing for efficient progress to the next processing step. This analysis is performed using advanced algorithms, enabling accurate semantic understanding.

[0468] Data retrieval and information generation

[0469] After a user's request is identified, the server collects data appropriate to that request from internal and external sources. This involves referencing the internet and specific databases as needed to obtain real-time and accurate information. The server can also refer to the user's past interaction history to generate individually customized information.

[0470] Response generation and user notifications

[0471] The collected and analyzed information is assembled into a user-specific response. The server formats this response, and the terminal uses a speech synthesis module to notify the user verbally. This allows users to obtain appropriate information without relying on visual cues.

[0472] Profile update and learning

[0473] After each interaction, the server records new information in the profile, learning the user's preferences and behavioral patterns. This profile update plays a crucial role in providing more personalized services in future interactions.

[0474] Specific example

[0475] For example, if a user says, "Tell me the weather for tomorrow," the device recognizes the voice, the server retrieves the weather information, and responds with a voice message saying, "It will be sunny tomorrow." Also, if a user asks, "Tell me my insurance card number," the server retrieves the relevant information from the user profile and notifies the user through the device.

[0476] In this way, this system realizes a consistent information delivery flow starting with the user's voice input, providing convenient and efficient support to the user.

[0477] The following describes the processing flow.

[0478] Step 1:

[0479] The device detects the user's voice input and captures it through the built-in microphone. The voice data is converted into a digital format and sent to the voice recognition system.

[0480] Step 2:

[0481] The device uses a speech recognition system to convert speech data into text data. This makes the user's spoken content ready for processing as text information.

[0482] Step 3:

[0483] The device passes text data to a natural language processing engine, which analyzes the user's request. At this stage, keywords and intent are extracted, and the content of the request is identified.

[0484] Step 4:

[0485] Based on the specified request, the server queries internal databases and external sources to find the necessary information. For example, weather information and personal settings information may be referenced here.

[0486] Step 5:

[0487] The server generates a response based on the acquired data, tailored to the user's request. This response may be customized to take the user's profile information into account.

[0488] Step 6:

[0489] The server sends the generated response to the terminal. The terminal uses a speech synthesis system to convert the text response into speech and delivers the information to the user through its speaker.

[0490] Step 7:

[0491] The server records user interaction details in a profile and updates the information. This improves the accuracy and personalization of future responses.

[0492] (Example 1)

[0493] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0494] In daily life, there is a need for systems that allow users to quickly and accurately obtain diverse information through voice input. However, conventional technologies have shortcomings in the process from voice recognition to information provision, resulting in delayed or inaccurate responses to user requests. Furthermore, it has been difficult to provide personalized information that reflects individual user preferences and past interactions.

[0495] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0496] In this invention, the server includes means for collecting auditory input from the user and converting it into text data, means for analyzing the text data using natural language processing technology and identifying the user's needs, and means for referring to information history and utilizing external information sources to obtain necessary information. This enables the user to quickly and accurately obtain personalized information and services starting from voice input.

[0497] A "user" refers to an individual or organization that uses a system to obtain information.

[0498] "Auditory input" refers to the process of supplying information or instructions using sound.

[0499] "Text data" refers to information in text format that has been converted based on speech or auditory input.

[0500] "Natural language processing technology" refers to the technology that enables computers to understand, analyze, and process human language.

[0501] "Needs" refer to the requests or information that users want to resolve through the system.

[0502] "Information history" refers to records of a user's past interactions and requests.

[0503] "External information sources" refer to information obtained from databases or the internet that exist outside the system.

[0504] "Reply" refers to the response generated by the system in response to a user's request.

[0505] "Notification" refers to the act of delivering a generated reply to the user via audio or other means.

[0506] A "profile" refers to a collection of information that records a user's preferences and past usage history.

[0507] This invention relates to an information provision system that receives everyday voice commands from a user, processes information based on those commands, and returns a voice response. The system functions through the cooperation of a mobile information terminal and a server.

[0508] The terminal first has an audio input device (e.g., a microphone) to acquire voice input from the user. This voice signal is converted into text data using internal speech recognition software (e.g., a speech recognition API).

[0509] Next, the device uses a natural language processing engine (e.g., a natural language processing library) to analyze the text data and identify the user's needs. This process allows the system to accurately understand the information and tasks the user is seeking.

[0510] Once user needs are identified, the server collects the necessary information. The server retrieves appropriate data by referencing the information history and utilizing external sources (e.g., information provision APIs). It can also generate personalized information based on the user's past interactions.

[0511] Based on the acquired information, the server generates a reply to the user and converts this response into audio data. The audio generated using speech synthesis software (e.g., a speech synthesis engine) is then notified to the user via the terminal.

[0512] This process allows users to quickly and accurately obtain information without relying on their vision. For example, if a user says, "Tell me about Italian restaurants nearby," the system analyzes the voice, collects appropriate restaurant information, and provides a voice response such as, "The nearest Italian restaurant is in front of the train station."

[0513] A concrete example of a prompt using a generative AI model would be: "Generate a response to the user's question, 'When will it rain next?'" This allows the system to simulate natural and contextually appropriate responses.

[0514] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0515] Step 1:

[0516] The user speaks into the device's microphone and inputs their request by voice. This input is in the format of, for example, "Tell me today's news." The input data is a raw voice signal.

[0517] Step 2:

[0518] The terminal digitizes the audio signal acquired from the user. Speech recognition software is used to generate text data from this digital audio signal. In this step, features are extracted from the audio waveform and matched with known audio to convert them into accurate text. The input is an audio signal, and the output is text data.

[0519] Step 3:

[0520] The device passes the generated text data to a natural language processing engine. The engine analyzes this data to determine the user's intent. For example, if the word "news" is included, it determines that the user is seeking the latest news information. The input is text data, and the output is data that indicates the user's needs.

[0521] Step 4:

[0522] The server searches for relevant information based on the user's needs. Here, it retrieves the most suitable data from existing information history and external sources. For example, it might retrieve the latest news articles through a news API. The input is the user's needs, and the output is the retrieved information data.

[0523] Step 5:

[0524] The server generates a response to the user based on the acquired information data. Using a generative AI model, it creates a response in natural, human-like language. For example, "Today's top news is an article about economic growth." The input is information data, and the output is the response text.

[0525] Step 6:

[0526] The terminal processes the response text received from the server using a speech synthesis module and converts it into audio data. This audio data is played back to the user in real time. The input is the response text, and the output is the audio data.

[0527] (Application Example 1)

[0528] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0529] The problem that this invention aims to solve is to enable users in modern society to quickly and accurately access diverse information and services through voice input. In particular, it aims to improve convenience and efficiency in urban life by utilizing users' real-time information about their living environment.

[0530] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0531] In this invention, the server includes means for acquiring voice input from the user and converting it into text data, means for analyzing the text data using natural language processing to identify the user's requests, and means for acquiring information about the user's living environment in real time and providing optimal services. This enables users to quickly access the information they need and provide appropriate services based on that information.

[0532] "Voice input" is the process of acquiring information using the user's voice and converting it into a digital format.

[0533] "Natural language processing" is a technology that analyzes text data to understand the meaning of language and identify user requests.

[0534] "User requests" refer to the requests for information and services that users express through voice input.

[0535] "Living environment information" refers to real-time information related to the environment in which a user lives, including urban traffic conditions and the operational status of public services.

[0536] "Real-time" refers to a function that processes and responds to the current state of affairs on the spot.

[0537] "Response generation" is the process of assembling appropriate information to provide to the user based on the information acquired.

[0538] A "profile" is a data structure used to record a user's interaction history and preferences, and to utilize this information for future service provision.

[0539] This invention relates to a system in which a user provides voice input via a mobile device such as a smartphone, and a service is provided based on that input. This system is implemented in the following form.

[0540] First, the device is equipped with a microphone to acquire voice input from the user. This voice input is converted into text data using speech recognition software such as the Google Speech-to-Text API. The device then parses the converted text data using a natural language processing library such as NLTK or spaCy to identify the user's request.

[0541] Based on user requests, the server utilizes various open APIs to acquire information about the living environment, such as traffic information and the operational status of public services, while referencing historical data. Furthermore, it generates the optimal response for the user based on this information, and converts this response into speech using the Google Text-to-Speech engine for notification. This allows users to obtain the necessary information without relying on visual information.

[0542] For example, if a user asks by voice, "Tell me the shortest route to my workplace," the terminal recognizes and analyzes the voice, and the server obtains real-time traffic information. The server calculates the most efficient route and notifies the user of the result by voice.

[0543] An example of a prompt for a generative AI model is: "Based on the user's voice instructions, design an application that contributes to improving life in a smart city by considering a process to provide optimal information and services."

[0544] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0545] Step 1:

[0546] The device acquires the user's voice input via the microphone. This voice data is sent to the Google Speech-to-Text API and converted into text data. The input is voice data, and the output is text data, which is the transcription of that voice.

[0547] Step 2:

[0548] The terminal analyzes the acquired text data using natural language processing libraries such as NLTK and spaCy. This process identifies the user's request. The input is the text data obtained in step 1, and the output is structured data containing the user's request.

[0549] Step 3:

[0550] The server receives structured data and, based on user requests, calls APIs to retrieve lifestyle information such as traffic information and public service information. The input is the user's request, and the output is a dataset containing the necessary information related to that request.

[0551] Step 4:

[0552] The server generates a response to the user based on the information it has obtained. This response is designed to include the most relevant information to the user's question. The input is the dataset obtained in step 3, and the output is a text message to notify the user.

[0553] Step 5:

[0554] The server converts the generated text message into speech using the Google Text-to-Speech engine and sends it to the device. The input is a text message, and the output is audio data designed to convey information to the user in a visually independent manner.

[0555] Step 6:

[0556] The device notifies the user of the received audio data through its speaker. The input is the audio data obtained in step 5, and the output is the audio information that the user can hear.

[0557] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0558] This invention relates to an advanced AI system that operates within a mobile device and has the ability to recognize user emotions and adjust interactions accordingly. By acquiring user voice input and incorporating an emotion engine, it is possible to detect emotions from voice data and use the results to generate personalized responses.

[0559] User interface and speech recognition

[0560] The device receives the user's voice through the microphone and converts the audio signal into a digital format. The speech recognition function receives this and processes it as text data, making it possible to analyze the content of the user's speech.

[0561] Emotion recognition and natural language processing

[0562] The device inputs voice data into an emotion engine, which analyzes features such as tone, speed, and pauses to recognize the user's emotional state. Simultaneously, a natural language processing engine analyzes text data to identify the content of the request.

[0563] Appropriate information acquisition and response generation

[0564] Based on the results of emotion recognition and information obtained from request analysis, the server retrieves appropriate data from internal databases and external sources to address the user's request. In this process, the communication style is adjusted according to the emotion information, and the tone and content of the response are customized.

[0565] Providing responses and adjusting to emotions

[0566] The server uses the acquired and generated information to create a response to the user and sends it to the terminal. The terminal uses speech synthesis to convert this response into an audio format and transmits it to the user. Furthermore, adjustments are made to the response to make it more acceptable, depending on the user's emotional state.

[0567] Personalized responses and profile updates

[0568] The server records and updates a profile of the user's interaction and emotional history. This update allows for more personalized responses in the future, improving the user experience. In particular, it has the ability to learn recurring request patterns and predict user needs.

[0569] Specific example

[0570] For example, if a user says "I'm tired today" in a slightly downcast tone, the device recognizes that emotion as "fatigue" or "stress." The server takes this information into account and generates a gentle response such as, "Why don't you take a short break?" In contrast, if a user says "What should I do tomorrow?" in a very cheerful voice, the server responds in an energetic tone, "The weather looks nice, so I recommend going out."

[0571] In this way, this system combines speech recognition, natural language processing, and emotion recognition to achieve flexible, context-aware interactions, providing a more user-friendly and effective user experience.

[0572] The following describes the processing flow.

[0573] Step 1:

[0574] The device acquires voice from the user through the microphone and converts the voice signal into digital data. This prepares it for the voice recognition process.

[0575] Step 2:

[0576] The device uses a speech recognition engine to convert digital speech data into text data, making the user's speech content analyzable.

[0577] Step 3:

[0578] The device inputs voice data into its emotion engine, which analyzes the tone, speed, and volume of the voice to recognize the user's emotions. This result is output as an emotion label.

[0579] Step 4:

[0580] The device uses a natural language processing engine to analyze text data and identify the intent behind the user's request. Key keywords and important information are then extracted.

[0581] Step 5:

[0582] The server uses the user's profile, based on the sentiment label and request content, to search for appropriate information from internal databases and external sources. It collects relevant information and generates a response tailored to the user.

[0583] Step 6:

[0584] The server adjusts the generated response content, taking emotional information into account, and refines the response to suit the user's tone and voice.

[0585] Step 7:

[0586] The server sends the final response to the terminal, which then uses speech synthesis to convert the response into speech format. The terminal then provides the user with appropriate audio information through its speaker.

[0587] Step 8:

[0588] The server records the interaction results and user sentiment data in a profile, which is then updated to improve the quality of future interactions. This results in more personalized responses in subsequent interactions.

[0589] (Example 2)

[0590] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0591] Conventional interactive systems have struggled to generate appropriate responses that reflect the user's emotional state, making it difficult to optimize the user experience. Therefore, there is a need for technology that accurately recognizes user emotions and customizes responses accordingly.

[0592] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0593] In this invention, the server includes means for extracting features from audio signals and recognizing emotional states, means for converting audio signals into text data, and means for analyzing text data using natural language processing techniques and identifying requests. This enables the generation of dynamic and personalized responses that correspond to the user's emotional state.

[0594] An "audio signal" is input information used to represent sound as digital data.

[0595] A "digital signal" is a data format that quantifies an analog signal, making it possible to process and analyze it using a computer.

[0596] "Emotional state" refers to a psychological state judged based on factors such as the user's way of speaking and tone of voice.

[0597] "Speech synthesis technology" is a technology that outputs text data as speech.

[0598] "Natural language processing technology" is a technology that enables computers to understand and analyze human language.

[0599] A "profile" is a database that records a user's past interactions and preferences, and is used to generate personalized responses.

[0600] A "source of information" is an internal or external database or API that is accessed to provide the requested information.

[0601] This invention is an interactive system that recognizes user emotions and optimizes responses based on those emotions. To achieve this, the user's device is equipped with hardware that acquires audio signals through a microphone and converts them into digital signals. Digital signal processing technology is used for this conversion.

[0602] The user's device uses the converted digital signal to convert speech into text data via a speech recognition module. A common speech recognition API is used as the software for this process. This text data is then analyzed using natural language processing techniques to identify the user's request. BERT models or equivalent techniques are used for natural language processing.

[0603] In parallel, the device extracts features such as voice tone and speed from the audio signal and inputs them into the emotion recognition engine. Here, an emotion analysis API is used as the algorithm for determining emotion.

[0604] The server retrieves appropriate information from internal and external sources based on the analyzed user requests and emotions. This information retrieval involves querying databases and making API calls to external services. Based on the retrieved information, a generative AI model is used to generate personalized responses for the user.

[0605] The response generated by the server is sent back to the user's terminal and output as speech using speech synthesis software. This allows the user to receive the audio output from their terminal. This entire process is also personalized, with adjustments made based on the user's emotional state.

[0606] For example, if a user says "I'm tired today," the device recognizes this as an emotion, and the server can generate a gentle response such as "Why don't you take a short break?". If the user cheerfully says "What should I do tomorrow?", the server can generate a response along the lines of "The weather looks nice, so I recommend going out."

[0607] An example of a prompt might be: "Create a prompt for a system that takes user emotions into consideration and generates an appropriate response. Offer comfort to tired users and active suggestions to energetic users."

[0608] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0609] Step 1:

[0610] The device acquires the user's voice through a microphone. This voice signal is input as an analog signal. The device uses digital signal processing technology to convert this analog signal into a digital format. Specifically, an analog-to-digital converter (ADC) samples and quantizes the signal to output a digital signal.

[0611] Step 2:

[0612] The digitized audio signal is sent to the terminal's speech recognition module, where it is converted into text data. The input is a digital signal, and the output is text data. Here, a speech recognition API is used, making the content of the speech parseable as a string of characters. Specifically, feature extraction and phoneme analysis are performed on the audio waveform.

[0613] Step 3:

[0614] The terminal passes text data to a natural language processing engine to identify the user's request. The input for this step is text data, and the output is semantic data representing the user's request. The BERT model is used for natural language processing, analyzing the request based on keywords and context contained in the text. Sentence structure analysis and intent interpretation are performed here.

[0615] Step 4:

[0616] Simultaneously, the device analyzes the tone and speed of the voice from the audio signal and inputs this information into the emotion recognition engine. The input to this process is the original audio signal, and the output is the emotional state. Acoustic features are extracted and classified into emotion categories. Specific operations include tone analysis and speed evaluation.

[0617] Step 5:

[0618] The server receives the user's request and emotional state, and retrieves appropriate information from an internal database or external source based on that information. In this step, the request and emotional state are input, and the retrieved information is output. Queries are executed against the database or external API calls are made.

[0619] Step 6:

[0620] Based on the acquired information, the server uses a generative AI model to generate a response that matches the user's emotions. In this step, the acquired information is the input, and the response text data is the output. The generative AI model generates the response as text data and adjusts the content and language style.

[0621] Step 7:

[0622] The response generated by the server is sent to the terminal and output as speech using speech synthesis technology. The input for this step is the text data of the response, and the output is the generated speech data. The speech synthesis software generates a speech waveform, which is transmitted to the user through the speaker.

[0623] Step 8:

[0624] The server records responses and user interaction history and updates the profile. Here, the user's request history and sentiment state are inputs, and the updated profile data is stored as output. Database operations update the profile, which is used to improve the accuracy of future responses.

[0625] (Application Example 2)

[0626] Next, we will explain Application Example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0627] In care settings, there is a need for a system that can accurately grasp the emotional state of users and provide information to care staff in real time, thereby enabling care tailored to individual needs. However, conventional methods have the challenge of not being able to quickly capture changes in emotions and effectively utilize each user's emotional history.

[0628] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0629] In this invention, the server includes means for recognizing the user's emotional state and adjusting responses based on emotional information, means for referring to past information and utilizing external information sources to obtain necessary data, and means for recording and updating the user's interaction history as a profile. This makes it possible to grasp the user's emotions in real time and to propose specific care to care staff that is appropriate to the user's condition.

[0630] An "audio signal" is an electrical representation of sound, converting human voices and other sounds into data that can be processed in analog or digital format.

[0631] "Digital data" refers to audio signals and other information structured in binary representation, converted into a format that can be processed by a computer.

[0632] "Natural language processing" is a field of technology that enables computers to analyze, understand, and generate language that humans use on a daily basis.

[0633] "User needs" refer to the demands and desires that users have for the system, and these are the things the system should address.

[0634] "Emotional state" refers to the psychological and emotional state of a user, inferred from factors such as tone of voice, speed, and pauses.

[0635] An "external information source" is an information source used to obtain necessary information from databases, online resources, and other sources that exist outside the system.

[0636] A "profile" is a collection of user-specific information created based on a user's past interaction history and emotional state.

[0637] The system based on this invention recognizes the user's emotional state in real time and provides that information to care staff, thereby enabling care tailored to the individual needs of each user in the care setting.

[0638] Hardware and software configuration

[0639] The device uses a smart device equipped with a high-performance microphone. It converts the voice signal into digital data and then converts the voice data into text using the Google Speech-to-Text API. The device uses the Emotion AI SDK to recognize emotional states and performs natural language processing using the Google NLP API.

[0640] Data processing and calculations

[0641] The server analyzes the acquired voice data to assess the user's emotional state. Based on the emotional information, it adjusts its response and provides information to care staff. It also creates a profile based on past data to predict the user's needs. It retrieves necessary data by referring to external sources and generates the most appropriate response for the user.

[0642] Specific example

[0643] For example, if a user says in a weak voice, "I'm not feeling well today," the device recognizes this emotion as "unpleasant," and the server instructs the care staff to "prepare for a medical check." Furthermore, based on the user's emotional history, the system can suggest a meal "suitable for their physical condition" for their next visit.

[0644] Example of a prompt

[0645] "Analyze the user's emotions and propose the most suitable care method."

[0646] "Track comments about health conditions and guide actions to encourage medical attention."

[0647] The aim of these technologies is to enable more flexible and effective care delivery in nursing care settings.

[0648] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0649] Step 1:

[0650] The terminal acquires the user's speech as an audio signal through a high-performance microphone. The input is the user's speech, and the output is audio data in digital format. The audio signal is acquired as digital data and prepared as input data for the next step.

[0651] Step 2:

[0652] The device uses the Google Speech-to-Text API to convert the audio signal into text data. In this step, the input is the digital audio data obtained in the previous step, and the output is text data. The converted text data recognizes the user's speech and prepares it for sentiment analysis.

[0653] Step 3:

[0654] The device utilizes the Emotion AI SDK to input text data and voice tone information, and analyzes the user's emotional state. The output is emotional information, which is labeled as "positive," "negative," "neutral," etc. This richly expressive emotional information is then used in the next step.

[0655] Step 4:

[0656] The server generates an appropriate response based on the sentiment information obtained in the previous step. In this step, sentiment information and the user's profile are taken as input, and the output is text data of the appropriately adjusted response. By utilizing the profile data, a customized response is prepared that takes past history into account.

[0657] Step 5:

[0658] The server performs speech synthesis to provide the generated response to the care staff, converting it into an audio format and sending it to the terminal. The input is the text response generated in the previous step, and the output is the response in audio format. This makes it possible to provide care staff with quick and accurate instructions in real time.

[0659] Step 6:

[0660] The server records the user's interaction history as a profile and updates it for use in generating responses in the future. The input is the sentiment information and response history up to the previous step, and the output is the updated user profile. This improves the accuracy of continuous care for the user.

[0661] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0662] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0663] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0664] [Fourth Embodiment]

[0665] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0666] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0667] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0668] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0669] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0670] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0671] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0672] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0673] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0674] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0675] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0676] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0677] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0678] This invention provides a system for coordinating diverse information provision and services based on voice input in mobile information terminals such as smartphones that users use on a daily basis. The aim of this system is to improve the user experience by smoothly coordinating a series of processes including speech recognition, natural language processing, data retrieval, information generation, and response.

[0679] User interface and speech recognition

[0680] The device is equipped with a microphone to receive voice commands from the user, and the acquired voice is converted into a digital signal. This signal is processed by a voice recognition module and converted into text data. Voice recognition is performed in real time, allowing the user to give commands in a natural way.

[0681] Natural Language Processing and Request Analysis

[0682] The device passes text data to a natural language processing engine to analyze the user's request. This identifies the information and tasks the user is seeking, allowing for efficient progress to the next processing step. This analysis is performed using advanced algorithms, enabling accurate semantic understanding.

[0683] Data retrieval and information generation

[0684] After a user's request is identified, the server collects data appropriate to that request from internal and external sources. This involves referencing the internet and specific databases as needed to obtain real-time and accurate information. The server can also refer to the user's past interaction history to generate individually customized information.

[0685] Response generation and user notifications

[0686] The collected and analyzed information is assembled into a user-specific response. The server formats this response, and the terminal uses a speech synthesis module to notify the user verbally. This allows users to obtain appropriate information without relying on visual cues.

[0687] Profile update and learning

[0688] After each interaction, the server records new information in the profile, learning the user's preferences and behavioral patterns. This profile update plays a crucial role in providing more personalized services in future interactions.

[0689] Specific example

[0690] For example, if a user says, "Tell me the weather for tomorrow," the device recognizes the voice, the server retrieves the weather information, and responds with a voice message saying, "It will be sunny tomorrow." Also, if a user asks, "Tell me my insurance card number," the server retrieves the relevant information from the user profile and notifies the user through the device.

[0691] In this way, this system realizes a consistent information delivery flow starting with the user's voice input, providing convenient and efficient support to the user.

[0692] The following describes the processing flow.

[0693] Step 1:

[0694] The device detects the user's voice input and captures it through the built-in microphone. The voice data is converted into a digital format and sent to the voice recognition system.

[0695] Step 2:

[0696] The device uses a speech recognition system to convert speech data into text data. This makes the user's spoken content ready for processing as text information.

[0697] Step 3:

[0698] The device passes text data to a natural language processing engine, which analyzes the user's request. At this stage, keywords and intent are extracted, and the content of the request is identified.

[0699] Step 4:

[0700] Based on the specified request, the server queries internal databases and external sources to find the necessary information. For example, weather information and personal settings information may be referenced here.

[0701] Step 5:

[0702] The server generates a response based on the acquired data, tailored to the user's request. This response may be customized to take the user's profile information into account.

[0703] Step 6:

[0704] The server sends the generated response to the terminal. The terminal uses a speech synthesis system to convert the text response into speech and delivers the information to the user through its speaker.

[0705] Step 7:

[0706] The server records user interaction details in a profile and updates the information. This improves the accuracy and personalization of future responses.

[0707] (Example 1)

[0708] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0709] In daily life, there is a need for systems that allow users to quickly and accurately obtain diverse information through voice input. However, conventional technologies have shortcomings in the process from voice recognition to information provision, resulting in delayed or inaccurate responses to user requests. Furthermore, it has been difficult to provide personalized information that reflects individual user preferences and past interactions.

[0710] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0711] In this invention, the server includes means for collecting auditory input from the user and converting it into text data, means for analyzing the text data using natural language processing technology and identifying the user's needs, and means for referring to information history and utilizing external information sources to obtain necessary information. This enables the user to quickly and accurately obtain personalized information and services starting from voice input.

[0712] A "user" refers to an individual or organization that uses a system to obtain information.

[0713] "Auditory input" refers to the process of supplying information or instructions using sound.

[0714] "Text data" refers to information in text format that has been converted based on speech or auditory input.

[0715] "Natural language processing technology" refers to the technology that enables computers to understand, analyze, and process human language.

[0716] "Needs" refer to the requests or information that users want to resolve through the system.

[0717] "Information history" refers to records of a user's past interactions and requests.

[0718] "External information sources" refer to information obtained from databases or the internet that exist outside the system.

[0719] "Reply" refers to the response generated by the system in response to a user's request.

[0720] "Notification" refers to the act of delivering a generated reply to the user via audio or other means.

[0721] A "profile" refers to a collection of information that records a user's preferences and past usage history.

[0722] This invention relates to an information provision system that receives everyday voice commands from a user, processes information based on those commands, and returns a voice response. The system functions through the cooperation of a mobile information terminal and a server.

[0723] The terminal first has an audio input device (e.g., a microphone) to acquire voice input from the user. This voice signal is converted into text data using internal speech recognition software (e.g., a speech recognition API).

[0724] Next, the device uses a natural language processing engine (e.g., a natural language processing library) to analyze the text data and identify the user's needs. This process allows the system to accurately understand the information and tasks the user is seeking.

[0725] Once user needs are identified, the server collects the necessary information. The server retrieves appropriate data by referencing the information history and utilizing external sources (e.g., information provision APIs). It can also generate personalized information based on the user's past interactions.

[0726] Based on the acquired information, the server generates a reply to the user and converts this response into audio data. The audio generated using speech synthesis software (e.g., a speech synthesis engine) is then notified to the user via the terminal.

[0727] This process allows users to quickly and accurately obtain information without relying on their vision. For example, if a user says, "Tell me about Italian restaurants nearby," the system analyzes the voice, collects appropriate restaurant information, and provides a voice response such as, "The nearest Italian restaurant is in front of the train station."

[0728] A concrete example of a prompt using a generative AI model would be: "Generate a response to the user's question, 'When will it rain next?'" This allows the system to simulate natural and contextually appropriate responses.

[0729] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0730] Step 1:

[0731] The user speaks into the device's microphone and inputs their request by voice. This input is in the format of, for example, "Tell me today's news." The input data is a raw voice signal.

[0732] Step 2:

[0733] The terminal digitizes the audio signal acquired from the user. Speech recognition software is used to generate text data from this digital audio signal. In this step, features are extracted from the audio waveform and matched with known audio to convert them into accurate text. The input is an audio signal, and the output is text data.

[0734] Step 3:

[0735] The device passes the generated text data to a natural language processing engine. The engine analyzes this data to determine the user's intent. For example, if the word "news" is included, it determines that the user is seeking the latest news information. The input is text data, and the output is data that indicates the user's needs.

[0736] Step 4:

[0737] The server searches for relevant information based on the user's needs. Here, it retrieves the most suitable data from existing information history and external sources. For example, it might retrieve the latest news articles through a news API. The input is the user's needs, and the output is the retrieved information data.

[0738] Step 5:

[0739] The server generates a response to the user based on the acquired information data. Using a generative AI model, it creates a response in natural, human-like language. For example, "Today's top news is an article about economic growth." The input is information data, and the output is the response text.

[0740] Step 6:

[0741] The terminal processes the response text received from the server using a speech synthesis module and converts it into audio data. This audio data is played back to the user in real time. The input is the response text, and the output is the audio data.

[0742] (Application Example 1)

[0743] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0744] The problem that this invention aims to solve is to enable users in modern society to quickly and accurately access diverse information and services through voice input. In particular, it aims to improve convenience and efficiency in urban life by utilizing users' real-time information about their living environment.

[0745] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0746] In this invention, the server includes means for acquiring voice input from the user and converting it into text data, means for analyzing the text data using natural language processing to identify the user's requests, and means for acquiring information about the user's living environment in real time and providing optimal services. This enables users to quickly access the information they need and provide appropriate services based on that information.

[0747] "Voice input" is the process of acquiring information using the user's voice and converting it into a digital format.

[0748] "Natural language processing" is a technology that analyzes text data to understand the meaning of language and identify user requests.

[0749] "User requests" refer to the requests for information and services that users express through voice input.

[0750] "Living environment information" refers to real-time information related to the environment in which a user lives, including urban traffic conditions and the operational status of public services.

[0751] "Real-time" refers to a function that processes and responds to the current state of affairs on the spot.

[0752] "Response generation" is the process of assembling appropriate information to provide to the user based on the information acquired.

[0753] A "profile" is a data structure used to record a user's interaction history and preferences, and to utilize this information for future service provision.

[0754] This invention relates to a system in which a user provides voice input via a mobile device such as a smartphone, and a service is provided based on that input. This system is implemented in the following form.

[0755] First, the device is equipped with a microphone to acquire voice input from the user. This voice input is converted into text data using speech recognition software such as the Google Speech-to-Text API. The device then parses the converted text data using a natural language processing library such as NLTK or spaCy to identify the user's request.

[0756] Based on user requests, the server utilizes various open APIs to acquire information about the living environment, such as traffic information and the operational status of public services, while referencing historical data. Furthermore, it generates the optimal response for the user based on this information, and converts this response into speech using the Google Text-to-Speech engine for notification. This allows users to obtain the necessary information without relying on visual information.

[0757] For example, if a user asks by voice, "Tell me the shortest route to my workplace," the terminal recognizes and analyzes the voice, and the server obtains real-time traffic information. The server calculates the most efficient route and notifies the user of the result by voice.

[0758] An example of a prompt for a generative AI model is: "Based on the user's voice instructions, design an application that contributes to improving life in a smart city by considering a process to provide optimal information and services."

[0759] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0760] Step 1:

[0761] The device acquires the user's voice input via the microphone. This voice data is sent to the Google Speech-to-Text API and converted into text data. The input is voice data, and the output is text data, which is the transcription of that voice.

[0762] Step 2:

[0763] The terminal analyzes the acquired text data using natural language processing libraries such as NLTK and spaCy. This process identifies the user's request. The input is the text data obtained in step 1, and the output is structured data containing the user's request.

[0764] Step 3:

[0765] The server receives structured data and, based on user requests, calls APIs to retrieve lifestyle information such as traffic information and public service information. The input is the user's request, and the output is a dataset containing the necessary information related to that request.

[0766] Step 4:

[0767] The server generates a response to the user based on the information it has obtained. This response is designed to include the most relevant information to the user's question. The input is the dataset obtained in step 3, and the output is a text message to notify the user.

[0768] Step 5:

[0769] The server converts the generated text message into speech using the Google Text-to-Speech engine and sends it to the device. The input is a text message, and the output is audio data designed to convey information to the user in a visually independent manner.

[0770] Step 6:

[0771] The device notifies the user of the received audio data through its speaker. The input is the audio data obtained in step 5, and the output is the audio information that the user can hear.

[0772] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0773] This invention relates to an advanced AI system that operates within a mobile device and has the ability to recognize user emotions and adjust interactions accordingly. By acquiring user voice input and incorporating an emotion engine, it is possible to detect emotions from voice data and use the results to generate personalized responses.

[0774] User interface and speech recognition

[0775] The device receives the user's voice through the microphone and converts the audio signal into a digital format. The speech recognition function receives this and processes it as text data, making it possible to analyze the content of the user's speech.

[0776] Emotion recognition and natural language processing

[0777] The device inputs voice data into an emotion engine, which analyzes features such as tone, speed, and pauses to recognize the user's emotional state. Simultaneously, a natural language processing engine analyzes text data to identify the content of the request.

[0778] Appropriate information acquisition and response generation

[0779] Based on the results of emotion recognition and information obtained from request analysis, the server retrieves appropriate data from internal databases and external sources to address the user's request. In this process, the communication style is adjusted according to the emotion information, and the tone and content of the response are customized.

[0780] Providing responses and adjusting to emotions

[0781] The server uses the acquired and generated information to create a response to the user and sends it to the terminal. The terminal uses speech synthesis to convert this response into an audio format and transmits it to the user. Furthermore, adjustments are made to the response to make it more acceptable, depending on the user's emotional state.

[0782] Personalized responses and profile updates

[0783] The server records and updates a profile of the user's interaction and emotional history. This update allows for more personalized responses in the future, improving the user experience. In particular, it has the ability to learn recurring request patterns and predict user needs.

[0784] Specific example

[0785] For example, if a user says "I'm tired today" in a slightly downcast tone, the device recognizes that emotion as "fatigue" or "stress." The server takes this information into account and generates a gentle response such as, "Why don't you take a short break?" In contrast, if a user says "What should I do tomorrow?" in a very cheerful voice, the server responds in an energetic tone, "The weather looks nice, so I recommend going out."

[0786] In this way, this system combines speech recognition, natural language processing, and emotion recognition to achieve flexible, context-aware interactions, providing a more user-friendly and effective user experience.

[0787] The following describes the processing flow.

[0788] Step 1:

[0789] The device acquires voice from the user through the microphone and converts the voice signal into digital data. This prepares it for the voice recognition process.

[0790] Step 2:

[0791] The device uses a speech recognition engine to convert digital speech data into text data, making the user's speech content analyzable.

[0792] Step 3:

[0793] The device inputs voice data into its emotion engine, which analyzes the tone, speed, and volume of the voice to recognize the user's emotions. This result is output as an emotion label.

[0794] Step 4:

[0795] The device uses a natural language processing engine to analyze text data and identify the intent behind the user's request. Key keywords and important information are then extracted.

[0796] Step 5:

[0797] The server uses the user's profile, based on the sentiment label and request content, to search for appropriate information from internal databases and external sources. It collects relevant information and generates a response tailored to the user.

[0798] Step 6:

[0799] The server adjusts the generated response content, taking emotional information into account, and refines the response to suit the user's tone and voice.

[0800] Step 7:

[0801] The server sends the final response to the terminal, which then uses speech synthesis to convert the response into speech format. The terminal then provides the user with appropriate audio information through its speaker.

[0802] Step 8:

[0803] The server records the interaction results and user sentiment data in a profile, which is then updated to improve the quality of future interactions. This results in more personalized responses in subsequent interactions.

[0804] (Example 2)

[0805] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0806] Conventional interactive systems have struggled to generate appropriate responses that reflect the user's emotional state, making it difficult to optimize the user experience. Therefore, there is a need for technology that accurately recognizes user emotions and customizes responses accordingly.

[0807] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0808] In this invention, the server includes means for extracting features from audio signals and recognizing emotional states, means for converting audio signals into text data, and means for analyzing text data using natural language processing techniques and identifying requests. This enables the generation of dynamic and personalized responses that correspond to the user's emotional state.

[0809] An "audio signal" is input information used to represent sound as digital data.

[0810] A "digital signal" is a data format that quantifies an analog signal, making it possible to process and analyze it using a computer.

[0811] "Emotional state" refers to a psychological state judged based on factors such as the user's way of speaking and tone of voice.

[0812] "Speech synthesis technology" is a technology that outputs text data as speech.

[0813] "Natural language processing technology" is a technology that enables computers to understand and analyze human language.

[0814] A "profile" is a database that records a user's past interactions and preferences, and is used to generate personalized responses.

[0815] A "source of information" is an internal or external database or API that is accessed to provide the requested information.

[0816] This invention is an interactive system that recognizes user emotions and optimizes responses based on those emotions. To achieve this, the user's device is equipped with hardware that acquires audio signals through a microphone and converts them into digital signals. Digital signal processing technology is used for this conversion.

[0817] The user's device uses the converted digital signal to convert speech into text data via a speech recognition module. A common speech recognition API is used as the software for this process. This text data is then analyzed using natural language processing techniques to identify the user's request. BERT models or equivalent techniques are used for natural language processing.

[0818] In parallel, the device extracts features such as voice tone and speed from the audio signal and inputs them into the emotion recognition engine. Here, an emotion analysis API is used as the algorithm for determining emotion.

[0819] The server retrieves appropriate information from internal and external sources based on the analyzed user requests and emotions. This information retrieval involves querying databases and making API calls to external services. Based on the retrieved information, a generative AI model is used to generate personalized responses for the user.

[0820] The response generated by the server is sent back to the user's terminal and output as speech using speech synthesis software. This allows the user to receive the audio output from their terminal. This entire process is also personalized, with adjustments made based on the user's emotional state.

[0821] For example, if a user says "I'm tired today," the device recognizes this as an emotion, and the server can generate a gentle response such as "Why don't you take a short break?". If the user cheerfully says "What should I do tomorrow?", the server can generate a response along the lines of "The weather looks nice, so I recommend going out."

[0822] An example of a prompt might be: "Create a prompt for a system that takes user emotions into consideration and generates an appropriate response. Offer comfort to tired users and active suggestions to energetic users."

[0823] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0824] Step 1:

[0825] The device acquires the user's voice through a microphone. This voice signal is input as an analog signal. The device uses digital signal processing technology to convert this analog signal into a digital format. Specifically, an analog-to-digital converter (ADC) samples and quantizes the signal to output a digital signal.

[0826] Step 2:

[0827] The digitized audio signal is sent to the terminal's speech recognition module, where it is converted into text data. The input is a digital signal, and the output is text data. Here, a speech recognition API is used, making the content of the speech parseable as a string of characters. Specifically, feature extraction and phoneme analysis are performed on the audio waveform.

[0828] Step 3:

[0829] The terminal passes text data to a natural language processing engine to identify the user's request. The input for this step is text data, and the output is semantic data representing the user's request. The BERT model is used for natural language processing, analyzing the request based on keywords and context contained in the text. Sentence structure analysis and intent interpretation are performed here.

[0830] Step 4:

[0831] Simultaneously, the device analyzes the tone and speed of the voice from the audio signal and inputs this information into the emotion recognition engine. The input to this process is the original audio signal, and the output is the emotional state. Acoustic features are extracted and classified into emotion categories. Specific operations include tone analysis and speed evaluation.

[0832] Step 5:

[0833] The server receives the user's request and emotional state, and retrieves appropriate information from an internal database or external source based on that information. In this step, the request and emotional state are input, and the retrieved information is output. Queries are executed against the database or external API calls are made.

[0834] Step 6:

[0835] Based on the acquired information, the server uses a generative AI model to generate a response that matches the user's emotions. In this step, the acquired information is the input, and the response text data is the output. The generative AI model generates the response as text data and adjusts the content and language style.

[0836] Step 7:

[0837] The response generated by the server is sent to the terminal and output as speech using speech synthesis technology. The input for this step is the text data of the response, and the output is the generated speech data. The speech synthesis software generates a speech waveform, which is transmitted to the user through the speaker.

[0838] Step 8:

[0839] The server records responses and user interaction history and updates the profile. Here, the user's request history and sentiment state are inputs, and the updated profile data is stored as output. Database operations update the profile, which is used to improve the accuracy of future responses.

[0840] (Application Example 2)

[0841] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0842] In care settings, there is a need for a system that can accurately grasp the emotional state of users and provide information to care staff in real time, thereby enabling care tailored to individual needs. However, conventional methods have the challenge of not being able to quickly capture changes in emotions and effectively utilize each user's emotional history.

[0843] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0844] In this invention, the server includes means for recognizing the user's emotional state and adjusting responses based on emotional information, means for referring to past information and utilizing external information sources to obtain necessary data, and means for recording and updating the user's interaction history as a profile. This makes it possible to grasp the user's emotions in real time and to propose specific care to care staff that is appropriate to the user's condition.

[0845] An "audio signal" is an electrical representation of sound, converting human voices and other sounds into data that can be processed in analog or digital format.

[0846] "Digital data" refers to audio signals and other information structured in binary representation, converted into a format that can be processed by a computer.

[0847] "Natural language processing" is a field of technology that enables computers to analyze, understand, and generate language that humans use on a daily basis.

[0848] "User needs" refer to the demands and desires that users have for the system, and these are the things the system should address.

[0849] "Emotional state" refers to the psychological and emotional state of a user, inferred from factors such as tone of voice, speed, and pauses.

[0850] An "external information source" is an information source used to obtain necessary information from databases, online resources, and other sources that exist outside the system.

[0851] A "profile" is a collection of user-specific information created based on a user's past interaction history and emotional state.

[0852] The system based on this invention recognizes the user's emotional state in real time and provides that information to care staff, thereby enabling care tailored to the individual needs of each user in the care setting.

[0853] Hardware and software configuration

[0854] The device uses a smart device equipped with a high-performance microphone. It converts the voice signal into digital data and then converts the voice data into text using the Google Speech-to-Text API. The device uses the Emotion AI SDK to recognize emotional states and performs natural language processing using the Google NLP API.

[0855] Data processing and calculations

[0856] The server analyzes the acquired voice data to assess the user's emotional state. Based on the emotional information, it adjusts its response and provides information to care staff. It also creates a profile based on past data to predict the user's needs. It retrieves necessary data by referring to external sources and generates the most appropriate response for the user.

[0857] Specific example

[0858] For example, if a user says in a weak voice, "I'm not feeling well today," the device recognizes this emotion as "unpleasant," and the server instructs the care staff to "prepare for a medical check." Furthermore, based on the user's emotional history, the system can suggest a meal "suitable for their physical condition" for their next visit.

[0859] Example of a prompt

[0860] "Analyze the user's emotions and propose the most suitable care method."

[0861] "Track comments about health conditions and guide actions to encourage medical attention."

[0862] The aim of these technologies is to enable more flexible and effective care delivery in nursing care settings.

[0863] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0864] Step 1:

[0865] The terminal acquires the user's speech as an audio signal through a high-performance microphone. The input is the user's speech, and the output is audio data in digital format. The audio signal is acquired as digital data and prepared as input data for the next step.

[0866] Step 2:

[0867] The device uses the Google Speech-to-Text API to convert the audio signal into text data. In this step, the input is the digital audio data obtained in the previous step, and the output is text data. The converted text data recognizes the user's speech and prepares it for sentiment analysis.

[0868] Step 3:

[0869] The device utilizes the Emotion AI SDK to input text data and voice tone information, and analyzes the user's emotional state. The output is emotional information, which is labeled as "positive," "negative," "neutral," etc. This richly expressive emotional information is then used in the next step.

[0870] Step 4:

[0871] The server generates an appropriate response based on the sentiment information obtained in the previous step. In this step, sentiment information and the user's profile are taken as input, and the output is text data of the appropriately adjusted response. By utilizing the profile data, a customized response is prepared that takes past history into account.

[0872] Step 5:

[0873] The server performs speech synthesis to provide the generated response to the care staff, converting it into an audio format and sending it to the terminal. The input is the text response generated in the previous step, and the output is the response in audio format. This makes it possible to provide care staff with quick and accurate instructions in real time.

[0874] Step 6:

[0875] The server records the user's interaction history as a profile and updates it for use in generating responses in the future. The input is the sentiment information and response history up to the previous step, and the output is the updated user profile. This improves the accuracy of continuous care for the user.

[0876] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0877] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0878] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0879] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0880] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0881] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0882] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0883] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0884] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0885] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0886] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0887] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0888] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0889] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0890] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0891] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0892] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0893] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0894] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0895] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0896] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted as being incorporated by reference.

[0897] The following is further disclosed regarding the embodiments described above.

[0898] (Claim 1)

[0899] A means of acquiring voice input from a user and converting it into text data,

[0900] A means for analyzing the text data using natural language processing and identifying the user's request,

[0901] Based on the user's request, a means of obtaining necessary information by referring to past data and utilizing external information sources,

[0902] A means of generating a response to the user based on the acquired information, converting it into voice, and notifying the user,

[0903] A system including means for recording and updating the aforementioned responses and user interaction history as a profile.

[0904] (Claim 2)

[0905] The system according to claim 1, which acquires the user's location information and generates a response based on the current environment information.

[0906] (Claim 3)

[0907] The system according to claim 1, which detects and predicts regular user requests and prepares responses in advance.

[0908] "Example 1"

[0909] (Claim 1)

[0910] A means of collecting auditory input from users and converting it into text data,

[0911] A means for analyzing the aforementioned character data using natural language processing technology and identifying user needs,

[0912] Based on the user's needs, a means of obtaining necessary information by referring to the information history and utilizing external information sources,

[0913] A means of generating a reply to the user based on the acquired information, converting it to voice, and communicating it,

[0914] A system including means for recording and updating the aforementioned replies and user interaction history as a profile.

[0915] (Claim 2)

[0916] The system according to claim 1, which collects user location information and generates a response based on current environmental information.

[0917] (Claim 3)

[0918] The system according to claim 1, which detects and predicts the regular needs of users and prepares replies in advance.

[0919] "Application Example 1"

[0920] (Claim 1)

[0921] A means of acquiring voice input from a user and converting it into text data,

[0922] A means for analyzing the text data using natural language processing and identifying the user's request,

[0923] Based on the user's request, a means of obtaining necessary information by referring to past data and utilizing external information sources,

[0924] A means of acquiring information about the user's living environment in real time and providing the optimal service,

[0925] A means of generating a response to the user based on the acquired information, converting it into voice, and notifying the user,

[0926] A system including means for recording and updating the aforementioned responses and user interaction history as a profile.

[0927] (Claim 2)

[0928] The system according to claim 1, which generates a response based on the user's location information and current urban environment information.

[0929] (Claim 3)

[0930] The system according to claim 1, which predicts periodic requests based on the user's past behavior patterns, prepares responses in advance, and adapts to various environmental conditions.

[0931] "Example 2 of combining an emotion engine"

[0932] (Claim 1)

[0933] A means for acquiring audio signals from a user and converting them into digital signals,

[0934] A means for extracting voice characteristics from the aforementioned digital signal and recognizing emotional state,

[0935] A means for analyzing the aforementioned audio signal and converting it into text data,

[0936] A means for analyzing the text data using natural language processing technology and identifying the user's request,

[0937] A means for obtaining appropriate information from a source based on the aforementioned emotional state and needs,

[0938] A means of generating a response that corresponds to the user's emotions based on the acquired information, converting it into speech using speech synthesis technology, and notifying the user;

[0939] A system including means for recording the aforementioned responses and user interaction history and updating the profile.

[0940] (Claim 2)

[0941] The system according to claim 1, which acquires the user's location information and generates a response based on the current environmental conditions.

[0942] (Claim 3)

[0943] The system according to claim 1, which detects and predicts recurring user requests and generates responses in advance.

[0944] "Application example 2 when combining with an emotional engine"

[0945] (Claim 1)

[0946] A means of acquiring audio signals from a user and converting them into digital data,

[0947] A means for analyzing the aforementioned digital data using natural language processing and identifying user needs,

[0948] A means for recognizing the user's emotional state and adjusting responses based on emotional information,

[0949] Based on the user's needs, a means of obtaining necessary data by referring to past information and utilizing external information sources,

[0950] A means of generating a response to the user based on the acquired data, converting it into speech, and providing it,

[0951] A system including means for recording and updating the aforementioned responses and user interaction history as a profile.

[0952] (Claim 2)

[0953] The system according to claim 1, which recognizes the emotions of users in real time in a care setting and provides emotional information to care staff.

[0954] (Claim 3)

[0955] The system according to claim 1, which predicts the needs of a user based on their past emotional history and proposes appropriate care in a care setting. [Explanation of symbols]

[0956] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of acquiring voice input from a user and converting it into text data, A means for analyzing the text data using natural language processing and identifying the user's request, Based on the user's request, a means of obtaining necessary information by referring to past data and utilizing external information sources, A means of generating a response to the user based on the acquired information, converting it into voice, and notifying the user, A system including means for recording and updating the aforementioned responses and user interaction history as a profile.

2. The system according to claim 1, which acquires the user's location information and generates a response based on the current environment information.

3. The system according to claim 1, which detects and predicts regular user requests and prepares responses in advance.