system
A voice-to-text system with generative AI and emotional recognition enhances information access for elderly and technologically unfamiliar users, offering personalized and emotionally responsive interactions.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-10
- Publication Date
- 2026-06-22
AI Technical Summary
Elderly, junior, and technologically unfamiliar individuals face challenges in accessing information due to the complexity of digital devices, leading to information gaps and isolation, with existing voice recognition systems failing to provide personalized and emotionally responsive interactions.
A system that converts voice input into text, processes it using a generative AI model, and provides personalized and emotionally tailored responses through a natural interface, utilizing speech synthesis or visual display.
Enables intuitive information access and service utilization for technologically challenged users, providing personalized and emotionally resonant interactions, improving user experience and independence.
Smart Images

Figure 2026101384000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] Among the elderly, junior people, and those with mechanical tone deafness who have difficulty operating smartphones, there are problems of information gap and isolation due to the lack of the ability to make full use of information technology. General digital devices are complex for these users and thus difficult to operate. This situation has become a significant barrier in information access in daily life and the use of various services.
Means for Solving the Problems
[0005] This invention provides a system that acquires voice information as a digital signal and converts it into text, thereby enabling users to easily access information and services using natural speech. Furthermore, it transmits the converted text to a remote information processing device and notifies the user of the response information generated by the information processing device either audibly or visually. This allows even users who have difficulty operating smartphones to intuitively operate the device via voice.
[0006] "Audio information" refers to sound signals provided as user utterances, conveying information in the form of natural language.
[0007] A "digital signal" refers to a data format obtained by converting audio information into a format that can be handled by a processing device, and is usually represented as binary data.
[0008] "Text" refers to information in string format converted from digital signals, and is composed of characters in a format that humans can read and write.
[0009] A "remote information processing device" is a computer device that exists in a location separate from the user's device and communicates via a network to perform data analysis and response generation.
[0010] "Response information" refers to output data generated by an information processing device in the form of answers or information provided in response to user inquiries.
[0011] "Means of notifying the user by voice or visual means" refers to a function that plays the converted response information audibly or displays it on the screen to the user, and is an interface for the user to receive information. [Brief explanation of the drawing]
[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2]This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]
[0013] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.
[0014] First, the terms used in the following description will be explained.
[0015] In the following embodiments, a processor with a reference number (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0016] In the following embodiments, a RAM (Random Access Memory) with a reference number is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0017] In the following embodiments, a storage with a reference number is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.
[0018] In the following embodiments, a communication I / F (Interface) with a reference number is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), etc.
[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0020] [First Embodiment]
[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0033] This invention provides a system that enables intuitive information acquisition and service utilization using smartphones, targeting elderly and technologically unfamiliar users. This system combines voice recognition technology and generative AI to allow users to perform various operations via voice input.
[0034] First, the user launches the agent app installed on their smartphone and makes a voice inquiry. For example, a possible question might be, "Please tell me about nearby hospitals." This voice input is converted into a digital signal by the smartphone.
[0035] Next, the device converts the audio into text data and sends this text to a remotely located server over the network. The server receives this text and generates a response using a generative AI model. For example, based on the user's current location, it retrieves a list of the nearest hospitals and generates this information in text format.
[0036] The generated response is sent back to the terminal via the network and notified to the user. Notification methods include the terminal using speech synthesis to inform the user verbally or displaying it visually on the screen. Furthermore, the server can customize the response based on the user profile, as the information is appropriately tailored to the user's needs.
[0037] For example, if a user needs weather forecast information, they can say "Tell me the weather for tomorrow" into their device. The voice is converted to text, and the server generates a response based on local weather information, such as "Tomorrow will be sunny, and the temperature will be 25 degrees Celsius." This allows the user to quickly and easily obtain the information they need.
[0038] This system aims to provide a user-friendly environment and reduce technical hurdles by utilizing a natural interface such as voice input.
[0039] The following describes the processing flow.
[0040] Step 1:
[0041] The user launches a smartphone app and makes a voice inquiry to request specific information. For example, they might say, "Tell me about events this weekend."
[0042] Step 2:
[0043] The device acquires this audio as a digital signal and converts it into text data using a speech recognition module.
[0044] Step 3:
[0045] The device sends the converted text data to the server over the internet. Here, the appropriate API endpoint is used to make the request.
[0046] Step 4:
[0047] The server analyzes the received text and runs a generative AI model to generate information that matches the user's inquiry. For example, it might refer to an event database to retrieve the desired information.
[0048] Step 5:
[0049] The server sends the generated response back to the terminal in text format. This includes the generated information and, if necessary, related data.
[0050] Step 6:
[0051] The device notifies the user of the received response data through speech synthesis or screen display. This allows the user to obtain information by voice or view details on the screen.
[0052] Step 7:
[0053] The system reviews the information provided by the user and makes additional inquiries as needed. This process is repeated as requested by the user.
[0054] (Example 1)
[0055] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0056] Traditionally, when elderly or technologically unfamiliar users retrieve information using voice input, challenges have included the complexity of the operation and the fact that the information provided is general and does not meet individual needs. Furthermore, dissatisfaction has arisen regarding the accuracy and speed of information obtained through voice input. Therefore, there is a need for a system that offers simple and intuitive operation and the provision of personalized information tailored to the user's attributes and current location.
[0057] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0058] In this invention, the server includes means for acquiring voice data as an electronic signal, means for converting the electronic signal into text information, and means for transmitting the text information to a remote data processing device. This enables users to intuitively acquire information using voice and to be provided with information tailored to their individual needs.
[0059] "Voice data" refers to the electronic signal obtained by converting the voice input from the user.
[0060] An "electronic signal" is a digital representation of audio information, converted into a format that can be processed.
[0061] "Textual information" refers to text data converted from electronic signals, in a format that can be processed by machines and programs.
[0062] A "data processing device" is a device that receives text information and processes it or generates responses using a generation AI model.
[0063] A "generative AI model" is an artificial intelligence model that learns from large amounts of data and generates text that sounds like it was written by a human.
[0064] "Response information" refers to the answer to the user generated by a data processing device based on textual information.
[0065] "Location information" refers to geographical information used to determine the user's current location.
[0066] "Attribute information" refers to information related to an individual, including the user's age, preferences, and past behavioral history.
[0067] "Personalized information" refers to information tailored to the specific needs of a user, based on their attribute information and current location.
[0068] The program for this system aims to provide an environment where elderly and technologically unfamiliar users can intuitively acquire information through voice input. Its main components include terminals (smart devices), servers, a network environment, and a generative AI model.
[0069] The user first launches a dedicated app installed on their smart device and performs voice input. The voice input is captured by the smart device's microphone and converted into a digital signal. The speech recognition software used here converts the voice data into text information. For example, commonly available programs are often used as the speech recognition engine.
[0070] The terminal transmits the converted text information to the server via the internet. The server uses a generative AI model to generate appropriate response information based on the input text information. A large-scale language model is used as this generative AI model. The generated response information is automatically customized, taking into account the user's location and attribute information.
[0071] The generated response information is then transmitted back to the terminal via the network. The terminal notifies the user by either playing this response information as audio using speech synthesis technology or by displaying it on the screen. Existing speech synthesis engines are commonly used for speech synthesis. Furthermore, the information is personalized based on the user's age and preferences, and a mechanism is in place to prioritize the notification of important information.
[0072] A concrete example is when a user asks via voice, "Please tell me which cafes are within walking distance from my current location." In this case, the generative AI model selects appropriate cafe candidates based on the user's current location and provides the most suitable answer. This system is designed to allow users to quickly and accurately obtain the information they need without performing complex operations.
[0073] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0074] Step 1:
[0075] The user launches a dedicated app on their smart device and provides voice input. This input is a prompt, such as "Tell me the weather nearby." The device acquires this voice as a digital signal through the microphone. Here, the input is voice data, and the output is a digital voice signal. Converting the voice data to a digital format enables subsequent processing.
[0076] Step 2:
[0077] The terminal inputs the acquired digital audio signal into speech recognition software, which converts it into text information. For example, Google® Speech-to-Text API or similar speech recognition technology is used. The input for this step is a digital audio signal, and the output is text information. This conversion makes the audio information machine-readable, preparing it for further processing.
[0078] Step 3:
[0079] The terminal sends the converted character information to the server over the network. The transmitted data includes character information, user location information, and related attribute information. The input for this step is character information, and the output is the dataset sent to the server. This action ensures that the server has the necessary information to generate response information.
[0080] Step 4:
[0081] The server analyzes the received text information and uses a generative AI model to create appropriate response information. A large-scale language model is used here. Based on the prompt and location information, the model generates personalized responses, such as "The weather nearby is sunny." The input for this step is text information and related information, and the output is the generated response information. The use of the AI model enables responses that are relevant to the user's context.
[0082] Step 5:
[0083] The server sends the generated response information back to the terminal via the network in text format. The input for this step is the generated response information, and the output is the response data sent to the terminal. This prepares the generated information for the user to receive.
[0084] Step 6:
[0085] The terminal either plays the received response information as speech using speech synthesis technology or displays it on the screen. The speech synthesis engines used are often widely used components. The input here is the response data, and the output is a notification to the user (audio or visual information). This final step allows the user to receive information in a simple way and take the necessary action.
[0086] (Application Example 1)
[0087] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0088] A challenge is enabling elderly and technologically unfamiliar users to easily obtain life support information using voice commands and live independently. Conventional technologies have insufficient voice recognition and generation AI capabilities, making it difficult to accurately meet the diverse needs of users. Furthermore, customized information provision is limited, and responses may be inadequate even in situations where rapid response is required.
[0089] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0090] In this invention, the server includes means for acquiring voice as a data signal, means for converting the data signal into character data, and means for providing life support information to the user. This enables intuitive information acquisition using voice and personalized service provision.
[0091] "Voice" refers to the human voice emitted by the user, and is an input method used to acquire information based on this voice.
[0092] A "data signal" is a signal that is converted from audio information into an electronic format and processed in a way that the system can understand.
[0093] "Character data" refers to data signals converted into strings of characters, which serve as the basis for the generation AI to interpret and generate responses.
[0094] A "data processing device" is a device that generates response information using character data, and examples include servers and cloud services.
[0095] "Response information" refers to information generated by a data processing device based on user input, and is the result provided to the user.
[0096] "Presenting aloud or visually" refers to a method of conveying generated response information to the user via speech synthesis technology or a display.
[0097] "Life support information" refers to information provided to improve the quality of life for users, and includes, for example, schedule management and reminder information.
[0098] "User attributes" refer to personal information such as the user's age, lifestyle, and preferences, and are criteria considered when individualizing the provision of information.
[0099] The system of this invention operates via a smartphone to facilitate the process by which users obtain life support information by voice.
[0100] The smartphone, acting as the terminal, uses its built-in microphone to convert the user's voice into a data signal. The smartphone then uses speech recognition software (e.g., Google Cloud Speech-to-Text) to convert this data signal into text data. This text data is then transmitted over the network to a remote data processing device. This data processing device is a server built on the cloud, which uses a generative AI model (e.g., OpenAI® API) to analyze the text data and generate response information.
[0101] The server customizes its response to match the user's current attributes. This allows for the provision of personalized information that takes into account, for example, the user's age and lifestyle. This response information is then sent back to the terminal via the network and presented to the user either as audio using speech synthesis technology (e.g., Amazon Polly) or as text displayed on the screen.
[0102] As a concrete example, consider the case where a user voice-inputs the prompt, "Tell me when to take my medication." This prompt is processed by the system, and a response such as "Please take your medication at 10 o'clock" is generated. By using this example prompt, users can easily obtain information to support their health management.
[0103] Example of a prompt: "What's the weather like tomorrow?"
[0104] This system design allows even elderly people unfamiliar with technology to operate it intuitively, enabling them to manage their daily lives more independently.
[0105] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0106] Step 1:
[0107] The user inputs voice by asking questions to the device. The input voice is captured by the device's microphone and first captured as analog audio data.
[0108] Step 2:
[0109] The terminal converts the acquired audio data into a data signal. Using speech recognition software (e.g., a speech recognition API), this analog audio data is converted into a digital data signal, preparing the audio input for text format.
[0110] Step 3:
[0111] The digital signal is converted into text data by the speech recognition module in the device. This process, which involves analyzing the digital audio signal and converting it into text, is typically done using tools such as Google Cloud Speech-to-Text.
[0112] Step 4:
[0113] The generated character data is sent to the server. The terminal uses the network to upload this character data to a remote data processing device. The character data is then passed directly to the server and becomes input data for the next processing step.
[0114] Step 5:
[0115] The server uses the received character data to begin analyzing and generating data using a generative AI model (e.g., OpenAI API). This analysis process involves internal calculations to understand the prompt and generate appropriate response information.
[0116] Step 6:
[0117] The generated response information is sent from the server to the terminal. The server packages the response information output by the generating AI model and sends it to the terminal via the network.
[0118] Step 7:
[0119] The device provides the user with the received response information either audibly or visually. Specifically, it plays the information back to the user as audio using Text-to-Speech technology, or displays it as visual information on the device's display.
[0120] This entire process allows users to quickly obtain accurate information and receive support that is helpful in their daily lives.
[0121] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0122] This invention relates to an agent system incorporating an emotion engine that provides information through a natural interface via the user's voice. In particular, it aims to improve the user experience by recognizing the user's emotions from their voice and adjusting the information provided in consideration of those emotions.
[0123] First, the user launches an agent application on their smartphone and makes a voice inquiry. For example, they can ask a question like, "Tell me today's news." This voice is then converted into a digital signal by the smartphone.
[0124] Next, the converted audio data is analyzed by an emotion engine. The emotion engine used by the device analyzes the tone, tempo, and volume of the user's voice, and from this, recognizes the user's emotional state (happy, sad, angry, etc.). This information can also be compared with the user's history to form trends.
[0125] The analyzed text data and sentiment information are then sent to the server. The server uses generative AI to generate the most appropriate response based on the user's emotions and inquiry. For example, if the user is feeling down, it might provide news information accompanied by words of encouragement.
[0126] The generated response is sent back to the terminal and notified to the user using speech synthesis or displayed visually on the screen. This notification uses a friendly tone and phrases that respond to emotions, enabling more user-friendly information delivery.
[0127] For example, if a user asks "What's the weather going to be like next?" in a sad voice, the emotion engine will determine the user's emotions and provide a response that includes something uplifting, such as "The weather will be sunny next. There might even be a rainbow after the rain, how exciting!"
[0128] This invention enables interactions that reflect the user's emotional state, significantly improving the user's comfort level.
[0129] The following describes the processing flow.
[0130] Step 1:
[0131] The user launches the app on their smartphone and makes a voice inquiry. For example, they might say, "Tell me the latest movie reviews."
[0132] Step 2:
[0133] The device acquires the user's voice as a digital signal. Using speech recognition technology, the digital signal is converted into text data.
[0134] Step 3:
[0135] The device sends the converted text data to the emotion engine, which analyzes the voice parameters. It recognizes the user's emotions from the tone, tempo, and volume of their voice and records them as data.
[0136] Step 4:
[0137] The device sends text data and sentiment data to the server. The server receives this data and uses generative AI to generate a response that is appropriate to the user's emotions.
[0138] Step 5:
[0139] The server generates a response and sends it back to the terminal. During this process, it takes emotional data into account and adjusts the tone and content accordingly. For example, if the user is happy, it might include humor.
[0140] Step 6:
[0141] To notify the user of the response information received by the device, the device will either output it as speech using speech synthesis or display it visually on the screen.
[0142] Step 7:
[0143] Users will have a more satisfying experience by reviewing the information provided and receiving responses that match their emotions. They may make additional inquiries as needed.
[0144] (Example 2)
[0145] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0146] Conventional voice interface systems do not take into account the user's emotional state, resulting in information that does not align with the user's feelings and leading to low satisfaction. Furthermore, the lack of personalized information delivery makes it difficult to improve the user experience.
[0147] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0148] In this invention, the server includes means for acquiring voice data as a digital signal, means for analyzing the user's emotional state and generating information based on this using prompt sentences in an AI model, and means for providing personalized information according to the emotional state. This enables the presentation of optimal information according to the user's emotions, thereby improving the user experience.
[0149] "Audio data" refers to a signal that represents sound in digital format, converted into a form that can be processed by a computer.
[0150] "Text data" refers to audio data converted into written information, and is a format used in natural language processing and analysis.
[0151] "Emotional information" refers to information indicating the emotional state analyzed from the user's voice, and is estimated based on the tone, tempo, and volume of the user's voice.
[0152] An "information processing device" is a remotely installed computer system that analyzes data transmitted by a user and generates appropriate information.
[0153] A "generative AI model" is a type of artificial intelligence that uses machine learning techniques to generate human language and is used to provide the most appropriate response to user input.
[0154] A "prompt statement" is an input statement used to set specific conditions or contexts for a generative AI model, and is used to control the model's output.
[0155] "Personalized information delivery" is a process of providing information tailored to the user's emotional state and preferences, and is an approach aimed at delivering information that is more relevant to the user.
[0156] This invention realizes a system that provides more personalized information through a user voice interface. The user inputs voice data using a terminal such as a smartphone or voice-enabled device. The terminal converts the voice data into a digital signal and generates text data based on that digital signal.
[0157] The device is equipped with software for sentiment analysis, which generates sentiment information by analyzing the user's voice tone, tempo, and volume. This sentiment information, along with the user's speech content, is transmitted to the information processing device using a secure protocol.
[0158] The server-side information processing unit uses a generative AI model to generate the optimal response based on the received text data and emotional information. The generated response is adjusted according to the user's emotions, providing information that resonates with the user. The generative AI model receives specific response requests using prompts. Prompts such as "If the user is in an emotional state, what tone should the response be in regarding the information content?" are used.
[0159] For example, if a user asks "What's the weather like tomorrow?" in a cheerful voice, the device will interpret that emotion as "happy." The server will then generate a response such as "It's going to be sunny tomorrow! It's going to be a great day!" and notify the user. This enables the provision of more user-friendly information tailored to the user's emotions, improving the user experience. Furthermore, the generated information is also displayed on the device's screen, so the system of this invention can provide information using both audio and visual means.
[0160] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0161] Step 1:
[0162] The user launches an agent application on a device such as a smartphone and performs voice input. For example, they might ask, "What's the weather like tomorrow?" The input is the user's voice, which the device converts into a digital signal via the microphone. This converted digital signal forms the basis for the next processing step.
[0163] Step 2:
[0164] The terminal undergoes a process of converting digital signals into text data. Speech recognition software is used to convert the content of speech into text-based data. The input is a digitized speech signal, and the output is text data. This text data is then used for subsequent processing.
[0165] Step 3:
[0166] The device extracts emotional information from both text data and audio. This is done using an emotion analysis module, which identifies the user's emotional state by analyzing the tone, speed, and volume of the voice. The input is a digital audio signal, and the output is an emotion tag (e.g., "happy," "sad"). This emotional information is then included in the next data transmission.
[0167] Step 4:
[0168] The terminal bundles the obtained text data and sentiment information into packets and sends them to the information processing device. A secure communication protocol is used for transmission. The input is text data and sentiment information, and the output is the data packets sent to the server. This enables the next generation process.
[0169] Step 5:
[0170] The server uses a generative AI model to create a response based on the received data. The prompt text is "If the user's emotions are in an emotional state, what tone should the response be to the information content?" and the AI outputs an appropriate text response according to this instruction. Personalized information is generated through this process.
[0171] Step 6:
[0172] A response generated by the server is sent to the terminal. The terminal receives this response and notifies the user using a speech synthesis engine. The input is the generated text response, and the output is either synthesized speech or text displayed on the screen. This allows the user to receive information appropriate to their emotions.
[0173] (Application Example 2)
[0174] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0175] Conventional speech recognition systems have the problem of not being able to provide information while taking into account the user's emotional state. Furthermore, while there is a need to provide appropriate feedback based on the user's emotions, current technology lacks systems that organically link emotion analysis and information provision. As a result, the user experience is limited and the quality of interaction suffers.
[0176] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0177] In this invention, the server includes means for acquiring voice data as digital information, means for analyzing the user's emotional state from the voice data, and means for adjusting response information based on the emotional state. This makes it possible to provide information that is tailored to the user's emotional state.
[0178] "Audio data" refers to sound signals acquired via a microphone based on the user's pronunciation.
[0179] "Digital information" refers to data obtained by converting analog audio data into numerical data, in a format that can be processed by electronic computers.
[0180] An "electronic computing device" is a computer or similar device used to analyze input data and perform information processing.
[0181] "Answer information" refers to the answer or response content generated by an electronic computer and provided to the user.
[0182] "Screen display" refers to a method of conveying information to users visually, and is performed using a display.
[0183] "User" refers to an individual who operates this system and obtains information.
[0184] "Emotional state" refers to the psychological or emotional condition exhibited by the user, and is recognized through the analysis of voice data.
[0185] "Adjustment" refers to the act of modifying or optimizing information based on specific circumstances or criteria.
[0186] To implement this invention, it is first necessary to equip a home robot with a microphone and speaker. When a user speaks to the robot and asks a question, the device acquires the voice data as digital information. This digital information is then converted into text using speech recognition software. At this stage, a common speech recognition service such as the Google Cloud Speech-to-Text API is used.
[0187] Next, the emotion engine analyzes the user's voice to determine their emotional state. Using the EmotionAPI or an equivalent emotion analysis API, the emotion engine analyzes the voice tone, rhythm, and volume to identify the user's emotions. The identified emotional state is sent to a computer and used as foundational data to generate optimal response information using a generative AI model.
[0188] The server utilizes generative AI such as OpenAI's GPT to generate response information that corresponds to the user's emotional state and the content of their conversation. The generative AI is input with prompts and is required to respond in an appropriate context. For example, if the user expresses the emotion "I'm feeling tired today," the server will generate a friendly response such as, "How about trying to relax today? Shall I recommend some relaxation techniques?"
[0189] The generated response information is sent to the device and converted into speech using a speech synthesis API such as Amazon Polly. The spoken response is either transmitted to the user through the speaker or presented as visual information on the screen.
[0190] Example of a prompt:
[0191] "User comment: 'I'm not really in the mood today.' The emotion engine has detected sadness. Please respond with something friendly to cheer up the user."
[0192] Such systems allow users to enjoy natural, emotionally resonant interactions, increasing the convenience of consumer robots in everyday life.
[0193] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0194] Step 1:
[0195] The terminal acquires the voice spoken by the user to the robot via a microphone. First, the voice data, as input, is converted into digital information. This process utilizes voice compression technology to convert analog voice signals into digital data. The output is the voice data in digital format.
[0196] Step 2:
[0197] The device converts the acquired digital speech information into text. This process utilizes the Google Cloud Speech-to-Text API to generate text from speech patterns. The input for this step is digital speech information, and the output is the transcribed user conversation. Specifically, the device uses a language model to convert phonemes into words.
[0198] Step 3:
[0199] The device uses an emotion engine to analyze the user's emotional state from their voice. The input consists of the text information obtained in step 2 and the tone and timing of the voice. Specifically, the EmotionAPI analyzes the voice features and outputs emotion labels such as joy and sadness. The output is information about the user's emotional state.
[0200] Step 4:
[0201] The terminal transmits text information and emotional state information to the computer. The input is the text from step 2 and the emotional state information from step 3. Based on this information, the computer generates an appropriate response through a generative AI model. The output is a list of answer candidates. Specifically, the computer uses OpenAI's GPT model to select the best answer based on the input prompt sentence.
[0202] Step 5:
[0203] The server selects the answer that best matches the user's emotions from the generated list of answer candidates and sends it to the terminal. The input is the list of answer candidates. The output is the adjusted answer. Specifically, it selects an answer that emphasizes emotionally appropriate feedback and suggestions.
[0204] Step 6:
[0205] The terminal converts the received response into speech using a speech synthesis API. The input is a processed response from a computer. The output is the spoken response. Specifically, it is made audible with a natural voice tone using services such as Amazon Polly.
[0206] Step 7:
[0207] The device either outputs the spoken response through its speaker or displays it as text on its screen. The input is the data that was spoken in step 6. The output is the response information in a form that the user can perceive. Specifically, the device adjusts the speaker volume to play the response appropriately.
[0208] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0209] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0210] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0211] [Second Embodiment]
[0212] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0213] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0214] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0215] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0216] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0217] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0218] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0219] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0220] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0221] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0222] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0223] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0224] This invention provides a system that enables intuitive information acquisition and service utilization using smartphones, targeting elderly and technologically unfamiliar users. This system combines voice recognition technology and generative AI to allow users to perform various operations via voice input.
[0225] First, the user launches the agent app installed on their smartphone and makes a voice inquiry. For example, a possible question might be, "Please tell me about nearby hospitals." This voice input is converted into a digital signal by the smartphone.
[0226] Next, the device converts the audio into text data and sends this text to a remotely located server over the network. The server receives this text and generates a response using a generative AI model. For example, based on the user's current location, it retrieves a list of the nearest hospitals and generates this information in text format.
[0227] The generated response is sent back to the terminal via the network and notified to the user. Notification methods include the terminal using speech synthesis to inform the user verbally or displaying it visually on the screen. Furthermore, the server can customize the response based on the user profile, as the information is appropriately tailored to the user's needs.
[0228] For example, if a user needs weather forecast information, they can say "Tell me the weather for tomorrow" into their device. The voice is converted to text, and the server generates a response based on local weather information, such as "Tomorrow will be sunny, and the temperature will be 25 degrees Celsius." This allows the user to quickly and easily obtain the information they need.
[0229] This system aims to provide a user-friendly environment and reduce technical hurdles by utilizing a natural interface such as voice input.
[0230] The following describes the processing flow.
[0231] Step 1:
[0232] The user launches a smartphone app and makes a voice inquiry to request specific information. For example, they might say, "Tell me about events this weekend."
[0233] Step 2:
[0234] The device acquires this audio as a digital signal and converts it into text data using a speech recognition module.
[0235] Step 3:
[0236] The device sends the converted text data to the server over the internet. Here, the appropriate API endpoint is used to make the request.
[0237] Step 4:
[0238] The server analyzes the received text and runs a generative AI model to generate information that matches the user's inquiry. For example, it might refer to an event database to retrieve the desired information.
[0239] Step 5:
[0240] The server sends the generated response back to the terminal in text format. This includes the generated information and, if necessary, related data.
[0241] Step 6:
[0242] The device notifies the user of the received response data through speech synthesis or screen display. This allows the user to obtain information by voice or view details on the screen.
[0243] Step 7:
[0244] The system reviews the information provided by the user and makes additional inquiries as needed. This process is repeated as requested by the user.
[0245] (Example 1)
[0246] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0247] Traditionally, when elderly or technologically unfamiliar users retrieve information using voice input, challenges have included the complexity of the operation and the fact that the information provided is general and does not meet individual needs. Furthermore, dissatisfaction has arisen regarding the accuracy and speed of information obtained through voice input. Therefore, there is a need for a system that offers simple and intuitive operation and the provision of personalized information tailored to the user's attributes and current location.
[0248] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0249] In this invention, the server includes means for acquiring voice data as an electronic signal, means for converting the electronic signal into text information, and means for transmitting the text information to a remote data processing device. This enables users to intuitively acquire information using voice and to be provided with information tailored to their individual needs.
[0250] "Voice data" refers to the electronic signal obtained by converting the voice input from the user.
[0251] An "electronic signal" is a digital representation of audio information, converted into a format that can be processed.
[0252] "Textual information" refers to text data converted from electronic signals, in a format that can be processed by machines and programs.
[0253] A "data processing device" is a device that receives text information and processes it or generates responses using a generation AI model.
[0254] A "generative AI model" is an artificial intelligence model that learns from large amounts of data and generates text that sounds like it was written by a human.
[0255] "Response information" refers to the answer to the user generated by a data processing device based on textual information.
[0256] "Location information" refers to geographical information used to determine the user's current location.
[0257] "Attribute information" refers to information related to an individual, including the user's age, preferences, and past behavioral history.
[0258] "Personalized information" refers to information tailored to the specific needs of a user, based on their attribute information and current location.
[0259] The program for this system aims to provide an environment where elderly and technologically unfamiliar users can intuitively acquire information through voice input. Its main components include terminals (smart devices), servers, a network environment, and a generative AI model.
[0260] The user first launches a dedicated app installed on their smart device and performs voice input. The voice input is captured by the smart device's microphone and converted into a digital signal. The speech recognition software used here converts the voice data into text information. For example, commonly available programs are often used as the speech recognition engine.
[0261] The terminal transmits the converted text information to the server via the internet. The server uses a generative AI model to generate appropriate response information based on the input text information. A large-scale language model is used as this generative AI model. The generated response information is automatically customized, taking into account the user's location and attribute information.
[0262] The generated response information is then transmitted back to the terminal via the network. The terminal notifies the user by either playing this response information as audio using speech synthesis technology or by displaying it on the screen. Existing speech synthesis engines are commonly used for speech synthesis. Furthermore, the information is personalized based on the user's age and preferences, and a mechanism is in place to prioritize the notification of important information.
[0263] A concrete example is when a user asks via voice, "Please tell me which cafes are within walking distance from my current location." In this case, the generative AI model selects appropriate cafe candidates based on the user's current location and provides the most suitable answer. This system is designed to allow users to quickly and accurately obtain the information they need without performing complex operations.
[0264] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0265] Step 1:
[0266] The user launches a dedicated app on their smart device and provides voice input. This input is a prompt, such as "Tell me the weather nearby." The device acquires this voice as a digital signal through the microphone. Here, the input is voice data, and the output is a digital voice signal. Converting the voice data to a digital format enables subsequent processing.
[0267] Step 2:
[0268] The terminal inputs the acquired digital audio signal into speech recognition software, which converts it into text information. For example, the Google Speech-to-Text API or similar speech recognition technology is used. The input for this step is a digital audio signal, and the output is text information. This conversion makes the audio information machine-readable, preparing it for further processing.
[0269] Step 3:
[0270] The terminal sends the converted character information to the server over the network. The transmitted data includes character information, user location information, and related attribute information. The input for this step is character information, and the output is the dataset sent to the server. This action ensures that the server has the necessary information to generate response information.
[0271] Step 4:
[0272] The server analyzes the received text information and uses a generative AI model to create appropriate response information. A large-scale language model is used here. Based on the prompt and location information, the model generates personalized responses, such as "The weather nearby is sunny." The input for this step is text information and related information, and the output is the generated response information. The use of the AI model enables responses that are relevant to the user's context.
[0273] Step 5:
[0274] The server sends the generated response information back to the terminal via the network in text format. The input for this step is the generated response information, and the output is the response data sent to the terminal. This prepares the generated information for the user to receive.
[0275] Step 6:
[0276] The terminal either plays the received response information as speech using speech synthesis technology or displays it on the screen. The speech synthesis engines used are often widely used components. The input here is the response data, and the output is a notification to the user (audio or visual information). This final step allows the user to receive information in a simple way and take the necessary action.
[0277] (Application Example 1)
[0278] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0279] It is an issue for elderly people and users unfamiliar with technology to easily obtain life support information using voice and live their daily lives independently. In conventional technologies, the voice recognition function and generative AI technology are insufficient, making it difficult to accurately meet the diverse needs of users. Also, customized information provision is limited, and there may be insufficient response in situations where quick responses are required.
[0280] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following respective means.
[0281] In this invention, the server includes means for acquiring voice as a data signal, means for converting the data signal into character data, and means for providing life support information to the user. Thereby, intuitive information acquisition using voice and individualized service provision become possible.
[0282] "Voice" refers to the human voice emitted by the user, and is an input means for acquiring information based on this.
[0283] "Data signal" is what voice information is converted into an electronic form, and is a signal processed in a form that the system can understand.
[0284] "Character data" is what the data signal is converted into a character string, and is the data that serves as the basis for the generative AI to interpret and generate a response.
[0285] "Data processing device" is a device for generating response information using character data, and for example, a server or cloud service corresponds to this.
[0286] "Response information" is the information generated by the data processing device based on the input from the user, and is the result provided to the user.
[0287] "Presented by voice or visually" is a method of conveying the generated response information to the user via voice synthesis technology or a display.
[0288] "Life support information" refers to information provided to improve the quality of life for users, and includes, for example, schedule management and reminder information.
[0289] "User attributes" refer to personal information such as the user's age, lifestyle, and preferences, and are criteria considered when individualizing the provision of information.
[0290] The system of this invention operates via a smartphone to facilitate the process by which users obtain life support information by voice.
[0291] The smartphone, acting as the terminal, uses its built-in microphone to convert the user's voice into a data signal. The smartphone then uses speech recognition software (e.g., Google Cloud Speech-to-Text) to convert this data signal into text data. This text data is then transmitted over the network to a remote data processing unit. This data processing unit is a server built on the cloud, which uses a generative AI model (e.g., OpenAI API) to analyze the text data and generate response information.
[0292] The server customizes its response to match the user's current attributes. This allows for the provision of personalized information that takes into account, for example, the user's age and lifestyle. This response information is then sent back to the terminal via the network and presented to the user either as audio using speech synthesis technology (e.g., Amazon Polly) or as text displayed on the screen.
[0293] As a concrete example, consider the case where a user voice-inputs the prompt, "Tell me when to take my medication." This prompt is processed by the system, and a response such as "Please take your medication at 10 o'clock" is generated. By using this example prompt, users can easily obtain information to support their health management.
[0294] Example of a prompt: "What's the weather like tomorrow?"
[0295] This system design allows even elderly people unfamiliar with technology to operate it intuitively, enabling them to manage their daily lives more independently.
[0296] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0297] Step 1:
[0298] The user inputs voice by asking questions to the device. The input voice is captured by the device's microphone and first captured as analog audio data.
[0299] Step 2:
[0300] The terminal converts the acquired audio data into a data signal. Using speech recognition software (e.g., a speech recognition API), this analog audio data is converted into a digital data signal, preparing the audio input for text format.
[0301] Step 3:
[0302] The digital signal is converted into text data by the speech recognition module in the device. This process, which involves analyzing the digital audio signal and converting it into text, is typically done using tools such as Google Cloud Speech-to-Text.
[0303] Step 4:
[0304] The generated character data is sent to the server. The terminal uses the network to upload this character data to a remote data processing device. The character data is then passed directly to the server and becomes input data for the next processing step.
[0305] Step 5:
[0306] The server uses the received character data to start the analysis and generation of data using a generative AI model (e.g., OpenAI API). In this analysis process, internal operations are performed to understand the prompt text and generate appropriate response information.
[0307] Step 6:
[0308] The generated response information is sent from the server to the terminal. The server packages the response information output by the generative AI model and sends it to the terminal through the network.
[0309] Step 7:
[0310] The terminal provides the received response information to the user audibly or visually. Specifically, it is reproduced for the user as audio information using Text-to-Speech technology, or displayed as visual information on the terminal's display.
[0311] Through this entire process, the user can obtain accurate information quickly and can receive assistance useful in daily life.
[0312] Furthermore, an emotion engine for estimating the user's emotion may be combined. That is, the specific processing unit 290 may estimate the user's emotion using the emotion recognition model 59 and perform specific processing using the user's emotion.
[0313] The present invention is an agent system incorporating an emotion engine, and provides information through a natural interface via the user's voice. In particular, it aims to improve the user experience by recognizing the user's emotion from the voice and adjusting the information in consideration of that emotion.
[0314] First, the user launches the agent application on the smartphone and makes an inquiry by voice. For example, a question such as "Tell me today's news" is possible. This voice is converted into a digital signal by the terminal, which is the smartphone.
[0315] Next, the converted audio data is analyzed by an emotion engine. The emotion engine used by the device analyzes the tone, tempo, and volume of the user's voice, and from this, recognizes the user's emotional state (happy, sad, angry, etc.). This information can also be compared with the user's history to form trends.
[0316] The analyzed text data and sentiment information are then sent to the server. The server uses generative AI to generate the most appropriate response based on the user's emotions and inquiry. For example, if the user is feeling down, it might provide news information accompanied by words of encouragement.
[0317] The generated response is sent back to the terminal and notified to the user using speech synthesis or displayed visually on the screen. This notification uses a friendly tone and phrases that respond to emotions, enabling more user-friendly information delivery.
[0318] For example, if a user asks "What's the weather going to be like next?" in a sad voice, the emotion engine will determine the user's emotions and provide a response that includes something uplifting, such as "The weather will be sunny next. There might even be a rainbow after the rain, how exciting!"
[0319] This invention enables interactions that reflect the user's emotional state, significantly improving the user's comfort level.
[0320] The following describes the processing flow.
[0321] Step 1:
[0322] The user launches the app on their smartphone and makes a voice inquiry. For example, they might say, "Tell me the latest movie reviews."
[0323] Step 2:
[0324] The device acquires the user's voice as a digital signal. Using speech recognition technology, the digital signal is converted into text data.
[0325] Step 3:
[0326] The device sends the converted text data to the emotion engine, which analyzes the voice parameters. It recognizes the user's emotions from the tone, tempo, and volume of their voice and records them as data.
[0327] Step 4:
[0328] The device sends text data and sentiment data to the server. The server receives this data and uses generative AI to generate a response that is appropriate to the user's emotions.
[0329] Step 5:
[0330] The server generates a response and sends it back to the terminal. During this process, it takes emotional data into account and adjusts the tone and content accordingly. For example, if the user is happy, it might include humor.
[0331] Step 6:
[0332] To notify the user of the response information received by the device, the device will either output it as speech using speech synthesis or display it visually on the screen.
[0333] Step 7:
[0334] Users will have a more satisfying experience by reviewing the information provided and receiving responses that match their emotions. They may make additional inquiries as needed.
[0335] (Example 2)
[0336] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0337] Conventional voice interface systems do not take into account the user's emotional state, resulting in information that does not align with the user's feelings and leading to low satisfaction. Furthermore, the lack of personalized information delivery makes it difficult to improve the user experience.
[0338] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0339] In this invention, the server includes means for acquiring voice data as a digital signal, means for analyzing the user's emotional state and generating information based on this using prompt sentences in an AI model, and means for providing personalized information according to the emotional state. This enables the presentation of optimal information according to the user's emotions, thereby improving the user experience.
[0340] "Audio data" refers to a signal that represents sound in digital format, converted into a form that can be processed by a computer.
[0341] "Text data" refers to audio data converted into written information, and is a format used in natural language processing and analysis.
[0342] "Emotional information" refers to information indicating the emotional state analyzed from the user's voice, and is estimated based on the tone, tempo, and volume of the user's voice.
[0343] An "information processing device" is a remotely installed computer system that analyzes data transmitted by a user and generates appropriate information.
[0344] A "generative AI model" is a type of artificial intelligence that uses machine learning techniques to generate human language and is used to provide the most appropriate response to user input.
[0345] A "prompt statement" is an input statement used to set specific conditions or contexts for a generative AI model, and is used to control the model's output.
[0346] "Personalized information delivery" is a process of providing information tailored to the user's emotional state and preferences, and is an approach aimed at delivering information that is more relevant to the user.
[0347] This invention realizes a system that provides more personalized information through a user voice interface. The user inputs voice data using a terminal such as a smartphone or voice-enabled device. The terminal converts the voice data into a digital signal and generates text data based on that digital signal.
[0348] The device is equipped with software for sentiment analysis, which generates sentiment information by analyzing the user's voice tone, tempo, and volume. This sentiment information, along with the user's speech content, is transmitted to the information processing device using a secure protocol.
[0349] The server-side information processing unit uses a generative AI model to generate the optimal response based on the received text data and emotional information. The generated response is adjusted according to the user's emotions, providing information that resonates with the user. The generative AI model receives specific response requests using prompts. Prompts such as "If the user is in an emotional state, what tone should the response be in regarding the information content?" are used.
[0350] For example, if a user asks "What's the weather like tomorrow?" in a cheerful voice, the device will interpret that emotion as "happy." The server will then generate a response such as "It's going to be sunny tomorrow! It's going to be a great day!" and notify the user. This enables the provision of more user-friendly information tailored to the user's emotions, improving the user experience. Furthermore, the generated information is also displayed on the device's screen, so the system of this invention can provide information using both audio and visual means.
[0351] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0352] Step 1:
[0353] The user launches an agent application on a device such as a smartphone and performs voice input. For example, they might ask, "What's the weather like tomorrow?" The input is the user's voice, which the device converts into a digital signal via the microphone. This converted digital signal forms the basis for the next processing step.
[0354] Step 2:
[0355] The terminal undergoes a process of converting digital signals into text data. Speech recognition software is used to convert the content of speech into text-based data. The input is a digitized speech signal, and the output is text data. This text data is then used for subsequent processing.
[0356] Step 3:
[0357] The device extracts emotional information from both text data and audio. This is done using an emotion analysis module, which identifies the user's emotional state by analyzing the tone, speed, and volume of the voice. The input is a digital audio signal, and the output is an emotion tag (e.g., "happy," "sad"). This emotional information is then included in the next data transmission.
[0358] Step 4:
[0359] The terminal bundles the obtained text data and sentiment information into packets and sends them to the information processing device. A secure communication protocol is used for transmission. The input is text data and sentiment information, and the output is the data packets sent to the server. This enables the next generation process.
[0360] Step 5:
[0361] The server uses a generative AI model to create a response based on the received data. The prompt text is "If the user's emotions are in an emotional state, what tone should the response be to the information content?" and the AI outputs an appropriate text response according to this instruction. Personalized information is generated through this process.
[0362] Step 6:
[0363] A response generated by the server is sent to the terminal. The terminal receives this response and notifies the user using a speech synthesis engine. The input is the generated text response, and the output is either synthesized speech or text displayed on the screen. This allows the user to receive information appropriate to their emotions.
[0364] (Application Example 2)
[0365] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0366] Conventional speech recognition systems have the problem of not being able to provide information while taking into account the user's emotional state. Furthermore, while there is a need to provide appropriate feedback based on the user's emotions, current technology lacks systems that organically link emotion analysis and information provision. As a result, the user experience is limited and the quality of interaction suffers.
[0367] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0368] In this invention, the server includes means for acquiring voice data as digital information, means for analyzing the user's emotional state from the voice data, and means for adjusting response information based on the emotional state. This makes it possible to provide information that is tailored to the user's emotional state.
[0369] "Audio data" refers to sound signals acquired via a microphone based on the user's pronunciation.
[0370] "Digital information" refers to data obtained by converting analog audio data into numerical data, in a format that can be processed by electronic computers.
[0371] An "electronic computing device" is a computer or similar device used to analyze input data and perform information processing.
[0372] "Answer information" refers to the answer or response content generated by an electronic computer and provided to the user.
[0373] "Screen display" refers to a method of conveying information to users visually, and is performed using a display.
[0374] "User" refers to an individual who operates this system and obtains information.
[0375] "Emotional state" refers to the psychological or emotional condition exhibited by the user, and is recognized through the analysis of voice data.
[0376] "Adjustment" refers to the act of modifying or optimizing information based on specific circumstances or criteria.
[0377] To implement this invention, it is first necessary to equip a home robot with a microphone and speaker. When a user speaks to the robot and asks a question, the device acquires the voice data as digital information. This digital information is then converted into text using speech recognition software. At this stage, a common speech recognition service such as the Google Cloud Speech-to-Text API is used.
[0378] Next, the emotion engine analyzes the user's voice to determine their emotional state. Using the EmotionAPI or an equivalent emotion analysis API, the emotion engine analyzes the voice tone, rhythm, and volume to identify the user's emotions. The identified emotional state is sent to a computer and used as foundational data to generate optimal response information using a generative AI model.
[0379] The server utilizes generative AI such as OpenAI's GPT to generate response information that corresponds to the user's emotional state and the content of their conversation. The generative AI is input with prompts and is required to respond in an appropriate context. For example, if the user expresses the emotion "I'm feeling tired today," the server will generate a friendly response such as, "How about trying to relax today? Shall I recommend some relaxation techniques?"
[0380] The generated response information is sent to the device and converted into speech using a speech synthesis API such as Amazon Polly. The spoken response is either transmitted to the user through the speaker or presented as visual information on the screen.
[0381] Example of a prompt:
[0382] "User comment: 'I'm not really in the mood today.' The emotion engine has detected sadness. Please respond with something friendly to cheer up the user."
[0383] Such systems allow users to enjoy natural, emotionally resonant interactions, increasing the convenience of consumer robots in everyday life.
[0384] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0385] Step 1:
[0386] The terminal acquires the voice spoken by the user to the robot via a microphone. First, the voice data, as input, is converted into digital information. This process utilizes voice compression technology to convert analog voice signals into digital data. The output is the voice data in digital format.
[0387] Step 2:
[0388] The device converts the acquired digital speech information into text. This process utilizes the Google Cloud Speech-to-Text API to generate text from speech patterns. The input for this step is digital speech information, and the output is the transcribed user conversation. Specifically, the device uses a language model to convert phonemes into words.
[0389] Step 3:
[0390] The device uses an emotion engine to analyze the user's emotional state from their voice. The input consists of the text information obtained in step 2 and the tone and timing of the voice. Specifically, the EmotionAPI analyzes the voice features and outputs emotion labels such as joy and sadness. The output is information about the user's emotional state.
[0391] Step 4:
[0392] The terminal transmits text information and emotional state information to the computer. The input is the text from step 2 and the emotional state information from step 3. Based on this information, the computer generates an appropriate response through a generative AI model. The output is a list of answer candidates. Specifically, the computer uses OpenAI's GPT model to select the best answer based on the input prompt sentence.
[0393] Step 5:
[0394] The server selects the answer that best matches the user's emotions from the generated list of answer candidates and sends it to the terminal. The input is the list of answer candidates. The output is the adjusted answer. Specifically, it selects an answer that emphasizes emotionally appropriate feedback and suggestions.
[0395] Step 6:
[0396] The terminal converts the received response into speech using a speech synthesis API. The input is a processed response from a computer. The output is the spoken response. Specifically, it is made audible with a natural voice tone using services such as Amazon Polly.
[0397] Step 7:
[0398] The device either outputs the spoken response through its speaker or displays it as text on its screen. The input is the data that was spoken in step 6. The output is the response information in a form that the user can perceive. Specifically, the device adjusts the speaker volume to play the response appropriately.
[0399] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0400] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0401] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0402] [Third Embodiment]
[0403] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0404] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0405] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0406] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0407] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0408] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0409] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0410] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0411] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0412] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0413] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0414] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0415] This invention provides a system that enables intuitive information acquisition and service utilization using smartphones, targeting elderly and technologically unfamiliar users. This system combines voice recognition technology and generative AI to allow users to perform various operations via voice input.
[0416] First, the user launches the agent app installed on their smartphone and makes a voice inquiry. For example, a possible question might be, "Please tell me about nearby hospitals." This voice input is converted into a digital signal by the smartphone.
[0417] Next, the device converts the audio into text data and sends this text to a remotely located server over the network. The server receives this text and generates a response using a generative AI model. For example, based on the user's current location, it retrieves a list of the nearest hospitals and generates this information in text format.
[0418] The generated response is sent back to the terminal via the network and notified to the user. Notification methods include the terminal using speech synthesis to inform the user verbally or displaying it visually on the screen. Furthermore, the server can customize the response based on the user profile, as the information is appropriately tailored to the user's needs.
[0419] For example, if a user needs weather forecast information, they can say "Tell me the weather for tomorrow" into their device. The voice is converted to text, and the server generates a response based on local weather information, such as "Tomorrow will be sunny, and the temperature will be 25 degrees Celsius." This allows the user to quickly and easily obtain the information they need.
[0420] This system aims to provide a user-friendly environment and reduce technical hurdles by utilizing a natural interface such as voice input.
[0421] The following describes the processing flow.
[0422] Step 1:
[0423] The user launches a smartphone app and makes a voice inquiry to request specific information. For example, they might say, "Tell me about events this weekend."
[0424] Step 2:
[0425] The device acquires this audio as a digital signal and converts it into text data using a speech recognition module.
[0426] Step 3:
[0427] The device sends the converted text data to the server over the internet. Here, the appropriate API endpoint is used to make the request.
[0428] Step 4:
[0429] The server analyzes the received text and runs a generative AI model to generate information that matches the user's inquiry. For example, it might refer to an event database to retrieve the desired information.
[0430] Step 5:
[0431] The server sends the generated response back to the terminal in text format. This includes the generated information and, if necessary, related data.
[0432] Step 6:
[0433] The device notifies the user of the received response data through speech synthesis or screen display. This allows the user to obtain information by voice or view details on the screen.
[0434] Step 7:
[0435] The system reviews the information provided by the user and makes additional inquiries as needed. This process is repeated as requested by the user.
[0436] (Example 1)
[0437] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0438] Traditionally, when elderly or technologically unfamiliar users retrieve information using voice input, challenges have included the complexity of the operation and the fact that the information provided is general and does not meet individual needs. Furthermore, dissatisfaction has arisen regarding the accuracy and speed of information obtained through voice input. Therefore, there is a need for a system that offers simple and intuitive operation and the provision of personalized information tailored to the user's attributes and current location.
[0439] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0440] In this invention, the server includes means for acquiring voice data as an electronic signal, means for converting the electronic signal into text information, and means for transmitting the text information to a remote data processing device. This enables users to intuitively acquire information using voice and to be provided with information tailored to their individual needs.
[0441] "Voice data" refers to the electronic signal obtained by converting the voice input from the user.
[0442] An "electronic signal" is a digital representation of audio information, converted into a format that can be processed.
[0443] "Textual information" refers to text data converted from electronic signals, in a format that can be processed by machines and programs.
[0444] A "data processing device" is a device that receives text information and processes it or generates responses using a generation AI model.
[0445] A "generative AI model" is an artificial intelligence model that learns from large amounts of data and generates text that sounds like it was written by a human.
[0446] "Response information" refers to the answer to the user generated by a data processing device based on textual information.
[0447] "Location information" refers to geographical information used to determine the user's current location.
[0448] "Attribute information" refers to information related to an individual, including the user's age, preferences, and past behavioral history.
[0449] "Personalized information" refers to information tailored to the specific needs of a user, based on their attribute information and current location.
[0450] The program for this system aims to provide an environment where elderly and technologically unfamiliar users can intuitively acquire information through voice input. Its main components include terminals (smart devices), servers, a network environment, and a generative AI model.
[0451] The user first launches a dedicated app installed on their smart device and performs voice input. The voice input is captured by the smart device's microphone and converted into a digital signal. The speech recognition software used here converts the voice data into text information. For example, commonly available programs are often used as the speech recognition engine.
[0452] The terminal transmits the converted text information to the server via the internet. The server uses a generative AI model to generate appropriate response information based on the input text information. A large-scale language model is used as this generative AI model. The generated response information is automatically customized, taking into account the user's location and attribute information.
[0453] The generated response information is then transmitted back to the terminal via the network. The terminal notifies the user by either playing this response information as audio using speech synthesis technology or by displaying it on the screen. Existing speech synthesis engines are commonly used for speech synthesis. Furthermore, the information is personalized based on the user's age and preferences, and a mechanism is in place to prioritize the notification of important information.
[0454] A concrete example is when a user asks via voice, "Please tell me which cafes are within walking distance from my current location." In this case, the generative AI model selects appropriate cafe candidates based on the user's current location and provides the most suitable answer. This system is designed to allow users to quickly and accurately obtain the information they need without performing complex operations.
[0455] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0456] Step 1:
[0457] The user launches a dedicated app on their smart device and provides voice input. This input is a prompt, such as "Tell me the weather nearby." The device acquires this voice as a digital signal through the microphone. Here, the input is voice data, and the output is a digital voice signal. Converting the voice data to a digital format enables subsequent processing.
[0458] Step 2:
[0459] The terminal inputs the acquired digital audio signal into speech recognition software, which converts it into text information. For example, the Google Speech-to-Text API or similar speech recognition technology is used. The input for this step is a digital audio signal, and the output is text information. This conversion makes the audio information machine-readable, preparing it for further processing.
[0460] Step 3:
[0461] The terminal sends the converted character information to the server over the network. The transmitted data includes character information, user location information, and related attribute information. The input for this step is character information, and the output is the dataset sent to the server. This action ensures that the server has the necessary information to generate response information.
[0462] Step 4:
[0463] The server analyzes the received text information and uses a generative AI model to create appropriate response information. A large-scale language model is used here. Based on the prompt and location information, the model generates personalized responses, such as "The weather nearby is sunny." The input for this step is text information and related information, and the output is the generated response information. The use of the AI model enables responses that are relevant to the user's context.
[0464] Step 5:
[0465] The server sends the generated response information back to the terminal via the network in text format. The input for this step is the generated response information, and the output is the response data sent to the terminal. This prepares the generated information for the user to receive.
[0466] Step 6:
[0467] The terminal either plays the received response information as speech using speech synthesis technology or displays it on the screen. The speech synthesis engines used are often widely used components. The input here is the response data, and the output is a notification to the user (audio or visual information). This final step allows the user to receive information in a simple way and take the necessary action.
[0468] (Application Example 1)
[0469] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0470] A challenge is enabling elderly and technologically unfamiliar users to easily obtain life support information using voice commands and live independently. Conventional technologies have insufficient voice recognition and generation AI capabilities, making it difficult to accurately meet the diverse needs of users. Furthermore, customized information provision is limited, and responses may be inadequate even in situations where rapid response is required.
[0471] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0472] In this invention, the server includes means for acquiring voice as a data signal, means for converting the data signal into character data, and means for providing life support information to the user. This enables intuitive information acquisition using voice and personalized service provision.
[0473] "Voice" refers to the human voice emitted by the user, and is an input method used to acquire information based on this voice.
[0474] A "data signal" is a signal that is converted from audio information into an electronic format and processed in a way that the system can understand.
[0475] "Character data" refers to data signals converted into strings of characters, which serve as the basis for the generation AI to interpret and generate responses.
[0476] A "data processing device" is a device that generates response information using character data, and examples include servers and cloud services.
[0477] "Response information" refers to information generated by a data processing device based on user input, and is the result provided to the user.
[0478] "Presenting aloud or visually" refers to a method of conveying generated response information to the user via speech synthesis technology or a display.
[0479] "Life support information" refers to information provided to improve the quality of life for users, and includes, for example, schedule management and reminder information.
[0480] "User attributes" refer to personal information such as the user's age, lifestyle, and preferences, and are criteria considered when individualizing the provision of information.
[0481] The system of this invention operates via a smartphone to facilitate the process by which users obtain life support information by voice.
[0482] The smartphone, acting as the terminal, uses its built-in microphone to convert the user's voice into a data signal. The smartphone then uses speech recognition software (e.g., Google Cloud Speech-to-Text) to convert this data signal into text data. This text data is then transmitted over the network to a remote data processing unit. This data processing unit is a server built on the cloud, which uses a generative AI model (e.g., OpenAI API) to analyze the text data and generate response information.
[0483] The server customizes its response to match the user's current attributes. This allows for the provision of personalized information that takes into account, for example, the user's age and lifestyle. This response information is then sent back to the terminal via the network and presented to the user either as audio using speech synthesis technology (e.g., Amazon Polly) or as text displayed on the screen.
[0484] As a concrete example, consider the case where a user voice-inputs the prompt, "Tell me when to take my medication." This prompt is processed by the system, and a response such as "Please take your medication at 10 o'clock" is generated. By using this example prompt, users can easily obtain information to support their health management.
[0485] Example of a prompt: "What's the weather like tomorrow?"
[0486] This system design allows even elderly people unfamiliar with technology to operate it intuitively, enabling them to manage their daily lives more independently.
[0487] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0488] Step 1:
[0489] The user inputs voice by asking questions to the device. The input voice is captured by the device's microphone and first captured as analog audio data.
[0490] Step 2:
[0491] The terminal converts the acquired audio data into a data signal. Using speech recognition software (e.g., a speech recognition API), this analog audio data is converted into a digital data signal, preparing the audio input for text format.
[0492] Step 3:
[0493] The digital signal is converted into text data by the speech recognition module in the device. This process, which involves analyzing the digital audio signal and converting it into text, is typically done using tools such as Google Cloud Speech-to-Text.
[0494] Step 4:
[0495] The generated character data is sent to the server. The terminal uses the network to upload this character data to a remote data processing device. The character data is then passed directly to the server and becomes input data for the next processing step.
[0496] Step 5:
[0497] The server uses the received character data to begin analyzing and generating data using a generative AI model (e.g., OpenAI API). This analysis process involves internal calculations to understand the prompt and generate appropriate response information.
[0498] Step 6:
[0499] The generated response information is sent from the server to the terminal. The server packages the response information output by the generating AI model and sends it to the terminal via the network.
[0500] Step 7:
[0501] The device provides the user with the received response information either audibly or visually. Specifically, it plays the information back to the user as audio using Text-to-Speech technology, or displays it as visual information on the device's display.
[0502] This entire process allows users to quickly obtain accurate information and receive support that is helpful in their daily lives.
[0503] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0504] This invention relates to an agent system incorporating an emotion engine that provides information through a natural interface via the user's voice. In particular, it aims to improve the user experience by recognizing the user's emotions from their voice and adjusting the information provided in consideration of those emotions.
[0505] First, the user launches an agent application on their smartphone and makes a voice inquiry. For example, they can ask a question like, "Tell me today's news." This voice is then converted into a digital signal by the smartphone.
[0506] Next, the converted audio data is analyzed by an emotion engine. The emotion engine used by the device analyzes the tone, tempo, and volume of the user's voice, and from this, recognizes the user's emotional state (happy, sad, angry, etc.). This information can also be compared with the user's history to form trends.
[0507] The analyzed text data and sentiment information are then sent to the server. The server uses generative AI to generate the most appropriate response based on the user's emotions and inquiry. For example, if the user is feeling down, it might provide news information accompanied by words of encouragement.
[0508] The generated response is sent back to the terminal and notified to the user using speech synthesis or displayed visually on the screen. This notification uses a friendly tone and phrases that respond to emotions, enabling more user-friendly information delivery.
[0509] For example, if a user asks "What's the weather going to be like next?" in a sad voice, the emotion engine will determine the user's emotions and provide a response that includes something uplifting, such as "The weather will be sunny next. There might even be a rainbow after the rain, how exciting!"
[0510] This invention enables interactions that reflect the user's emotional state, significantly improving the user's comfort level.
[0511] The following describes the processing flow.
[0512] Step 1:
[0513] The user launches the app on their smartphone and makes a voice inquiry. For example, they might say, "Tell me the latest movie reviews."
[0514] Step 2:
[0515] The device acquires the user's voice as a digital signal. Using speech recognition technology, the digital signal is converted into text data.
[0516] Step 3:
[0517] The device sends the converted text data to the emotion engine, which analyzes the voice parameters. It recognizes the user's emotions from the tone, tempo, and volume of their voice and records them as data.
[0518] Step 4:
[0519] The device sends text data and sentiment data to the server. The server receives this data and uses generative AI to generate a response that is appropriate to the user's emotions.
[0520] Step 5:
[0521] The server generates a response and sends it back to the terminal. During this process, it takes emotional data into account and adjusts the tone and content accordingly. For example, if the user is happy, it might include humor.
[0522] Step 6:
[0523] To notify the user of the response information received by the device, the device will either output it as speech using speech synthesis or display it visually on the screen.
[0524] Step 7:
[0525] Users will have a more satisfying experience by reviewing the information provided and receiving responses that match their emotions. They may make additional inquiries as needed.
[0526] (Example 2)
[0527] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0528] Conventional voice interface systems do not take into account the user's emotional state, resulting in information that does not align with the user's feelings and leading to low satisfaction. Furthermore, the lack of personalized information delivery makes it difficult to improve the user experience.
[0529] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0530] In this invention, the server includes means for acquiring voice data as a digital signal, means for analyzing the user's emotional state and generating information based on this using prompt sentences in an AI model, and means for providing personalized information according to the emotional state. This enables the presentation of optimal information according to the user's emotions, thereby improving the user experience.
[0531] "Audio data" refers to a signal that represents sound in digital format, converted into a form that can be processed by a computer.
[0532] "Text data" refers to audio data converted into written information, and is a format used in natural language processing and analysis.
[0533] "Emotional information" refers to information indicating the emotional state analyzed from the user's voice, and is estimated based on the tone, tempo, and volume of the user's voice.
[0534] An "information processing device" is a remotely installed computer system that analyzes data transmitted by a user and generates appropriate information.
[0535] A "generative AI model" is a type of artificial intelligence that uses machine learning techniques to generate human language and is used to provide the most appropriate response to user input.
[0536] A "prompt statement" is an input statement used to set specific conditions or contexts for a generative AI model, and is used to control the model's output.
[0537] "Personalized information delivery" is a process of providing information tailored to the user's emotional state and preferences, and is an approach aimed at delivering information that is more relevant to the user.
[0538] This invention realizes a system that provides more personalized information through a user voice interface. The user inputs voice data using a terminal such as a smartphone or voice-enabled device. The terminal converts the voice data into a digital signal and generates text data based on that digital signal.
[0539] The device is equipped with software for sentiment analysis, which generates sentiment information by analyzing the user's voice tone, tempo, and volume. This sentiment information, along with the user's speech content, is transmitted to the information processing device using a secure protocol.
[0540] The server-side information processing unit uses a generative AI model to generate the optimal response based on the received text data and emotional information. The generated response is adjusted according to the user's emotions, providing information that resonates with the user. The generative AI model receives specific response requests using prompts. Prompts such as "If the user is in an emotional state, what tone should the response be in regarding the information content?" are used.
[0541] For example, if a user asks "What's the weather like tomorrow?" in a cheerful voice, the device will interpret that emotion as "happy." The server will then generate a response such as "It's going to be sunny tomorrow! It's going to be a great day!" and notify the user. This enables the provision of more user-friendly information tailored to the user's emotions, improving the user experience. Furthermore, the generated information is also displayed on the device's screen, so the system of this invention can provide information using both audio and visual means.
[0542] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0543] Step 1:
[0544] The user launches an agent application on a device such as a smartphone and performs voice input. For example, they might ask, "What's the weather like tomorrow?" The input is the user's voice, which the device converts into a digital signal via the microphone. This converted digital signal forms the basis for the next processing step.
[0545] Step 2:
[0546] The terminal undergoes a process of converting digital signals into text data. Speech recognition software is used to convert the content of speech into text-based data. The input is a digitized speech signal, and the output is text data. This text data is then used for subsequent processing.
[0547] Step 3:
[0548] The device extracts emotional information from both text data and audio. This is done using an emotion analysis module, which identifies the user's emotional state by analyzing the tone, speed, and volume of the voice. The input is a digital audio signal, and the output is an emotion tag (e.g., "happy," "sad"). This emotional information is then included in the next data transmission.
[0549] Step 4:
[0550] The terminal bundles the obtained text data and sentiment information into packets and sends them to the information processing device. A secure communication protocol is used for transmission. The input is text data and sentiment information, and the output is the data packets sent to the server. This enables the next generation process.
[0551] Step 5:
[0552] The server uses a generative AI model to create a response based on the received data. The prompt text is "If the user's emotions are in an emotional state, what tone should the response be to the information content?" and the AI outputs an appropriate text response according to this instruction. Personalized information is generated through this process.
[0553] Step 6:
[0554] A response generated by the server is sent to the terminal. The terminal receives this response and notifies the user using a speech synthesis engine. The input is the generated text response, and the output is either synthesized speech or text displayed on the screen. This allows the user to receive information appropriate to their emotions.
[0555] (Application Example 2)
[0556] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0557] Conventional speech recognition systems have the problem of not being able to provide information while taking into account the user's emotional state. Furthermore, while there is a need to provide appropriate feedback based on the user's emotions, current technology lacks systems that organically link emotion analysis and information provision. As a result, the user experience is limited and the quality of interaction suffers.
[0558] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0559] In this invention, the server includes means for acquiring voice data as digital information, means for analyzing the user's emotional state from the voice data, and means for adjusting response information based on the emotional state. This makes it possible to provide information that is tailored to the user's emotional state.
[0560] "Audio data" refers to sound signals acquired via a microphone based on the user's pronunciation.
[0561] "Digital information" refers to data obtained by converting analog audio data into numerical data, in a format that can be processed by electronic computers.
[0562] An "electronic computing device" is a computer or similar device used to analyze input data and perform information processing.
[0563] "Answer information" refers to the answer or response content generated by an electronic computer and provided to the user.
[0564] "Screen display" refers to a method of conveying information to users visually, and is performed using a display.
[0565] "User" refers to an individual who operates this system and obtains information.
[0566] "Emotional state" refers to the psychological or emotional condition exhibited by the user, and is recognized through the analysis of voice data.
[0567] "Adjustment" refers to the act of modifying or optimizing information based on specific circumstances or criteria.
[0568] To implement this invention, it is first necessary to equip a home robot with a microphone and speaker. When a user speaks to the robot and asks a question, the device acquires the voice data as digital information. This digital information is then converted into text using speech recognition software. At this stage, a common speech recognition service such as the Google Cloud Speech-to-Text API is used.
[0569] Next, the emotion engine analyzes the user's voice to determine their emotional state. Using the EmotionAPI or an equivalent emotion analysis API, the emotion engine analyzes the voice tone, rhythm, and volume to identify the user's emotions. The identified emotional state is sent to a computer and used as foundational data to generate optimal response information using a generative AI model.
[0570] The server utilizes generative AI such as OpenAI's GPT to generate response information that corresponds to the user's emotional state and the content of their conversation. The generative AI is input with prompts and is required to respond in an appropriate context. For example, if the user expresses the emotion "I'm feeling tired today," the server will generate a friendly response such as, "How about trying to relax today? Shall I recommend some relaxation techniques?"
[0571] The generated response information is sent to the device and converted into speech using a speech synthesis API such as Amazon Polly. The spoken response is either transmitted to the user through the speaker or presented as visual information on the screen.
[0572] Example of a prompt:
[0573] "User comment: 'I'm not really in the mood today.' The emotion engine has detected sadness. Please respond with something friendly to cheer up the user."
[0574] Such systems allow users to enjoy natural, emotionally resonant interactions, increasing the convenience of consumer robots in everyday life.
[0575] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0576] Step 1:
[0577] The terminal acquires the voice spoken by the user to the robot via a microphone. First, the voice data, as input, is converted into digital information. This process utilizes voice compression technology to convert analog voice signals into digital data. The output is the voice data in digital format.
[0578] Step 2:
[0579] The device converts the acquired digital speech information into text. This process utilizes the Google Cloud Speech-to-Text API to generate text from speech patterns. The input for this step is digital speech information, and the output is the transcribed user conversation. Specifically, the device uses a language model to convert phonemes into words.
[0580] Step 3:
[0581] The device uses an emotion engine to analyze the user's emotional state from their voice. The input consists of the text information obtained in step 2 and the tone and timing of the voice. Specifically, the EmotionAPI analyzes the voice features and outputs emotion labels such as joy and sadness. The output is information about the user's emotional state.
[0582] Step 4:
[0583] The terminal transmits text information and emotional state information to the computer. The input is the text from step 2 and the emotional state information from step 3. Based on this information, the computer generates an appropriate response through a generative AI model. The output is a list of answer candidates. Specifically, the computer uses OpenAI's GPT model to select the best answer based on the input prompt sentence.
[0584] Step 5:
[0585] The server selects the answer that best matches the user's emotions from the generated list of answer candidates and sends it to the terminal. The input is the list of answer candidates. The output is the adjusted answer. Specifically, it selects an answer that emphasizes emotionally appropriate feedback and suggestions.
[0586] Step 6:
[0587] The terminal converts the received response into speech using a speech synthesis API. The input is a processed response from a computer. The output is the spoken response. Specifically, it is made audible with a natural voice tone using services such as Amazon Polly.
[0588] Step 7:
[0589] The device either outputs the spoken response through its speaker or displays it as text on its screen. The input is the data that was spoken in step 6. The output is the response information in a form that the user can perceive. Specifically, the device adjusts the speaker volume to play the response appropriately.
[0590] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0591] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0592] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0593] [Fourth Embodiment]
[0594] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0595] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0596] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0597] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0598] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0599] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0600] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0601] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0602] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0603] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0604] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0605] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0606] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0607] This invention provides a system that enables intuitive information acquisition and service utilization using smartphones, targeting elderly and technologically unfamiliar users. This system combines voice recognition technology and generative AI to allow users to perform various operations via voice input.
[0608] First, the user launches the agent app installed on their smartphone and makes a voice inquiry. For example, a possible question might be, "Please tell me about nearby hospitals." This voice input is converted into a digital signal by the smartphone.
[0609] Next, the device converts the audio into text data and sends this text to a remotely located server over the network. The server receives this text and generates a response using a generative AI model. For example, based on the user's current location, it retrieves a list of the nearest hospitals and generates this information in text format.
[0610] The generated response is sent back to the terminal via the network and notified to the user. Notification methods include the terminal using speech synthesis to inform the user verbally or displaying it visually on the screen. Furthermore, the server can customize the response based on the user profile, as the information is appropriately tailored to the user's needs.
[0611] For example, if a user needs weather forecast information, they can say "Tell me the weather for tomorrow" into their device. The voice is converted to text, and the server generates a response based on local weather information, such as "Tomorrow will be sunny, and the temperature will be 25 degrees Celsius." This allows the user to quickly and easily obtain the information they need.
[0612] This system aims to provide a user-friendly environment and reduce technical hurdles by utilizing a natural interface such as voice input.
[0613] The following describes the processing flow.
[0614] Step 1:
[0615] The user launches a smartphone app and makes a voice inquiry to request specific information. For example, they might say, "Tell me about events this weekend."
[0616] Step 2:
[0617] The device acquires this audio as a digital signal and converts it into text data using a speech recognition module.
[0618] Step 3:
[0619] The device sends the converted text data to the server over the internet. Here, the appropriate API endpoint is used to make the request.
[0620] Step 4:
[0621] The server analyzes the received text and runs a generative AI model to generate information that matches the user's inquiry. For example, it might refer to an event database to retrieve the desired information.
[0622] Step 5:
[0623] The server sends the generated response back to the terminal in text format. This includes the generated information and, if necessary, related data.
[0624] Step 6:
[0625] The device notifies the user of the received response data through speech synthesis or screen display. This allows the user to obtain information by voice or view details on the screen.
[0626] Step 7:
[0627] The system reviews the information provided by the user and makes additional inquiries as needed. This process is repeated as requested by the user.
[0628] (Example 1)
[0629] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0630] Traditionally, when elderly or technologically unfamiliar users retrieve information using voice input, challenges have included the complexity of the operation and the fact that the information provided is general and does not meet individual needs. Furthermore, dissatisfaction has arisen regarding the accuracy and speed of information obtained through voice input. Therefore, there is a need for a system that offers simple and intuitive operation and the provision of personalized information tailored to the user's attributes and current location.
[0631] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0632] In this invention, the server includes means for acquiring voice data as an electronic signal, means for converting the electronic signal into text information, and means for transmitting the text information to a remote data processing device. This enables users to intuitively acquire information using voice and to be provided with information tailored to their individual needs.
[0633] "Voice data" refers to the electronic signal obtained by converting the voice input from the user.
[0634] An "electronic signal" is a digital representation of audio information, converted into a format that can be processed.
[0635] "Textual information" refers to text data converted from electronic signals, in a format that can be processed by machines and programs.
[0636] A "data processing device" is a device that receives text information and processes it or generates responses using a generation AI model.
[0637] A "generative AI model" is an artificial intelligence model that learns from large amounts of data and generates text that sounds like it was written by a human.
[0638] "Response information" refers to the answer to the user generated by a data processing device based on textual information.
[0639] "Location information" refers to geographical information used to determine the user's current location.
[0640] "Attribute information" refers to information related to an individual, including the user's age, preferences, and past behavioral history.
[0641] "Personalized information" refers to information tailored to the specific needs of a user, based on their attribute information and current location.
[0642] The program for this system aims to provide an environment where elderly and technologically unfamiliar users can intuitively acquire information through voice input. Its main components include terminals (smart devices), servers, a network environment, and a generative AI model.
[0643] The user first launches a dedicated app installed on their smart device and performs voice input. The voice input is captured by the smart device's microphone and converted into a digital signal. The speech recognition software used here converts the voice data into text information. For example, commonly available programs are often used as the speech recognition engine.
[0644] The terminal transmits the converted text information to the server via the internet. The server uses a generative AI model to generate appropriate response information based on the input text information. A large-scale language model is used as this generative AI model. The generated response information is automatically customized, taking into account the user's location and attribute information.
[0645] The generated response information is then transmitted back to the terminal via the network. The terminal notifies the user by either playing this response information as audio using speech synthesis technology or by displaying it on the screen. Existing speech synthesis engines are commonly used for speech synthesis. Furthermore, the information is personalized based on the user's age and preferences, and a mechanism is in place to prioritize the notification of important information.
[0646] A concrete example is when a user asks via voice, "Please tell me which cafes are within walking distance from my current location." In this case, the generative AI model selects appropriate cafe candidates based on the user's current location and provides the most suitable answer. This system is designed to allow users to quickly and accurately obtain the information they need without performing complex operations.
[0647] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0648] Step 1:
[0649] The user launches a dedicated app on their smart device and provides voice input. This input is a prompt, such as "Tell me the weather nearby." The device acquires this voice as a digital signal through the microphone. Here, the input is voice data, and the output is a digital voice signal. Converting the voice data to a digital format enables subsequent processing.
[0650] Step 2:
[0651] The terminal inputs the acquired digital audio signal into speech recognition software, which converts it into text information. For example, the Google Speech-to-Text API or similar speech recognition technology is used. The input for this step is a digital audio signal, and the output is text information. This conversion makes the audio information machine-readable, preparing it for further processing.
[0652] Step 3:
[0653] The terminal sends the converted character information to the server over the network. The transmitted data includes character information, user location information, and related attribute information. The input for this step is character information, and the output is the dataset sent to the server. This action ensures that the server has the necessary information to generate response information.
[0654] Step 4:
[0655] The server analyzes the received text information and uses a generative AI model to create appropriate response information. A large-scale language model is used here. Based on the prompt and location information, the model generates personalized responses, such as "The weather nearby is sunny." The input for this step is text information and related information, and the output is the generated response information. The use of the AI model enables responses that are relevant to the user's context.
[0656] Step 5:
[0657] The server sends the generated response information back to the terminal via the network in text format. The input for this step is the generated response information, and the output is the response data sent to the terminal. This prepares the generated information for the user to receive.
[0658] Step 6:
[0659] The terminal either plays the received response information as speech using speech synthesis technology or displays it on the screen. The speech synthesis engines used are often widely used components. The input here is the response data, and the output is a notification to the user (audio or visual information). This final step allows the user to receive information in a simple way and take the necessary action.
[0660] (Application Example 1)
[0661] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0662] A challenge is enabling elderly and technologically unfamiliar users to easily obtain life support information using voice commands and live independently. Conventional technologies have insufficient voice recognition and generation AI capabilities, making it difficult to accurately meet the diverse needs of users. Furthermore, customized information provision is limited, and responses may be inadequate even in situations where rapid response is required.
[0663] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0664] In this invention, the server includes means for acquiring voice as a data signal, means for converting the data signal into character data, and means for providing life support information to the user. This enables intuitive information acquisition using voice and personalized service provision.
[0665] "Voice" refers to the human voice emitted by the user, and is an input method used to acquire information based on this voice.
[0666] A "data signal" is a signal that is converted from audio information into an electronic format and processed in a way that the system can understand.
[0667] "Character data" refers to data signals converted into strings of characters, which serve as the basis for the generation AI to interpret and generate responses.
[0668] A "data processing device" is a device that generates response information using character data, and examples include servers and cloud services.
[0669] "Response information" refers to information generated by a data processing device based on user input, and is the result provided to the user.
[0670] "Presenting aloud or visually" refers to a method of conveying generated response information to the user via speech synthesis technology or a display.
[0671] "Life support information" refers to information provided to improve the quality of life for users, and includes, for example, schedule management and reminder information.
[0672] "User attributes" refer to personal information such as the user's age, lifestyle, and preferences, and are criteria considered when individualizing the provision of information.
[0673] The system of this invention operates via a smartphone to facilitate the process by which users obtain life support information by voice.
[0674] The smartphone, acting as the terminal, uses its built-in microphone to convert the user's voice into a data signal. The smartphone then uses speech recognition software (e.g., Google Cloud Speech-to-Text) to convert this data signal into text data. This text data is then transmitted over the network to a remote data processing unit. This data processing unit is a server built on the cloud, which uses a generative AI model (e.g., OpenAI API) to analyze the text data and generate response information.
[0675] The server customizes its response to match the user's current attributes. This allows for the provision of personalized information that takes into account, for example, the user's age and lifestyle. This response information is then sent back to the terminal via the network and presented to the user either as audio using speech synthesis technology (e.g., Amazon Polly) or as text displayed on the screen.
[0676] As a concrete example, consider the case where a user voice-inputs the prompt, "Tell me when to take my medication." This prompt is processed by the system, and a response such as "Please take your medication at 10 o'clock" is generated. By using this example prompt, users can easily obtain information to support their health management.
[0677] Example of a prompt: "What's the weather like tomorrow?"
[0678] This system design allows even elderly people unfamiliar with technology to operate it intuitively, enabling them to manage their daily lives more independently.
[0679] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0680] Step 1:
[0681] The user inputs voice by asking questions to the device. The input voice is captured by the device's microphone and first captured as analog audio data.
[0682] Step 2:
[0683] The terminal converts the acquired audio data into a data signal. Using speech recognition software (e.g., a speech recognition API), this analog audio data is converted into a digital data signal, preparing the audio input for text format.
[0684] Step 3:
[0685] The digital signal is converted into text data by the speech recognition module in the device. This process, which involves analyzing the digital audio signal and converting it into text, is typically done using tools such as Google Cloud Speech-to-Text.
[0686] Step 4:
[0687] The generated character data is sent to the server. The terminal uses the network to upload this character data to a remote data processing device. The character data is then passed directly to the server and becomes input data for the next processing step.
[0688] Step 5:
[0689] The server uses the received character data to begin analyzing and generating data using a generative AI model (e.g., OpenAI API). This analysis process involves internal calculations to understand the prompt and generate appropriate response information.
[0690] Step 6:
[0691] The generated response information is sent from the server to the terminal. The server packages the response information output by the generating AI model and sends it to the terminal via the network.
[0692] Step 7:
[0693] The device provides the user with the received response information either audibly or visually. Specifically, it plays the information back to the user as audio using Text-to-Speech technology, or displays it as visual information on the device's display.
[0694] This entire process allows users to quickly obtain accurate information and receive support that is helpful in their daily lives.
[0695] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0696] This invention relates to an agent system incorporating an emotion engine that provides information through a natural interface via the user's voice. In particular, it aims to improve the user experience by recognizing the user's emotions from their voice and adjusting the information provided in consideration of those emotions.
[0697] First, the user launches an agent application on their smartphone and makes a voice inquiry. For example, they can ask a question like, "Tell me today's news." This voice is then converted into a digital signal by the smartphone.
[0698] Next, the converted audio data is analyzed by an emotion engine. The emotion engine used by the device analyzes the tone, tempo, and volume of the user's voice, and from this, recognizes the user's emotional state (happy, sad, angry, etc.). This information can also be compared with the user's history to form trends.
[0699] The analyzed text data and sentiment information are then sent to the server. The server uses generative AI to generate the most appropriate response based on the user's emotions and inquiry. For example, if the user is feeling down, it might provide news information accompanied by words of encouragement.
[0700] The generated response is sent back to the terminal and notified to the user using speech synthesis or displayed visually on the screen. This notification uses a friendly tone and phrases that respond to emotions, enabling more user-friendly information delivery.
[0701] For example, if a user asks "What's the weather going to be like next?" in a sad voice, the emotion engine will determine the user's emotions and provide a response that includes something uplifting, such as "The weather will be sunny next. There might even be a rainbow after the rain, how exciting!"
[0702] This invention enables interactions that reflect the user's emotional state, significantly improving the user's comfort level.
[0703] The following describes the processing flow.
[0704] Step 1:
[0705] The user launches the app on their smartphone and makes a voice inquiry. For example, they might say, "Tell me the latest movie reviews."
[0706] Step 2:
[0707] The device acquires the user's voice as a digital signal. Using speech recognition technology, the digital signal is converted into text data.
[0708] Step 3:
[0709] The device sends the converted text data to the emotion engine, which analyzes the voice parameters. It recognizes the user's emotions from the tone, tempo, and volume of their voice and records them as data.
[0710] Step 4:
[0711] The device sends text data and sentiment data to the server. The server receives this data and uses generative AI to generate a response that is appropriate to the user's emotions.
[0712] Step 5:
[0713] The server generates a response and sends it back to the terminal. During this process, it takes emotional data into account and adjusts the tone and content accordingly. For example, if the user is happy, it might include humor.
[0714] Step 6:
[0715] To notify the user of the response information received by the device, the device will either output it as speech using speech synthesis or display it visually on the screen.
[0716] Step 7:
[0717] Users will have a more satisfying experience by reviewing the information provided and receiving responses that match their emotions. They may make additional inquiries as needed.
[0718] (Example 2)
[0719] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0720] Conventional voice interface systems do not take into account the user's emotional state, resulting in information that does not align with the user's feelings and leading to low satisfaction. Furthermore, the lack of personalized information delivery makes it difficult to improve the user experience.
[0721] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0722] In this invention, the server includes means for acquiring voice data as a digital signal, means for analyzing the user's emotional state and generating information based on this using prompt sentences in an AI model, and means for providing personalized information according to the emotional state. This enables the presentation of optimal information according to the user's emotions, thereby improving the user experience.
[0723] "Audio data" refers to a signal that represents sound in digital format, converted into a form that can be processed by a computer.
[0724] "Text data" refers to audio data converted into written information, and is a format used in natural language processing and analysis.
[0725] "Emotional information" refers to information indicating the emotional state analyzed from the user's voice, and is estimated based on the tone, tempo, and volume of the user's voice.
[0726] An "information processing device" is a remotely installed computer system that analyzes data transmitted by a user and generates appropriate information.
[0727] A "generative AI model" is a type of artificial intelligence that uses machine learning techniques to generate human language and is used to provide the most appropriate response to user input.
[0728] A "prompt statement" is an input statement used to set specific conditions or contexts for a generative AI model, and is used to control the model's output.
[0729] "Personalized information delivery" is a process of providing information tailored to the user's emotional state and preferences, and is an approach aimed at delivering information that is more relevant to the user.
[0730] This invention realizes a system that provides more personalized information through a user voice interface. The user inputs voice data using a terminal such as a smartphone or voice-enabled device. The terminal converts the voice data into a digital signal and generates text data based on that digital signal.
[0731] The device is equipped with software for sentiment analysis, which generates sentiment information by analyzing the user's voice tone, tempo, and volume. This sentiment information, along with the user's speech content, is transmitted to the information processing device using a secure protocol.
[0732] The server-side information processing unit uses a generative AI model to generate the optimal response based on the received text data and emotional information. The generated response is adjusted according to the user's emotions, providing information that resonates with the user. The generative AI model receives specific response requests using prompts. Prompts such as "If the user is in an emotional state, what tone should the response be in regarding the information content?" are used.
[0733] For example, if a user asks "What's the weather like tomorrow?" in a cheerful voice, the device will interpret that emotion as "happy." The server will then generate a response such as "It's going to be sunny tomorrow! It's going to be a great day!" and notify the user. This enables the provision of more user-friendly information tailored to the user's emotions, improving the user experience. Furthermore, the generated information is also displayed on the device's screen, so the system of this invention can provide information using both audio and visual means.
[0734] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0735] Step 1:
[0736] The user launches an agent application on a device such as a smartphone and performs voice input. For example, they might ask, "What's the weather like tomorrow?" The input is the user's voice, which the device converts into a digital signal via the microphone. This converted digital signal forms the basis for the next processing step.
[0737] Step 2:
[0738] The terminal undergoes a process of converting digital signals into text data. Speech recognition software is used to convert the content of speech into text-based data. The input is a digitized speech signal, and the output is text data. This text data is then used for subsequent processing.
[0739] Step 3:
[0740] The device extracts emotional information from both text data and audio. This is done using an emotion analysis module, which identifies the user's emotional state by analyzing the tone, speed, and volume of the voice. The input is a digital audio signal, and the output is an emotion tag (e.g., "happy," "sad"). This emotional information is then included in the next data transmission.
[0741] Step 4:
[0742] The terminal bundles the obtained text data and sentiment information into packets and sends them to the information processing device. A secure communication protocol is used for transmission. The input is text data and sentiment information, and the output is the data packets sent to the server. This enables the next generation process.
[0743] Step 5:
[0744] The server uses a generative AI model to create a response based on the received data. The prompt text is "If the user's emotions are in an emotional state, what tone should the response be to the information content?" and the AI outputs an appropriate text response according to this instruction. Personalized information is generated through this process.
[0745] Step 6:
[0746] A response generated by the server is sent to the terminal. The terminal receives this response and notifies the user using a speech synthesis engine. The input is the generated text response, and the output is either synthesized speech or text displayed on the screen. This allows the user to receive information appropriate to their emotions.
[0747] (Application Example 2)
[0748] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0749] Conventional speech recognition systems have the problem of not being able to provide information while taking into account the user's emotional state. Furthermore, while there is a need to provide appropriate feedback based on the user's emotions, current technology lacks systems that organically link emotion analysis and information provision. As a result, the user experience is limited and the quality of interaction suffers.
[0750] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0751] In this invention, the server includes means for acquiring voice data as digital information, means for analyzing the user's emotional state from the voice data, and means for adjusting response information based on the emotional state. This makes it possible to provide information that is tailored to the user's emotional state.
[0752] "Audio data" refers to sound signals acquired via a microphone based on the user's pronunciation.
[0753] "Digital information" refers to data obtained by converting analog audio data into numerical data, in a format that can be processed by electronic computers.
[0754] An "electronic computing device" is a computer or similar device used to analyze input data and perform information processing.
[0755] "Answer information" refers to the answer or response content generated by an electronic computer and provided to the user.
[0756] "Screen display" refers to a method of conveying information to users visually, and is performed using a display.
[0757] "User" refers to an individual who operates this system and obtains information.
[0758] "Emotional state" refers to the psychological or emotional condition exhibited by the user, and is recognized through the analysis of voice data.
[0759] "Adjustment" refers to the act of modifying or optimizing information based on specific circumstances or criteria.
[0760] To implement this invention, it is first necessary to equip a home robot with a microphone and speaker. When a user speaks to the robot and asks a question, the device acquires the voice data as digital information. This digital information is then converted into text using speech recognition software. At this stage, a common speech recognition service such as the Google Cloud Speech-to-Text API is used.
[0761] Next, the emotion engine analyzes the user's voice to determine their emotional state. Using the EmotionAPI or an equivalent emotion analysis API, the emotion engine analyzes the voice tone, rhythm, and volume to identify the user's emotions. The identified emotional state is sent to a computer and used as foundational data to generate optimal response information using a generative AI model.
[0762] The server utilizes generative AI such as OpenAI's GPT to generate response information that corresponds to the user's emotional state and the content of their conversation. The generative AI is input with prompts and is required to respond in an appropriate context. For example, if the user expresses the emotion "I'm feeling tired today," the server will generate a friendly response such as, "How about trying to relax today? Shall I recommend some relaxation techniques?"
[0763] The generated response information is sent to the device and converted into speech using a speech synthesis API such as Amazon Polly. The spoken response is either transmitted to the user through the speaker or presented as visual information on the screen.
[0764] Example of a prompt:
[0765] "User comment: 'I'm not really in the mood today.' The emotion engine has detected sadness. Please respond with something friendly to cheer up the user."
[0766] Such systems allow users to enjoy natural, emotionally resonant interactions, increasing the convenience of consumer robots in everyday life.
[0767] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0768] Step 1:
[0769] The terminal acquires the voice spoken by the user to the robot via a microphone. First, the voice data, as input, is converted into digital information. This process utilizes voice compression technology to convert analog voice signals into digital data. The output is the voice data in digital format.
[0770] Step 2:
[0771] The device converts the acquired digital speech information into text. This process utilizes the Google Cloud Speech-to-Text API to generate text from speech patterns. The input for this step is digital speech information, and the output is the transcribed user conversation. Specifically, the device uses a language model to convert phonemes into words.
[0772] Step 3:
[0773] The device uses an emotion engine to analyze the user's emotional state from their voice. The input consists of the text information obtained in step 2 and the tone and timing of the voice. Specifically, the EmotionAPI analyzes the voice features and outputs emotion labels such as joy and sadness. The output is information about the user's emotional state.
[0774] Step 4:
[0775] The terminal transmits text information and emotional state information to the computer. The input is the text from step 2 and the emotional state information from step 3. Based on this information, the computer generates an appropriate response through a generative AI model. The output is a list of answer candidates. Specifically, the computer uses OpenAI's GPT model to select the best answer based on the input prompt sentence.
[0776] Step 5:
[0777] The server selects the answer that best matches the user's emotions from the generated list of answer candidates and sends it to the terminal. The input is the list of answer candidates. The output is the adjusted answer. Specifically, it selects an answer that emphasizes emotionally appropriate feedback and suggestions.
[0778] Step 6:
[0779] The terminal converts the received response into speech using a speech synthesis API. The input is a processed response from a computer. The output is the spoken response. Specifically, it is made audible with a natural voice tone using services such as Amazon Polly.
[0780] Step 7:
[0781] The device either outputs the spoken response through its speaker or displays it as text on its screen. The input is the data that was spoken in step 6. The output is the response information in a form that the user can perceive. Specifically, the device adjusts the speaker volume to play the response appropriately.
[0782] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0783] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0784] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0785] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0786] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0787] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0788] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0789] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0790] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0791] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0792] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0793] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0794] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0795] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0796] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0797] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0798] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0799] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0800] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0801] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0802] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0803] The following is further disclosed regarding the embodiments described above.
[0804] (Claim 1)
[0805] A means of acquiring audio information as a digital signal,
[0806] means for converting the aforementioned digital signal into text,
[0807] Means for transmitting the aforementioned text to a remote information processing device,
[0808] means for receiving response information generated by the information processing device,
[0809] Means for notifying the user of the aforementioned response information audibly or visually,
[0810] A system that includes this.
[0811] (Claim 2)
[0812] The system according to claim 1, further comprising means for accessing an external information source based on the user's voice input and returning the acquired information to the user via the information processing device.
[0813] (Claim 3)
[0814] The system according to claim 1, further comprising means for adjusting the information presented based on the user's age and preferences to provide personalized information.
[0815] "Example 1"
[0816] (Claim 1)
[0817] A means of acquiring audio data as an electronic signal,
[0818] means for converting the aforementioned electronic signal into character information,
[0819] Means for transmitting the aforementioned character information to a remote data processing device,
[0820] The data processing device provides means for receiving response information generated using the generated AI model,
[0821] Means for notifying the user of the aforementioned response information by voice or image,
[0822] Means for acquiring and using the user's location information,
[0823] A system that includes this.
[0824] (Claim 2)
[0825] The system according to claim 1, further comprising means for accessing information sources based on voice instructions from a user and responding to the user with the acquired information via the data processing device.
[0826] (Claim 3)
[0827] The system according to claim 1, further comprising means for adjusting the information provided based on user attribute information and providing personalized information.
[0828] "Application Example 1"
[0829] (Claim 1)
[0830] A means of acquiring audio as a data signal,
[0831] Means for converting the aforementioned data signal into character data,
[0832] Means for transmitting the aforementioned character data to a remote data processing device,
[0833] means for receiving response information generated by the data processing device,
[0834] Means for presenting the aforementioned response information to the user audibly or visually,
[0835] A means of providing users with information on daily living support,
[0836] A system that includes this.
[0837] (Claim 2)
[0838] The system according to claim 1, further comprising means for accessing an external information source based on the user's voice input and responding to the user with the acquired information via the data processing device.
[0839] (Claim 3)
[0840] The system according to claim 1, further comprising means for adjusting the information presented based on the user's attributes and providing personalized information.
[0841] "Example 2 of combining an emotion engine"
[0842] (Claim 1)
[0843] A means of acquiring audio data as a digital signal,
[0844] means for converting the aforementioned digital signal into text data,
[0845] Means for transmitting emotional information extracted from the aforementioned text data and audio to a remote information processing device,
[0846] means for receiving response information generated by the information processing device,
[0847] Means for notifying the user of the aforementioned response information audibly or visually,
[0848] A system that includes this.
[0849] (Claim 2)
[0850] The system according to claim 1, further comprising means for accessing an external information source based on the user's voice input, adjusting the acquired information according to the emotional state, and responding to the user via the information processing device.
[0851] (Claim 3)
[0852] The system according to claim 1, further comprising means for analyzing the user's emotional state and providing personalized information to a generative AI model using prompt sentences.
[0853] "Application example 2 when combining with an emotional engine"
[0854] (Claim 1)
[0855] A device that acquires audio data as digital information,
[0856] A device for converting the aforementioned digital information into a string,
[0857] A device for transmitting the aforementioned string to a remote electronic computer,
[0858] A device for receiving response information generated by the aforementioned electronic computer,
[0859] A device that notifies the user of the aforementioned response information by voice or screen display,
[0860] A device that analyzes the emotional state of users from their voice data,
[0861] A device that adjusts response information based on the aforementioned emotional state,
[0862] A system that includes this.
[0863] (Claim 2)
[0864] The system according to claim 1, further comprising a device that accesses an external information source based on a user's voice input and provides the acquired information to the user via the electronic computer.
[0865] (Claim 3)
[0866] The system according to claim 1, further comprising a device that adjusts the information presented based on the user's characteristics and provides personalized information. [Explanation of symbols]
[0867] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A means of acquiring audio as a data signal, Means for converting the aforementioned data signal into character data, Means for transmitting the aforementioned character data to a remote data processing device, means for receiving response information generated by the data processing device, Means for presenting the aforementioned response information to the user audibly or visually, A means of providing users with information on daily living support, A system that includes this.
2. The system according to claim 1, further comprising means for accessing an external information source based on the user's voice input and responding to the user with the acquired information via the data processing device.
3. The system according to claim 1, further comprising means for adjusting the information presented based on the user's attributes and providing personalized information.