system

JP2026105495APending Publication Date: 2026-06-26SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-16
Publication Date
2026-06-26

Smart Images

  • Figure 2026105495000001_ABST
    Figure 2026105495000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A speech recognition means that receives an audio input and converts the audio into text data, A means for relaying the aforementioned character data and performing language processing in a processing device, The aforementioned processing device includes a speech synthesis means that converts the generated response characters into sound and outputs them, A physiological information analysis means that recognizes the user's physiological information and adjusts the response based on said physiological information, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In modern society, there is a problem that loneliness is increasing, especially among the elderly. Many elderly people have few opportunities to have daily conversations, and their mental health may be impaired. A communication tool that can be easily operated is required to reduce this loneliness. However, with existing technologies, it is difficult to have natural and continuous conversations, and it is not possible to provide an interlocutor that is friendly to the elderly.

Means for Solving the Problems

[0005] This invention solves this problem by providing a system comprising a speech recognition means for converting voice input into text data, a server for natural language processing, and a speech synthesis means for converting the generated response text into speech. This system enables natural conversation with the user and, through sentiment analysis on the server side and learning from past conversation data, can provide more appropriate and friendly responses.

[0006] "Voice input" refers to the act or function of receiving a user's voice through a microphone or other device and processing its content as analyzable data.

[0007] "Speech recognition means" refers to technologies and devices for converting received speech input into text data.

[0008] "Text data" refers to string data containing linguistic information converted by speech recognition technology.

[0009] A "server" is a central processing unit or system that receives text data and performs natural language processing and response generation.

[0010] "Natural language processing" refers to the technology that enables computers to understand and process human language, and specifically includes the analysis of text data and the generation of responses.

[0011] "Response text" refers to string data generated on the server through natural language processing as a response to user input.

[0012] "Speech synthesis means" refers to technologies and devices that convert response text into speech data and allow the user to listen to it.

[0013] "Conversation data" refers to data that records the history of interactions between the user and the system, and is used to improve future conversations. [Brief explanation of the drawing]

[0014] [Figure 1] It is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

Embodiments for Carrying Out the Invention

[0015] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0016] First, the terms used in the following description will be explained.

[0017] In the following embodiments, a numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0018] In the following embodiments, a numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0019] In the following embodiments, a numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0020] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0021] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0022] [First Embodiment]

[0023] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0024] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0025] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0026] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0027] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0028] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0029] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0030] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0031] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0032] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0033] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0034] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0035] This invention provides a dialogue system that allows elderly and other users to converse naturally. This system consists of three main components: the user, the terminal (Robot-kun), and the server, which work together in cooperation.

[0036] Users can easily use the system by simply talking to the robot. The voice spoken by the user is received by a microphone built into the device. The device uses speech recognition to convert the voice into text data. The converted text data is then sent to the server.

[0037] The server performs natural language processing based on the received text data. This process helps understand the user's intent and generates appropriate response text. The server also performs sentiment analysis and constructs responses that are empathetic to the user's emotions. Furthermore, by referring to past conversation data, it can adjust responses to ensure smoother interactions in the future.

[0038] The response text generated by the server is sent to the terminal. The terminal converts the response text into speech using a speech synthesis system and plays it to the user through its speaker. At this time, the terminal may also incorporate facial expressions and gestures to provide a more natural conversational experience.

[0039] As a concrete example, consider a scenario where a user tells the robot, "I was lonely today because I didn't get to talk to anyone." The device transcribes this audio into text and sends it to the server. The server performs sentiment analysis and understands the emotion of "loneliness." It then generates a response text such as, "That must have been lonely. I'll be your conversation partner, so please feel free to talk about anything," and sends it to the device. The device then converts this back into speech and conveys it to the user, thus enabling system-based dialogue.

[0040] In this way, the present invention plays an important role in realizing communication that users find comfortable and alleviating feelings of loneliness among the elderly.

[0041] The following describes the processing flow.

[0042] Step 1:

[0043] The user speaks to the robot. The device's built-in microphone captures the voice input during this conversation.

[0044] Step 2:

[0045] The terminal converts the captured audio data into text data using speech recognition. The converted text data is then prepared for the next processing step.

[0046] Step 3:

[0047] The terminal sends text data to the server. At this time, data cleansing, such as noise reduction, is performed as a preprocessing step if necessary.

[0048] Step 4:

[0049] The server inputs the received text data into a natural language processing system to analyze the user's intent. It also uses a sentiment analysis module to estimate the user's emotions.

[0050] Step 5:

[0051] The server generates appropriate response text based on the analysis results. During this process, it references past conversation history and adjusts the response to reflect the user's preferences.

[0052] Step 6:

[0053] The server sends the generated response text to the terminal. The transmitted data is provided in a format suitable for the speech synthesis process.

[0054] Step 7:

[0055] The terminal converts the received response text into speech data using a speech synthesis system. The converted speech is then transmitted to the user through the speaker.

[0056] Step 8:

[0057] The user receives a voice response from the device and continues the conversation as needed. This process is repeated, and communication with the user takes place.

[0058] (Example 1)

[0059] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0060] In modern society, loneliness and lack of communication experienced by the elderly and those living alone are significant issues affecting their psychological health. Furthermore, existing dialogue systems are inadequate at understanding emotions, making it difficult to achieve natural, empathetic communication similar to that between humans. Moreover, optimizing continuous dialogue based on past conversations is challenging, resulting in insufficient improvement of the long-term customer experience.

[0061] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0062] In this invention, the server includes a speech recognition means that receives voice input and converts the voice into text data; a means that performs sentiment analysis based on the received text data and generates a response corresponding to the text data; and a means that learns past dialogue data with the user and optimizes the response in the next dialogue. This enables natural dialogue that is empathetic to the emotions of each individual user, which can alleviate feelings of loneliness among the elderly and support the building of lasting relationships.

[0063] "Voice input" refers to a data format that users provide to a system via voice.

[0064] "Text data" refers to text-formatted data converted from voice input.

[0065] "Speech recognition means" refers to a technology or program for converting speech input into text data.

[0066] An "information processing device" is a computer system that receives data via a network, and then analyzes and processes it.

[0067] "Natural language processing" is a technology that analyzes human language and generates responses in a way that is natural for humans.

[0068] "Speech synthesis means" refers to a technology or program for outputting text data as speech.

[0069] "Sentiment analysis" is a technology that analyzes and detects a user's emotions from text data.

[0070] "Past dialogue data" refers to records of previous communications between the user and the system.

[0071] "Response optimization" is the process of using past dialogue data to adjust responses in future dialogues to be more appropriate.

[0072] "Object movements and expressions" refer to the actions and facial expressions that physical devices or virtual characters perform during interactions with the user.

[0073] This invention aims to realize a dialogue system that allows elderly people and those living alone to converse naturally. The system consists of three elements: user, terminal, and server.

[0074] The user inputs voice to interact with the system. Specifically, the user starts the system by speaking in the form of everyday conversation through the voice input device built into the terminal. The terminal receives this voice and converts it into text data using a commonly used speech recognition API (for example, a standard voice input API).

[0075] The converted text data is sent from the terminal to the server, where it performs natural language processing using a generative AI model. The server analyzes the user's intent and generates an appropriate response using sentiment analysis tools (such as a general sentiment analysis engine). It then refers to a database of past conversations to optimize the response based on the user's history. This enables more accurate communication based on previous conversation history.

[0076] The response text generated by the server is sent back to the terminal. The terminal uses speech synthesis technology (such as a common speech synthesis engine) to convert this text information into speech, which is then transmitted to the user through the speaker. This allows the user to experience a natural conversation with the robot. Furthermore, the terminal can provide a more emotionally rich and interactive conversational experience through facial expressions and gestures.

[0077] For example, if a user tells the robot, "I was lonely today because I didn't get to talk to anyone," the device converts this audio into text data and sends it to the server. The server analyzes the emotion of "loneliness" and generates a response such as, "That must have been lonely. I'll be your conversation partner, so please feel free to talk about anything," which the device then returns to the user as audio. Through this entire process, the system alleviates the user's feelings of loneliness and facilitates comfortable communication.

[0078] As an example of a prompt, by inputting a message in the format of "Generate a dialogue response that conveys the user is feeling lonely," the generative AI model can design a situation-appropriate response.

[0079] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0080] Step 1:

[0081] The user speaks into the device's voice input device. The device receives the voice signal through the microphone. The input voice data is converted into text data using speech recognition. This conversion process analyzes the voice waveform data, and its contents are output as text data.

[0082] Step 2:

[0083] The terminal sends the character data generated by speech recognition to the server. Communication takes place via a common protocol (e.g., HTTP). At this point, the character data is the input and is properly transferred to the server as output.

[0084] Step 3:

[0085] The server processes the received text data and performs natural language processing. A generative AI model is used to analyze the meaning of the text and understand the user's intent. This process involves word semantic analysis and sentence structure analysis, generating new response text. The generated response is further refined using sentiment analysis tools to transform it into emotionally resonant content.

[0086] Step 4:

[0087] The server sends the generated response text to the terminal. Here too, data is transferred via the communication protocol. The input is the adjusted response text, and the output is the transmission to the terminal.

[0088] Step 5:

[0089] The terminal converts the response text received from the server into speech using a speech synthesis system. In this process, a speech waveform is generated based on the text data and output to the user through the speaker. Here, the input is the response text, and the output is synthesized speech data.

[0090] Step 6:

[0091] The device further embodies responses using physical actions and facial expressions. This is to provide users with a more natural and engaging experience by having the robot interact with them using facial expressions and gestures. In this process, responses and action instructions are generated as input, and based on these, physical actions and visual expressions are output.

[0092] (Application Example 1)

[0093] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0094] There is a need to provide a natural conversational environment that allows elderly and other users with communication difficulties to engage in dialogue with peace of mind and alleviate feelings of loneliness. Conventional dialogue systems often lack naturalness and comfort in conversation because they do not adequately consider the physiological and psychological state of the user when responding. A system is needed to improve this and realize a dialogue experience that is tailored to the user's needs.

[0095] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0096] In this invention, the server includes a speech recognition means that receives acoustic input and converts it into text data, a means that relays the text data and performs language processing, a speech synthesis means that converts the generated response text into sound and outputs it, and a physiological information analysis means that recognizes the user's physiological information and adjusts the response. This enables appropriate and natural dialogue that is tailored to the user's physiological and psychological characteristics.

[0097] "Audio input" refers to the voice acquired from the user and is the input format used when a dialogue system receives voice information.

[0098] "Character data" refers to digital text information converted from acoustic input by speech recognition technology.

[0099] A "speech recognition system" is a process that has the function of converting acoustic input into text data, and performs acoustic analysis and generates textual representations.

[0100] "Language processing" is the process of analyzing and understanding text data and generating appropriate responses, utilizing natural language processing technologies.

[0101] "Speech synthesis means" refers to technology that converts text data back into sound and outputs it to the user.

[0102] "Physiological information" refers to information that indicates the user's physical condition, including indicators such as heart rate, complexion, and facial expression.

[0103] A "physiological information analysis means" is a process equipped with the function of adjusting the content and tone of responses based on the user's physiological information.

[0104] This invention is designed as part of a conversation system for use by the elderly and mainly consists of functions for acoustic input, text data conversion, language processing, speech generation, and physiological information analysis.

[0105] First, when the user speaks to the robot, the microphone built into the device receives this audio as sound input. The received audio is then converted into text data by a speech recognition system. In this process, the device utilizes Google's Speech-to-Text API to achieve advanced speech recognition.

[0106] Next, the text data is sent to the server, where language processing is performed using Python and spaCy. The server analyzes and understands the text data, and through natural language processing, generates a response that is tailored to the user. Simultaneously, sentiment analysis is performed using IBM Watson® Natural Language Understanding, and a response adapted to the user's emotions is prepared. In addition, physiological information such as the user's heart rate and facial expressions is acquired and the response is adjusted to achieve truly personalized dialogue.

[0107] The generated response text is sent to the terminal, converted back into sound using a speech synthesis method based on the Google Text-to-Speech API, and output to the user through the speaker. At this time, the terminal provides a more natural conversational experience by incorporating facial expressions and gestures according to the user's physiological state.

[0108] As a concrete example, if a user in a nursing home mutters "I was lonely today" through their glasses, the system immediately transcribes this voice into text and sends it to the server. After appropriately understanding the situation through language processing and physiological information analysis, it can respond with something like, "Thank you for sharing how you're feeling lonely. I'm thinking about what you can do to distract you a little."

[0109] Examples of prompts for the generative AI model include questions such as, "How was your day today?" and "Tell me about something fun you've done recently." This ensures that the conversation is always fresh and engaging.

[0110] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0111] Step 1:

[0112] The user speaks to the robot, and the microphone built into the device captures this audio input. The audio input is captured as raw speech data and passed to the Google Speech-to-Text API.

[0113] Step 2:

[0114] The device uses speech recognition to convert the acquired audio input into text data. Specifically, the Google Speech-to-Text API analyzes the audio data and generates a corresponding text representation. The input is audio, and the output is text data.

[0115] Step 3:

[0116] Text data is sent to the server, which uses Python and spaCy to perform natural language processing. The server receives the text data as input, performs grammatical analysis and contextual understanding, and extracts the user's intent. This process prepares the generative AI model to provide appropriate answers and information based on the user's intent.

[0117] Step 4:

[0118] The server uses IBM Watson Natural Language Understanding to perform sentiment analysis based on text data. It identifies emotions such as joy, anger, sadness, and happiness from the text data received as input and designs a response accordingly. The results of the sentiment analysis provide data to understand the user's psychological state.

[0119] Step 5:

[0120] A physiological information analysis system acquires the user's physiological information and provides this information to the server. The server comprehensively analyzes the physiological information and emotion analysis data to generate the optimal response. Specifically, it takes into account the user's heart rate and facial expression patterns to adjust the tone and content of the response.

[0121] Step 6:

[0122] The server sends the generated response text to the device, which then uses the Google Text-to-Speech API to convert the text data into speech. Finally, this audio response is delivered to the user through the device's speaker. The output is the audio that the user hears.

[0123] Step 7:

[0124] The device also displays facial expressions and simple gestures when responding, providing users with a natural and friendly conversational experience. Based on user feedback, it is recorded on the server so that future responses can be further optimized.

[0125] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0126] This invention relates to a system that includes an emotion engine that analyzes voice input during user interaction to recognize emotions and generates a response based on the results. This system consists of a user, a terminal, a server, and the emotion engine.

[0127] When a user speaks to the robot, their voice is captured by a microphone built into the device. The device converts the voice input into text data using speech recognition and sends this data to the server.

[0128] On the server, the received text data is analyzed by a natural language processing system, and at the same time, an emotion engine is activated to recognize emotions from the audio data. The emotion engine has the ability to estimate the user's emotional state from the intonation, speed, and pauses of the voice. Through this emotion analysis, it is possible to understand what the user is feeling.

[0129] The server takes the results of this emotion recognition into account and generates an appropriate response text. For example, if the user is judged to be sad, it will generate an encouraging response, and if the user is judged to be happy, it will select a more friendly response.

[0130] The generated response text is sent to the terminal and converted into speech data using speech synthesis technology. The converted speech is then transmitted to the user through the speaker. At this time, the terminal responds to the user using gestures and facial expressions, providing a more human-like and pleasant conversational experience.

[0131] As a concrete example, suppose a user says to the robot, "I'm feeling down because I had a bad day." In this case, the device converts this utterance into text and sends it to the server. The emotion engine analyzes the user's tone of voice and content and concludes that the user is "feeling down." The server then generates a comforting response such as, "Some days are like that. I'm sure you'll do better next time."

[0132] In this way, the present invention enables natural dialogue that is attentive to the user's emotions and can fulfill the role of a conversational partner.

[0133] The following describes the processing flow.

[0134] Step 1:

[0135] The user speaks to the robot. During this interaction, the voice is captured by the microphone in the device.

[0136] Step 2:

[0137] The terminal converts the captured audio data into text data using speech recognition. The converted text data is then prepared for subsequent processing.

[0138] Step 3:

[0139] The terminal sends text data and voice feature data to the server. The voice feature data includes information on intonation, speed, and pauses.

[0140] Step 4:

[0141] On the server, natural language processing is performed on the text data to analyze user intent. Simultaneously, an emotion engine recognizes the user's emotions from the voice feature data.

[0142] Step 5:

[0143] The server combines the results of natural language processing with the results of emotion recognition by the emotion engine to generate appropriate response text. The generated response will be tailored to the user's emotions.

[0144] Step 6:

[0145] The server sends the generated response text to the terminal. The terminal then prepares to convert this into audio data.

[0146] Step 7:

[0147] The terminal uses speech synthesis to convert the response text into audio data and transmits the response to the user through the speaker.

[0148] Step 8:

[0149] The user receives an audio response from the device, and if they wish to continue the conversation, they speak again, and the process is repeated.

[0150] (Example 2)

[0151] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0152] While modern interactive systems possess technologies that convert voice input into text for text-based conversations, understanding human emotions and providing flexible responses based on them remains challenging. There is a need to develop systems that can overcome this challenge and enable more natural dialogue.

[0153] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0154] In this invention, the server includes means for receiving audio signals and converting them into text information, means for performing sentiment analysis based on the received text information, and means for learning past dialogue data with the user and optimizing responses in subsequent dialogues. This makes it possible to understand the user's emotions and generate natural and appropriate responses accordingly.

[0155] An "audio signal" is a digital or analog representation of sound waveforms, including human voices.

[0156] "Textual information" refers to data expressed as characters, specifically a format in which audio, images, and other data are replaced with text.

[0157] "Speech recognition means" refers to technologies and devices for converting speech signals into text information, and has the function of analyzing speech data and converting it into an appropriate string of characters.

[0158] An "information processing device" is a device used for inputting, processing, storing, and outputting data, and includes computers and servers.

[0159] "Language data processing" refers to the process of analyzing, understanding, and generating data related to language, and may include natural language processing techniques.

[0160] "Speech synthesis means" refers to technologies and devices that convert text information into speech data that can be reproduced as speech.

[0161] "Emotion analysis" is a technology that infers and determines a speaker's emotional state from their voice or text.

[0162] "User" refers to an individual or organization that operates and uses the system or service.

[0163] "Dialogue data" refers to data that records interactions between a user and a system, including the content and duration of the conversation.

[0164] "Response optimization" refers to the process of generating the optimal response based on past information and circumstances, with the aim of providing the most appropriate and effective response for the user.

[0165] This invention is a system that understands a user's emotions through voice interaction and generates a response tailored to that user. The system consists of a user, a terminal, a server, and an emotion analysis engine. Specifically, it is configured as follows:

[0166] When a user speaks into the microphone built into the device, the device captures the audio signal. Here, we can take smartphones and tablets as examples of general-purpose devices. The captured audio signal is converted into text information using speech recognition software, such as a "speech recognition API."

[0167] The converted text information is transmitted to the server via the internet using a secure protocol. The server uses a natural language processing system, such as a "natural language processing engine," to analyze the received text information as linguistic data. Simultaneously, an emotion analysis engine is used to analyze the user's emotions from the audio signal and generate decision data based on that analysis.

[0168] The information obtained through this analysis is used to generate appropriate response text using a generative AI model. For example, if a user says, "I'm very tired today," the system will generate a response such as, "You must be tired. Please get some rest." Such responses are achieved by using the instruction, "Generate appropriate words of comfort based on the user's input," as the prompt for the generative AI model.

[0169] The generated response text information is sent back to the terminal and converted into speech using speech synthesis technology. The converted speech is then transmitted to the user through the speaker. This allows the system to provide the user with a natural and comfortable conversational experience.

[0170] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0171] Step 1:

[0172] The user speaks to the robot using voice. The input is the user's spoken words, for example, "I'm very tired today." A microphone built into the device captures this voice signal and outputs it as digital audio data.

[0173] Step 2:

[0174] The device launches speech recognition software and converts the input speech data into text. This step uses a speech recognition API to output the spoken content as a string. Specifically, it generates the text "I'm very tired today."

[0175] Step 3:

[0176] The converted character information is sent from the terminal to the server. The transmitted data is character information in text format. The character information reaches the server securely using an internet data transmission protocol.

[0177] Step 4:

[0178] The server analyzes the received text information using a natural language processing engine. The input is text information sent by the user, and the output is the result of language structure analysis. Here, the grammar and meaning of the text information are understood.

[0179] Step 5:

[0180] The server simultaneously uses an emotion analysis engine to analyze the user's emotions from the audio signal. The input is characteristic data of the audio signal (intonation, speed, and pauses), and the output is data indicating the user's emotional state.

[0181] Step 6:

[0182] Based on the results of natural language processing and sentiment analysis, the server uses a generative AI model to generate response text. The model receives the analysis results (text information and language analysis results) as input and outputs an appropriate response, such as "Thank you for your hard work. Please rest well today." Based on the instructions in the prompt, the response will be tailored to the user's emotions.

[0183] Step 7:

[0184] The generated response character information is sent from the server to the terminal. In this step, the response character information is transmitted as text data.

[0185] Step 8:

[0186] The terminal converts the received response text information into speech data using speech synthesis technology. This process takes text information as input and generates a speech signal as output.

[0187] Step 9:

[0188] The voice data is transmitted to the user through the device's speaker. The user is provided with a natural and comfortable voice response that feels like interacting with a human.

[0189] (Application Example 2)

[0190] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0191] Conventional speech recognition systems are capable of accurately transcribing speech into text and generating responses, but they have struggled to accurately understand the user's emotions and respond flexibly based on those emotions. Furthermore, when interacting with users in anxious or stressful situations, there is a need to provide appropriate and comforting responses, but only a limited number of systems have been able to deliver satisfactory results in this regard.

[0192] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0193] In this invention, the server includes: speech recognition means for receiving voice input and converting the voice into text data; means for transmitting the text data to a processing unit for natural language processing; speech synthesis means for converting the response text generated by the processing unit into voice and outputting it; emotion recognition means for analyzing the user's emotions and generating an appropriate response based on those emotions; and means for providing visual or physical responses to the user using an optical device or a motion device, while taking into account the generated response. This enables appropriate dialogue and responses in accordance with the user's emotions.

[0194] "Voice input" is a function that receives the user's voice as a digital signal.

[0195] A "speech recognition method that converts speech into text data" is a technology that analyzes input speech and converts its content into text information.

[0196] "Methods for natural language processing" are algorithms that analyze text-based data to understand its meaning and intent.

[0197] "Speech synthesis means" refers to the ability to generate and output speech based on generated text information.

[0198] "Emotion recognition means" refers to processing that identifies the user's emotional state from voice or text.

[0199] "Means of providing visual or physical responses" refers to technologies that use optical devices or motors to provide visual or physical feedback in a manner similar to that of a human.

[0200] This invention provides a system that receives voice input and converts the voice into text data. This system includes voice recognition means, natural language processing means, voice synthesis means, and emotion recognition means.

[0201] The server receives voice input using a microphone and uses a speech recognition library (e.g., Google Speech-to-Text API) to convert it into text data. The converted text data is then parsed by a natural language processing system (e.g., Google Natural Language API). This natural language processing is performed to understand the meaning and intent of the text.

[0202] The server also uses emotion recognition to identify the user's emotional state from their voice or text. Emotion analysis utilizes elements such as intonation, pauses, and speed of speech. Based on this analysis, an appropriate response is generated and output as speech using a speech synthesis system (e.g., Google Text-to-Speech API). The speech is then delivered to the user via a speaker.

[0203] For example, if a user says, "I'm a little tired today," the speech recognition system converts this speech into text, and the server performs sentiment analysis, recognizing that the user is "tired." The server then generates a comforting response such as, "Please rest well today," and returns it to the user as audio.

[0204] By utilizing generative AI models, the server can also provide visual or physical feedback that takes into account the generated response. Optical and motion devices can be used to achieve more human-like responses.

[0205] Example prompt: "When a user says 'I'm a little tired today,' consider what emotion you should recognize and what response you should generate."

[0206] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0207] Step 1:

[0208] The terminal captures the user's voice input using a microphone. The input is the user's voice, which is received by the terminal as a digital signal.

[0209] Step 2:

[0210] The device uses a speech recognition library (e.g., Google Speech-to-Text API) to convert speech into text data. The input is an audio signal, and the output is text data corresponding to its content. The process involves analyzing the speech pattern and converting the spoken content into text.

[0211] Step 3:

[0212] The terminal sends the converted text data to the server. At this stage, the text data is the input, and the transmission of data to the server is the output. Data is transferred accurately and efficiently to the server using data communication means.

[0213] Step 4:

[0214] The server analyzes the received text data using a natural language processing system (e.g., Google Natural Language API). The input is text data, and the output is the analysis results regarding the meaning and context of the text. The intent of the input is understood through lexical analysis and contextual recognition.

[0215] Step 5:

[0216] The server uses emotion recognition means to identify the user's emotions from the input audio and text. The input consists of text data and audio attributes (e.g., intonation, speed), and the output is the identified emotional state. The server analyzes the voice tone and expression to infer the user's emotions.

[0217] Step 6:

[0218] The server generates an appropriate response based on sentiment analysis. Here, the emotional state is the input, and the generated response text is the output. A generative AI model is used to construct friendly responses that correspond to the user's emotions.

[0219] Step 7:

[0220] The server sends the generated response text to the terminal, where it is converted into speech by a speech synthesis system (e.g., Google Text-to-Speech API). The input is the generated response text, and the output is the synthesized speech. The message is conveyed to the user through natural-sounding speech synthesis.

[0221] Step 8:

[0222] The device transmits the generated audio to the user using its speaker. The input in this step is audio data, and the output is audio output from the speaker. Appropriate acoustic output is provided to enable voice communication with the user.

[0223] Step 9:

[0224] The device utilizes optical or motion devices to provide the user with visual or physical responses. Here, the input is feedback configuration information based on the generated emotion, and the output is the visual or motional effect on the user. Automated actions and displays are employed to present feedback corresponding to specific emotions as naturally as possible.

[0225] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0226] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0227] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0228] [Second Embodiment]

[0229] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0230] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0231] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0232] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0233] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0234] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0235] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0236] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0237] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0238] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0239] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0240] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0241] This invention provides a dialogue system that allows elderly and other users to converse naturally. This system consists of three main components: the user, the terminal (Robot-kun), and the server, which work together in cooperation.

[0242] Users can easily use the system by simply talking to the robot. The voice spoken by the user is received by a microphone built into the device. The device uses speech recognition to convert the voice into text data. The converted text data is then sent to the server.

[0243] The server performs natural language processing based on the received text data. This process helps understand the user's intent and generates appropriate response text. The server also performs sentiment analysis and constructs responses that are empathetic to the user's emotions. Furthermore, by referring to past conversation data, it can adjust responses to ensure smoother interactions in the future.

[0244] The response text generated by the server is sent to the terminal. The terminal converts the response text into speech using a speech synthesis system and plays it to the user through its speaker. At this time, the terminal may also incorporate facial expressions and gestures to provide a more natural conversational experience.

[0245] As a concrete example, consider a scenario where a user tells the robot, "I was lonely today because I didn't get to talk to anyone." The device transcribes this audio into text and sends it to the server. The server performs sentiment analysis and understands the emotion of "loneliness." It then generates a response text such as, "That must have been lonely. I'll be your conversation partner, so please feel free to talk about anything," and sends it to the device. The device then converts this back into speech and conveys it to the user, thus enabling system-based dialogue.

[0246] In this way, the present invention plays an important role in realizing communication that users find comfortable and alleviating feelings of loneliness among the elderly.

[0247] The following describes the processing flow.

[0248] Step 1:

[0249] The user speaks to the robot. The device's built-in microphone captures the voice input during this conversation.

[0250] Step 2:

[0251] The terminal converts the captured audio data into text data using speech recognition. The converted text data is then prepared for the next processing step.

[0252] Step 3:

[0253] The terminal sends text data to the server. At this time, data cleansing, such as noise reduction, is performed as a preprocessing step if necessary.

[0254] Step 4:

[0255] The server inputs the received text data into a natural language processing system to analyze the user's intent. It also uses a sentiment analysis module to estimate the user's emotions.

[0256] Step 5:

[0257] The server generates appropriate response text based on the analysis results. During this process, it references past conversation history and adjusts the response to reflect the user's preferences.

[0258] Step 6:

[0259] The server sends the generated response text to the terminal. The transmitted data is provided in a format suitable for the speech synthesis process.

[0260] Step 7:

[0261] The terminal converts the received response text into speech data using a speech synthesis system. The converted speech is then transmitted to the user through the speaker.

[0262] Step 8:

[0263] The user receives a voice response from the device and continues the conversation as needed. This process is repeated, and communication with the user takes place.

[0264] (Example 1)

[0265] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0266] In modern society, loneliness and lack of communication experienced by the elderly and those living alone are significant issues affecting their psychological health. Furthermore, existing dialogue systems are inadequate at understanding emotions, making it difficult to achieve natural, empathetic communication similar to that between humans. Moreover, optimizing continuous dialogue based on past conversations is challenging, resulting in insufficient improvement of the long-term customer experience.

[0267] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0268] In this invention, the server includes a speech recognition means that receives voice input and converts the voice into text data; a means that performs sentiment analysis based on the received text data and generates a response corresponding to the text data; and a means that learns past dialogue data with the user and optimizes the response in the next dialogue. This enables natural dialogue that is empathetic to the emotions of each individual user, which can alleviate feelings of loneliness among the elderly and support the building of lasting relationships.

[0269] "Voice input" refers to a data format that users provide to a system via voice.

[0270] "Text data" refers to text-formatted data converted from voice input.

[0271] "Speech recognition means" refers to a technology or program for converting speech input into text data.

[0272] An "information processing device" is a computer system that receives data via a network, and then analyzes and processes it.

[0273] "Natural language processing" is a technology that analyzes human language and generates responses in a way that is natural for humans.

[0274] "Speech synthesis means" refers to a technology or program for outputting text data as speech.

[0275] "Sentiment analysis" is a technology that analyzes and detects a user's emotions from text data.

[0276] "Past dialogue data" refers to records of previous communications between the user and the system.

[0277] "Response optimization" is the process of using past dialogue data to adjust responses in future dialogues to be more appropriate.

[0278] "Object movements and expressions" refer to the actions and facial expressions that physical devices or virtual characters perform during interactions with the user.

[0279] This invention aims to realize a dialogue system that allows elderly people and those living alone to converse naturally. The system consists of three elements: user, terminal, and server.

[0280] The user inputs voice to interact with the system. Specifically, the user starts the system by speaking in the form of everyday conversation through the voice input device built into the terminal. The terminal receives this voice and converts it into text data using a commonly used speech recognition API (for example, a standard voice input API).

[0281] This converted character data is sent from the terminal to the server, and the server performs natural language processing using a generative AI model. The server analyzes the user's intention and uses a sentiment analysis tool (such as a general sentiment analysis engine) to generate an appropriate response. Then, it refers to the past conversation database to optimize the response based on the user's history. This enables more accurate communication based on the previous conversation history.

[0282] The response characters generated by the server are sent back to the terminal. The terminal uses a voice synthesis means (such as a general voice synthesis engine) to convert this character information into voice and transmit it to the user through a speaker. This allows the user to experience a natural conversation with the robot. Furthermore, the terminal can provide a more emotional and interactive conversation experience through expressions and gestures.

[0283] As a specific example, when the user tells the robot "I was lonely today because I couldn't talk to anyone", the terminal converts this voice into character data and sends it to the server. The server analyzes the emotion of "lonely" and generates a response such as "That must have been lonely. I'll be your conversation partner, so please feel free to talk about anything", and the terminal converts this into voice and returns it to the user. Through this series of processes, the user's sense of loneliness is alleviated, and a comfortable communication is realized.

[0284] As an example of a prompt sentence, by inputting in the form of "Generate a response for a conversation where the user conveys a feeling of loneliness", the generative AI model can design a response according to the situation.

[0285] The flow of the specific process in Example 1 will be described using FIG. 11.

[0286] Step 1:

[0287] The user speaks into the device's voice input device. The device receives the voice signal through the microphone. The input voice data is converted into text data using speech recognition. This conversion process analyzes the voice waveform data, and its contents are output as text data.

[0288] Step 2:

[0289] The terminal sends the character data generated by speech recognition to the server. Communication takes place via a common protocol (e.g., HTTP). At this point, the character data is the input and is properly transferred to the server as output.

[0290] Step 3:

[0291] The server processes the received text data and performs natural language processing. A generative AI model is used to analyze the meaning of the text and understand the user's intent. This process involves word semantic analysis and sentence structure analysis, generating new response text. The generated response is further refined using sentiment analysis tools to transform it into emotionally resonant content.

[0292] Step 4:

[0293] The server sends the generated response text to the terminal. Here too, data is transferred via the communication protocol. The input is the adjusted response text, and the output is the transmission to the terminal.

[0294] Step 5:

[0295] The terminal converts the response text received from the server into speech using a speech synthesis system. In this process, a speech waveform is generated based on the text data and output to the user through the speaker. Here, the input is the response text, and the output is synthesized speech data.

[0296] Step 6:

[0297] The device further embodies responses using physical actions and facial expressions. This is to provide users with a more natural and engaging experience by having the robot interact with them using facial expressions and gestures. In this process, responses and action instructions are generated as input, and based on these, physical actions and visual expressions are output.

[0298] (Application Example 1)

[0299] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0300] There is a need to provide a natural conversational environment that allows elderly and other users with communication difficulties to engage in dialogue with peace of mind and alleviate feelings of loneliness. Conventional dialogue systems often lack naturalness and comfort in conversation because they do not adequately consider the physiological and psychological state of the user when responding. A system is needed to improve this and realize a dialogue experience that is tailored to the user's needs.

[0301] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0302] In this invention, the server includes a speech recognition means that receives acoustic input and converts it into text data, a means that relays the text data and performs language processing, a speech synthesis means that converts the generated response text into sound and outputs it, and a physiological information analysis means that recognizes the user's physiological information and adjusts the response. This enables appropriate and natural dialogue that is tailored to the user's physiological and psychological characteristics.

[0303] "Audio input" refers to the voice acquired from the user and is the input format used when a dialogue system receives voice information.

[0304] "Character data" refers to digital text information converted from acoustic input by speech recognition technology.

[0305] The "voice recognition means" has the function of converting acoustic input into character data, and is a process for analyzing acoustics and generating character expressions.

[0306] "Language processing" is a process for analyzing and understanding character data and generating appropriate responses, and utilizes natural language processing technology.

[0307] The "voice synthesis means" is a technology for converting text data back into acoustics and outputting it to the user.

[0308] "Physiological information" is information indicating the physical state of the user, and includes indicators such as heart rate, complexion, and expression.

[0309] The "physiological information analysis means" is a process equipped with the function of adjusting the content and tone of the response based on the physiological information of the user.

[0310] This invention is designed as part of a conversation system for the elderly, and mainly consists of functions such as acoustic input, character data conversion, language processing, voice generation, and physiological information analysis.

[0311] First, when the user talks to the robot, the microphone built into the terminal receives this voice as acoustic input. The received acoustics are converted into character data by the voice recognition means. At this time, the terminal utilizes the Google Speech-to-Text API to achieve advanced voice recognition.

[0312] Next, the character data is sent to the server, and language processing using Python and spaCy is performed. The server analyzes and understands the character data, and through natural language processing, generates a response that empathizes with the user. On the other hand, sentiment analysis is also performed using IBM Watson Natural Language Understanding, and a response adapted to the user's sentiment is prepared. In addition to this, physiological information such as the user's heart rate and expression is acquired, and by adjusting the response, a truly personalized conversation is realized.

[0313] The generated response text is sent to the terminal, converted back into sound using a speech synthesis method based on the Google Text-to-Speech API, and output to the user through the speaker. At this time, the terminal provides a more natural conversational experience by incorporating facial expressions and gestures according to the user's physiological state.

[0314] As a concrete example, if a user in a nursing home mutters "I was lonely today" through their glasses, the system immediately transcribes this voice into text and sends it to the server. After appropriately understanding the situation through language processing and physiological information analysis, it can respond with something like, "Thank you for sharing how you're feeling lonely. I'm thinking about what you can do to distract you a little."

[0315] Examples of prompts for the generative AI model include questions such as, "How was your day today?" and "Tell me about something fun you've done recently." This ensures that the conversation is always fresh and engaging.

[0316] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0317] Step 1:

[0318] The user speaks to the robot, and the microphone built into the device captures this audio input. The audio input is captured as raw speech data and passed to the Google Speech-to-Text API.

[0319] Step 2:

[0320] The device uses speech recognition to convert the acquired audio input into text data. Specifically, the Google Speech-to-Text API analyzes the audio data and generates a corresponding text representation. The input is audio, and the output is text data.

[0321] Step 3:

[0322] Text data is sent to the server, which uses Python and spaCy to perform natural language processing. The server receives the text data as input, performs grammatical analysis and contextual understanding, and extracts the user's intent. This process prepares the generative AI model to provide appropriate answers and information based on the user's intent.

[0323] Step 4:

[0324] The server uses IBM Watson Natural Language Understanding to perform sentiment analysis based on text data. It identifies emotions such as joy, anger, sadness, and happiness from the text data received as input and designs a response accordingly. The results of the sentiment analysis provide data to understand the user's psychological state.

[0325] Step 5:

[0326] A physiological information analysis system acquires the user's physiological information and provides this information to the server. The server comprehensively analyzes the physiological information and emotion analysis data to generate the optimal response. Specifically, it takes into account the user's heart rate and facial expression patterns to adjust the tone and content of the response.

[0327] Step 6:

[0328] The server sends the generated response text to the device, which then uses the Google Text-to-Speech API to convert the text data into speech. Finally, this audio response is delivered to the user through the device's speaker. The output is the audio that the user hears.

[0329] Step 7:

[0330] The device also displays facial expressions and simple gestures when responding, providing users with a natural and friendly conversational experience. Based on user feedback, it is recorded on the server so that future responses can be further optimized.

[0331] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0332] This invention relates to a system that includes an emotion engine that analyzes voice input during user interaction to recognize emotions and generates a response based on the results. This system consists of a user, a terminal, a server, and the emotion engine.

[0333] When a user speaks to the robot, their voice is captured by a microphone built into the device. The device converts the voice input into text data using speech recognition and sends this data to the server.

[0334] On the server, the received text data is analyzed by a natural language processing system, and at the same time, an emotion engine is activated to recognize emotions from the audio data. The emotion engine has the ability to estimate the user's emotional state from the intonation, speed, and pauses of the voice. Through this emotion analysis, it is possible to understand what the user is feeling.

[0335] The server takes the results of this emotion recognition into account and generates an appropriate response text. For example, if the user is judged to be sad, it will generate an encouraging response, and if the user is judged to be happy, it will select a more friendly response.

[0336] The generated response text is sent to the terminal and converted into speech data using speech synthesis technology. The converted speech is then transmitted to the user through the speaker. At this time, the terminal responds to the user using gestures and facial expressions, providing a more human-like and pleasant conversational experience.

[0337] As a concrete example, suppose a user says to the robot, "I'm feeling down because I had a bad day." In this case, the device converts this utterance into text and sends it to the server. The emotion engine analyzes the user's tone of voice and content and concludes that the user is "feeling down." The server then generates a comforting response such as, "Some days are like that. I'm sure you'll do better next time."

[0338] In this way, the present invention enables natural dialogue that is attentive to the user's emotions and can fulfill the role of a conversational partner.

[0339] The following describes the processing flow.

[0340] Step 1:

[0341] The user speaks to the robot. During this interaction, the voice is captured by the microphone in the device.

[0342] Step 2:

[0343] The terminal converts the captured audio data into text data using speech recognition. The converted text data is then prepared for subsequent processing.

[0344] Step 3:

[0345] The terminal sends text data and voice feature data to the server. The voice feature data includes information on intonation, speed, and pauses.

[0346] Step 4:

[0347] On the server, natural language processing is performed on the text data to analyze user intent. Simultaneously, an emotion engine recognizes the user's emotions from the voice feature data.

[0348] Step 5:

[0349] The server combines the results of natural language processing with the results of emotion recognition by the emotion engine to generate appropriate response text. The generated response will be tailored to the user's emotions.

[0350] Step 6:

[0351] The server sends the generated response text to the terminal. The terminal then prepares to convert this into audio data.

[0352] Step 7:

[0353] The terminal uses speech synthesis to convert the response text into audio data and transmits the response to the user through the speaker.

[0354] Step 8:

[0355] The user receives an audio response from the device, and if they wish to continue the conversation, they speak again, and the process is repeated.

[0356] (Example 2)

[0357] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0358] While modern interactive systems possess technologies that convert voice input into text for text-based conversations, understanding human emotions and providing flexible responses based on them remains challenging. There is a need to develop systems that can overcome this challenge and enable more natural dialogue.

[0359] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0360] In this invention, the server includes means for receiving audio signals and converting them into text information, means for performing sentiment analysis based on the received text information, and means for learning past dialogue data with the user and optimizing responses in subsequent dialogues. This makes it possible to understand the user's emotions and generate natural and appropriate responses accordingly.

[0361] An "audio signal" is a digital or analog representation of sound waveforms, including human voices.

[0362] "Textual information" refers to data expressed as characters, specifically a format in which audio, images, and other data are replaced with text.

[0363] "Speech recognition means" refers to technologies and devices for converting speech signals into text information, and has the function of analyzing speech data and converting it into an appropriate string of characters.

[0364] An "information processing device" is a device used for inputting, processing, storing, and outputting data, and includes computers and servers.

[0365] "Language data processing" refers to the process of analyzing, understanding, and generating data related to language, and may include natural language processing techniques.

[0366] "Speech synthesis means" refers to technologies and devices that convert text information into speech data that can be reproduced as speech.

[0367] "Emotion analysis" is a technology that infers and determines a speaker's emotional state from their voice or text.

[0368] "User" refers to an individual or organization that operates and uses the system or service.

[0369] "Dialogue data" refers to data that records interactions between a user and a system, including the content and duration of the conversation.

[0370] "Response optimization" refers to the process of generating the optimal response based on past information and circumstances, with the aim of providing the most appropriate and effective response for the user.

[0371] This invention is a system that understands a user's emotions through voice interaction and generates a response tailored to that user. The system consists of a user, a terminal, a server, and an emotion analysis engine. Specifically, it is configured as follows:

[0372] When a user speaks into the microphone built into the device, the device captures the audio signal. Here, we can take smartphones and tablets as examples of general-purpose devices. The captured audio signal is converted into text information using speech recognition software, such as a "speech recognition API."

[0373] The converted text information is transmitted to the server via the internet using a secure protocol. The server uses a natural language processing system, such as a "natural language processing engine," to analyze the received text information as linguistic data. Simultaneously, an emotion analysis engine is used to analyze the user's emotions from the audio signal and generate decision data based on that analysis.

[0374] The information obtained through this analysis is used to generate appropriate response text using a generative AI model. For example, if a user says, "I'm very tired today," the system will generate a response such as, "You must be tired. Please get some rest." Such responses are achieved by using the instruction, "Generate appropriate words of comfort based on the user's input," as the prompt for the generative AI model.

[0375] The generated response text information is sent back to the terminal and converted into speech using speech synthesis technology. The converted speech is then transmitted to the user through the speaker. This allows the system to provide the user with a natural and comfortable conversational experience.

[0376] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0377] Step 1:

[0378] The user speaks to the robot using voice. The input is the user's spoken words, for example, "I'm very tired today." A microphone built into the device captures this voice signal and outputs it as digital audio data.

[0379] Step 2:

[0380] The device launches speech recognition software and converts the input speech data into text. This step uses a speech recognition API to output the spoken content as a string. Specifically, it generates the text "I'm very tired today."

[0381] Step 3:

[0382] The converted character information is sent from the terminal to the server. The transmitted data is character information in text format. The character information reaches the server securely using an internet data transmission protocol.

[0383] Step 4:

[0384] The server analyzes the received text information using a natural language processing engine. The input is text information sent by the user, and the output is the result of language structure analysis. Here, the grammar and meaning of the text information are understood.

[0385] Step 5:

[0386] The server simultaneously uses an emotion analysis engine to analyze the user's emotions from the audio signal. The input is characteristic data of the audio signal (intonation, speed, and pauses), and the output is data indicating the user's emotional state.

[0387] Step 6:

[0388] Based on the results of natural language processing and sentiment analysis, the server uses a generative AI model to generate response text. The model receives the analysis results (text information and language analysis results) as input and outputs an appropriate response, such as "Thank you for your hard work. Please rest well today." Based on the instructions in the prompt, the response will be tailored to the user's emotions.

[0389] Step 7:

[0390] The generated response character information is sent from the server to the terminal. In this step, the response character information is transmitted as text data.

[0391] Step 8:

[0392] The terminal converts the received response text information into speech data using speech synthesis technology. This process takes text information as input and generates a speech signal as output.

[0393] Step 9:

[0394] The voice data is transmitted to the user through the device's speaker. The user is provided with a natural and comfortable voice response that feels like interacting with a human.

[0395] (Application Example 2)

[0396] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0397] Conventional speech recognition systems are capable of accurately transcribing speech into text and generating responses, but they have struggled to accurately understand the user's emotions and respond flexibly based on those emotions. Furthermore, when interacting with users in anxious or stressful situations, there is a need to provide appropriate and comforting responses, but only a limited number of systems have been able to deliver satisfactory results in this regard.

[0398] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0399] In this invention, the server includes: speech recognition means for receiving voice input and converting the voice into text data; means for transmitting the text data to a processing unit for natural language processing; speech synthesis means for converting the response text generated by the processing unit into voice and outputting it; emotion recognition means for analyzing the user's emotions and generating an appropriate response based on those emotions; and means for providing visual or physical responses to the user using an optical device or a motion device, while taking into account the generated response. This enables appropriate dialogue and responses in accordance with the user's emotions.

[0400] "Voice input" is a function that receives the user's voice as a digital signal.

[0401] A "speech recognition method that converts speech into text data" is a technology that analyzes input speech and converts its content into text information.

[0402] "Methods for natural language processing" are algorithms that analyze text-based data to understand its meaning and intent.

[0403] "Speech synthesis means" refers to the ability to generate and output speech based on generated text information.

[0404] "Emotion recognition means" refers to processing that identifies the user's emotional state from voice or text.

[0405] "Means of providing visual or physical responses" refers to technologies that use optical devices or motors to provide visual or physical feedback in a manner similar to that of a human.

[0406] This invention provides a system that receives voice input and converts the voice into text data. This system includes voice recognition means, natural language processing means, voice synthesis means, and emotion recognition means.

[0407] The server receives voice input using a microphone and uses a speech recognition library (e.g., Google Speech-to-Text API) to convert it into text data. The converted text data is then parsed by a natural language processing system (e.g., Google Natural Language API). This natural language processing is performed to understand the meaning and intent of the text.

[0408] The server also uses emotion recognition to identify the user's emotional state from their voice or text. Emotion analysis utilizes elements such as intonation, pauses, and speed of speech. Based on this analysis, an appropriate response is generated and output as speech using a speech synthesis system (e.g., Google Text-to-Speech API). The speech is then delivered to the user via a speaker.

[0409] For example, if a user says, "I'm a little tired today," the speech recognition system converts this speech into text, and the server performs sentiment analysis, recognizing that the user is "tired." The server then generates a comforting response such as, "Please rest well today," and returns it to the user as audio.

[0410] By utilizing generative AI models, the server can also provide visual or physical feedback that takes into account the generated response. Optical and motion devices can be used to achieve more human-like responses.

[0411] Example prompt: "When a user says 'I'm a little tired today,' consider what emotion you should recognize and what response you should generate."

[0412] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0413] Step 1:

[0414] The terminal captures the user's voice input using a microphone. The input is the user's voice, which is received by the terminal as a digital signal.

[0415] Step 2:

[0416] The device uses a speech recognition library (e.g., Google Speech-to-Text API) to convert speech into text data. The input is an audio signal, and the output is text data corresponding to its content. The process involves analyzing the speech pattern and converting the spoken content into text.

[0417] Step 3:

[0418] The terminal sends the converted text data to the server. At this stage, the text data is the input, and the transmission of data to the server is the output. Data is transferred accurately and efficiently to the server using data communication means.

[0419] Step 4:

[0420] The server analyzes the received text data using a natural language processing system (e.g., Google Natural Language API). The input is text data, and the output is the analysis results regarding the meaning and context of the text. The intent of the input is understood through lexical analysis and contextual recognition.

[0421] Step 5:

[0422] The server uses emotion recognition means to identify the user's emotions from the input audio and text. The input consists of text data and audio attributes (e.g., intonation, speed), and the output is the identified emotional state. The server analyzes the voice tone and expression to infer the user's emotions.

[0423] Step 6:

[0424] The server generates an appropriate response based on sentiment analysis. Here, the emotional state is the input, and the generated response text is the output. A generative AI model is used to construct friendly responses that correspond to the user's emotions.

[0425] Step 7:

[0426] The server sends the generated response text to the terminal, where it is converted into speech by a speech synthesis system (e.g., Google Text-to-Speech API). The input is the generated response text, and the output is the synthesized speech. The message is conveyed to the user through natural-sounding speech synthesis.

[0427] Step 8:

[0428] The device transmits the generated audio to the user using its speaker. The input in this step is audio data, and the output is audio output from the speaker. Appropriate acoustic output is provided to enable voice communication with the user.

[0429] Step 9:

[0430] The device utilizes optical or motion devices to provide the user with visual or physical responses. Here, the input is feedback configuration information based on the generated emotion, and the output is the visual or motional effect on the user. Automated actions and displays are employed to present feedback corresponding to specific emotions as naturally as possible.

[0431] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0432] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0433] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0434] [Third Embodiment]

[0435] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0436] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0437] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0438] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0439] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0440] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0441] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0442] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0443] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0444] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0445] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0446] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0447] This invention provides a dialogue system that allows elderly and other users to converse naturally. This system consists of three main components: the user, the terminal (Robot-kun), and the server, which work together in cooperation.

[0448] Users can easily use the system by simply talking to the robot. The voice spoken by the user is received by a microphone built into the device. The device uses speech recognition to convert the voice into text data. The converted text data is then sent to the server.

[0449] The server performs natural language processing based on the received text data. This process helps understand the user's intent and generates appropriate response text. The server also performs sentiment analysis and constructs responses that are empathetic to the user's emotions. Furthermore, by referring to past conversation data, it can adjust responses to ensure smoother interactions in the future.

[0450] The response text generated by the server is sent to the terminal. The terminal converts the response text into speech using a speech synthesis system and plays it to the user through its speaker. At this time, the terminal may also incorporate facial expressions and gestures to provide a more natural conversational experience.

[0451] As a concrete example, consider a scenario where a user tells the robot, "I was lonely today because I didn't get to talk to anyone." The device transcribes this audio into text and sends it to the server. The server performs sentiment analysis and understands the emotion of "loneliness." It then generates a response text such as, "That must have been lonely. I'll be your conversation partner, so please feel free to talk about anything," and sends it to the device. The device then converts this back into speech and conveys it to the user, thus enabling system-based dialogue.

[0452] In this way, the present invention plays an important role in realizing communication that users find comfortable and alleviating feelings of loneliness among the elderly.

[0453] The following describes the processing flow.

[0454] Step 1:

[0455] The user speaks to the robot. The device's built-in microphone captures the voice input during this conversation.

[0456] Step 2:

[0457] The terminal converts the captured audio data into text data using speech recognition. The converted text data is then prepared for the next processing step.

[0458] Step 3:

[0459] The terminal sends text data to the server. At this time, data cleansing, such as noise reduction, is performed as a preprocessing step if necessary.

[0460] Step 4:

[0461] The server inputs the received text data into a natural language processing system to analyze the user's intent. It also uses a sentiment analysis module to estimate the user's emotions.

[0462] Step 5:

[0463] The server generates appropriate response text based on the analysis results. During this process, it references past conversation history and adjusts the response to reflect the user's preferences.

[0464] Step 6:

[0465] The server sends the generated response text to the terminal. The transmitted data is provided in a format suitable for the speech synthesis process.

[0466] Step 7:

[0467] The terminal converts the received response text into speech data using a speech synthesis system. The converted speech is then transmitted to the user through the speaker.

[0468] Step 8:

[0469] The user receives a voice response from the device and continues the conversation as needed. This process is repeated, and communication with the user takes place.

[0470] (Example 1)

[0471] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0472] In modern society, loneliness and lack of communication experienced by the elderly and those living alone are significant issues affecting their psychological health. Furthermore, existing dialogue systems are inadequate at understanding emotions, making it difficult to achieve natural, empathetic communication similar to that between humans. Moreover, optimizing continuous dialogue based on past conversations is challenging, resulting in insufficient improvement of the long-term customer experience.

[0473] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0474] In this invention, the server includes a speech recognition means that receives voice input and converts the voice into text data; a means that performs sentiment analysis based on the received text data and generates a response corresponding to the text data; and a means that learns past dialogue data with the user and optimizes the response in the next dialogue. This enables natural dialogue that is empathetic to the emotions of each individual user, which can alleviate feelings of loneliness among the elderly and support the building of lasting relationships.

[0475] "Voice input" refers to a data format that users provide to a system via voice.

[0476] "Text data" refers to text-formatted data converted from voice input.

[0477] "Speech recognition means" refers to a technology or program for converting speech input into text data.

[0478] An "information processing device" is a computer system that receives data via a network, and then analyzes and processes it.

[0479] "Natural language processing" is a technology that analyzes human language and generates responses in a way that is natural for humans.

[0480] "Speech synthesis means" refers to a technology or program for outputting text data as speech.

[0481] "Sentiment analysis" is a technology that analyzes and detects a user's emotions from text data.

[0482] "Past dialogue data" refers to records of previous communications between the user and the system.

[0483] "Response optimization" is the process of using past dialogue data to adjust responses in future dialogues to be more appropriate.

[0484] "Object movements and expressions" refer to the actions and facial expressions that physical devices or virtual characters perform during interactions with the user.

[0485] This invention aims to realize a dialogue system that allows elderly people and those living alone to converse naturally. The system consists of three elements: user, terminal, and server.

[0486] The user inputs voice to interact with the system. Specifically, the user starts the system by speaking in the form of everyday conversation through the voice input device built into the terminal. The terminal receives this voice and converts it into text data using a commonly used speech recognition API (for example, a standard voice input API).

[0487] The converted text data is sent from the terminal to the server, where it performs natural language processing using a generative AI model. The server analyzes the user's intent and generates an appropriate response using sentiment analysis tools (such as a general sentiment analysis engine). It then refers to a database of past conversations to optimize the response based on the user's history. This enables more accurate communication based on previous conversation history.

[0488] The response text generated by the server is sent back to the terminal. The terminal uses speech synthesis technology (such as a common speech synthesis engine) to convert this text information into speech, which is then transmitted to the user through the speaker. This allows the user to experience a natural conversation with the robot. Furthermore, the terminal can provide a more emotionally rich and interactive conversational experience through facial expressions and gestures.

[0489] For example, if a user tells the robot, "I was lonely today because I didn't get to talk to anyone," the device converts this audio into text data and sends it to the server. The server analyzes the emotion of "loneliness" and generates a response such as, "That must have been lonely. I'll be your conversation partner, so please feel free to talk about anything," which the device then returns to the user as audio. Through this entire process, the system alleviates the user's feelings of loneliness and facilitates comfortable communication.

[0490] As an example of a prompt, by inputting a message in the format of "Generate a dialogue response that conveys the user is feeling lonely," the generative AI model can design a situation-appropriate response.

[0491] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0492] Step 1:

[0493] The user speaks into the device's voice input device. The device receives the voice signal through the microphone. The input voice data is converted into text data using speech recognition. This conversion process analyzes the voice waveform data, and its contents are output as text data.

[0494] Step 2:

[0495] The terminal sends the character data generated by speech recognition to the server. Communication takes place via a common protocol (e.g., HTTP). At this point, the character data is the input and is properly transferred to the server as output.

[0496] Step 3:

[0497] The server processes the received text data and performs natural language processing. A generative AI model is used to analyze the meaning of the text and understand the user's intent. This process involves word semantic analysis and sentence structure analysis, generating new response text. The generated response is further refined using sentiment analysis tools to transform it into emotionally resonant content.

[0498] Step 4:

[0499] The server sends the generated response text to the terminal. Here too, data is transferred via the communication protocol. The input is the adjusted response text, and the output is the transmission to the terminal.

[0500] Step 5:

[0501] The terminal converts the response text received from the server into speech using a speech synthesis system. In this process, a speech waveform is generated based on the text data and output to the user through the speaker. Here, the input is the response text, and the output is synthesized speech data.

[0502] Step 6:

[0503] The device further embodies responses using physical actions and facial expressions. This is to provide users with a more natural and engaging experience by having the robot interact with them using facial expressions and gestures. In this process, responses and action instructions are generated as input, and based on these, physical actions and visual expressions are output.

[0504] (Application Example 1)

[0505] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0506] There is a need to provide a natural conversational environment that allows elderly and other users with communication difficulties to engage in dialogue with peace of mind and alleviate feelings of loneliness. Conventional dialogue systems often lack naturalness and comfort in conversation because they do not adequately consider the physiological and psychological state of the user when responding. A system is needed to improve this and realize a dialogue experience that is tailored to the user's needs.

[0507] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0508] In this invention, the server includes a speech recognition means that receives acoustic input and converts it into text data, a means that relays the text data and performs language processing, a speech synthesis means that converts the generated response text into sound and outputs it, and a physiological information analysis means that recognizes the user's physiological information and adjusts the response. This enables appropriate and natural dialogue that is tailored to the user's physiological and psychological characteristics.

[0509] "Audio input" refers to the voice acquired from the user and is the input format used when a dialogue system receives voice information.

[0510] "Character data" refers to digital text information converted from acoustic input by speech recognition technology.

[0511] A "speech recognition system" is a process that has the function of converting acoustic input into text data, and performs acoustic analysis and generates textual representations.

[0512] "Language processing" is the process of analyzing and understanding text data and generating appropriate responses, utilizing natural language processing technologies.

[0513] "Speech synthesis means" refers to technology that converts text data back into sound and outputs it to the user.

[0514] "Physiological information" refers to information that indicates the user's physical condition, including indicators such as heart rate, complexion, and facial expression.

[0515] A "physiological information analysis means" is a process equipped with the function of adjusting the content and tone of responses based on the user's physiological information.

[0516] This invention is designed as part of a conversation system for use by the elderly and mainly consists of functions for acoustic input, text data conversion, language processing, speech generation, and physiological information analysis.

[0517] First, when the user speaks to the robot, the microphone built into the device receives this audio as sound input. The received audio is then converted into text data by a speech recognition system. In this process, the device utilizes the Google Speech-to-Text API to achieve advanced speech recognition.

[0518] Next, the text data is sent to the server, where language processing is performed using Python and spaCy. The server analyzes and understands the text data, and through natural language processing, generates a response that is tailored to the user. Simultaneously, sentiment analysis is performed using IBM Watson Natural Language Understanding, and a response adapted to the user's emotions is prepared. In addition, physiological information such as the user's heart rate and facial expressions is acquired and the response is adjusted to achieve truly personalized dialogue.

[0519] The generated response text is sent to the terminal, converted back into sound using a speech synthesis method based on the Google Text-to-Speech API, and output to the user through the speaker. At this time, the terminal provides a more natural conversational experience by incorporating facial expressions and gestures according to the user's physiological state.

[0520] As a concrete example, if a user in a nursing home mutters "I was lonely today" through their glasses, the system immediately transcribes this voice into text and sends it to the server. After appropriately understanding the situation through language processing and physiological information analysis, it can respond with something like, "Thank you for sharing how you're feeling lonely. I'm thinking about what you can do to distract you a little."

[0521] Examples of prompts for the generative AI model include questions such as, "How was your day today?" and "Tell me about something fun you've done recently." This ensures that the conversation is always fresh and engaging.

[0522] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0523] Step 1:

[0524] The user speaks to the robot, and the microphone built into the device captures this audio input. The audio input is captured as raw speech data and passed to the Google Speech-to-Text API.

[0525] Step 2:

[0526] The device uses speech recognition to convert the acquired audio input into text data. Specifically, the Google Speech-to-Text API analyzes the audio data and generates a corresponding text representation. The input is audio, and the output is text data.

[0527] Step 3:

[0528] Text data is sent to the server, which uses Python and spaCy to perform natural language processing. The server receives the text data as input, performs grammatical analysis and contextual understanding, and extracts the user's intent. This process prepares the generative AI model to provide appropriate answers and information based on the user's intent.

[0529] Step 4:

[0530] The server uses IBM Watson Natural Language Understanding to perform sentiment analysis based on text data. It identifies emotions such as joy, anger, sadness, and happiness from the text data received as input and designs a response accordingly. The results of the sentiment analysis provide data to understand the user's psychological state.

[0531] Step 5:

[0532] A physiological information analysis system acquires the user's physiological information and provides this information to the server. The server comprehensively analyzes the physiological information and emotion analysis data to generate the optimal response. Specifically, it takes into account the user's heart rate and facial expression patterns to adjust the tone and content of the response.

[0533] Step 6:

[0534] The server sends the generated response text to the device, which then uses the Google Text-to-Speech API to convert the text data into speech. Finally, this audio response is delivered to the user through the device's speaker. The output is the audio that the user hears.

[0535] Step 7:

[0536] The device also displays facial expressions and simple gestures when responding, providing users with a natural and friendly conversational experience. Based on user feedback, it is recorded on the server so that future responses can be further optimized.

[0537] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0538] This invention relates to a system that includes an emotion engine that analyzes voice input during user interaction to recognize emotions and generates a response based on the results. This system consists of a user, a terminal, a server, and the emotion engine.

[0539] When a user speaks to the robot, their voice is captured by a microphone built into the device. The device converts the voice input into text data using speech recognition and sends this data to the server.

[0540] On the server, the received text data is analyzed by a natural language processing system, and at the same time, an emotion engine is activated to recognize emotions from the audio data. The emotion engine has the ability to estimate the user's emotional state from the intonation, speed, and pauses of the voice. Through this emotion analysis, it is possible to understand what the user is feeling.

[0541] The server takes the results of this emotion recognition into account and generates an appropriate response text. For example, if the user is judged to be sad, it will generate an encouraging response, and if the user is judged to be happy, it will select a more friendly response.

[0542] The generated response text is sent to the terminal and converted into speech data using speech synthesis technology. The converted speech is then transmitted to the user through the speaker. At this time, the terminal responds to the user using gestures and facial expressions, providing a more human-like and pleasant conversational experience.

[0543] As a concrete example, suppose a user says to the robot, "I'm feeling down because I had a bad day." In this case, the device converts this utterance into text and sends it to the server. The emotion engine analyzes the user's tone of voice and content and concludes that the user is "feeling down." The server then generates a comforting response such as, "Some days are like that. I'm sure you'll do better next time."

[0544] In this way, the present invention enables natural dialogue that is attentive to the user's emotions and can fulfill the role of a conversational partner.

[0545] The following describes the processing flow.

[0546] Step 1:

[0547] The user speaks to the robot. During this interaction, the voice is captured by the microphone in the device.

[0548] Step 2:

[0549] The terminal converts the captured audio data into text data using speech recognition. The converted text data is then prepared for subsequent processing.

[0550] Step 3:

[0551] The terminal sends text data and voice feature data to the server. The voice feature data includes information on intonation, speed, and pauses.

[0552] Step 4:

[0553] On the server, natural language processing is performed on the text data to analyze user intent. Simultaneously, an emotion engine recognizes the user's emotions from the voice feature data.

[0554] Step 5:

[0555] The server combines the results of natural language processing with the results of emotion recognition by the emotion engine to generate appropriate response text. The generated response will be tailored to the user's emotions.

[0556] Step 6:

[0557] The server sends the generated response text to the terminal. The terminal then prepares to convert this into audio data.

[0558] Step 7:

[0559] The terminal uses speech synthesis to convert the response text into audio data and transmits the response to the user through the speaker.

[0560] Step 8:

[0561] The user receives an audio response from the device, and if they wish to continue the conversation, they speak again, and the process is repeated.

[0562] (Example 2)

[0563] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0564] While modern interactive systems possess technologies that convert voice input into text for text-based conversations, understanding human emotions and providing flexible responses based on them remains challenging. There is a need to develop systems that can overcome this challenge and enable more natural dialogue.

[0565] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0566] In this invention, the server includes means for receiving audio signals and converting them into text information, means for performing sentiment analysis based on the received text information, and means for learning past dialogue data with the user and optimizing responses in subsequent dialogues. This makes it possible to understand the user's emotions and generate natural and appropriate responses accordingly.

[0567] An "audio signal" is a digital or analog representation of sound waveforms, including human voices.

[0568] "Textual information" refers to data expressed as characters, specifically a format in which audio, images, and other data are replaced with text.

[0569] "Speech recognition means" refers to technologies and devices for converting speech signals into text information, and has the function of analyzing speech data and converting it into an appropriate string of characters.

[0570] An "information processing device" is a device used for inputting, processing, storing, and outputting data, and includes computers and servers.

[0571] "Language data processing" refers to the process of analyzing, understanding, and generating data related to language, and may include natural language processing techniques.

[0572] "Speech synthesis means" refers to technologies and devices that convert text information into speech data that can be reproduced as speech.

[0573] "Emotion analysis" is a technology that infers and determines a speaker's emotional state from their voice or text.

[0574] "User" refers to an individual or organization that operates and uses the system or service.

[0575] "Dialogue data" refers to data that records interactions between a user and a system, including the content and duration of the conversation.

[0576] "Response optimization" refers to the process of generating the optimal response based on past information and circumstances, with the aim of providing the most appropriate and effective response for the user.

[0577] This invention is a system that understands a user's emotions through voice interaction and generates a response tailored to that user. The system consists of a user, a terminal, a server, and an emotion analysis engine. Specifically, it is configured as follows:

[0578] When a user speaks into the microphone built into the device, the device captures the audio signal. Here, we can take smartphones and tablets as examples of general-purpose devices. The captured audio signal is converted into text information using speech recognition software, such as a "speech recognition API."

[0579] The converted text information is transmitted to the server via the internet using a secure protocol. The server uses a natural language processing system, such as a "natural language processing engine," to analyze the received text information as linguistic data. Simultaneously, an emotion analysis engine is used to analyze the user's emotions from the audio signal and generate decision data based on that analysis.

[0580] The information obtained through this analysis is used to generate appropriate response text using a generative AI model. For example, if a user says, "I'm very tired today," the system will generate a response such as, "You must be tired. Please get some rest." Such responses are achieved by using the instruction, "Generate appropriate words of comfort based on the user's input," as the prompt for the generative AI model.

[0581] The generated response text information is sent back to the terminal and converted into speech using speech synthesis technology. The converted speech is then transmitted to the user through the speaker. This allows the system to provide the user with a natural and comfortable conversational experience.

[0582] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0583] Step 1:

[0584] The user speaks to the robot using voice. The input is the user's spoken words, for example, "I'm very tired today." A microphone built into the device captures this voice signal and outputs it as digital audio data.

[0585] Step 2:

[0586] The device launches speech recognition software and converts the input speech data into text. This step uses a speech recognition API to output the spoken content as a string. Specifically, it generates the text "I'm very tired today."

[0587] Step 3:

[0588] The converted character information is sent from the terminal to the server. The transmitted data is character information in text format. The character information reaches the server securely using an internet data transmission protocol.

[0589] Step 4:

[0590] The server analyzes the received text information using a natural language processing engine. The input is text information sent by the user, and the output is the result of language structure analysis. Here, the grammar and meaning of the text information are understood.

[0591] Step 5:

[0592] The server simultaneously uses an emotion analysis engine to analyze the user's emotions from the audio signal. The input is characteristic data of the audio signal (intonation, speed, and pauses), and the output is data indicating the user's emotional state.

[0593] Step 6:

[0594] Based on the results of natural language processing and sentiment analysis, the server uses a generative AI model to generate response text. The model receives the analysis results (text information and language analysis results) as input and outputs an appropriate response, such as "Thank you for your hard work. Please rest well today." Based on the instructions in the prompt, the response will be tailored to the user's emotions.

[0595] Step 7:

[0596] The generated response character information is sent from the server to the terminal. In this step, the response character information is transmitted as text data.

[0597] Step 8:

[0598] The terminal converts the received response text information into speech data using speech synthesis technology. This process takes text information as input and generates a speech signal as output.

[0599] Step 9:

[0600] The voice data is transmitted to the user through the device's speaker. The user is provided with a natural and comfortable voice response that feels like interacting with a human.

[0601] (Application Example 2)

[0602] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0603] Conventional speech recognition systems are capable of accurately transcribing speech into text and generating responses, but they have struggled to accurately understand the user's emotions and respond flexibly based on those emotions. Furthermore, when interacting with users in anxious or stressful situations, there is a need to provide appropriate and comforting responses, but only a limited number of systems have been able to deliver satisfactory results in this regard.

[0604] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0605] In this invention, the server includes: speech recognition means for receiving voice input and converting the voice into text data; means for transmitting the text data to a processing unit for natural language processing; speech synthesis means for converting the response text generated by the processing unit into voice and outputting it; emotion recognition means for analyzing the user's emotions and generating an appropriate response based on those emotions; and means for providing visual or physical responses to the user using an optical device or a motion device, while taking into account the generated response. This enables appropriate dialogue and responses in accordance with the user's emotions.

[0606] "Voice input" is a function that receives the user's voice as a digital signal.

[0607] A "speech recognition method that converts speech into text data" is a technology that analyzes input speech and converts its content into text information.

[0608] "Methods for natural language processing" are algorithms that analyze text-based data to understand its meaning and intent.

[0609] "Speech synthesis means" refers to the ability to generate and output speech based on generated text information.

[0610] "Emotion recognition means" refers to processing that identifies the user's emotional state from voice or text.

[0611] "Means of providing visual or physical responses" refers to technologies that use optical devices or motors to provide visual or physical feedback in a manner similar to that of a human.

[0612] This invention provides a system that receives voice input and converts the voice into text data. This system includes voice recognition means, natural language processing means, voice synthesis means, and emotion recognition means.

[0613] The server receives voice input using a microphone and uses a speech recognition library (e.g., Google Speech-to-Text API) to convert it into text data. The converted text data is then parsed by a natural language processing system (e.g., Google Natural Language API). This natural language processing is performed to understand the meaning and intent of the text.

[0614] The server also uses emotion recognition to identify the user's emotional state from their voice or text. Emotion analysis utilizes elements such as intonation, pauses, and speed of speech. Based on this analysis, an appropriate response is generated and output as speech using a speech synthesis system (e.g., Google Text-to-Speech API). The speech is then delivered to the user via a speaker.

[0615] For example, if a user says, "I'm a little tired today," the speech recognition system converts this speech into text, and the server performs sentiment analysis, recognizing that the user is "tired." The server then generates a comforting response such as, "Please rest well today," and returns it to the user as audio.

[0616] By utilizing generative AI models, the server can also provide visual or physical feedback that takes into account the generated response. Optical and motion devices can be used to achieve more human-like responses.

[0617] Example prompt: "When a user says 'I'm a little tired today,' consider what emotion you should recognize and what response you should generate."

[0618] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0619] Step 1:

[0620] The terminal captures the user's voice input using a microphone. The input is the user's voice, which is received by the terminal as a digital signal.

[0621] Step 2:

[0622] The device uses a speech recognition library (e.g., Google Speech-to-Text API) to convert speech into text data. The input is an audio signal, and the output is text data corresponding to its content. The process involves analyzing the speech pattern and converting the spoken content into text.

[0623] Step 3:

[0624] The terminal sends the converted text data to the server. At this stage, the text data is the input, and the transmission of data to the server is the output. Data is transferred accurately and efficiently to the server using data communication means.

[0625] Step 4:

[0626] The server analyzes the received text data using a natural language processing system (e.g., Google Natural Language API). The input is text data, and the output is the analysis results regarding the meaning and context of the text. The intent of the input is understood through lexical analysis and contextual recognition.

[0627] Step 5:

[0628] The server uses emotion recognition means to identify the user's emotions from the input audio and text. The input consists of text data and audio attributes (e.g., intonation, speed), and the output is the identified emotional state. The server analyzes the voice tone and expression to infer the user's emotions.

[0629] Step 6:

[0630] The server generates an appropriate response based on sentiment analysis. Here, the emotional state is the input, and the generated response text is the output. A generative AI model is used to construct friendly responses that correspond to the user's emotions.

[0631] Step 7:

[0632] The server sends the generated response text to the terminal, where it is converted into speech by a speech synthesis system (e.g., Google Text-to-Speech API). The input is the generated response text, and the output is the synthesized speech. The message is conveyed to the user through natural-sounding speech synthesis.

[0633] Step 8:

[0634] The device transmits the generated audio to the user using its speaker. The input in this step is audio data, and the output is audio output from the speaker. Appropriate acoustic output is provided to enable voice communication with the user.

[0635] Step 9:

[0636] The device utilizes optical or motion devices to provide the user with visual or physical responses. Here, the input is feedback configuration information based on the generated emotion, and the output is the visual or motional effect on the user. Automated actions and displays are employed to present feedback corresponding to specific emotions as naturally as possible.

[0637] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0638] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0639] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0640] [Fourth Embodiment]

[0641] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0642] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0643] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0644] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0645] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0646] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0647] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0648] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0649] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0650] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0651] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0652] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0653] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0654] This invention provides a dialogue system that allows elderly and other users to converse naturally. This system consists of three main components: the user, the terminal (Robot-kun), and the server, which work together in cooperation.

[0655] Users can easily use the system by simply talking to the robot. The voice spoken by the user is received by a microphone built into the device. The device uses speech recognition to convert the voice into text data. The converted text data is then sent to the server.

[0656] The server performs natural language processing based on the received text data. This process helps understand the user's intent and generates appropriate response text. The server also performs sentiment analysis and constructs responses that are empathetic to the user's emotions. Furthermore, by referring to past conversation data, it can adjust responses to ensure smoother interactions in the future.

[0657] The response text generated by the server is sent to the terminal. The terminal converts the response text into speech using a speech synthesis system and plays it to the user through its speaker. At this time, the terminal may also incorporate facial expressions and gestures to provide a more natural conversational experience.

[0658] As a concrete example, consider a scenario where a user tells the robot, "I was lonely today because I didn't get to talk to anyone." The device transcribes this audio into text and sends it to the server. The server performs sentiment analysis and understands the emotion of "loneliness." It then generates a response text such as, "That must have been lonely. I'll be your conversation partner, so please feel free to talk about anything," and sends it to the device. The device then converts this back into speech and conveys it to the user, thus enabling system-based dialogue.

[0659] In this way, the present invention plays an important role in realizing communication that users find comfortable and alleviating feelings of loneliness among the elderly.

[0660] The following describes the processing flow.

[0661] Step 1:

[0662] The user speaks to the robot. The device's built-in microphone captures the voice input during this conversation.

[0663] Step 2:

[0664] The terminal converts the captured audio data into text data using speech recognition. The converted text data is then prepared for the next processing step.

[0665] Step 3:

[0666] The terminal sends text data to the server. At this time, data cleansing, such as noise reduction, is performed as a preprocessing step if necessary.

[0667] Step 4:

[0668] The server inputs the received text data into a natural language processing system to analyze the user's intent. It also uses a sentiment analysis module to estimate the user's emotions.

[0669] Step 5:

[0670] The server generates appropriate response text based on the analysis results. During this process, it references past conversation history and adjusts the response to reflect the user's preferences.

[0671] Step 6:

[0672] The server sends the generated response text to the terminal. The transmitted data is provided in a format suitable for the speech synthesis process.

[0673] Step 7:

[0674] The terminal converts the received response text into speech data using a speech synthesis system. The converted speech is then transmitted to the user through the speaker.

[0675] Step 8:

[0676] The user receives a voice response from the device and continues the conversation as needed. This process is repeated, and communication with the user takes place.

[0677] (Example 1)

[0678] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0679] In modern society, loneliness and lack of communication experienced by the elderly and those living alone are significant issues affecting their psychological health. Furthermore, existing dialogue systems are inadequate at understanding emotions, making it difficult to achieve natural, empathetic communication similar to that between humans. Moreover, optimizing continuous dialogue based on past conversations is challenging, resulting in insufficient improvement of the long-term customer experience.

[0680] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0681] In this invention, the server includes a speech recognition means that receives voice input and converts the voice into text data; a means that performs sentiment analysis based on the received text data and generates a response corresponding to the text data; and a means that learns past dialogue data with the user and optimizes the response in the next dialogue. This enables natural dialogue that is empathetic to the emotions of each individual user, which can alleviate feelings of loneliness among the elderly and support the building of lasting relationships.

[0682] "Voice input" refers to a data format that users provide to a system via voice.

[0683] "Text data" refers to text-formatted data converted from voice input.

[0684] "Speech recognition means" refers to a technology or program for converting speech input into text data.

[0685] An "information processing device" is a computer system that receives data via a network, and then analyzes and processes it.

[0686] "Natural language processing" is a technology that analyzes human language and generates responses in a way that is natural for humans.

[0687] "Speech synthesis means" refers to a technology or program for outputting text data as speech.

[0688] "Sentiment analysis" is a technology that analyzes and detects a user's emotions from text data.

[0689] "Past dialogue data" refers to records of previous communications between the user and the system.

[0690] "Response optimization" is the process of using past dialogue data to adjust responses in future dialogues to be more appropriate.

[0691] "Object movements and expressions" refer to the actions and facial expressions that physical devices or virtual characters perform during interactions with the user.

[0692] This invention aims to realize a dialogue system that allows elderly people and those living alone to converse naturally. The system consists of three elements: user, terminal, and server.

[0693] The user inputs voice to interact with the system. Specifically, the user starts the system by speaking in the form of everyday conversation through the voice input device built into the terminal. The terminal receives this voice and converts it into text data using a commonly used speech recognition API (for example, a standard voice input API).

[0694] The converted text data is sent from the terminal to the server, where it performs natural language processing using a generative AI model. The server analyzes the user's intent and generates an appropriate response using sentiment analysis tools (such as a general sentiment analysis engine). It then refers to a database of past conversations to optimize the response based on the user's history. This enables more accurate communication based on previous conversation history.

[0695] The response text generated by the server is sent back to the terminal. The terminal uses speech synthesis technology (such as a common speech synthesis engine) to convert this text information into speech, which is then transmitted to the user through the speaker. This allows the user to experience a natural conversation with the robot. Furthermore, the terminal can provide a more emotionally rich and interactive conversational experience through facial expressions and gestures.

[0696] For example, if a user tells the robot, "I was lonely today because I didn't get to talk to anyone," the device converts this audio into text data and sends it to the server. The server analyzes the emotion of "loneliness" and generates a response such as, "That must have been lonely. I'll be your conversation partner, so please feel free to talk about anything," which the device then returns to the user as audio. Through this entire process, the system alleviates the user's feelings of loneliness and facilitates comfortable communication.

[0697] As an example of a prompt, by inputting a message in the format of "Generate a dialogue response that conveys the user is feeling lonely," the generative AI model can design a situation-appropriate response.

[0698] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0699] Step 1:

[0700] The user speaks into the device's voice input device. The device receives the voice signal through the microphone. The input voice data is converted into text data using speech recognition. This conversion process analyzes the voice waveform data, and its contents are output as text data.

[0701] Step 2:

[0702] The terminal sends the character data generated by speech recognition to the server. Communication takes place via a common protocol (e.g., HTTP). At this point, the character data is the input and is properly transferred to the server as output.

[0703] Step 3:

[0704] The server processes the received text data and performs natural language processing. A generative AI model is used to analyze the meaning of the text and understand the user's intent. This process involves word semantic analysis and sentence structure analysis, generating new response text. The generated response is further refined using sentiment analysis tools to transform it into emotionally resonant content.

[0705] Step 4:

[0706] The server sends the generated response text to the terminal. Here too, data is transferred via the communication protocol. The input is the adjusted response text, and the output is the transmission to the terminal.

[0707] Step 5:

[0708] The terminal converts the response text received from the server into speech using a speech synthesis system. In this process, a speech waveform is generated based on the text data and output to the user through the speaker. Here, the input is the response text, and the output is synthesized speech data.

[0709] Step 6:

[0710] The device further embodies responses using physical actions and facial expressions. This is to provide users with a more natural and engaging experience by having the robot interact with them using facial expressions and gestures. In this process, responses and action instructions are generated as input, and based on these, physical actions and visual expressions are output.

[0711] (Application Example 1)

[0712] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0713] There is a need to provide a natural conversational environment that allows elderly and other users with communication difficulties to engage in dialogue with peace of mind and alleviate feelings of loneliness. Conventional dialogue systems often lack naturalness and comfort in conversation because they do not adequately consider the physiological and psychological state of the user when responding. A system is needed to improve this and realize a dialogue experience that is tailored to the user's needs.

[0714] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0715] In this invention, the server includes a speech recognition means that receives acoustic input and converts it into text data, a means that relays the text data and performs language processing, a speech synthesis means that converts the generated response text into sound and outputs it, and a physiological information analysis means that recognizes the user's physiological information and adjusts the response. This enables appropriate and natural dialogue that is tailored to the user's physiological and psychological characteristics.

[0716] "Audio input" refers to the voice acquired from the user and is the input format used when a dialogue system receives voice information.

[0717] "Character data" refers to digital text information converted from acoustic input by speech recognition technology.

[0718] A "speech recognition system" is a process that has the function of converting acoustic input into text data, and performs acoustic analysis and generates textual representations.

[0719] "Language processing" is the process of analyzing and understanding text data and generating appropriate responses, utilizing natural language processing technologies.

[0720] "Speech synthesis means" refers to technology that converts text data back into sound and outputs it to the user.

[0721] "Physiological information" refers to information that indicates the user's physical condition, including indicators such as heart rate, complexion, and facial expression.

[0722] A "physiological information analysis means" is a process equipped with the function of adjusting the content and tone of responses based on the user's physiological information.

[0723] This invention is designed as part of a conversation system for use by the elderly and mainly consists of functions for acoustic input, text data conversion, language processing, speech generation, and physiological information analysis.

[0724] First, when the user speaks to the robot, the microphone built into the device receives this audio as sound input. The received audio is then converted into text data by a speech recognition system. In this process, the device utilizes the Google Speech-to-Text API to achieve advanced speech recognition.

[0725] Next, the text data is sent to the server, where language processing is performed using Python and spaCy. The server analyzes and understands the text data, and through natural language processing, generates a response that is tailored to the user. Simultaneously, sentiment analysis is performed using IBM Watson Natural Language Understanding, and a response adapted to the user's emotions is prepared. In addition, physiological information such as the user's heart rate and facial expressions is acquired and the response is adjusted to achieve truly personalized dialogue.

[0726] The generated response text is sent to the terminal, converted back into sound using a speech synthesis method based on the Google Text-to-Speech API, and output to the user through the speaker. At this time, the terminal provides a more natural conversational experience by incorporating facial expressions and gestures according to the user's physiological state.

[0727] As a concrete example, if a user in a nursing home mutters "I was lonely today" through their glasses, the system immediately transcribes this voice into text and sends it to the server. After appropriately understanding the situation through language processing and physiological information analysis, it can respond with something like, "Thank you for sharing how you're feeling lonely. I'm thinking about what you can do to distract you a little."

[0728] Examples of prompts for the generative AI model include questions such as, "How was your day today?" and "Tell me about something fun you've done recently." This ensures that the conversation is always fresh and engaging.

[0729] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0730] Step 1:

[0731] The user speaks to the robot, and the microphone built into the device captures this audio input. The audio input is captured as raw speech data and passed to the Google Speech-to-Text API.

[0732] Step 2:

[0733] The device uses speech recognition to convert the acquired audio input into text data. Specifically, the Google Speech-to-Text API analyzes the audio data and generates a corresponding text representation. The input is audio, and the output is text data.

[0734] Step 3:

[0735] Text data is sent to the server, which uses Python and spaCy to perform natural language processing. The server receives the text data as input, performs grammatical analysis and contextual understanding, and extracts the user's intent. This process prepares the generative AI model to provide appropriate answers and information based on the user's intent.

[0736] Step 4:

[0737] The server uses IBM Watson Natural Language Understanding to perform sentiment analysis based on text data. It identifies emotions such as joy, anger, sadness, and happiness from the text data received as input and designs a response accordingly. The results of the sentiment analysis provide data to understand the user's psychological state.

[0738] Step 5:

[0739] A physiological information analysis system acquires the user's physiological information and provides this information to the server. The server comprehensively analyzes the physiological information and emotion analysis data to generate the optimal response. Specifically, it takes into account the user's heart rate and facial expression patterns to adjust the tone and content of the response.

[0740] Step 6:

[0741] The server sends the generated response text to the device, which then uses the Google Text-to-Speech API to convert the text data into speech. Finally, this audio response is delivered to the user through the device's speaker. The output is the audio that the user hears.

[0742] Step 7:

[0743] The device also displays facial expressions and simple gestures when responding, providing users with a natural and friendly conversational experience. Based on user feedback, it is recorded on the server so that future responses can be further optimized.

[0744] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0745] This invention relates to a system that includes an emotion engine that analyzes voice input during user interaction to recognize emotions and generates a response based on the results. This system consists of a user, a terminal, a server, and the emotion engine.

[0746] When a user speaks to the robot, their voice is captured by a microphone built into the device. The device converts the voice input into text data using speech recognition and sends this data to the server.

[0747] On the server, the received text data is analyzed by a natural language processing system, and at the same time, an emotion engine is activated to recognize emotions from the audio data. The emotion engine has the ability to estimate the user's emotional state from the intonation, speed, and pauses of the voice. Through this emotion analysis, it is possible to understand what the user is feeling.

[0748] The server takes the results of this emotion recognition into account and generates an appropriate response text. For example, if the user is judged to be sad, it will generate an encouraging response, and if the user is judged to be happy, it will select a more friendly response.

[0749] The generated response text is sent to the terminal and converted into speech data using speech synthesis technology. The converted speech is then transmitted to the user through the speaker. At this time, the terminal responds to the user using gestures and facial expressions, providing a more human-like and pleasant conversational experience.

[0750] As a concrete example, suppose a user says to the robot, "I'm feeling down because I had a bad day." In this case, the device converts this utterance into text and sends it to the server. The emotion engine analyzes the user's tone of voice and content and concludes that the user is "feeling down." The server then generates a comforting response such as, "Some days are like that. I'm sure you'll do better next time."

[0751] In this way, the present invention enables natural dialogue that is attentive to the user's emotions and can fulfill the role of a conversational partner.

[0752] The following describes the processing flow.

[0753] Step 1:

[0754] The user speaks to the robot. During this interaction, the voice is captured by the microphone in the device.

[0755] Step 2:

[0756] The terminal converts the captured audio data into text data using speech recognition. The converted text data is then prepared for subsequent processing.

[0757] Step 3:

[0758] The terminal sends text data and voice feature data to the server. The voice feature data includes information on intonation, speed, and pauses.

[0759] Step 4:

[0760] On the server, natural language processing is performed on the text data to analyze user intent. Simultaneously, an emotion engine recognizes the user's emotions from the voice feature data.

[0761] Step 5:

[0762] The server combines the results of natural language processing with the results of emotion recognition by the emotion engine to generate appropriate response text. The generated response will be tailored to the user's emotions.

[0763] Step 6:

[0764] The server sends the generated response text to the terminal. The terminal then prepares to convert this into audio data.

[0765] Step 7:

[0766] The terminal uses speech synthesis to convert the response text into audio data and transmits the response to the user through the speaker.

[0767] Step 8:

[0768] The user receives an audio response from the device, and if they wish to continue the conversation, they speak again, and the process is repeated.

[0769] (Example 2)

[0770] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0771] While modern interactive systems possess technologies that convert voice input into text for text-based conversations, understanding human emotions and providing flexible responses based on them remains challenging. There is a need to develop systems that can overcome this challenge and enable more natural dialogue.

[0772] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0773] In this invention, the server includes means for receiving audio signals and converting them into text information, means for performing sentiment analysis based on the received text information, and means for learning past dialogue data with the user and optimizing responses in subsequent dialogues. This makes it possible to understand the user's emotions and generate natural and appropriate responses accordingly.

[0774] An "audio signal" is a digital or analog representation of sound waveforms, including human voices.

[0775] "Textual information" refers to data expressed as characters, specifically a format in which audio, images, and other data are replaced with text.

[0776] "Speech recognition means" refers to technologies and devices for converting speech signals into text information, and has the function of analyzing speech data and converting it into an appropriate string of characters.

[0777] An "information processing device" is a device used for inputting, processing, storing, and outputting data, and includes computers and servers.

[0778] "Language data processing" refers to the process of analyzing, understanding, and generating data related to language, and may include natural language processing techniques.

[0779] "Speech synthesis means" refers to technologies and devices that convert text information into speech data that can be reproduced as speech.

[0780] "Emotion analysis" is a technology that infers and determines a speaker's emotional state from their voice or text.

[0781] "User" refers to an individual or organization that operates and uses the system or service.

[0782] "Dialogue data" refers to data that records interactions between a user and a system, including the content and duration of the conversation.

[0783] "Response optimization" refers to the process of generating the optimal response based on past information and circumstances, with the aim of providing the most appropriate and effective response for the user.

[0784] This invention is a system that understands a user's emotions through voice interaction and generates a response tailored to that user. The system consists of a user, a terminal, a server, and an emotion analysis engine. Specifically, it is configured as follows:

[0785] When a user speaks into the microphone built into the device, the device captures the audio signal. Here, we can take smartphones and tablets as examples of general-purpose devices. The captured audio signal is converted into text information using speech recognition software, such as a "speech recognition API."

[0786] The converted text information is transmitted to the server via the internet using a secure protocol. The server uses a natural language processing system, such as a "natural language processing engine," to analyze the received text information as linguistic data. Simultaneously, an emotion analysis engine is used to analyze the user's emotions from the audio signal and generate decision data based on that analysis.

[0787] The information obtained through this analysis is used to generate appropriate response text using a generative AI model. For example, if a user says, "I'm very tired today," the system will generate a response such as, "You must be tired. Please get some rest." Such responses are achieved by using the instruction, "Generate appropriate words of comfort based on the user's input," as the prompt for the generative AI model.

[0788] The generated response text information is sent back to the terminal and converted into speech using speech synthesis technology. The converted speech is then transmitted to the user through the speaker. This allows the system to provide the user with a natural and comfortable conversational experience.

[0789] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0790] Step 1:

[0791] The user speaks to the robot using voice. The input is the user's spoken words, for example, "I'm very tired today." A microphone built into the device captures this voice signal and outputs it as digital audio data.

[0792] Step 2:

[0793] The device launches speech recognition software and converts the input speech data into text. This step uses a speech recognition API to output the spoken content as a string. Specifically, it generates the text "I'm very tired today."

[0794] Step 3:

[0795] The converted character information is sent from the terminal to the server. The transmitted data is character information in text format. The character information reaches the server securely using an internet data transmission protocol.

[0796] Step 4:

[0797] The server analyzes the received text information using a natural language processing engine. The input is text information sent by the user, and the output is the result of language structure analysis. Here, the grammar and meaning of the text information are understood.

[0798] Step 5:

[0799] The server simultaneously uses an emotion analysis engine to analyze the user's emotions from the audio signal. The input is characteristic data of the audio signal (intonation, speed, and pauses), and the output is data indicating the user's emotional state.

[0800] Step 6:

[0801] Based on the results of natural language processing and sentiment analysis, the server uses a generative AI model to generate response text. The model receives the analysis results (text information and language analysis results) as input and outputs an appropriate response, such as "Thank you for your hard work. Please rest well today." Based on the instructions in the prompt, the response will be tailored to the user's emotions.

[0802] Step 7:

[0803] The generated response character information is sent from the server to the terminal. In this step, the response character information is transmitted as text data.

[0804] Step 8:

[0805] The terminal converts the received response text information into speech data using speech synthesis technology. This process takes text information as input and generates a speech signal as output.

[0806] Step 9:

[0807] The voice data is transmitted to the user through the device's speaker. The user is provided with a natural and comfortable voice response that feels like interacting with a human.

[0808] (Application Example 2)

[0809] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0810] Conventional speech recognition systems are capable of accurately transcribing speech into text and generating responses, but they have struggled to accurately understand the user's emotions and respond flexibly based on those emotions. Furthermore, when interacting with users in anxious or stressful situations, there is a need to provide appropriate and comforting responses, but only a limited number of systems have been able to deliver satisfactory results in this regard.

[0811] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0812] In this invention, the server includes: speech recognition means for receiving voice input and converting the voice into text data; means for transmitting the text data to a processing unit for natural language processing; speech synthesis means for converting the response text generated by the processing unit into voice and outputting it; emotion recognition means for analyzing the user's emotions and generating an appropriate response based on those emotions; and means for providing visual or physical responses to the user using an optical device or a motion device, while taking into account the generated response. This enables appropriate dialogue and responses in accordance with the user's emotions.

[0813] "Voice input" is a function that receives the user's voice as a digital signal.

[0814] A "speech recognition method that converts speech into text data" is a technology that analyzes input speech and converts its content into text information.

[0815] "Methods for natural language processing" are algorithms that analyze text-based data to understand its meaning and intent.

[0816] "Speech synthesis means" refers to the ability to generate and output speech based on generated text information.

[0817] "Emotion recognition means" refers to processing that identifies the user's emotional state from voice or text.

[0818] "Means of providing visual or physical responses" refers to technologies that use optical devices or motors to provide visual or physical feedback in a manner similar to that of a human.

[0819] This invention provides a system that receives voice input and converts the voice into text data. This system includes voice recognition means, natural language processing means, voice synthesis means, and emotion recognition means.

[0820] The server receives voice input using a microphone and uses a speech recognition library (e.g., Google Speech-to-Text API) to convert it into text data. The converted text data is then parsed by a natural language processing system (e.g., Google Natural Language API). This natural language processing is performed to understand the meaning and intent of the text.

[0821] The server also uses emotion recognition to identify the user's emotional state from their voice or text. Emotion analysis utilizes elements such as intonation, pauses, and speed of speech. Based on this analysis, an appropriate response is generated and output as speech using a speech synthesis system (e.g., Google Text-to-Speech API). The speech is then delivered to the user via a speaker.

[0822] For example, if a user says, "I'm a little tired today," the speech recognition system converts this speech into text, and the server performs sentiment analysis, recognizing that the user is "tired." The server then generates a comforting response such as, "Please rest well today," and returns it to the user as audio.

[0823] By utilizing generative AI models, the server can also provide visual or physical feedback that takes into account the generated response. Optical and motion devices can be used to achieve more human-like responses.

[0824] Example prompt: "When a user says 'I'm a little tired today,' consider what emotion you should recognize and what response you should generate."

[0825] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0826] Step 1:

[0827] The terminal captures the user's voice input using a microphone. The input is the user's voice, which is received by the terminal as a digital signal.

[0828] Step 2:

[0829] The device uses a speech recognition library (e.g., Google Speech-to-Text API) to convert speech into text data. The input is an audio signal, and the output is text data corresponding to its content. The process involves analyzing the speech pattern and converting the spoken content into text.

[0830] Step 3:

[0831] The terminal sends the converted text data to the server. At this stage, the text data is the input, and the transmission of data to the server is the output. Data is transferred accurately and efficiently to the server using data communication means.

[0832] Step 4:

[0833] The server analyzes the received text data using a natural language processing system (e.g., Google Natural Language API). The input is text data, and the output is the analysis results regarding the meaning and context of the text. The intent of the input is understood through lexical analysis and contextual recognition.

[0834] Step 5:

[0835] The server uses emotion recognition means to identify the user's emotions from the input audio and text. The input consists of text data and audio attributes (e.g., intonation, speed), and the output is the identified emotional state. The server analyzes the voice tone and expression to infer the user's emotions.

[0836] Step 6:

[0837] The server generates an appropriate response based on sentiment analysis. Here, the emotional state is the input, and the generated response text is the output. A generative AI model is used to construct friendly responses that correspond to the user's emotions.

[0838] Step 7:

[0839] The server sends the generated response text to the terminal, where it is converted into speech by a speech synthesis system (e.g., Google Text-to-Speech API). The input is the generated response text, and the output is the synthesized speech. The message is conveyed to the user through natural-sounding speech synthesis.

[0840] Step 8:

[0841] The device transmits the generated audio to the user using its speaker. The input in this step is audio data, and the output is audio output from the speaker. Appropriate acoustic output is provided to enable voice communication with the user.

[0842] Step 9:

[0843] The device utilizes optical or motion devices to provide the user with visual or physical responses. Here, the input is feedback configuration information based on the generated emotion, and the output is the visual or motional effect on the user. Automated actions and displays are employed to present feedback corresponding to specific emotions as naturally as possible.

[0844] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0845] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0846] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0847] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0848] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0849] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0850] The inside of the Emotion Map 400 represents what's in your mind, while the outside represents what you're doing. Therefore, the further you go out the 400-coordinate scale, the more visible your emotions become (the more they manifest in your actions).

[0851] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0852] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0853] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0854] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0855] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0856] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0857] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0858] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0859] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0860] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0861] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0862] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0863] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0864] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0865] The following is further disclosed regarding the embodiments described above.

[0866] (Claim 1)

[0867] A speech recognition means that receives voice input and converts the voice into text data,

[0868] A means for transmitting the aforementioned text data and performing natural language processing on a server,

[0869] A speech synthesis means that converts the response text generated in the server into speech and outputs it,

[0870] A system that includes this.

[0871] (Claim 2)

[0872] The system according to claim 1, further comprising means for performing sentiment analysis based on text data received by the server and generating a response corresponding to the text data.

[0873] (Claim 3)

[0874] The system according to claim 1, wherein the server has means for learning past conversation data with the user and optimizing responses in the next conversation.

[0875] "Example 1"

[0876] (Claim 1)

[0877] A speech recognition means that receives voice input and converts the voice into text data,

[0878] A means for transmitting the aforementioned character data and performing natural language processing in an information processing device,

[0879] The aforementioned information processing device includes a speech synthesis means that converts the generated response characters into speech and outputs them,

[0880] A means of controlling the movement and facial expressions of objects in the response output to improve the naturalness of the dialogue,

[0881] A system that includes this.

[0882] (Claim 2)

[0883] The system according to claim 1, further comprising means for performing sentiment analysis based on character data received by the information processing device and generating a response corresponding to the character data.

[0884] (Claim 3)

[0885] The system according to claim 1, wherein the information processing device includes means for learning past dialogue data with the user and optimizing the response in the next dialogue.

[0886] "Application Example 1"

[0887] (Claim 1)

[0888] A speech recognition means that receives an audio input and converts the audio into text data,

[0889] A means for relaying the aforementioned character data and performing language processing in a processing device,

[0890] The aforementioned processing device includes a speech synthesis means that converts the generated response characters into sound and outputs them,

[0891] A physiological information analysis means that recognizes the user's physiological information and adjusts the response based on said physiological information,

[0892] A system that includes this.

[0893] (Claim 2)

[0894] The system according to claim 1, further comprising means for performing sentiment analysis based on character data received by the processing device and generating a response corresponding to the character data.

[0895] (Claim 3)

[0896] The system according to claim 1, wherein the processing device includes means for learning past dialogue data with the user and optimizing the response in the next dialogue.

[0897] "Example 2 of combining an emotion engine"

[0898] (Claim 1)

[0899] A speech recognition means that receives an audio signal and converts the audio into text information,

[0900] Means for transmitting the aforementioned character information and performing language data processing in an information processing device,

[0901] The aforementioned information processing device includes a speech synthesis means that converts the generated response character information into speech and outputs it,

[0902] A system that includes this.

[0903] (Claim 2)

[0904] The system according to claim 1, further comprising means for performing sentiment analysis based on character information received by the information processing device and generating a response corresponding to the character information.

[0905] (Claim 3)

[0906] The system according to claim 1, wherein the information processing device includes means for learning past dialogue data with the user and optimizing the response in the next dialogue.

[0907] "Application example 2 when combining with an emotional engine"

[0908] (Claim 1)

[0909] A speech recognition means that receives voice input and converts the voice into text data,

[0910] Means for transmitting the aforementioned text data and performing natural language processing in a processing device,

[0911] The aforementioned processing apparatus includes a speech synthesis means that converts the generated response text into speech and outputs it,

[0912] An emotion recognition means that analyzes the user's emotions and generates an appropriate response based on those emotions,

[0913] Means for providing a visual or physical response to a user using an optical device or motion device, while taking into account the generated response,

[0914] A system that includes this.

[0915] (Claim 2)

[0916] The system according to claim 1, further comprising means for performing sentiment analysis based on text data received by the processing device and generating a response corresponding to the text data.

[0917] (Claim 3)

[0918] The system according to claim 1, wherein the processing device includes means for analyzing past communication data with the user and optimizing responses in future communications. [Explanation of symbols]

[0919] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A speech recognition means that receives an audio input and converts the audio into text data, A means for relaying the aforementioned character data and performing language processing in a processing device, The aforementioned processing device includes a speech synthesis means that converts the generated response characters into sound and outputs them, A physiological information analysis means that recognizes the user's physiological information and adjusts the response based on said physiological information, A system that includes this.

2. The system according to claim 1, further comprising means for performing sentiment analysis based on character data received by the processing device and generating a response corresponding to the character data.

3. The system according to claim 1, wherein the processing device includes means for learning past dialogue data with the user and optimizing the response in the next dialogue.