system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A system that analyzes motion data to generate natural language responses addresses communication barriers for voice-impaired individuals, facilitating smooth and natural interactions.

JP2026096649APending Publication Date: 2026-06-15SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-03
Publication Date: 2026-06-15

Smart Images

Figure 2026096649000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A means for receiving motion data acquired using user input means, A means for analyzing the aforementioned operation data and recognizing its meaning, A means for generating natural language based on the recognized semantic content, A means of outputting a response sentence from the generated natural language, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] People who have difficulty in voice communication feel many barriers in conveying their intentions. In particular, in daily conversations, they often feel a sense of loneliness and alienation due to limited communication means. In such a situation, the need for a communication support system that can convey intentions in a natural form is increasing.

Means for Solving the Problems

[0005] This invention provides a system that receives motion data acquired using user input means, analyzes that data, and recognizes its meaning. Furthermore, it includes means for generating natural language based on the recognized content and outputting a response sentence from the generated natural language. This enables natural and smooth communication using gestures and simple inputs made by the user.

[0006] "User input means" refers to devices or interfaces that allow users to communicate their actions and intentions to a system.

[0007] "Action data" refers to digital information data acquired based on the user's physical actions and inputs.

[0008] "Analysis" is the process of breaking down and interpreting received data in order to identify its meaning.

[0009] "Means of recognition" refer to technologies and algorithms used to identify and understand the meaning and content of input data.

[0010] "Natural language generation" is a technology in which machines create human language based on data.

[0011] A "response message" is a natural language message generated by the system in response to user input.

[0012] "Means of output" refers to the technologies and devices used to present the generated response sentence to the user. [Brief explanation of the drawing]

[0013] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3]It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

Embodiments for Carrying Out the Invention

[0014] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0019] In the following embodiments, the numbered communication I / F (Interface) is an interface including a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] To implement this invention, a user-operated terminal and a server that performs data analysis and response generation are used. The terminal is equipped with an input device that can acquire user gestures or text input and transmits this input data to the server. The server analyzes the data transmitted from the terminal and recognizes the meaning of the gestures and input content.

[0035] As a concrete example, the device's camera captures the user's hand movements and sends the gesture as digital data to the server. The server uses a gesture recognition algorithm to analyze this digital data and determine that it means "hello." The server then uses a natural language generation engine to create an appropriate response. For example, it might generate a message like, "Hello, how are you?"

[0036] The generated response is sent from the server to the terminal and displayed on the terminal's screen. Furthermore, by combining this with speech synthesis technology, the response can also be played back as audio. This allows users who have difficulty with voice communication to communicate their intentions to others in real time.

[0037] This invention significantly reduces communication barriers, enabling people to enrich their interactions in daily life. In this way, it realizes a system that allows for natural and smooth conversation without the need for voice.

[0038] The following describes the processing flow.

[0039] Step 1:

[0040] The device acquires user gestures and text input through the input device. When the user waves their hand, the device's camera captures the movement and saves it as image data.

[0041] Step 2:

[0042] The device transmits the acquired gestures and input data to the server via the network. At this time, the user's input is formalized as digital data.

[0043] Step 3:

[0044] The server receives data sent from the terminal and begins analysis using a deep learning algorithm. Here, the server applies a gesture recognition model to identify the meaning of the actions from the received data.

[0045] Step 4:

[0046] The server activates a natural language generation engine based on the analysis results and generates an appropriate response. For example, in response to a gesture meaning "hello," it creates the response "Hello, how are you?"

[0047] Step 5:

[0048] The server sends the generated response to the terminal. This response is returned to the terminal in the form of text data.

[0049] Step 6:

[0050] The terminal displays the received response on its screen or plays it back as speech using speech synthesis technology. The user can then communicate their intentions to the other party by confirming this response.

[0051] (Example 1)

[0052] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0053] In modern society, there are challenges in smooth communication in situations where voice-based communication is difficult or in environments where voice input is unsuitable. Furthermore, there is a need for a system that can easily recognize gestures, accurately understand their intent, and respond without using infrared cameras or advanced sensors.

[0054] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0055] In this invention, the server includes means for receiving gesture or text data acquired by a user's terminal, means for analyzing the gesture or text data and identifying its meaning, and means for generating natural language using a generative artificial intelligence model based on the identified meaning. This enables the server to understand the user's intent and respond appropriately without relying on voice, and facilitates natural and smooth communication through audiovisual means.

[0056] A "user" refers to an individual who operates the system using gestures or text input.

[0057] A "terminal" refers to an electronic device that receives user input and transmits data to a server.

[0058] A "gesture" is an action a user takes to express their intentions, and it is acquired as digital data for the system to interpret.

[0059] "Text data" refers to character information entered by a user using a keyboard or touchscreen.

[0060] A "server" refers to a central processing unit that receives data from terminals, analyzes it, and generates responses.

[0061] "Analysis" refers to data processing to recognize the meaning of received gesture or text data.

[0062] A "generative artificial intelligence model" refers to an artificial intelligence algorithm that generates natural language responses based on the semantic content of received data.

[0063] "Natural language generation" refers to the technology of creating language in a form that humans can understand based on analysis results.

[0064] A "response sentence" refers to a natural language sentence presented to the user, created by a generative artificial intelligence model.

[0065] "Speech synthesis means" refers to technology that outputs generated response sentences as speech and presents auditory information to the user.

[0066] This invention is realized by using a terminal used by the user and a server that analyzes data and generates a response. The terminal is equipped with devices for inputting user gestures or text, specifically including a camera, touchscreen, and microphone. The terminal converts the user's actions and inputs acquired through these devices into digital data and transmits it to the server.

[0067] The server uses a computer vision library to analyze the received data and recognize gestures. OpenCV is a possible example. Based on the analyzed gestures and text, a generative artificial intelligence model (e.g., GPT-4®) is used to generate natural language responses.

[0068] The generated response is sent to the terminal, which displays the text on its screen. Furthermore, the generated response can also be played back as audio using a Text-to-Speech (TTS) engine. As a result, even in situations where audio is difficult to use, the user can confirm the response through both sight and sound.

[0069] As a concrete example, consider a prompt: "The user has made a waving gesture. Generate an appropriate natural language response." By inputting this prompt into the AI model, the server generates an appropriate natural language response that is relevant to the situation.

[0070] Thus, this invention constructs a system that combines a terminal and a server to accurately recognize the user's intent and provide an appropriate response, thereby enabling smooth and natural communication.

[0071] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0072] Step 1:

[0073] The user inputs gestures or text using the device's camera or touchscreen. The device captures this input as digital data and encodes it as input data. For example, a hand-waving gesture is captured as video data by the camera.

[0074] Step 2:

[0075] The terminal transmits the acquired digital data to the server. The input data is compressed and encrypted and converted into a format that the server can receive. During this process, the terminal uses a data transfer protocol to communicate securely.

[0076] Step 3:

[0077] The server analyzes the received digital data. The server applies computer vision algorithms to recognize gestures from the input data. This analysis identifies that the "waving hand" gesture means "hello."

[0078] Step 4:

[0079] The server generates a natural language response using a generative AI model based on the analysis results. The prompt "The user made a waving gesture. Please generate an appropriate natural language response." is input to the AI model. Through this data processing, the server generates the response "Hello, how are you?".

[0080] Step 5:

[0081] The server sends the generated response to the terminal. The terminal prepares to display the received response on its screen. If using a TTS engine, the terminal performs the process of converting the response into audio data.

[0082] Step 6:

[0083] The terminal displays the response as text using a display device and also plays it back as speech using a speech synthesis device. The user can receive the system's response through both sight and hearing. This operation enables smooth communication even in non-voice environments.

[0084] (Application Example 1)

[0085] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0086] In the field of elderly care, there is a need for means to achieve smooth and intuitive communication between care recipients who have difficulty with verbal communication and support staff. In many cases, care recipients are unable to adequately convey their needs or condition, leading to communication breakdowns and misunderstandings, so methods to resolve this are necessary.

[0087] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0088] In this invention, the server includes means for receiving motion data acquired using user input means, means for analyzing the motion data and recognizing its semantic content, and means for generating natural language based on the recognized semantic content. This enables intuitive communication based on the gestures and movements of the person being cared for.

[0089] "User input means" refers to devices or functions that users can directly operate, and are devices for acquiring operational data.

[0090] "Action data" refers to information that is digitally represented based on detected user actions and gestures.

[0091] "Means of recognizing semantic content" refers to algorithms and processes for analyzing acquired behavioral data and understanding the meaning that the data conveys.

[0092] "Means of natural language generation" refers to technologies and systems that generate text in a human-understandable language based on recognized semantic content.

[0093] "Means of outputting response sentences from generated natural language" refers to a system that displays or converts the generated text into speech in order to convey it to the user.

[0094] "Means of converting into acoustic signals and presenting as information" refers to methods of transmitting information to a user as an audio message using technologies and devices that convert text-based information into audio information.

[0095] This invention is a system for facilitating communication between care recipients who have difficulty with verbal communication and support staff in the field of elderly care. The system includes a user input means, a motion data analysis means, a semantic content recognition means, a natural language generation means, and a response sentence output means.

[0096] The user's device is equipped with input devices such as a camera to capture the gestures and movements of the person being cared for. This movement data is sent from the device to the server. The server analyzes the movement data using gesture recognition APIs such as Google® ML Kit. The analyzed data is then interpreted through a natural language processing engine such as Google Cloud Natural Language. Based on the recognized information, an appropriate response sentence is generated.

[0097] The generated response is sent from the server to the terminal, displayed on the terminal's screen, and also presented as an audio signal using a speech synthesis API such as Google Cloud Text-to-Speech. In this way, the care recipient's wishes can be clearly communicated to the staff.

[0098] For example, if a person receiving care indicates their desire for water through gestures, the action is captured by a camera and recognized as a request for water. A response such as "Shall I bring you a drink?" is then generated and played back as audio. This entire process enables quick and accurate communication. Furthermore, a prompt such as "Please enter a hand gesture indicating 'water'" can be used with the AI generation model. Through this prompt, the system accurately generates the intended response.

[0099] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0100] Step 1:

[0101] The device captures the user's gestures and movements through its camera. At this stage, the camera acquires video data in real time as input, and this video data is converted into a digital format and prepared as motion data.

[0102] Step 2:

[0103] The terminal sends the acquired motion data to the server. Here, the motion data is sent to the server via the network. The server receives the motion data as input and prepares it for subsequent analysis.

[0104] Step 3:

[0105] The server analyzes the motion data using a gesture recognition algorithm. Specifically, it processes the data using APIs such as Google ML Kit to detect gesture patterns. This process outputs the meaning of the gesture, such as "water request."

[0106] Step 4:

[0107] The server passes the recognized semantic content to a natural language processing engine. Semantic data is used as input, and an appropriate response is output via Google Cloud Natural Language or similar tools. The data processing in this process involves converting semantic content into natural language sentences.

[0108] Step 5:

[0109] The server sends the generated response message to the terminal. In this step, the response message is sent back to the terminal via the network. The terminal receives it and prepares for screen display or audio playback.

[0110] Step 6:

[0111] The terminal displays the received response on its screen and presents it to the user as an audio signal using speech synthesis technology. APIs such as Google Cloud Text-to-Speech are utilized to convert text data into audio data and output it in a format that is easy for the user to understand. This output allows the user's intentions to be conveyed to the support staff.

[0112] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0113] This invention includes a system that recognizes a user's emotions and generates a response based on those emotions. The basic configuration includes a terminal operated by the user and a server that performs analysis, emotion recognition, and natural language generation.

[0114] First, the device acquires user gestures and text input. During this process, sensors such as the camera and microphone built into the device can capture the user's facial expressions and voice tone. This acquired data is then transmitted to a server via the network.

[0115] The server first analyzes the received data as gestures or text input to determine its meaning. Simultaneously, it activates the emotion engine to recognize emotions from the user's input data. For example, if a user smiles, waves, and types "hello," the emotion engine recognizes the emotion as "friendly."

[0116] The server uses a natural language generation engine to generate a response with an appropriate tone and content based on the emotions it recognizes. For example, a possible response might be, "You seem happy! Hello, how are you?" The generated response is then immediately sent to the terminal.

[0117] The device displays the received response on its screen or presents it to the user as audio using speech synthesis technology. This allows the user to experience communication that reflects their own emotions. Emotion-based responses enable users to communicate their intentions without misunderstanding and facilitate deeper interactions.

[0118] This invention aims to improve the quality of everyday interactions by enabling two-way communication that takes user emotions into account. This embodiment allows all users to enjoy more personalized communication.

[0119] The following describes the processing flow.

[0120] Step 1:

[0121] The device acquires user gestures, facial expressions, and voice through the input device. The necessary camera and microphone capture the user's movements, facial expressions, and voice tone, and save this as initial data.

[0122] Step 2:

[0123] The device transmits digital data, including acquired gestures, facial data, and voice tone, to a server. The data is securely transferred over the network in real time.

[0124] Step 3:

[0125] The server receives the transmitted data and analyzes it using a gesture recognition algorithm. Here, it performs meaning analysis of gestures and voice, as well as user facial expressions, to identify the basic intent.

[0126] Step 4:

[0127] The server activates the emotion engine and recognizes the user's emotional state based on the data. For example, if the user is smiling, the emotion engine identifies "joy."

[0128] Step 5:

[0129] The server uses the analysis results and emotion recognition results to generate a response sentence through a natural language generation engine. If the recognized emotion is "joy," it will create an emotion-friendly response such as, "You seem happy! Hello, how are you?"

[0130] Step 6:

[0131] The server sends the generated response message to the terminal. The response message is sent in text format and kept in a state where it can be responded to in real time.

[0132] Step 7:

[0133] The terminal displays the received response on its screen and plays it back as speech using its speech synthesis function. The user can visually and audibly confirm the response, and the communication is completed.

[0134] (Example 2)

[0135] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0136] In modern dialogue systems, it is difficult to generate responses that accurately reflect the user's emotions and intentions. This can lead to unnatural communication and failure to meet user expectations. Therefore, there is a need for systems that can recognize user actions and emotions and respond in a natural manner.

[0137] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0138] In this invention, the server includes means for receiving action information acquired using a user input device, means for analyzing the action information and recognizing its semantic content and emotion, and means for generating a response sentence using a natural language generation model based on the recognized semantic content and emotion. This makes it possible to generate a response that reflects the user's feelings and intentions.

[0139] "User input devices" refer to hardware used by users to input information into a terminal, and specifically include devices such as cameras and microphones.

[0140] "Action information" refers to data such as gestures, voice tone, and facial expressions acquired by the user through an input device.

[0141] "Means of analysis" refers to software or algorithms that use a computer's processing power to extract specific meanings or emotions from received behavioral information.

[0142] "Emotion" refers to data that indicates the user's emotional state, and is information recognized through the analysis of behavioral data.

[0143] A "natural language generation model" refers to artificial intelligence technology that takes input data and converts it into a language format that humans naturally use to create response sentences.

[0144] A "response sentence" is a natural language sentence generated based on recognized semantic content and emotions, and is output information presented to the user.

[0145] A "display device" is a device used to visually present a generated response to the user, and usually refers to a screen or monitor.

[0146] "Audio signal" refers to electrical or digital acoustic data used to transmit generated response sentences to the user as sound.

[0147] "Speech synthesis technology" is a technology that generates speech based on text data to make it sound like a human is speaking.

[0148] This invention provides a system that recognizes a user's emotions and generates a response based on those emotions. The system includes a terminal operated by the user and a server that performs data analysis, emotion recognition, and natural language generation.

[0149] The device uses sensors such as a camera and microphone to capture the user's gestures and voice tone. This allows it to collect information about actions such as the user smiling, waving, or speaking in a specific tone towards the device. This action information is then transmitted to a server via the network.

[0150] The server uses gesture recognition algorithms and speech recognition software to analyze the received behavioral information. The analysis identifies the specific meaning and emotion associated with the behavior. Furthermore, an emotion engine is used to recognize the user's emotions based on the identified meaning and emotion. For example, if a smile and friendly words are detected simultaneously, the emotion "friendly" is recognized.

[0151] Next, the server utilizes a generative AI model to generate natural-sounding responses that match the recognized emotions. Pre-configured prompts can be used to generate these responses. A specific example of a prompt is, "Generate a friendly response in a scenario where the user is smiling and waving." Based on this prompt, the generative model will create a response such as, "You look happy! Hi, how are you?"

[0152] The generated response is sent back from the server to the terminal. The terminal either displays this response on its screen or presents it to the user as speech using speech synthesis technology, enabling natural communication. This system allows users to receive responses that reflect their own emotions, resulting in a more satisfying communication experience.

[0153] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0154] Step 1:

[0155] The device captures the user's gestures and voice. The user performs actions such as smiling or waving towards the camera, and these actions and voices are captured by the device's camera and microphone. As a result, facial expression data and voice tone information are collected as input data.

[0156] Step 2:

[0157] The terminal transmits the acquired operational information to the server. The input data is encrypted via the Internet Protocol and securely transferred to the server. Video and audio data are transmitted during this process.

[0158] Step 3:

[0159] The server analyzes the transmitted data. It uses a gesture recognition algorithm to identify specific user actions from the video data. It also uses voice analysis software to extract tones and keywords from the voice data. Through these analyses, the user's actions and speech content are clarified, and semantic content such as "greeting" or "friendliness" is output.

[0160] Step 4:

[0161] The server activates the emotion engine based on the recognized semantic content and behavioral information. This analyzes the user's emotions and identifies feelings such as "friendly" or "happy." The emotion recognition model takes the analysis results as input and outputs appropriate emotional information.

[0162] Step 5:

[0163] The server utilizes a generative AI model to generate response sentences that correspond to the recognized emotions. By using pre-configured prompt sentences, it creates natural language response sentences that are appropriate for the emotion. When the prompt sentence "Generate a friendly response in a scenario where the user is smiling and waving and greeting" is entered, it generates a response such as "You look happy! Hello, how are you?"

[0164] Step 6:

[0165] The server sends the generated response to the terminal. This response is transmitted as text or audio data and delivered to the user in real time.

[0166] Step 7:

[0167] The device presents the received response to the user. It can be displayed as text on the screen or played back as speech using speech synthesis technology. This allows the user to receive a response that reflects their own emotions, enabling a smoother communication experience.

[0168] (Application Example 2)

[0169] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0170] In modern households, communication has become more diverse, and there is a need for systems that can understand subtle emotional changes and respond appropriately. However, conventional systems have been unable to adequately process user input information in accordance with emotions, making it difficult to achieve natural communication that reflects the user's feelings. As a result, smooth interpersonal relationships in daily life have not been fully achieved.

[0171] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0172] In this invention, the server includes means for receiving motion data and emotion data acquired using user input means, means for analyzing the motion data and emotion data and recognizing semantic content and emotional state, and means for generating natural language based on the recognized semantic content and emotional state. This enables communication adapted to the user's emotions.

[0173] "Motion data" refers to information about the user's physical movements and gestures, which is acquired through sensors.

[0174] "Emotional data" refers to information that indicates a user's emotional state, obtained from things like their facial expressions and tone of voice.

[0175] "Meaningful content" refers to information that has meaning as a result of analyzing behavioral data and emotional data obtained from users.

[0176] "Emotional state" refers to the state of emotions a user is experiencing at a given time, and is identified through data analysis.

[0177] "Natural language generation" is the process by which a computer creates a response sentence in human language based on the results of its analysis.

[0178] A "response statement" is a natural language document generated by the system in response to user input.

[0179] An "acoustic signal" is a format in which sound is processed and output as an electrical signal by electronic devices.

[0180] "Tone" refers to the pitch or nuance adjusted in written or spoken language to convey emotions or intentions.

[0181] "Communication" is the process of exchanging information and emotions between people and systems.

[0182] This invention applies to a home-use interactive system that understands emotions and provides adaptive responses. The system includes a user-operated terminal and a server. The terminal acquires user behavior data and emotion data using sensors such as a camera and microphone. This data is transmitted to the server via a network.

[0183] The server processes data using the following software and hardware: Image processing uses OpenCV and other tools to analyze facial expression data. Audio processing utilizes CMU Sphinx and other tools to read emotions from voice tone. The Google Cloud Emotion API and other tools are used as emotion recognition engines to recognize the user's emotional state. Natural language generation uses natural language processing libraries such as GPT-3 (registered trademark).

[0184] As a result of this process, the server generates the most appropriate response for the user and sends it to the terminal. The terminal then presents the generated response to the user as audio or text, enabling communication tailored to the user's emotions.

[0185] For example, when a child returns home from school and starts doing homework, the system uses the device's camera to read facial expressions and the microphone to analyze the tone of voice. If it determines that the child is feeling tired, the server generates a response message such as, "You've worked so hard today! Shall we take a little break?" and sends it to the device. The device then plays this message back as audio, prompting the user to take an appropriate rest.

[0186] A concrete example of a prompt for a generative AI model is as follows: "Generate an appropriate response when a user speaks to you with a smile." This prompt is used to generate a contextually appropriate response based on the user's actions and emotions.

[0187] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0188] Step 1:

[0189] The device uses a camera and microphone to acquire user behavior and emotional data. It captures facial expressions as images and voice tone as audio data. Input consists of video data from the camera and audio data from the microphone, which are output as sensor data.

[0190] Step 2:

[0191] The acquired sensor data is sent from the terminal to the server. It is crucial to pass both video and audio data to the server. The input is sensor data, and the output is data transmission to the server via the network.

[0192] Step 3:

[0193] The server uses the OpenCV image processing library to analyze facial expressions from video data and identifies emotional states using an emotion recognition engine. The input is video data, and the output is the emotional state inferred from the user's facial expressions. Specifically, it extracts facial feature points and determines emotions based on them.

[0194] Step 4:

[0195] The server uses the CMU Sphinx speech analysis engine to analyze the tone of speech data and further reinforces the user's emotional state based on the results. The input is speech data, and the output is emotional information based on the speech tone. It analyzes the pitch and emphasis of the speech to determine the emotion.

[0196] Step 5:

[0197] The server integrates the analysis results of facial expressions and voice tone to determine the user's overall emotional state. The input is the analysis result from the previous step, and the output is the integrated emotional state. It performs the operation of integrating multiple emotional information to estimate the most likely emotion.

[0198] Step 6:

[0199] The server uses a natural language generation engine to generate appropriate response sentences based on integrated emotional states and text input from the user. The input is the integrated emotional state and text input, and the output is the response sentence. A process of creating prompt sentences and constructing responses is carried out using a generative AI model.

[0200] Step 7:

[0201] The server sends the generated response to the terminal, which then displays the response to the user as either audio or text. The input is the response, and the output is the actual message displayed on the speech generator or screen. The terminal uses speech synthesis technology to perform the specific action of uttering the response.

[0202] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0203] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0204] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0205] [Second Embodiment]

[0206] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0207] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0208] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0209] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0210] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0211] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0212] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0213] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0214] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0215] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0216] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0217] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0218] To implement this invention, a user-operated terminal and a server that performs data analysis and response generation are used. The terminal is equipped with an input device that can acquire user gestures or text input and transmits this input data to the server. The server analyzes the data transmitted from the terminal and recognizes the meaning of the gestures and input content.

[0219] As a concrete example, the device's camera captures the user's hand movements and sends the gesture as digital data to the server. The server uses a gesture recognition algorithm to analyze this digital data and determine that it means "hello." The server then uses a natural language generation engine to create an appropriate response. For example, it might generate a message like, "Hello, how are you?"

[0220] The generated response is sent from the server to the terminal and displayed on the terminal's screen. Furthermore, by combining this with speech synthesis technology, the response can also be played back as audio. This allows users who have difficulty with voice communication to communicate their intentions to others in real time.

[0221] This invention significantly reduces communication barriers, enabling people to enrich their interactions in daily life. In this way, it realizes a system that allows for natural and smooth conversation without the need for voice.

[0222] The following describes the processing flow.

[0223] Step 1:

[0224] The device acquires user gestures and text input through the input device. When the user waves their hand, the device's camera captures the movement and saves it as image data.

[0225] Step 2:

[0226] The device transmits the acquired gestures and input data to the server via the network. At this time, the user's input is formalized as digital data.

[0227] Step 3:

[0228] The server receives data sent from the terminal and begins analysis using a deep learning algorithm. Here, the server applies a gesture recognition model to identify the meaning of the actions from the received data.

[0229] Step 4:

[0230] The server activates a natural language generation engine based on the analysis results and generates an appropriate response. For example, in response to a gesture meaning "hello," it creates the response "Hello, how are you?"

[0231] Step 5:

[0232] The server sends the generated response to the terminal. This response is returned to the terminal in the form of text data.

[0233] Step 6:

[0234] The terminal displays the received response on its screen or plays it back as speech using speech synthesis technology. The user can then communicate their intentions to the other party by confirming this response.

[0235] (Example 1)

[0236] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0237] In modern society, there are challenges in smooth communication in situations where voice-based communication is difficult or in environments where voice input is unsuitable. Furthermore, there is a need for a system that can easily recognize gestures, accurately understand their intent, and respond without using infrared cameras or advanced sensors.

[0238] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0239] In this invention, the server includes means for receiving gesture or text data acquired by a user's terminal, means for analyzing the gesture or text data and identifying its meaning, and means for generating natural language using a generative artificial intelligence model based on the identified meaning. This enables the server to understand the user's intent and respond appropriately without relying on voice, and facilitates natural and smooth communication through audiovisual means.

[0240] A "user" refers to an individual who operates the system using gestures or text input.

[0241] A "terminal" refers to an electronic device that receives user input and transmits data to a server.

[0242] A "gesture" is an action a user takes to express their intentions, and it is acquired as digital data for the system to interpret.

[0243] "Text data" refers to character information entered by a user using a keyboard or touchscreen.

[0244] A "server" refers to a central processing unit that receives data from terminals, analyzes it, and generates responses.

[0245] "Analysis" refers to data processing to recognize the meaning of received gesture or text data.

[0246] A "generative artificial intelligence model" refers to an artificial intelligence algorithm that generates natural language responses based on the semantic content of received data.

[0247] "Natural language generation" refers to the technology of creating language in a form that humans can understand based on analysis results.

[0248] A "response sentence" refers to a natural language sentence presented to the user, created by a generative artificial intelligence model.

[0249] "Speech synthesis means" refers to technology that outputs generated response sentences as speech and presents auditory information to the user.

[0250] This invention is realized by using a terminal used by the user and a server that analyzes data and generates a response. The terminal is equipped with devices for inputting user gestures or text, specifically including a camera, touchscreen, and microphone. The terminal converts the user's actions and inputs acquired through these devices into digital data and transmits it to the server.

[0251] The server uses a computer vision library to analyze the received data and recognize gestures. OpenCV is a possible example. Based on the analyzed gestures and text, a generative artificial intelligence model (e.g., GPT-4) is used to generate natural language responses.

[0252] The generated response is sent to the terminal, which displays the text on its screen. Furthermore, the generated response can also be played back as audio using a Text-to-Speech (TTS) engine. As a result, even in situations where audio is difficult to use, the user can confirm the response through both sight and sound.

[0253] As a concrete example, consider a prompt: "The user has made a waving gesture. Generate an appropriate natural language response." By inputting this prompt into the AI model, the server generates an appropriate natural language response that is relevant to the situation.

[0254] Thus, this invention constructs a system that combines a terminal and a server to accurately recognize the user's intent and provide an appropriate response, thereby enabling smooth and natural communication.

[0255] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0256] Step 1:

[0257] The user inputs gestures or text using the device's camera or touchscreen. The device captures this input as digital data and encodes it as input data. For example, a hand-waving gesture is captured as video data by the camera.

[0258] Step 2:

[0259] The terminal transmits the acquired digital data to the server. The input data is compressed and encrypted and converted into a format that the server can receive. During this process, the terminal uses a data transfer protocol to communicate securely.

[0260] Step 3:

[0261] The server analyzes the received digital data. The server applies computer vision algorithms to recognize gestures from the input data. This analysis identifies that the "waving hand" gesture means "hello."

[0262] Step 4:

[0263] The server generates a natural language response using a generative AI model based on the analysis results. The prompt "The user made a waving gesture. Please generate an appropriate natural language response." is input to the AI model. Through this data processing, the server generates the response "Hello, how are you?".

[0264] Step 5:

[0265] The server sends the generated response to the terminal. The terminal prepares to display the received response on its screen. If using a TTS engine, the terminal performs the process of converting the response into audio data.

[0266] Step 6:

[0267] The terminal displays the response as text using a display device and also plays it back as speech using a speech synthesis device. The user can receive the system's response through both sight and hearing. This operation enables smooth communication even in non-voice environments.

[0268] (Application Example 1)

[0269] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0270] In the field of elderly care, there is a need for means to achieve smooth and intuitive communication between care recipients who have difficulty with verbal communication and support staff. In many cases, care recipients are unable to adequately convey their needs or condition, leading to communication breakdowns and misunderstandings, so methods to resolve this are necessary.

[0271] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0272] In this invention, the server includes means for receiving motion data acquired using user input means, means for analyzing the motion data and recognizing its semantic content, and means for generating natural language based on the recognized semantic content. This enables intuitive communication based on the gestures and movements of the person being cared for.

[0273] "User input means" refers to devices or functions that users can directly operate, and are devices for acquiring operational data.

[0274] "Action data" refers to information that is digitally represented based on detected user actions and gestures.

[0275] "Means of recognizing semantic content" refers to algorithms and processes for analyzing acquired behavioral data and understanding the meaning that the data conveys.

[0276] "Means of natural language generation" refers to technologies and systems that generate text in a human-understandable language based on recognized semantic content.

[0277] "Means of outputting response sentences from generated natural language" refers to a system that displays or converts the generated text into speech in order to convey it to the user.

[0278] "Means of converting into acoustic signals and presenting as information" refers to methods of transmitting information to a user as an audio message using technologies and devices that convert text-based information into audio information.

[0279] This invention is a system for facilitating communication between care recipients who have difficulty with verbal communication and support staff in the field of elderly care. The system includes a user input means, a motion data analysis means, a semantic content recognition means, a natural language generation means, and a response sentence output means.

[0280] The user's terminal is equipped with input devices such as a camera to capture the gestures and actions of the care recipient. These motion data are sent from the terminal to the server. The server uses a gesture recognition API such as Google ML Kit to analyze the motion data. The analyzed data is recognized for its semantic content through a natural language processing engine such as Google Cloud Natural Language. Based on the recognized information, an appropriate response text is generated.

[0281] The generated response text is sent from the server to the terminal, displayed on the terminal's display, and presented as an acoustic signal using a text-to-speech API such as Google Cloud Text-to-Speech. In this way, it becomes possible to convey the wishes of the care recipient to the staff in an easy-to-understand manner.

[0282] As a specific example, when the care recipient indicates the intention of "wanting to drink water" through a gesture, the motion is captured by the camera and recognized as the semantic content of "water supply request". Then, a response text such as "May I bring you a drink?" is generated and played as audio. Through this series of processes, rapid and accurate communication can be achieved. Also, for the generative AI model, a prompt text such as "Please input the hand motion that shows 'water' in the gesture." can be used. Through this prompt, the system accurately generates the intended response.

[0283] The flow of the specific process in Application Example 1 will be described using FIG. 12.

[0284] Step 1:

[0285] The terminal captures the user's gestures and actions through the camera. At this stage, the camera acquires video data in real time as input, converts the video data into a digital format, and prepares it as motion data.

[0286] Step 2:

[0287] The terminal sends the acquired operation data to the server. Here, the operation data is sent towards the server via the network. The server receives the operation data as input and prepares for subsequent analysis.

[0288] Step 3:

[0289] The server analyzes the operation data using a gesture recognition algorithm. Specifically, APIs such as Google ML Kit are used to process the data and detect the gesture pattern. Through this process, "water supply request" etc. is output as the semantic content of the gesture.

[0290] Step 4:

[0291] The server passes the recognized semantic content to the natural language processing engine. Semantic content data is used as input, and an appropriate response sentence is output through Google Cloud Natural Language etc. The data processing in this process is the conversion from semantic content to natural language sentences.

[0292] Step 5:

[0293] The server sends the generated response sentence to the terminal. In this step, the response sentence is sent back to the terminal through the network. The terminal receives this and prepares for screen display or voice playback.

[0294] Step 6:

[0295] The terminal displays the received response sentence on the display and presents it to the user as an acoustic signal using speech synthesis technology. APIs such as Google Cloud Text-to-Speech are utilized to convert the text data into voice data and output it in an easily understandable form for the user. Through this output, the user's intention is conveyed to the support staff.

[0296] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0297] This invention includes a system that recognizes a user's emotions and generates a response based on those emotions. The basic configuration includes a terminal operated by the user and a server that performs analysis, emotion recognition, and natural language generation.

[0298] First, the device acquires user gestures and text input. During this process, sensors such as the camera and microphone built into the device can capture the user's facial expressions and voice tone. This acquired data is then transmitted to a server via the network.

[0299] The server first analyzes the received data as gestures or text input to determine its meaning. Simultaneously, it activates the emotion engine to recognize emotions from the user's input data. For example, if a user smiles, waves, and types "hello," the emotion engine recognizes the emotion as "friendly."

[0300] The server uses a natural language generation engine to generate a response with an appropriate tone and content based on the emotions it recognizes. For example, a possible response might be, "You seem happy! Hello, how are you?" The generated response is then immediately sent to the terminal.

[0301] The device displays the received response on its screen or presents it to the user as audio using speech synthesis technology. This allows the user to experience communication that reflects their own emotions. Emotion-based responses enable users to communicate their intentions without misunderstanding and facilitate deeper interactions.

[0302] The present invention aims to improve the quality of daily interactions by realizing two-way communication that takes into account the emotions of users. With this embodiment, all users can enjoy more personalized communication.

[0303] The following describes the process flow.

[0304] Step 1:

[0305] The terminal acquires gestures, facial expressions, and voices of the user through input devices. The necessary cameras and microphones capture the user's movements, expressions, and voice tones and save them as initial data.

[0306] Step 2:

[0307] The terminal transmits digital data including the acquired gestures, facial data, voice tones, etc. to the server. The data is securely transferred in real-time via the network.

[0308] Step 3:

[0309] The server receives the transmitted data and analyzes the data using a gesture recognition algorithm. Here, it performs gesture and voice meaning and user expression analysis to identify the basic intention.

[0310] Step 4:

[0311] The server activates the emotion engine and recognizes the user's emotional state based on the data. For example, when the user is smiling, the emotion engine identifies "joy".

[0312] Step 5:

[0313] The server uses the analysis results and emotion recognition results to generate a response sentence through a natural language generation engine. If the recognized emotion is "joy," it will create an emotion-friendly response such as, "You seem happy! Hello, how are you?"

[0314] Step 6:

[0315] The server sends the generated response message to the terminal. The response message is sent in text format and kept in a state where it can be responded to in real time.

[0316] Step 7:

[0317] The terminal displays the received response on its screen and plays it back as speech using its speech synthesis function. The user can visually and audibly confirm the response, and the communication is completed.

[0318] (Example 2)

[0319] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0320] In modern dialogue systems, it is difficult to generate responses that accurately reflect the user's emotions and intentions. This can lead to unnatural communication and failure to meet user expectations. Therefore, there is a need for systems that can recognize user actions and emotions and respond in a natural manner.

[0321] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0322] In this invention, the server includes means for receiving action information acquired using a user input device, means for analyzing the action information and recognizing its semantic content and emotion, and means for generating a response sentence using a natural language generation model based on the recognized semantic content and emotion. This makes it possible to generate a response that reflects the user's feelings and intentions.

[0323] "User input devices" refer to hardware used by users to input information into a terminal, and specifically include devices such as cameras and microphones.

[0324] "Action information" refers to data such as gestures, voice tone, and facial expressions acquired by the user through an input device.

[0325] "Means of analysis" refers to software or algorithms that use a computer's processing power to extract specific meanings or emotions from received behavioral information.

[0326] "Emotion" refers to data that indicates the user's emotional state, and is information recognized through the analysis of behavioral data.

[0327] A "natural language generation model" refers to artificial intelligence technology that takes input data and converts it into a language format that humans naturally use to create response sentences.

[0328] A "response sentence" is a natural language sentence generated based on recognized semantic content and emotions, and is output information presented to the user.

[0329] A "display device" is a device used to visually present a generated response to the user, and usually refers to a screen or monitor.

[0330] "Audio signal" refers to electrical or digital acoustic data used to transmit generated response sentences to the user as sound.

[0331] "Speech synthesis technology" is a technology that generates speech based on text data to make it sound like a human is speaking.

[0332] This invention provides a system that recognizes a user's emotions and generates a response based on those emotions. The system includes a terminal operated by the user and a server that performs data analysis, emotion recognition, and natural language generation.

[0333] The device uses sensors such as a camera and microphone to capture the user's gestures and voice tone. This allows it to collect information about actions such as the user smiling, waving, or speaking in a specific tone towards the device. This action information is then transmitted to a server via the network.

[0334] The server uses gesture recognition algorithms and speech recognition software to analyze the received behavioral information. The analysis identifies the specific meaning and emotion associated with the behavior. Furthermore, an emotion engine is used to recognize the user's emotions based on the identified meaning and emotion. For example, if a smile and friendly words are detected simultaneously, the emotion "friendly" is recognized.

[0335] Next, the server utilizes a generative AI model to generate natural-sounding responses that match the recognized emotions. Pre-configured prompts can be used to generate these responses. A specific example of a prompt is, "Generate a friendly response in a scenario where the user is smiling and waving." Based on this prompt, the generative model will create a response such as, "You look happy! Hi, how are you?"

[0336] The generated response is sent back from the server to the terminal. The terminal either displays this response on its screen or presents it to the user as speech using speech synthesis technology, enabling natural communication. This system allows users to receive responses that reflect their own emotions, resulting in a more satisfying communication experience.

[0337] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0338] Step 1:

[0339] The device captures the user's gestures and voice. The user performs actions such as smiling or waving towards the camera, and these actions and voices are captured by the device's camera and microphone. As a result, facial expression data and voice tone information are collected as input data.

[0340] Step 2:

[0341] The terminal transmits the acquired operational information to the server. The input data is encrypted via the Internet Protocol and securely transferred to the server. Video and audio data are transmitted during this process.

[0342] Step 3:

[0343] The server analyzes the transmitted data. It uses a gesture recognition algorithm to identify specific user actions from the video data. It also uses voice analysis software to extract tones and keywords from the voice data. Through these analyses, the user's actions and speech content are clarified, and semantic content such as "greeting" or "friendliness" is output.

[0344] Step 4:

[0345] The server activates the emotion engine based on the recognized semantic content and behavioral information. This analyzes the user's emotions and identifies feelings such as "friendly" or "happy." The emotion recognition model takes the analysis results as input and outputs appropriate emotional information.

[0346] Step 5:

[0347] The server utilizes a generative AI model to generate response sentences that correspond to the recognized emotions. By using pre-configured prompt sentences, it creates natural language response sentences that are appropriate for the emotion. When the prompt sentence "Generate a friendly response in a scenario where the user is smiling and waving and greeting" is entered, it generates a response such as "You look happy! Hello, how are you?"

[0348] Step 6:

[0349] The server sends the generated response to the terminal. This response is transmitted as text or audio data and delivered to the user in real time.

[0350] Step 7:

[0351] The device presents the received response to the user. It can be displayed as text on the screen or played back as speech using speech synthesis technology. This allows the user to receive a response that reflects their own emotions, enabling a smoother communication experience.

[0352] (Application Example 2)

[0353] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0354] In modern households, communication has become more diverse, and there is a need for systems that can understand subtle emotional changes and respond appropriately. However, conventional systems have been unable to adequately process user input information in accordance with emotions, making it difficult to achieve natural communication that reflects the user's feelings. As a result, smooth interpersonal relationships in daily life have not been fully achieved.

[0355] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0356] In this invention, the server includes means for receiving motion data and emotion data acquired using user input means, means for analyzing the motion data and emotion data and recognizing semantic content and emotional state, and means for generating natural language based on the recognized semantic content and emotional state. This enables communication adapted to the user's emotions.

[0357] "Motion data" refers to information about the user's physical movements and gestures, which is acquired through sensors.

[0358] "Emotional data" refers to information that indicates a user's emotional state, obtained from things like their facial expressions and tone of voice.

[0359] "Meaningful content" refers to information that has meaning as a result of analyzing behavioral data and emotional data obtained from users.

[0360] "Emotional state" refers to the state of emotions a user is experiencing at a given time, and is identified through data analysis.

[0361] "Natural language generation" is the process by which a computer creates a response sentence in human language based on the results of its analysis.

[0362] A "response statement" is a natural language document generated by the system in response to user input.

[0363] An "acoustic signal" is a format in which sound is processed and output as an electrical signal by electronic devices.

[0364] "Tone" refers to the pitch or nuance adjusted in written or spoken language to convey emotions or intentions.

[0365] "Communication" is the process of exchanging information and emotions between people and systems.

[0366] This invention applies to a home-use interactive system that understands emotions and provides adaptive responses. The system includes a user-operated terminal and a server. The terminal acquires user behavior data and emotion data using sensors such as a camera and microphone. This data is transmitted to the server via a network.

[0367] The server processes data using the following software and hardware: Image processing uses OpenCV and other tools to analyze facial expression data. Audio processing utilizes CMU Sphinx and other tools to read emotions from voice tone. The Google Cloud Emotion API and other tools are used as emotion recognition engines to recognize the user's emotional state. Natural language processing libraries such as GPT-3 are used for natural language generation.

[0368] As a result of this process, the server generates the most appropriate response for the user and sends it to the terminal. The terminal then presents the generated response to the user as audio or text, enabling communication tailored to the user's emotions.

[0369] For example, when a child returns home from school and starts doing homework, the system uses the device's camera to read facial expressions and the microphone to analyze the tone of voice. If it determines that the child is feeling tired, the server generates a response message such as, "You've worked so hard today! Shall we take a little break?" and sends it to the device. The device then plays this message back as audio, prompting the user to take an appropriate rest.

[0370] A concrete example of a prompt for a generative AI model is as follows: "Generate an appropriate response when a user speaks to you with a smile." This prompt is used to generate a contextually appropriate response based on the user's actions and emotions.

[0371] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0372] Step 1:

[0373] The device uses a camera and microphone to acquire user behavior and emotional data. It captures facial expressions as images and voice tone as audio data. Input consists of video data from the camera and audio data from the microphone, which are output as sensor data.

[0374] Step 2:

[0375] The acquired sensor data is sent from the terminal to the server. It is crucial to pass both video and audio data to the server. The input is sensor data, and the output is data transmission to the server via the network.

[0376] Step 3:

[0377] The server uses the OpenCV image processing library to analyze facial expressions from video data and identifies emotional states using an emotion recognition engine. The input is video data, and the output is the emotional state inferred from the user's facial expressions. Specifically, it extracts facial feature points and determines emotions based on them.

[0378] Step 4:

[0379] The server uses the CMU Sphinx speech analysis engine to analyze the tone of speech data and further reinforces the user's emotional state based on the results. The input is speech data, and the output is emotional information based on the speech tone. It analyzes the pitch and emphasis of the speech to determine the emotion.

[0380] Step 5:

[0381] The server integrates the analysis results of facial expressions and voice tone to determine the user's overall emotional state. The input is the analysis result from the previous step, and the output is the integrated emotional state. It performs the operation of integrating multiple emotional information to estimate the most likely emotion.

[0382] Step 6:

[0383] The server uses a natural language generation engine to generate appropriate response sentences based on integrated emotional states and text input from the user. The input is the integrated emotional state and text input, and the output is the response sentence. A process of creating prompt sentences and constructing responses is carried out using a generative AI model.

[0384] Step 7:

[0385] The server sends the generated response to the terminal, which then displays the response to the user as either audio or text. The input is the response, and the output is the actual message displayed on the speech generator or screen. The terminal uses speech synthesis technology to perform the specific action of uttering the response.

[0386] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0387] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0388] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0389] [Third Embodiment]

[0390] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0391] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0392] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0393] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0394] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0395] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0396] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0397] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0398] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0399] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0400] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0401] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0402] To implement this invention, a user-operated terminal and a server that performs data analysis and response generation are used. The terminal is equipped with an input device that can acquire user gestures or text input and transmits this input data to the server. The server analyzes the data transmitted from the terminal and recognizes the meaning of the gestures and input content.

[0403] As a concrete example, the device's camera captures the user's hand movements and sends the gesture as digital data to the server. The server uses a gesture recognition algorithm to analyze this digital data and determine that it means "hello." The server then uses a natural language generation engine to create an appropriate response. For example, it might generate a message like, "Hello, how are you?"

[0404] The generated response is sent from the server to the terminal and displayed on the terminal's screen. Furthermore, by combining this with speech synthesis technology, the response can also be played back as audio. This allows users who have difficulty with voice communication to communicate their intentions to others in real time.

[0405] This invention significantly reduces communication barriers, enabling people to enrich their interactions in daily life. In this way, it realizes a system that allows for natural and smooth conversation without the need for voice.

[0406] The following describes the processing flow.

[0407] Step 1:

[0408] The device acquires user gestures and text input through the input device. When the user waves their hand, the device's camera captures the movement and saves it as image data.

[0409] Step 2:

[0410] The device transmits the acquired gestures and input data to the server via the network. At this time, the user's input is formalized as digital data.

[0411] Step 3:

[0412] The server receives data sent from the terminal and begins analysis using a deep learning algorithm. Here, the server applies a gesture recognition model to identify the meaning of the actions from the received data.

[0413] Step 4:

[0414] The server activates a natural language generation engine based on the analysis results and generates an appropriate response. For example, in response to a gesture meaning "hello," it creates the response "Hello, how are you?"

[0415] Step 5:

[0416] The server sends the generated response to the terminal. This response is returned to the terminal in the form of text data.

[0417] Step 6:

[0418] The terminal displays the received response on its screen or plays it back as speech using speech synthesis technology. The user can then communicate their intentions to the other party by confirming this response.

[0419] (Example 1)

[0420] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0421] In modern society, there are challenges in smooth communication in situations where voice-based communication is difficult or in environments where voice input is unsuitable. Furthermore, there is a need for a system that can easily recognize gestures, accurately understand their intent, and respond without using infrared cameras or advanced sensors.

[0422] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0423] In this invention, the server includes means for receiving gesture or text data acquired by a user's terminal, means for analyzing the gesture or text data and identifying its meaning, and means for generating natural language using a generative artificial intelligence model based on the identified meaning. This enables the server to understand the user's intent and respond appropriately without relying on voice, and facilitates natural and smooth communication through audiovisual means.

[0424] A "user" refers to an individual who operates the system using gestures or text input.

[0425] A "terminal" refers to an electronic device that receives user input and transmits data to a server.

[0426] A "gesture" is an action a user takes to express their intentions, and it is acquired as digital data for the system to interpret.

[0427] "Text data" refers to character information entered by a user using a keyboard or touchscreen.

[0428] A "server" refers to a central processing unit that receives data from terminals, analyzes it, and generates responses.

[0429] "Analysis" refers to data processing to recognize the meaning of received gesture or text data.

[0430] A "generative artificial intelligence model" refers to an artificial intelligence algorithm that generates natural language responses based on the semantic content of received data.

[0431] "Natural language generation" refers to the technology of creating language in a form that humans can understand based on analysis results.

[0432] A "response sentence" refers to a natural language sentence presented to the user, created by a generative artificial intelligence model.

[0433] "Speech synthesis means" refers to technology that outputs generated response sentences as speech and presents auditory information to the user.

[0434] This invention is realized by using a terminal used by the user and a server that analyzes data and generates a response. The terminal is equipped with devices for inputting user gestures or text, specifically including a camera, touchscreen, and microphone. The terminal converts the user's actions and inputs acquired through these devices into digital data and transmits it to the server.

[0435] The server uses a computer vision library to analyze the received data and recognize gestures. OpenCV is a possible example. Based on the analyzed gestures and text, a generative artificial intelligence model (e.g., GPT-4) is used to generate natural language responses.

[0436] The generated response is sent to the terminal, which displays the text on its screen. Furthermore, the generated response can also be played back as audio using a Text-to-Speech (TTS) engine. As a result, even in situations where audio is difficult to use, the user can confirm the response through both sight and sound.

[0437] As a concrete example, consider a prompt: "The user has made a waving gesture. Generate an appropriate natural language response." By inputting this prompt into the AI model, the server generates an appropriate natural language response that is relevant to the situation.

[0438] Thus, this invention constructs a system that combines a terminal and a server to accurately recognize the user's intent and provide an appropriate response, thereby enabling smooth and natural communication.

[0439] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0440] Step 1:

[0441] The user inputs gestures or text using the device's camera or touchscreen. The device captures this input as digital data and encodes it as input data. For example, a hand-waving gesture is captured as video data by the camera.

[0442] Step 2:

[0443] The terminal transmits the acquired digital data to the server. The input data is compressed and encrypted and converted into a format that the server can receive. During this process, the terminal uses a data transfer protocol to communicate securely.

[0444] Step 3:

[0445] The server analyzes the received digital data. The server applies computer vision algorithms to recognize gestures from the input data. This analysis identifies that the "waving hand" gesture means "hello."

[0446] Step 4:

[0447] The server generates a natural language response using a generative AI model based on the analysis results. The prompt "The user made a waving gesture. Please generate an appropriate natural language response." is input to the AI model. Through this data processing, the server generates the response "Hello, how are you?".

[0448] Step 5:

[0449] The server sends the generated response to the terminal. The terminal prepares to display the received response on its screen. If using a TTS engine, the terminal performs the process of converting the response into audio data.

[0450] Step 6:

[0451] The terminal displays the response as text using a display device and also plays it back as speech using a speech synthesis device. The user can receive the system's response through both sight and hearing. This operation enables smooth communication even in non-voice environments.

[0452] (Application Example 1)

[0453] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0454] In the field of elderly care, there is a need for means to achieve smooth and intuitive communication between care recipients who have difficulty with verbal communication and support staff. In many cases, care recipients are unable to adequately convey their needs or condition, leading to communication breakdowns and misunderstandings, so methods to resolve this are necessary.

[0455] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0456] In this invention, the server includes means for receiving motion data acquired using user input means, means for analyzing the motion data and recognizing its semantic content, and means for generating natural language based on the recognized semantic content. This enables intuitive communication based on the gestures and movements of the person being cared for.

[0457] "User input means" refers to devices or functions that users can directly operate, and are devices for acquiring operational data.

[0458] "Action data" refers to information that is digitally represented based on detected user actions and gestures.

[0459] "Means of recognizing semantic content" refers to algorithms and processes for analyzing acquired behavioral data and understanding the meaning that the data conveys.

[0460] "Means of natural language generation" refers to technologies and systems that generate text in a human-understandable language based on recognized semantic content.

[0461] "Means of outputting response sentences from generated natural language" refers to a system that displays or converts the generated text into speech in order to convey it to the user.

[0462] "Means of converting into acoustic signals and presenting as information" refers to methods of transmitting information to a user as an audio message using technologies and devices that convert text-based information into audio information.

[0463] This invention is a system for facilitating communication between care recipients who have difficulty with verbal communication and support staff in the field of elderly care. The system includes a user input means, a motion data analysis means, a semantic content recognition means, a natural language generation means, and a response sentence output means.

[0464] The user's device is equipped with input devices such as a camera to capture the gestures and movements of the person being cared for. This movement data is sent from the device to the server. The server analyzes the movement data using gesture recognition APIs such as Google ML Kit. The analyzed data is then interpreted through a natural language processing engine such as Google Cloud Natural Language. Based on the recognized information, an appropriate response sentence is generated.

[0465] The generated response is sent from the server to the terminal, displayed on the terminal's screen, and also presented as an audio signal using a speech synthesis API such as Google Cloud Text-to-Speech. In this way, the care recipient's wishes can be clearly communicated to the staff.

[0466] For example, if a person receiving care indicates their desire for water through gestures, the action is captured by a camera and recognized as a request for water. A response such as "Shall I bring you a drink?" is then generated and played back as audio. This entire process enables quick and accurate communication. Furthermore, a prompt such as "Please enter a hand gesture indicating 'water'" can be used with the AI generation model. Through this prompt, the system accurately generates the intended response.

[0467] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0468] Step 1:

[0469] The device captures the user's gestures and movements through its camera. At this stage, the camera acquires video data in real time as input, and this video data is converted into a digital format and prepared as motion data.

[0470] Step 2:

[0471] The terminal sends the acquired motion data to the server. Here, the motion data is sent to the server via the network. The server receives the motion data as input and prepares it for subsequent analysis.

[0472] Step 3:

[0473] The server analyzes the motion data using a gesture recognition algorithm. Specifically, it processes the data using APIs such as Google ML Kit to detect gesture patterns. This process outputs the meaning of the gesture, such as "water request."

[0474] Step 4:

[0475] The server passes the recognized semantic content to a natural language processing engine. Semantic data is used as input, and an appropriate response is output via Google Cloud Natural Language or similar tools. The data processing in this process involves converting semantic content into natural language sentences.

[0476] Step 5:

[0477] The server sends the generated response message to the terminal. In this step, the response message is sent back to the terminal via the network. The terminal receives it and prepares for screen display or audio playback.

[0478] Step 6:

[0479] The terminal displays the received response on its screen and presents it to the user as an audio signal using speech synthesis technology. APIs such as Google Cloud Text-to-Speech are utilized to convert text data into audio data and output it in a format that is easy for the user to understand. This output allows the user's intentions to be conveyed to the support staff.

[0480] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0481] This invention includes a system that recognizes a user's emotions and generates a response based on those emotions. The basic configuration includes a terminal operated by the user and a server that performs analysis, emotion recognition, and natural language generation.

[0482] First, the device acquires user gestures and text input. During this process, sensors such as the camera and microphone built into the device can capture the user's facial expressions and voice tone. This acquired data is then transmitted to a server via the network.

[0483] The server first analyzes the received data as gestures or text input to determine its meaning. Simultaneously, it activates the emotion engine to recognize emotions from the user's input data. For example, if a user smiles, waves, and types "hello," the emotion engine recognizes the emotion as "friendly."

[0484] The server uses a natural language generation engine to generate a response with an appropriate tone and content based on the emotions it recognizes. For example, a possible response might be, "You seem happy! Hello, how are you?" The generated response is then immediately sent to the terminal.

[0485] The device displays the received response on its screen or presents it to the user as audio using speech synthesis technology. This allows the user to experience communication that reflects their own emotions. Emotion-based responses enable users to communicate their intentions without misunderstanding and facilitate deeper interactions.

[0486] This invention aims to improve the quality of everyday interactions by enabling two-way communication that takes user emotions into account. This embodiment allows all users to enjoy more personalized communication.

[0487] The following describes the processing flow.

[0488] Step 1:

[0489] The device acquires user gestures, facial expressions, and voice through the input device. The necessary camera and microphone capture the user's movements, facial expressions, and voice tone, and save this as initial data.

[0490] Step 2:

[0491] The device transmits digital data, including acquired gestures, facial data, and voice tone, to a server. The data is securely transferred over the network in real time.

[0492] Step 3:

[0493] The server receives the transmitted data and analyzes it using a gesture recognition algorithm. Here, it performs meaning analysis of gestures and voice, as well as user facial expressions, to identify the basic intent.

[0494] Step 4:

[0495] The server activates the emotion engine and recognizes the user's emotional state based on the data. For example, if the user is smiling, the emotion engine identifies "joy."

[0496] Step 5:

[0497] The server uses the analysis results and emotion recognition results to generate a response sentence through a natural language generation engine. If the recognized emotion is "joy," it will create an emotion-friendly response such as, "You seem happy! Hello, how are you?"

[0498] Step 6:

[0499] The server sends the generated response message to the terminal. The response message is sent in text format and kept in a state where it can be responded to in real time.

[0500] Step 7:

[0501] The terminal displays the received response on its screen and plays it back as speech using its speech synthesis function. The user can visually and audibly confirm the response, and the communication is completed.

[0502] (Example 2)

[0503] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0504] In modern dialogue systems, it is difficult to generate responses that accurately reflect the user's emotions and intentions. This can lead to unnatural communication and failure to meet user expectations. Therefore, there is a need for systems that can recognize user actions and emotions and respond in a natural manner.

[0505] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0506] In this invention, the server includes means for receiving action information acquired using a user input device, means for analyzing the action information and recognizing its semantic content and emotion, and means for generating a response sentence using a natural language generation model based on the recognized semantic content and emotion. This makes it possible to generate a response that reflects the user's feelings and intentions.

[0507] "User input devices" refer to hardware used by users to input information into a terminal, and specifically include devices such as cameras and microphones.

[0508] "Action information" refers to data such as gestures, voice tone, and facial expressions acquired by the user through an input device.

[0509] "Means of analysis" refers to software or algorithms that use a computer's processing power to extract specific meanings or emotions from received behavioral information.

[0510] "Emotion" refers to data that indicates the user's emotional state, and is information recognized through the analysis of behavioral data.

[0511] A "natural language generation model" refers to artificial intelligence technology that takes input data and converts it into a language format that humans naturally use to create response sentences.

[0512] A "response sentence" is a natural language sentence generated based on recognized semantic content and emotions, and is output information presented to the user.

[0513] A "display device" is a device used to visually present a generated response to the user, and usually refers to a screen or monitor.

[0514] "Audio signal" refers to electrical or digital acoustic data used to transmit generated response sentences to the user as sound.

[0515] "Speech synthesis technology" is a technology that generates speech based on text data to make it sound like a human is speaking.

[0516] This invention provides a system that recognizes a user's emotions and generates a response based on those emotions. The system includes a terminal operated by the user and a server that performs data analysis, emotion recognition, and natural language generation.

[0517] The device uses sensors such as a camera and microphone to capture the user's gestures and voice tone. This allows it to collect information about actions such as the user smiling, waving, or speaking in a specific tone towards the device. This action information is then transmitted to a server via the network.

[0518] The server uses gesture recognition algorithms and speech recognition software to analyze the received behavioral information. The analysis identifies the specific meaning and emotion associated with the behavior. Furthermore, an emotion engine is used to recognize the user's emotions based on the identified meaning and emotion. For example, if a smile and friendly words are detected simultaneously, the emotion "friendly" is recognized.

[0519] Next, the server utilizes a generative AI model to generate natural-sounding responses that match the recognized emotions. Pre-configured prompts can be used to generate these responses. A specific example of a prompt is, "Generate a friendly response in a scenario where the user is smiling and waving." Based on this prompt, the generative model will create a response such as, "You look happy! Hi, how are you?"

[0520] The generated response is sent back from the server to the terminal. The terminal either displays this response on its screen or presents it to the user as speech using speech synthesis technology, enabling natural communication. This system allows users to receive responses that reflect their own emotions, resulting in a more satisfying communication experience.

[0521] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0522] Step 1:

[0523] The device captures the user's gestures and voice. The user performs actions such as smiling or waving towards the camera, and these actions and voices are captured by the device's camera and microphone. As a result, facial expression data and voice tone information are collected as input data.

[0524] Step 2:

[0525] The terminal transmits the acquired operational information to the server. The input data is encrypted via the Internet Protocol and securely transferred to the server. Video and audio data are transmitted during this process.

[0526] Step 3:

[0527] The server analyzes the transmitted data. It uses a gesture recognition algorithm to identify specific user actions from the video data. It also uses voice analysis software to extract tones and keywords from the voice data. Through these analyses, the user's actions and speech content are clarified, and semantic content such as "greeting" or "friendliness" is output.

[0528] Step 4:

[0529] The server activates the emotion engine based on the recognized semantic content and behavioral information. This analyzes the user's emotions and identifies feelings such as "friendly" or "happy." The emotion recognition model takes the analysis results as input and outputs appropriate emotional information.

[0530] Step 5:

[0531] The server utilizes a generative AI model to generate response sentences that correspond to the recognized emotions. By using pre-configured prompt sentences, it creates natural language response sentences that are appropriate for the emotion. When the prompt sentence "Generate a friendly response in a scenario where the user is smiling and waving and greeting" is entered, it generates a response such as "You look happy! Hello, how are you?"

[0532] Step 6:

[0533] The server sends the generated response to the terminal. This response is transmitted as text or audio data and delivered to the user in real time.

[0534] Step 7:

[0535] The device presents the received response to the user. It can be displayed as text on the screen or played back as speech using speech synthesis technology. This allows the user to receive a response that reflects their own emotions, enabling a smoother communication experience.

[0536] (Application Example 2)

[0537] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0538] In modern households, communication has become more diverse, and there is a need for systems that can understand subtle emotional changes and respond appropriately. However, conventional systems have been unable to adequately process user input information in accordance with emotions, making it difficult to achieve natural communication that reflects the user's feelings. As a result, smooth interpersonal relationships in daily life have not been fully achieved.

[0539] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0540] In this invention, the server includes means for receiving motion data and emotion data acquired using user input means, means for analyzing the motion data and emotion data and recognizing semantic content and emotional state, and means for generating natural language based on the recognized semantic content and emotional state. This enables communication adapted to the user's emotions.

[0541] "Motion data" refers to information about the user's physical movements and gestures, which is acquired through sensors.

[0542] "Emotional data" refers to information that indicates a user's emotional state, obtained from things like their facial expressions and tone of voice.

[0543] "Meaningful content" refers to information that has meaning as a result of analyzing behavioral data and emotional data obtained from users.

[0544] "Emotional state" refers to the state of emotions a user is experiencing at a given time, and is identified through data analysis.

[0545] "Natural language generation" is the process by which a computer creates a response sentence in human language based on the results of its analysis.

[0546] A "response statement" is a natural language document generated by the system in response to user input.

[0547] An "acoustic signal" is a format in which sound is processed and output as an electrical signal by electronic devices.

[0548] "Tone" refers to the pitch or nuance adjusted in written or spoken language to convey emotions or intentions.

[0549] "Communication" is the process of exchanging information and emotions between people and systems.

[0550] This invention applies to a home-use interactive system that understands emotions and provides adaptive responses. The system includes a user-operated terminal and a server. The terminal acquires user behavior data and emotion data using sensors such as a camera and microphone. This data is transmitted to the server via a network.

[0551] The server processes data using the following software and hardware: Image processing uses OpenCV and other tools to analyze facial expression data. Audio processing utilizes CMU Sphinx and other tools to read emotions from voice tone. The Google Cloud Emotion API and other tools are used as emotion recognition engines to recognize the user's emotional state. Natural language processing libraries such as GPT-3 are used for natural language generation.

[0552] As a result of this process, the server generates the most appropriate response for the user and sends it to the terminal. The terminal then presents the generated response to the user as audio or text, enabling communication tailored to the user's emotions.

[0553] For example, when a child returns home from school and starts doing homework, the system uses the device's camera to read facial expressions and the microphone to analyze the tone of voice. If it determines that the child is feeling tired, the server generates a response message such as, "You've worked so hard today! Shall we take a little break?" and sends it to the device. The device then plays this message back as audio, prompting the user to take an appropriate rest.

[0554] A concrete example of a prompt for a generative AI model is as follows: "Generate an appropriate response when a user speaks to you with a smile." This prompt is used to generate a contextually appropriate response based on the user's actions and emotions.

[0555] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0556] Step 1:

[0557] The device uses a camera and microphone to acquire user behavior and emotional data. It captures facial expressions as images and voice tone as audio data. Input consists of video data from the camera and audio data from the microphone, which are output as sensor data.

[0558] Step 2:

[0559] The acquired sensor data is sent from the terminal to the server. It is crucial to pass both video and audio data to the server. The input is sensor data, and the output is data transmission to the server via the network.

[0560] Step 3:

[0561] The server uses the OpenCV image processing library to analyze facial expressions from video data and identifies emotional states using an emotion recognition engine. The input is video data, and the output is the emotional state inferred from the user's facial expressions. Specifically, it extracts facial feature points and determines emotions based on them.

[0562] Step 4:

[0563] The server uses the CMU Sphinx speech analysis engine to analyze the tone of speech data and further reinforces the user's emotional state based on the results. The input is speech data, and the output is emotional information based on the speech tone. It analyzes the pitch and emphasis of the speech to determine the emotion.

[0564] Step 5:

[0565] The server integrates the analysis results of facial expressions and voice tone to determine the user's overall emotional state. The input is the analysis result from the previous step, and the output is the integrated emotional state. It performs the operation of integrating multiple emotional information to estimate the most likely emotion.

[0566] Step 6:

[0567] The server uses a natural language generation engine to generate appropriate response sentences based on integrated emotional states and text input from the user. The input is the integrated emotional state and text input, and the output is the response sentence. A process of creating prompt sentences and constructing responses is carried out using a generative AI model.

[0568] Step 7:

[0569] The server sends the generated response to the terminal, which then displays the response to the user as either audio or text. The input is the response, and the output is the actual message displayed on the speech generator or screen. The terminal uses speech synthesis technology to perform the specific action of uttering the response.

[0570] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0571] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0572] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0573] [Fourth Embodiment]

[0574] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0575] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0576] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0577] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0578] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0579] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0580] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0581] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0582] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0583] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0584] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0585] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0586] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0587] To implement this invention, a user-operated terminal and a server that performs data analysis and response generation are used. The terminal is equipped with an input device that can acquire user gestures or text input and transmits this input data to the server. The server analyzes the data transmitted from the terminal and recognizes the meaning of the gestures and input content.

[0588] As a concrete example, the device's camera captures the user's hand movements and sends the gesture as digital data to the server. The server uses a gesture recognition algorithm to analyze this digital data and determine that it means "hello." The server then uses a natural language generation engine to create an appropriate response. For example, it might generate a message like, "Hello, how are you?"

[0589] The generated response is sent from the server to the terminal and displayed on the terminal's screen. Furthermore, by combining this with speech synthesis technology, the response can also be played back as audio. This allows users who have difficulty with voice communication to communicate their intentions to others in real time.

[0590] This invention significantly reduces communication barriers, enabling people to enrich their interactions in daily life. In this way, it realizes a system that allows for natural and smooth conversation without the need for voice.

[0591] The following describes the processing flow.

[0592] Step 1:

[0593] The device acquires user gestures and text input through the input device. When the user waves their hand, the device's camera captures the movement and saves it as image data.

[0594] Step 2:

[0595] The device transmits the acquired gestures and input data to the server via the network. At this time, the user's input is formalized as digital data.

[0596] Step 3:

[0597] The server receives data sent from the terminal and begins analysis using a deep learning algorithm. Here, the server applies a gesture recognition model to identify the meaning of the actions from the received data.

[0598] Step 4:

[0599] The server activates a natural language generation engine based on the analysis results and generates an appropriate response. For example, in response to a gesture meaning "hello," it creates the response "Hello, how are you?"

[0600] Step 5:

[0601] The server sends the generated response to the terminal. This response is returned to the terminal in the form of text data.

[0602] Step 6:

[0603] The terminal displays the received response on its screen or plays it back as speech using speech synthesis technology. The user can then communicate their intentions to the other party by confirming this response.

[0604] (Example 1)

[0605] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0606] In modern society, there are challenges in smooth communication in situations where voice-based communication is difficult or in environments where voice input is unsuitable. Furthermore, there is a need for a system that can easily recognize gestures, accurately understand their intent, and respond without using infrared cameras or advanced sensors.

[0607] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0608] In this invention, the server includes means for receiving gesture or text data acquired by a user's terminal, means for analyzing the gesture or text data and identifying its meaning, and means for generating natural language using a generative artificial intelligence model based on the identified meaning. This enables the server to understand the user's intent and respond appropriately without relying on voice, and facilitates natural and smooth communication through audiovisual means.

[0609] A "user" refers to an individual who operates the system using gestures or text input.

[0610] A "terminal" refers to an electronic device that receives user input and transmits data to a server.

[0611] A "gesture" is an action a user takes to express their intentions, and it is acquired as digital data for the system to interpret.

[0612] "Text data" refers to character information entered by a user using a keyboard or touchscreen.

[0613] A "server" refers to a central processing unit that receives data from terminals, analyzes it, and generates responses.

[0614] "Analysis" refers to data processing to recognize the meaning of received gesture or text data.

[0615] A "generative artificial intelligence model" refers to an artificial intelligence algorithm that generates natural language responses based on the semantic content of received data.

[0616] "Natural language generation" refers to the technology of creating language in a form that humans can understand based on analysis results.

[0617] A "response sentence" refers to a natural language sentence presented to the user, created by a generative artificial intelligence model.

[0618] "Speech synthesis means" refers to technology that outputs generated response sentences as speech and presents auditory information to the user.

[0619] This invention is realized by using a terminal used by the user and a server that analyzes data and generates a response. The terminal is equipped with devices for inputting user gestures or text, specifically including a camera, touchscreen, and microphone. The terminal converts the user's actions and inputs acquired through these devices into digital data and transmits it to the server.

[0620] The server uses a computer vision library to analyze the received data and recognize gestures. OpenCV is a possible example. Based on the analyzed gestures and text, a generative artificial intelligence model (e.g., GPT-4) is used to generate natural language responses.

[0621] The generated response is sent to the terminal, which displays the text on its screen. Furthermore, the generated response can also be played back as audio using a Text-to-Speech (TTS) engine. As a result, even in situations where audio is difficult to use, the user can confirm the response through both sight and sound.

[0622] As a concrete example, consider a prompt: "The user has made a waving gesture. Generate an appropriate natural language response." By inputting this prompt into the AI model, the server generates an appropriate natural language response that is relevant to the situation.

[0623] Thus, this invention constructs a system that combines a terminal and a server to accurately recognize the user's intent and provide an appropriate response, thereby enabling smooth and natural communication.

[0624] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0625] Step 1:

[0626] The user inputs gestures or text using the device's camera or touchscreen. The device captures this input as digital data and encodes it as input data. For example, a hand-waving gesture is captured as video data by the camera.

[0627] Step 2:

[0628] The terminal transmits the acquired digital data to the server. The input data is compressed and encrypted and converted into a format that the server can receive. During this process, the terminal uses a data transfer protocol to communicate securely.

[0629] Step 3:

[0630] The server analyzes the received digital data. The server applies computer vision algorithms to recognize gestures from the input data. This analysis identifies that the "waving hand" gesture means "hello."

[0631] Step 4:

[0632] The server generates a natural language response using a generative AI model based on the analysis results. The prompt "The user made a waving gesture. Please generate an appropriate natural language response." is input to the AI model. Through this data processing, the server generates the response "Hello, how are you?".

[0633] Step 5:

[0634] The server sends the generated response to the terminal. The terminal prepares to display the received response on its screen. If using a TTS engine, the terminal performs the process of converting the response into audio data.

[0635] Step 6:

[0636] The terminal displays the response as text using a display device and also plays it back as speech using a speech synthesis device. The user can receive the system's response through both sight and hearing. This operation enables smooth communication even in non-voice environments.

[0637] (Application Example 1)

[0638] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0639] In the field of elderly care, there is a need for means to achieve smooth and intuitive communication between care recipients who have difficulty with verbal communication and support staff. In many cases, care recipients are unable to adequately convey their needs or condition, leading to communication breakdowns and misunderstandings, so methods to resolve this are necessary.

[0640] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0641] In this invention, the server includes means for receiving motion data acquired using user input means, means for analyzing the motion data and recognizing its semantic content, and means for generating natural language based on the recognized semantic content. This enables intuitive communication based on the gestures and movements of the person being cared for.

[0642] "User input means" refers to devices or functions that users can directly operate, and are devices for acquiring operational data.

[0643] "Action data" refers to information that is digitally represented based on detected user actions and gestures.

[0644] "Means of recognizing semantic content" refers to algorithms and processes for analyzing acquired behavioral data and understanding the meaning that the data conveys.

[0645] "Means of natural language generation" refers to technologies and systems that generate text in a human-understandable language based on recognized semantic content.

[0646] "Means of outputting response sentences from generated natural language" refers to a system that displays or converts the generated text into speech in order to convey it to the user.

[0647] "Means of converting into acoustic signals and presenting as information" refers to methods of transmitting information to a user as an audio message using technologies and devices that convert text-based information into audio information.

[0648] This invention is a system for facilitating communication between care recipients who have difficulty with verbal communication and support staff in the field of elderly care. The system includes a user input means, a motion data analysis means, a semantic content recognition means, a natural language generation means, and a response sentence output means.

[0649] The user's device is equipped with input devices such as a camera to capture the gestures and movements of the person being cared for. This movement data is sent from the device to the server. The server analyzes the movement data using gesture recognition APIs such as Google ML Kit. The analyzed data is then interpreted through a natural language processing engine such as Google Cloud Natural Language. Based on the recognized information, an appropriate response sentence is generated.

[0650] The generated response is sent from the server to the terminal, displayed on the terminal's screen, and also presented as an audio signal using a speech synthesis API such as Google Cloud Text-to-Speech. In this way, the care recipient's wishes can be clearly communicated to the staff.

[0651] For example, if a person receiving care indicates their desire for water through gestures, the action is captured by a camera and recognized as a request for water. A response such as "Shall I bring you a drink?" is then generated and played back as audio. This entire process enables quick and accurate communication. Furthermore, a prompt such as "Please enter a hand gesture indicating 'water'" can be used with the AI generation model. Through this prompt, the system accurately generates the intended response.

[0652] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0653] Step 1:

[0654] The device captures the user's gestures and movements through its camera. At this stage, the camera acquires video data in real time as input, and this video data is converted into a digital format and prepared as motion data.

[0655] Step 2:

[0656] The terminal sends the acquired motion data to the server. Here, the motion data is sent to the server via the network. The server receives the motion data as input and prepares it for subsequent analysis.

[0657] Step 3:

[0658] The server analyzes the motion data using a gesture recognition algorithm. Specifically, it processes the data using APIs such as Google ML Kit to detect gesture patterns. This process outputs the meaning of the gesture, such as "water request."

[0659] Step 4:

[0660] The server passes the recognized semantic content to a natural language processing engine. Semantic data is used as input, and an appropriate response is output via Google Cloud Natural Language or similar tools. The data processing in this process involves converting semantic content into natural language sentences.

[0661] Step 5:

[0662] The server sends the generated response message to the terminal. In this step, the response message is sent back to the terminal via the network. The terminal receives it and prepares for screen display or audio playback.

[0663] Step 6:

[0664] The terminal displays the received response on its screen and presents it to the user as an audio signal using speech synthesis technology. APIs such as Google Cloud Text-to-Speech are utilized to convert text data into audio data and output it in a format that is easy for the user to understand. This output allows the user's intentions to be conveyed to the support staff.

[0665] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0666] This invention includes a system that recognizes a user's emotions and generates a response based on those emotions. The basic configuration includes a terminal operated by the user and a server that performs analysis, emotion recognition, and natural language generation.

[0667] First, the device acquires user gestures and text input. During this process, sensors such as the camera and microphone built into the device can capture the user's facial expressions and voice tone. This acquired data is then transmitted to a server via the network.

[0668] The server first analyzes the received data as gestures or text input to determine its meaning. Simultaneously, it activates the emotion engine to recognize emotions from the user's input data. For example, if a user smiles, waves, and types "hello," the emotion engine recognizes the emotion as "friendly."

[0669] The server uses a natural language generation engine to generate a response with an appropriate tone and content based on the emotions it recognizes. For example, a possible response might be, "You seem happy! Hello, how are you?" The generated response is then immediately sent to the terminal.

[0670] The device displays the received response on its screen or presents it to the user as audio using speech synthesis technology. This allows the user to experience communication that reflects their own emotions. Emotion-based responses enable users to communicate their intentions without misunderstanding and facilitate deeper interactions.

[0671] This invention aims to improve the quality of everyday interactions by enabling two-way communication that takes user emotions into account. This embodiment allows all users to enjoy more personalized communication.

[0672] The following describes the processing flow.

[0673] Step 1:

[0674] The device acquires user gestures, facial expressions, and voice through the input device. The necessary camera and microphone capture the user's movements, facial expressions, and voice tone, and save this as initial data.

[0675] Step 2:

[0676] The device transmits digital data, including acquired gestures, facial data, and voice tone, to a server. The data is securely transferred over the network in real time.

[0677] Step 3:

[0678] The server receives the transmitted data and analyzes it using a gesture recognition algorithm. Here, it performs meaning analysis of gestures and voice, as well as user facial expressions, to identify the basic intent.

[0679] Step 4:

[0680] The server activates the emotion engine and recognizes the user's emotional state based on the data. For example, if the user is smiling, the emotion engine identifies "joy."

[0681] Step 5:

[0682] The server uses the analysis results and emotion recognition results to generate a response sentence through a natural language generation engine. If the recognized emotion is "joy," it will create an emotion-friendly response such as, "You seem happy! Hello, how are you?"

[0683] Step 6:

[0684] The server sends the generated response message to the terminal. The response message is sent in text format and kept in a state where it can be responded to in real time.

[0685] Step 7:

[0686] The terminal displays the received response on its screen and plays it back as speech using its speech synthesis function. The user can visually and audibly confirm the response, and the communication is completed.

[0687] (Example 2)

[0688] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0689] In modern dialogue systems, it is difficult to generate responses that accurately reflect the user's emotions and intentions. This can lead to unnatural communication and failure to meet user expectations. Therefore, there is a need for systems that can recognize user actions and emotions and respond in a natural manner.

[0690] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0691] In this invention, the server includes means for receiving action information acquired using a user input device, means for analyzing the action information and recognizing its semantic content and emotion, and means for generating a response sentence using a natural language generation model based on the recognized semantic content and emotion. This makes it possible to generate a response that reflects the user's feelings and intentions.

[0692] "User input devices" refer to hardware used by users to input information into a terminal, and specifically include devices such as cameras and microphones.

[0693] "Action information" refers to data such as gestures, voice tone, and facial expressions acquired by the user through an input device.

[0694] "Means of analysis" refers to software or algorithms that use a computer's processing power to extract specific meanings or emotions from received behavioral information.

[0695] "Emotion" refers to data that indicates the user's emotional state, and is information recognized through the analysis of behavioral data.

[0696] A "natural language generation model" refers to artificial intelligence technology that takes input data and converts it into a language format that humans naturally use to create response sentences.

[0697] A "response sentence" is a natural language sentence generated based on recognized semantic content and emotions, and is output information presented to the user.

[0698] A "display device" is a device used to visually present a generated response to the user, and usually refers to a screen or monitor.

[0699] "Audio signal" refers to electrical or digital acoustic data used to transmit generated response sentences to the user as sound.

[0700] "Speech synthesis technology" is a technology that generates speech based on text data to make it sound like a human is speaking.

[0701] This invention provides a system that recognizes a user's emotions and generates a response based on those emotions. The system includes a terminal operated by the user and a server that performs data analysis, emotion recognition, and natural language generation.

[0702] The device uses sensors such as a camera and microphone to capture the user's gestures and voice tone. This allows it to collect information about actions such as the user smiling, waving, or speaking in a specific tone towards the device. This action information is then transmitted to a server via the network.

[0703] The server uses gesture recognition algorithms and speech recognition software to analyze the received behavioral information. The analysis identifies the specific meaning and emotion associated with the behavior. Furthermore, an emotion engine is used to recognize the user's emotions based on the identified meaning and emotion. For example, if a smile and friendly words are detected simultaneously, the emotion "friendly" is recognized.

[0704] Next, the server utilizes a generative AI model to generate natural-sounding responses that match the recognized emotions. Pre-configured prompts can be used to generate these responses. A specific example of a prompt is, "Generate a friendly response in a scenario where the user is smiling and waving." Based on this prompt, the generative model will create a response such as, "You look happy! Hi, how are you?"

[0705] The generated response is sent back from the server to the terminal. The terminal either displays this response on its screen or presents it to the user as speech using speech synthesis technology, enabling natural communication. This system allows users to receive responses that reflect their own emotions, resulting in a more satisfying communication experience.

[0706] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0707] Step 1:

[0708] The device captures the user's gestures and voice. The user performs actions such as smiling or waving towards the camera, and these actions and voices are captured by the device's camera and microphone. As a result, facial expression data and voice tone information are collected as input data.

[0709] Step 2:

[0710] The terminal transmits the acquired operational information to the server. The input data is encrypted via the Internet Protocol and securely transferred to the server. Video and audio data are transmitted during this process.

[0711] Step 3:

[0712] The server analyzes the transmitted data. It uses a gesture recognition algorithm to identify specific user actions from the video data. It also uses voice analysis software to extract tones and keywords from the voice data. Through these analyses, the user's actions and speech content are clarified, and semantic content such as "greeting" or "friendliness" is output.

[0713] Step 4:

[0714] The server activates the emotion engine based on the recognized semantic content and behavioral information. This analyzes the user's emotions and identifies feelings such as "friendly" or "happy." The emotion recognition model takes the analysis results as input and outputs appropriate emotional information.

[0715] Step 5:

[0716] The server utilizes a generative AI model to generate response sentences that correspond to the recognized emotions. By using pre-configured prompt sentences, it creates natural language response sentences that are appropriate for the emotion. When the prompt sentence "Generate a friendly response in a scenario where the user is smiling and waving and greeting" is entered, it generates a response such as "You look happy! Hello, how are you?"

[0717] Step 6:

[0718] The server sends the generated response to the terminal. This response is transmitted as text or audio data and delivered to the user in real time.

[0719] Step 7:

[0720] The device presents the received response to the user. It can be displayed as text on the screen or played back as speech using speech synthesis technology. This allows the user to receive a response that reflects their own emotions, enabling a smoother communication experience.

[0721] (Application Example 2)

[0722] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0723] In modern households, communication has become more diverse, and there is a need for systems that can understand subtle emotional changes and respond appropriately. However, conventional systems have been unable to adequately process user input information in accordance with emotions, making it difficult to achieve natural communication that reflects the user's feelings. As a result, smooth interpersonal relationships in daily life have not been fully achieved.

[0724] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0725] In this invention, the server includes means for receiving motion data and emotion data acquired using user input means, means for analyzing the motion data and emotion data and recognizing semantic content and emotional state, and means for generating natural language based on the recognized semantic content and emotional state. This enables communication adapted to the user's emotions.

[0726] "Motion data" refers to information about the user's physical movements and gestures, which is acquired through sensors.

[0727] "Emotional data" refers to information that indicates a user's emotional state, obtained from things like their facial expressions and tone of voice.

[0728] "Meaningful content" refers to information that has meaning as a result of analyzing behavioral data and emotional data obtained from users.

[0729] "Emotional state" refers to the state of emotions a user is experiencing at a given time, and is identified through data analysis.

[0730] "Natural language generation" is the process by which a computer creates a response sentence in human language based on the results of its analysis.

[0731] A "response statement" is a natural language document generated by the system in response to user input.

[0732] An "acoustic signal" is a format in which sound is processed and output as an electrical signal by electronic devices.

[0733] "Tone" refers to the pitch or nuance adjusted in written or spoken language to convey emotions or intentions.

[0734] "Communication" is the process of exchanging information and emotions between people and systems.

[0735] This invention applies to a home-use interactive system that understands emotions and provides adaptive responses. The system includes a user-operated terminal and a server. The terminal acquires user behavior data and emotion data using sensors such as a camera and microphone. This data is transmitted to the server via a network.

[0736] The server processes data using the following software and hardware: Image processing uses OpenCV and other tools to analyze facial expression data. Audio processing utilizes CMU Sphinx and other tools to read emotions from voice tone. The Google Cloud Emotion API and other tools are used as emotion recognition engines to recognize the user's emotional state. Natural language processing libraries such as GPT-3 are used for natural language generation.

[0737] As a result of this process, the server generates the most appropriate response for the user and sends it to the terminal. The terminal then presents the generated response to the user as audio or text, enabling communication tailored to the user's emotions.

[0738] For example, when a child returns home from school and starts doing homework, the system uses the device's camera to read facial expressions and the microphone to analyze the tone of voice. If it determines that the child is feeling tired, the server generates a response message such as, "You've worked so hard today! Shall we take a little break?" and sends it to the device. The device then plays this message back as audio, prompting the user to take an appropriate rest.

[0739] A concrete example of a prompt for a generative AI model is as follows: "Generate an appropriate response when a user speaks to you with a smile." This prompt is used to generate a contextually appropriate response based on the user's actions and emotions.

[0740] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0741] Step 1:

[0742] The device uses a camera and microphone to acquire user behavior and emotional data. It captures facial expressions as images and voice tone as audio data. Input consists of video data from the camera and audio data from the microphone, which are output as sensor data.

[0743] Step 2:

[0744] The acquired sensor data is sent from the terminal to the server. It is crucial to pass both video and audio data to the server. The input is sensor data, and the output is data transmission to the server via the network.

[0745] Step 3:

[0746] The server uses the OpenCV image processing library to analyze facial expressions from video data and identifies emotional states using an emotion recognition engine. The input is video data, and the output is the emotional state inferred from the user's facial expressions. Specifically, it extracts facial feature points and determines emotions based on them.

[0747] Step 4:

[0748] The server uses the CMU Sphinx speech analysis engine to analyze the tone of speech data and further reinforces the user's emotional state based on the results. The input is speech data, and the output is emotional information based on the speech tone. It analyzes the pitch and emphasis of the speech to determine the emotion.

[0749] Step 5:

[0750] The server integrates the analysis results of facial expressions and voice tone to determine the user's overall emotional state. The input is the analysis result from the previous step, and the output is the integrated emotional state. It performs the operation of integrating multiple emotional information to estimate the most likely emotion.

[0751] Step 6:

[0752] The server uses a natural language generation engine to generate appropriate response sentences based on integrated emotional states and text input from the user. The input is the integrated emotional state and text input, and the output is the response sentence. A process of creating prompt sentences and constructing responses is carried out using a generative AI model.

[0753] Step 7:

[0754] The server sends the generated response to the terminal, which then displays the response to the user as either audio or text. The input is the response, and the output is the actual message displayed on the speech generator or screen. The terminal uses speech synthesis technology to perform the specific action of uttering the response.

[0755] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0756] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0757] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0758] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0759] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0760] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0761] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0762] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0763] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0764] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0765] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0766] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0767] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0768] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0769] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0770] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0771] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0772] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0773] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0774] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0775] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted as being incorporated by reference.

[0776] The following is further disclosed regarding the embodiments described above.

[0777] (Claim 1)

[0778] A means for receiving motion data acquired using user input means,

[0779] A means for analyzing the aforementioned operation data and recognizing its meaning,

[0780] A means for generating natural language based on the recognized semantic content,

[0781] A means of outputting a response sentence from the generated natural language,

[0782] A system that includes this.

[0783] (Claim 2)

[0784] The system according to claim 1, further comprising an algorithm for analyzing gesture input as motion data.

[0785] (Claim 3)

[0786] The system according to claim 1, further comprising means for presenting the response statement to the user as an acoustic signal.

[0787] "Example 1"

[0788] (Claim 1)

[0789] A means for receiving gesture or text data acquired by the user's device,

[0790] A means for analyzing the aforementioned gesture or text data and identifying its meaning,

[0791] A means for generating natural language using a generative artificial intelligence model based on identified semantic content,

[0792] A means for outputting a response sentence from the generated natural language and presenting it using a terminal display device or speech synthesis means,

[0793] A system that includes this.

[0794] (Claim 2)

[0795] The system according to claim 1, further comprising an algorithm that uses a visual information processing library for analyzing the gesture or text data.

[0796] (Claim 3)

[0797] The system according to claim 1, further comprising means for outputting the response statement as an audio signal.

[0798] "Application Example 1"

[0799] (Claim 1)

[0800] A means for receiving motion data acquired using user input means,

[0801] A means for analyzing the aforementioned operation data and recognizing its meaning,

[0802] A means for generating natural language based on the recognized semantic content,

[0803] A means of outputting a response sentence from the generated natural language,

[0804] Means for using an input device to recognize actions performed by the user,

[0805] A means for converting the aforementioned response sentence into an acoustic signal and presenting it as information,

[0806] Means to promote support in the field of long-term care,

[0807] A system that includes this.

[0808] (Claim 2)

[0809] The system according to claim 1, further comprising an algorithm for analyzing gesture input as motion data.

[0810] (Claim 3)

[0811] The system according to claim 1, further comprising means for presenting the response statement to the user as an acoustic signal.

[0812] "Example 2 of combining an emotion engine"

[0813] (Claim 1)

[0814] Means for receiving operation information acquired using a user input device,

[0815] A means for analyzing the aforementioned action information and recognizing its meaning and emotion,

[0816] A means for generating a response sentence using a natural language generation model based on the recognized semantic content and emotion,

[0817] A means for presenting the generated response sentence on a display device,

[0818] A system that includes this.

[0819] (Claim 2)

[0820] The system according to claim 1, comprising an algorithm for analyzing gesture input as motion information and recognizing emotions.

[0821] (Claim 3)

[0822] The system according to claim 1, further comprising speech synthesis technology for generating the aforementioned response sentence as an audio signal and presenting it to the user.

[0823] "Application example 2 when combining with an emotional engine"

[0824] (Claim 1)

[0825] A means for receiving motion data and emotion data acquired using user input means,

[0826] A means for analyzing the aforementioned motion data and emotion data to recognize semantic content and emotional state,

[0827] A means for generating natural language based on the recognized semantic content and emotional state,

[0828] A means of outputting response sentences from generated natural language and providing communication adapted to the user's emotions,

[0829] A system that includes this.

[0830] (Claim 2)

[0831] The system according to claim 1, further comprising an algorithm for analyzing gesture input as motion data and for performing emotion recognition.

[0832] (Claim 3)

[0833] The system according to claim 1, further comprising means for presenting the response statement to the user as an acoustic signal and outputting it in an appropriate tone based on the user's emotions. [Explanation of symbols]

[0834] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means for receiving motion data acquired using user input means, A means for analyzing the aforementioned operation data and recognizing its meaning, A means for generating natural language based on the recognized semantic content, A means of outputting a response sentence from the generated natural language, A system that includes this.

2. The system according to claim 1, further comprising an algorithm for analyzing gesture input as motion data.

3. The system according to claim 1, further comprising means for presenting the response statement to the user as an acoustic signal.