system

The system allows users to interact with digital assistants using head movements, addressing the limitations of conventional systems by providing a natural and efficient interaction method that maintains privacy and accessibility.

JP2026103507APending Publication Date: 2026-06-24SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-12
Publication Date
2026-06-24

AI Technical Summary

Technical Problem

Conventional systems fail to allow users to comfortably and effectively interact with digital assistants in situations where speaking or using hands is difficult, such as in crowded environments or quiet settings, and often pose privacy concerns and are challenging for users with disabilities.

Method used

A system utilizing a voice output device with a motion sensor that detects head movements, a motion analysis device to determine instructions, and a response generation device to provide feedback, enabling interaction through gestures without voice.

Benefits of technology

Enables efficient and natural interaction with digital assistants, maintaining privacy and accessibility for users in various environments, including quiet spaces and facilitating hands-free operation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026103507000001_ABST
    Figure 2026103507000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A sound output device equipped with a motion sensor that detects head movements, A motion analysis device that analyzes motion data output from the motion sensor and determines instructions based on motion gestures, A response generation device that processes information based on the determined instructions and provides feedback to the user through an audio output device, A processing means equipped with a function to acquire information using head movements while the user is moving and to support the updating of route information, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0004] , ,

[0005] , , ,

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, the method including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In many modern situations, users often face situations where it is difficult to speak or operate a device using hands, such as in a crowded train or a quiet environment. In such situations, there is a need for means to communicate comfortably and effectively with a digital assistant. However, conventional systems have not been able to fully meet such requirements. Therefore, it is necessary to provide a system that allows users to naturally interact with a digital assistant using gestures based on head movements without making a sound.

Means for Solving the Problems

[0005] This invention provides a system including a voice output device equipped with a motion sensor that detects head movements, a motion analysis device that analyzes motion data and determines instructions based on motion gestures, and a response generation device that processes information based on those instructions and provides feedback of the results. With this system, users can interact with a digital assistant using only head movements, without using their voice, and obtain the necessary information and responses. This achieves high convenience and efficiency even in crowded environments.

[0006] An "audio output device" is a device provided to deliver audio information to a user, and is capable of transmitting audio signals to the user's ears.

[0007] A "motion sensor" is a sensor device that detects the movement of a user's head or other body parts and generates physical motion data.

[0008] "Motion data" refers to raw data acquired by motion sensors, which includes information about the user's head movements.

[0009] A "motion analysis device" is a device that analyzes motion data to recognize specific motion gestures and determines instructions based on the results.

[0010] "Motional gestures" are patterns of head movements that have specific meanings and are recognized by systems.

[0011] A "response generation device" is a device that performs appropriate information processing based on the judgment results of a motion analysis device and presents the results to the user.

[0012] "Information processing" refers to a series of data operations performed by a system based on the judgment results of a motion analysis device, and includes the process of generating a response to the user. [Brief explanation of the drawing]

[0013] [Figure 1]This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0014] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0019] In the following embodiments, the numbered communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), and the like.

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] This invention relates to a system that utilizes head movements to effectively communicate with a digital assistant without using voice. Specifically, the user wears an earphone-type audio output device, which has a built-in motion sensor. The motion sensor detects the user's head movements in real time and generates motion data representing those movements.

[0035] The terminal receives this motion data and transmits it to the server wirelessly. The motion analysis device then analyzes the motion data on the server and compares it to pre-registered motion gesture patterns. Based on this comparison, the user's intention is determined as a simple answer such as "Yes" or "No," or as another more complex instruction. The determination result is used as input for information processing by the response generation device, and the information processing is performed accordingly.

[0036] The server then generates the results of the information processing as audio feedback and sends this response back to the user's terminal. The terminal informs the user of the response through the feedback audio from its voice output device. This allows the user to interact with the digital assistant in a simple and natural way.

[0037] As a concrete example, consider a scenario where a user wants to manipulate their music playlist. Suppose a user wearing earphones indicates they want to increase the volume using a gesture. The user might move their head from side to side or perform a specific motion gesture to convey this intention. The device detects this movement through motion sensors, and the server interprets it as a "volume increase command." The server then increases the volume based on this command and provides confirmation via voice feedback, allowing the user to receive confirmation. This rapid execution of the process allows the user to complete the operation efficiently.

[0038] The following describes the processing flow.

[0039] Step 1:

[0040] The device uses motion sensors built into the earphones to detect the user's head movements. The detected movements are recorded as real-time digital data.

[0041] Step 2:

[0042] The terminal packets operational data and transmits it to the server via wireless communication. This communication is optimized to ensure that operational data is transmitted accurately and without delay.

[0043] Step 3:

[0044] The server analyzes the received behavioral data. The analysis algorithm compares this data to known behavioral gesture patterns and determines the user's intent as a specific instruction such as "Yes" or "No".

[0045] Step 4:

[0046] The server executes the corresponding information processing based on the determination results obtained from the operation analysis. This includes specific actions such as music playback and volume adjustment.

[0047] Step 5:

[0048] The server generates the results of the information processing as audio data and prepares it for feedback to the user.

[0049] Step 6:

[0050] The feedback audio data from the server is sent to the terminal. The terminal plays the received audio data to the user via an audio output device (earphones).

[0051] Step 7:

[0052] The user can confirm whether the action was successful or whether the desired operation has been completed through the played audio feedback.

[0053] (Example 1)

[0054] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0055] Traditional methods of interacting with digital assistants require the use of voice commands, which can raise privacy concerns in quiet environments. Furthermore, in environments where voice communication is unavailable, users may find it difficult to operate the assistant. Additionally, existing interfaces are often difficult for users with disabilities to use. There is a need to address these challenges and enable users to interact with digital assistants in a more natural and reliable way.

[0056] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0057] In this invention, the server includes data analysis means for analyzing motion data and determining instructions based on motion gestures; response generation means for performing information processing based on the determined instructions and generating the result as voice feedback; and feedback transmission means for transmitting the generated voice feedback via wireless communication. This allows the user to operate the digital assistant silently using head movements, enabling efficient use of the assistant while maintaining privacy even in quiet environments.

[0058] A "motion detection means" is a device that detects the user's head movements in real time, converts them into digital signals, and generates motion data.

[0059] "Transmission means" refers to a function for transmitting generated operational data to other devices using wireless communication technology.

[0060] A "data analysis means" is a device that analyzes transmitted motion data and compares it with pre-registered motion patterns to determine the user's intent.

[0061] The "response generation means" is a function that performs information processing based on instructions determined by the data analysis means and generates the result as audio feedback.

[0062] The "feedback transmission means" is a function that transmits the generated audio feedback to the user's terminal using wireless communication.

[0063] A "feedback presentation means" is a function that communicates the results to the user via an audio output device, used to inform the user of the transmitted audio feedback.

[0064] Wireless communication is a technology that uses radio waves to send and receive data, and is a means of data communication without using cables.

[0065] This invention is a system that enables users to interact with a digital assistant naturally and effectively without using their voice. Specifically, the user uses an audio output device with a built-in motion sensor, such as earphones, which can detect head movements.

[0066] When a user puts on earphones, the terminal receives detection data from its motion sensor. The terminal transmits the motion data to the server using wireless communication. The server uses a motion analysis device to compare the received data with known motion patterns and analyze the user's intent. High-performance data analysis software is used for this motion analysis, enabling rapid and accurate analysis.

[0067] The server processes the information using a response generation device based on the analysis results and generates audio feedback. The generated audio feedback is returned to the user's terminal via a feedback transmission means and finally presented to the user through an audio output device. This allows the user to perform operations such as increasing the music volume or skipping songs with simple gestures, such as moving their head.

[0068] A concrete example is when a user wants to increase the volume of a music playlist. The user indicates this intention by moving their head from side to side, and the device detects this action and sends it to the server. The server recognizes the "increase volume" instruction, adjusts the volume, and sends confirmation audio feedback back to the user.

[0069] Examples of prompts used in the generative AI model include "How do you move your head to increase the volume?" and "What is the correct gesture to skip to the next song?", helping users to utilize the system more effectively.

[0070] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0071] Step 1:

[0072] The user wears an earphone-type audio output device. A motion sensor detects head movements and generates detection data in real time. This data shows the acceleration and angular changes when the user moves their head in a specific direction.

[0073] Step 2:

[0074] The terminal transmits motion data received from the motion sensor to the server via wireless communication. The data transmitted includes the type of motion (e.g., left-right movement), intensity, and duration.

[0075] Step 3:

[0076] The server receives motion data and compares it to pre-programmed motion patterns using a motion analysis device. The analysis extracts features from the data to determine the user's intended instruction (e.g., "increase volume"). This process applies a gesture recognition algorithm to interpret the intent of the action.

[0077] Step 4:

[0078] The server processes the information using a response generator based on the analysis results. Specifically, the server executes the corresponding action (for example, increasing the volume) according to the determined instructions. The result after processing is generated as audio feedback.

[0079] Step 5:

[0080] The server sends the generated audio feedback to the terminal. This feedback includes a confirmation message indicating that a specific action has been completed (for example, "Volume increased").

[0081] Step 6:

[0082] The device presents the received audio feedback to the user through the earphones. The audio output device plays the audio feedback, and the user confirms that their instructions have been processed correctly.

[0083] This allows users to intuitively operate the digital assistant through the system without using voice commands.

[0084] (Application Example 1)

[0085] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0086] Currently, it is difficult for users of public transportation to obtain information or change their travel plans without using their hands. Furthermore, using voice instructions can be challenging in noisy environments. Therefore, there is a need for efficient and rapid access to information while on board.

[0087] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0088] In this invention, the server includes means for recognizing the user's gestures using an audio output device equipped with a motion sensor that detects head movements, motion analysis means for determining instructions based on the motion gestures, and information processing means for processing information based on the determined instructions and feeding the results back to the user through the audio output device. This makes it possible for the user to acquire information and update route information using only head movements without using their hands.

[0089] "Head movement" refers to the physical movement of the user's head, which can be used to express the user's intentions.

[0090] A "motion sensor" is a device used to detect physical movement, and specifically refers to a sensor used to accurately capture the movement of a user's head.

[0091] An "audio output device" is a device that transmits processed information to the user as sound, and examples include earphones and speakers.

[0092] "Motion data" refers to data output from motion sensors, which quantifies the user's head movements.

[0093] A "motion analysis device" is a device that analyzes motion data and recognizes specific gesture patterns.

[0094] An "instruction" is a command for an operation or control that the system determines based on the user's gestures.

[0095] "Information processing" is the process of calculation and decision-making that a system performs to generate the optimal response based on the instructions it receives.

[0096] A "response generation device" is a device that provides feedback to the user in the form of voice or other formats based on the results of information processing.

[0097] "Route information" refers to route information necessary for a user's travel, and is particularly useful when using public transportation.

[0098] "Moving" refers to a state in which the user is physically changing their location while moving.

[0099] "Processing means" refers to the functional elements of hardware or software necessary for a system to perform a specific function.

[0100] In implementing this system, the user uses an audio output device equipped with a motion sensor. The audio output device takes the form of an earphone and detects the user's head movements in real time. Motion data from this motion sensor is transmitted to a terminal and then transferred to a server via wireless communication.

[0101] The server includes a motion analysis device that analyzes the transmitted motion data. During the analysis, the user's instructions are determined by comparing them with pre-registered gesture patterns. For example, if the user makes a gesture indicating "Tell me the next stop," the server processes the information based on this determination. As a result of the information processing, a response generation device generates voice feedback, and the content of that feedback is sent back to the user through the terminal.

[0102] In this entire process, it is possible to generate voice feedback by utilizing the Google® Cloud Text-to-Speech API. Furthermore, the processing power of servers and smartphones is used to perform calculations for real-time information retrieval and route updates. This allows users to obtain information and adjust their routes hands-free while on public transport, enabling more efficient travel.

[0103] As a concrete example, suppose a user is on a bus and wants to know the route to their destination. In this case, by slightly moving their head from side to side, the instruction "Tell me the next stop" is transmitted to the server, and the next stop is announced via voice feedback.

[0104] For example, prompts for the generated AI model can be formalized as follows: "Based on the user's head movements, analyze predefined gestures silently and simulate a scenario for a smart city traffic information assistant system." This formalization can contribute to improving the system's accuracy.

[0105] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0106] Step 1:

[0107] The user wears an audio output device equipped with a motion sensor. The motion sensor detects the user's head movements in real time and outputs them as motion data. This data is collected by a sensor built into the earphones.

[0108] Step 2:

[0109] The terminal transmits the detected motion data to the server via wireless communication. Here, the terminal's role is to receive motion data from the motion sensor and accurately relay it to the server.

[0110] Step 3:

[0111] The server receives motion data from a motion analysis device. The server analyzes this data and compares it to predefined gesture patterns. For example, a gesture of moving the head from side to side is interpreted as "Tell me the next station." The input is motion data, and the output is the interpreted instruction. This process involves data comparison and pattern recognition calculations.

[0112] Step 4:

[0113] The server processes information based on the determined instruction. Here, it retrieves information about the next stop from public transport and generates the content of the voice feedback. The input is the determined instruction, and the output is the feedback information. This process includes searching for online information and generating data feedback.

[0114] Step 5:

[0115] The server generates audio feedback using a response generator and sends the content as an audio signal to the terminal. For example, it sends audio data generated using the Google Cloud Text-to-Speech API.

[0116] Step 6:

[0117] The device provides the user with generated audio feedback through an audio output device. This allows the user to hear the response to their instructions and obtain necessary information hands-free, even on public transport.

[0118] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0119] This invention relates to a system that provides an interface integrating gestures based on head movements with user emotion recognition. The system includes a voice output device equipped with a motion sensor, a motion analysis device, a response generation device, and an emotion engine. The emotion engine is responsible for generating more appropriate responses by analyzing the emotions indicated by the user's voice and movements.

[0120] The device detects the user's head movements via motion sensors through an earphone-type audio output device worn by the user. In addition, a built-in emotion engine analyzes the user's emotional state using the tone, tempo, and voice patterns of their voice. This data is used during information processing to improve the user experience.

[0121] The server analyzes motion data from motion sensors to confirm the user's intent and determine the appropriate instruction. Simultaneously, the emotion engine analyzes the acquired emotion data and evaluates the user's current emotional state. Based on this evaluation, the response generation device formulates an appropriate response corresponding to the determined instruction and emotional state.

[0122] As a concrete example, consider a case where a user chooses relaxing music to alleviate stress they feel while listening to music. The user can indicate a change of music with a gesture, and the emotion engine detects the user's stress level from their voice. A motion analysis device determines the "change music" instruction from head movements and processes it on a server in combination with the emotion engine's detection. A response generation device recommends music that helps reduce stress and plays it to the user through earphones. In this way, the user's actions and emotions-based experience enable a personalized response.

[0123] The following describes the processing flow.

[0124] Step 1:

[0125] The device detects the user's head movements via motion sensors built into the earphones. Simultaneously, it captures audio signals to collect the user's voice.

[0126] Step 2:

[0127] The terminal transmits the acquired motion data and audio data to the server. In this process, motion data is transmitted immediately via wireless communication, and audio data is used for sentiment analysis.

[0128] Step 3:

[0129] The server analyzes motion data and determines the user's intent based on motion gestures. Simultaneously, it analyzes voice data using an emotion engine to evaluate the user's emotional state.

[0130] Step 4:

[0131] Based on the results of behavioral and emotional analysis, the server performs optimal information processing according to the user's state. For example, if the server determines that the user intends to change the music and is also feeling stressed, it will select music that is effective in reducing stress.

[0132] Step 5:

[0133] The server sends the response generated as a result of information processing (e.g., a new music track) to the terminal as audio output data.

[0134] Step 6:

[0135] The terminal transmits the audio output data received from the server to the user through an audio output device. The user can confirm that the requested action has been performed by listening to this feedback.

[0136] Step 7:

[0137] Through the provided voice feedback, users can understand that their intentions have been reflected and that appropriate responses have been given, allowing them to continue further operations.

[0138] (Example 2)

[0139] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0140] Existing voice output systems face the challenge of providing personalized responses based not only on user actions but also on emotions. In particular, conventional technologies fail to adequately understand user intent and provide real-time responses. As a result, the user experience may be limited, potentially leading to decreased satisfaction.

[0141] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0142] In this invention, the server includes a detection device means for detecting the user's head movements, an analysis device means for analyzing data from the detection device means and determining instructions based on the motion gestures, and an analysis device means for analyzing the user's voice characteristics and evaluating their emotional state. This enables the integration of the user's actions and emotions to provide appropriate responses in real time, improving the personalized user experience.

[0143] A "detection device means" is a device equipped with the function of accurately sensing the user's head movements and extracting them as data.

[0144] The "analysis device means" is a device that analyzes data supplied by the detection device means and identifies instructions based on the user's gestures.

[0145] An "analysis device" is a device that analyzes the tone and patterns from the user's voice and has the function of evaluating their current emotional state.

[0146] A "generative AI model" is an artificial intelligence model that generates appropriate information processing and responses based on user behavior and emotional data.

[0147] A "response device means" is a device that outputs the results of information processing as sound and provides feedback to the user.

[0148] This invention is a system that analyzes the user's head movements and emotional state of voice to provide personalized responses. The system consists of a motion sensor, an emotion engine, and an analysis device, all integrated into an earphone-type audio output device.

[0149] The device uses motion sensors to detect the user's head movements. This information is sent to an analysis device and analyzed as a motion gesture. For example, a user can instruct the system to change the music by moving their head from side to side.

[0150] Furthermore, the device's built-in emotion engine analyzes the user's voice tone, tempo, and speech patterns to assess their emotional state. Emotion analysis is performed using a generative AI model, enabling a more accurate understanding of the user's emotions. For example, if a user is stressed, this state can be detected from their voice tone.

[0151] The server receives this behavioral and emotional data and processes each piece of data in an integrated manner. A generative AI model is used to process information based on the user's intentions and emotions. As a result of the processing, an appropriate response is generated and fed back to the user through the device. For example, if the server determines that the user is feeling stressed, it selects relaxing music and plays it through the earphones.

[0152] A concrete example is a user prompt such as, "I want to listen to relaxing music." The system can analyze this request and adjust the music it provides. In this way, it is possible to provide a highly personalized experience that takes into account the user's emotions and actions.

[0153] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0154] Step 1:

[0155] The user wears earphone-type audio output devices and enjoys media such as music. Simultaneously, a motion sensor built into the device detects the user's head movements. This motion data functions as input data to capture the user's intentions. The detected data is transmitted to a motion analysis device. Specifically, the user indicates music playback or selection by moving their head up and down or left and right.

[0156] Step 2:

[0157] The terminal's motion analysis device analyzes the transmitted head movement data and maps it to specific motion gestures. This analysis determines what kind of command the user issued. For example, if "up and down movement = skipping a song" is defined, the corresponding instruction will be output. Specifically, an algorithm is implemented to convert motion data into motion gestures, and the necessary information is extracted and the data is converted.

[0158] Step 3:

[0159] Simultaneously, the device's emotion engine captures the user's voice and analyzes its tone, tempo, and speech patterns. Using a generative AI model, it evaluates the user's emotional state. This audio data becomes input for the emotion analysis, and the emotion engine outputs emotion labels such as "stressed." For example, it might detect a high-pitched, hurried voice pattern and classify it as "stressed."

[0160] Step 4:

[0161] The server integrates the motion gestures and emotion labels received from the terminal. Using a generative AI model, it analyzes this data and processes the information to match the user's intentions and emotions. For example, if the intention to "relax" is associated with an emotion label, the server will select relaxing music. In this process, motion gestures and emotion labels are input, and a specific response plan is generated and output.

[0162] Step 5:

[0163] The server, via a response device, sends the selected, appropriate music to the terminal and plays it through the user's earphones. The final output is carefully selected music or audio feedback. Specifically, the server performs signal conversion to execute the processing results as audio output. The user experiences a newly selected song playing through their earphones.

[0164] (Application Example 2)

[0165] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0166] Current personal robots and sound output devices have limited ability to adaptively respond based on user emotions and real-time actions. In particular, they are inadequate in supporting users in reducing stress and fatigue in their daily lives, and there is a need for more advanced interfaces to optimize the user experience.

[0167] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0168] In this invention, the server includes means for using an acoustic output device equipped with a motion sensor that detects head movements, means for using an analysis device that analyzes motion data and determines instructions based on motion gestures, and means for using emotion analysis means that analyze the characteristics of the user's voice and evaluate their emotional state. This enables the generation of personalized responses based on the user's actions and emotions.

[0169] A "motion sensor" is a device that detects the movement of a user's head and has the function of providing that motion data.

[0170] An "audio output device" is a device for transmitting sound or music to the user, and it is desirable that it be in a form that the user can wear.

[0171] An "analysis device" is a component that analyzes motion data and determines appropriate instructions based on the detected motion gestures.

[0172] A "response generation device" is a system that processes information based on analyzed instructions and emotional states, and then feeds the results back to the user through an acoustic output device.

[0173] "Emotional analysis techniques" are technologies that analyze characteristics such as the tone and tempo of a user's voice to evaluate the user's emotional state.

[0174] A "control mechanism" is a device that selects appropriate music or conversation based on the evaluated emotional state and provides adaptive responses to the user.

[0175] In an embodiment for carrying out the present invention, the system mainly consists of a server and a terminal. The terminal includes a motion sensor that detects the user's head movements and an acoustic output device that outputs sound. The motion sensor has the function of detecting the user's motion gestures and transmitting the data to the server. The acoustic output device is a device that provides voice feedback to the user.

[0176] The server is equipped with a motion analysis device and an emotion analysis means, and performs these functions. The motion analysis device analyzes data acquired from motion sensors using a deep learning model based on Keras to analyze the user's intentions. The emotion analysis means uses spaCy and TENSORFLOW® to analyze the user's voice data and reveal their emotional state. This enables the generation of more personalized responses based on emotions.

[0177] For example, if the user is tired, the emotion analysis system detects this, and the server selects relaxing music via the control system and plays it to the user through the audio output device. At this time, the system takes the user's emotions into consideration and provides more appropriate music and conversation.

[0178] The generative AI model in this system selects the optimal response through prompts. An example of a prompt is, "If a user is feeling stressed, how should they choose relaxing music?" Based on this prompt, the AI ​​intelligently generates a response. As a result, not only is the user experience improved, but specifications that are adapted to the user's emotions and actions can be realized.

[0179] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0180] Step 1:

[0181] The device uses motion sensors to detect the user's head movements. The detected motion data is sent to the server. This input data includes the user's motion gestures, and its output is data transfer to the server.

[0182] Step 2:

[0183] The server analyzes the received motion data using a motion analysis device. The intent of the motion gesture is determined by a model using Keras. Here, gesture data from the motion sensor is used as input, and the analysis result is obtained as output.

[0184] Step 3:

[0185] The terminal records the user's voice using an acoustic output device and sends it to the server. The input here is the user's voice data, and the transmission to the server is the output.

[0186] Step 4:

[0187] The server analyzes the received voice data using emotion analysis tools. It evaluates the emotional state using spaCy and TensorFlow. In this process, the user's voice data is used as input, and the analyzed emotional information is the output.

[0188] Step 5:

[0189] The server combines the analysis results of the gestures and emotional information to generate an appropriate response using a generative AI model. It uses prompts to select the most suitable music or conversation. The input here is the analyzed gesture and emotional data, and the output is the response information to the user.

[0190] Step 6:

[0191] The terminal receives response information from the server and provides feedback to the user through an audio output device. Selected music or conversation is input to the output device, and its playback is the output.

[0192] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0193] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0194] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0195] [Second Embodiment]

[0196] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0197] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0198] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0199] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0200] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0201] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0202] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0203] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0204] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0205] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0206] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0207] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0208] This invention relates to a system that utilizes head movements to effectively communicate with a digital assistant without using voice. Specifically, the user wears an earphone-type audio output device, which has a built-in motion sensor. The motion sensor detects the user's head movements in real time and generates motion data representing those movements.

[0209] The terminal receives this motion data and transmits it to the server wirelessly. The motion analysis device then analyzes the motion data on the server and compares it to pre-registered motion gesture patterns. Based on this comparison, the user's intention is determined as a simple answer such as "Yes" or "No," or as another more complex instruction. The determination result is used as input for information processing by the response generation device, and the information processing is performed accordingly.

[0210] The server then generates the results of the information processing as audio feedback and sends this response back to the user's terminal. The terminal informs the user of the response through the feedback audio from its voice output device. This allows the user to interact with the digital assistant in a simple and natural way.

[0211] As a concrete example, consider a scenario where a user wants to manipulate their music playlist. Suppose a user wearing earphones indicates they want to increase the volume using a gesture. The user might move their head from side to side or perform a specific motion gesture to convey this intention. The device detects this movement through motion sensors, and the server interprets it as a "volume increase command." The server then increases the volume based on this command and provides confirmation via voice feedback, allowing the user to receive confirmation. This rapid execution of the process allows the user to complete the operation efficiently.

[0212] The following describes the processing flow.

[0213] Step 1:

[0214] The device uses motion sensors built into the earphones to detect the user's head movements. The detected movements are recorded as real-time digital data.

[0215] Step 2:

[0216] The terminal packets operational data and transmits it to the server via wireless communication. This communication is optimized to ensure that operational data is transmitted accurately and without delay.

[0217] Step 3:

[0218] The server analyzes the received behavioral data. The analysis algorithm compares this data to known behavioral gesture patterns and determines the user's intent as a specific instruction such as "Yes" or "No".

[0219] Step 4:

[0220] The server executes the corresponding information processing based on the determination results obtained from the operation analysis. This includes specific actions such as music playback and volume adjustment.

[0221] Step 5:

[0222] The server generates the results of the information processing as audio data and prepares it for feedback to the user.

[0223] Step 6:

[0224] The feedback audio data from the server is sent to the terminal. The terminal plays the received audio data to the user via an audio output device (earphones).

[0225] Step 7:

[0226] The user can confirm whether the action was successful or whether the desired operation has been completed through the played audio feedback.

[0227] (Example 1)

[0228] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0229] Traditional methods of interacting with digital assistants require the use of voice commands, which can raise privacy concerns in quiet environments. Furthermore, in environments where voice communication is unavailable, users may find it difficult to operate the assistant. Additionally, existing interfaces are often difficult for users with disabilities to use. There is a need to address these challenges and enable users to interact with digital assistants in a more natural and reliable way.

[0230] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0231] In this invention, the server includes data analysis means for analyzing motion data and determining instructions based on motion gestures; response generation means for performing information processing based on the determined instructions and generating the result as voice feedback; and feedback transmission means for transmitting the generated voice feedback via wireless communication. This allows the user to operate the digital assistant silently using head movements, enabling efficient use of the assistant while maintaining privacy even in quiet environments.

[0232] A "motion detection means" is a device that detects the user's head movements in real time, converts them into digital signals, and generates motion data.

[0233] "Transmission means" refers to a function for transmitting generated operational data to other devices using wireless communication technology.

[0234] A "data analysis means" is a device that analyzes transmitted motion data and compares it with pre-registered motion patterns to determine the user's intent.

[0235] The "response generation means" is a function that performs information processing based on instructions determined by the data analysis means and generates the result as audio feedback.

[0236] The "feedback transmission means" is a function that transmits the generated audio feedback to the user's terminal using wireless communication.

[0237] A "feedback presentation means" is a function that communicates the results to the user via an audio output device, used to inform the user of the transmitted audio feedback.

[0238] Wireless communication is a technology that uses radio waves to send and receive data, and is a means of data communication without using cables.

[0239] This invention is a system that enables users to interact with a digital assistant naturally and effectively without using their voice. Specifically, the user uses an audio output device with a built-in motion sensor, such as earphones, which can detect head movements.

[0240] When a user puts on earphones, the terminal receives detection data from its motion sensor. The terminal transmits the motion data to the server using wireless communication. The server uses a motion analysis device to compare the received data with known motion patterns and analyze the user's intent. High-performance data analysis software is used for this motion analysis, enabling rapid and accurate analysis.

[0241] The server processes the information using a response generation device based on the analysis results and generates audio feedback. The generated audio feedback is returned to the user's terminal via a feedback transmission means and finally presented to the user through an audio output device. This allows the user to perform operations such as increasing the music volume or skipping songs with simple gestures, such as moving their head.

[0242] A concrete example is when a user wants to increase the volume of a music playlist. The user indicates this intention by moving their head from side to side, and the device detects this action and sends it to the server. The server recognizes the "increase volume" instruction, adjusts the volume, and sends confirmation audio feedback back to the user.

[0243] Examples of prompts used in the generative AI model include "How do you move your head to increase the volume?" and "What is the correct gesture to skip to the next song?", helping users to utilize the system more effectively.

[0244] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0245] Step 1:

[0246] The user wears an earphone-type audio output device. A motion sensor detects head movements and generates detection data in real time. This data shows the acceleration and angular changes when the user moves their head in a specific direction.

[0247] Step 2:

[0248] The terminal transmits motion data received from the motion sensor to the server via wireless communication. The data transmitted includes the type of motion (e.g., left-right movement), intensity, and duration.

[0249] Step 3:

[0250] The server receives motion data and compares it to pre-programmed motion patterns using a motion analysis device. The analysis extracts features from the data to determine the user's intended instruction (e.g., "increase volume"). This process applies a gesture recognition algorithm to interpret the intent of the action.

[0251] Step 4:

[0252] The server processes the information using a response generator based on the analysis results. Specifically, the server executes the corresponding action (for example, increasing the volume) according to the determined instructions. The result after processing is generated as audio feedback.

[0253] Step 5:

[0254] The server sends the generated audio feedback to the terminal. This feedback includes a confirmation message indicating that a specific action has been completed (for example, "Volume increased").

[0255] Step 6:

[0256] The device presents the received audio feedback to the user through the earphones. The audio output device plays the audio feedback, and the user confirms that their instructions have been processed correctly.

[0257] This allows users to intuitively operate the digital assistant through the system without using voice commands.

[0258] (Application Example 1)

[0259] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0260] Currently, it is difficult for users of public transportation to obtain information or change their travel plans without using their hands. Furthermore, using voice instructions can be challenging in noisy environments. Therefore, there is a need for efficient and rapid access to information while on board.

[0261] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0262] In this invention, the server includes means for recognizing the user's gestures using an audio output device equipped with a motion sensor that detects head movements, motion analysis means for determining instructions based on the motion gestures, and information processing means for processing information based on the determined instructions and feeding the results back to the user through the audio output device. This makes it possible for the user to acquire information and update route information using only head movements without using their hands.

[0263] "Head movement" refers to the physical movement of the user's head, which can be used to express the user's intentions.

[0264] A "motion sensor" is a device used to detect physical movement, and specifically refers to a sensor used to accurately capture the movement of a user's head.

[0265] An "audio output device" is a device that transmits processed information to the user as sound, and examples include earphones and speakers.

[0266] "Motion data" refers to data output from motion sensors, which quantifies the user's head movements.

[0267] A "motion analysis device" is a device that analyzes motion data and recognizes specific gesture patterns.

[0268] An "instruction" is a command for an operation or control that the system determines based on the user's gestures.

[0269] "Information processing" is the process of calculation and decision-making that a system performs to generate the optimal response based on the instructions it receives.

[0270] A "response generation device" is a device that provides feedback to the user in the form of voice or other formats based on the results of information processing.

[0271] "Route information" refers to route information necessary for a user's travel, and is particularly useful when using public transportation.

[0272] "Moving" refers to a state in which the user is physically changing their location while moving.

[0273] "Processing means" refers to the functional elements of hardware or software necessary for a system to perform a specific function.

[0274] In implementing this system, the user uses an audio output device equipped with a motion sensor. The audio output device takes the form of an earphone and detects the user's head movements in real time. Motion data from this motion sensor is transmitted to a terminal and then transferred to a server via wireless communication.

[0275] The server includes a motion analysis device that analyzes the transmitted motion data. During the analysis, the user's instructions are determined by comparing them with pre-registered gesture patterns. For example, if the user makes a gesture indicating "Tell me the next stop," the server processes the information based on this determination. As a result of the information processing, a response generation device generates voice feedback, and the content of that feedback is sent back to the user through the terminal.

[0276] In this entire process, the Google Cloud Text-to-Speech API can be used to generate voice feedback. Furthermore, the processing power of servers and smartphones is utilized to perform calculations for real-time information retrieval and route updates. This allows users to obtain information and adjust their routes hands-free while on public transport, enabling more efficient travel.

[0277] As a concrete example, suppose a user is on a bus and wants to know the route to their destination. In this case, by slightly moving their head from side to side, the instruction "Tell me the next stop" is transmitted to the server, and the next stop is announced via voice feedback.

[0278] As an example of a prompt sentence for the generated AI model, by formalizing it as "Based on the movement of the user's head, analyze the pre - defined gestures without words and simulate the scenario of a traffic information assistant system for smart cities.", it is possible to contribute to improving the accuracy of the system.

[0279] The flow of the specific process in Application Example 1 will be described using FIG. 12.

[0280] Step 1:

[0281] The user wears a voice - output device equipped with a motion sensor. The motion sensor detects the movement of the user's head in real - time and outputs this as motion data. This data is collected by the built - in sensor of the earphone.

[0282] Step 2:

[0283] The terminal transmits the detected motion data to the server via wireless communication. Here, the role of the terminal is to receive the motion data from the motion sensor and accurately relay it to the server.

[0284] Step 3:

[0285] The server receives the motion data by the motion analysis device. The server analyzes this data and compares it with the pre - defined gesture patterns. For example, a gesture of moving the head horizontally is determined as "Tell me the next stop". The input is the motion data, and the output is the determined instruction. In this process, operations of data comparison and pattern recognition are performed.

[0286] Step 4:

[0287] The server processes information based on the determined instruction. Here, it retrieves information about the next stop from public transport and generates the content of the voice feedback. The input is the determined instruction, and the output is the feedback information. This process includes searching for online information and generating data feedback.

[0288] Step 5:

[0289] The server generates audio feedback using a response generator and sends the content as an audio signal to the terminal. For example, it sends audio data generated using the Google Cloud Text-to-Speech API.

[0290] Step 6:

[0291] The device provides the user with generated audio feedback through an audio output device. This allows the user to hear the response to their instructions and obtain necessary information hands-free, even on public transport.

[0292] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0293] This invention relates to a system that provides an interface integrating gestures based on head movements with user emotion recognition. The system includes a voice output device equipped with a motion sensor, a motion analysis device, a response generation device, and an emotion engine. The emotion engine is responsible for generating more appropriate responses by analyzing the emotions indicated by the user's voice and movements.

[0294] The device detects the user's head movements via motion sensors through an earphone-type audio output device worn by the user. In addition, a built-in emotion engine analyzes the user's emotional state using the tone, tempo, and voice patterns of their voice. This data is used during information processing to improve the user experience.

[0295] The server analyzes motion data from motion sensors to confirm the user's intent and determine the appropriate instruction. Simultaneously, the emotion engine analyzes the acquired emotion data and evaluates the user's current emotional state. Based on this evaluation, the response generation device formulates an appropriate response corresponding to the determined instruction and emotional state.

[0296] As a concrete example, consider a case where a user chooses relaxing music to alleviate stress they feel while listening to music. The user can indicate a change of music with a gesture, and the emotion engine detects the user's stress level from their voice. A motion analysis device determines the "change music" instruction from head movements and processes it on a server in combination with the emotion engine's detection. A response generation device recommends music that helps reduce stress and plays it to the user through earphones. In this way, the user's actions and emotions-based experience enable a personalized response.

[0297] The following describes the processing flow.

[0298] Step 1:

[0299] The device detects the user's head movements via motion sensors built into the earphones. Simultaneously, it captures audio signals to collect the user's voice.

[0300] Step 2:

[0301] The terminal transmits the acquired motion data and audio data to the server. In this process, motion data is transmitted immediately via wireless communication, and audio data is used for sentiment analysis.

[0302] Step 3:

[0303] The server analyzes the operation data and determines the user's intention based on the operation gesture. At the same time, the server analyzes the voice data with an emotion engine to evaluate the user's emotional state.

[0304] Step 4:

[0305] Based on the results of the motion analysis and the emotion analysis, the server performs optimal information processing according to the user's state. For example, if the user intends to change the music and is determined to be feeling stressed, the server selects music that is effective in reducing stress.

[0306] Step 5:

[0307] The server transmits the response (e.g., a new music track) generated as a result of the information processing to the terminal as voice output data.

[0308] Step 6:

[0309] The terminal transmits the voice output data received from the server to the user through a voice output device. The user can confirm that the requested action has been executed by listening to the feedback.

[0310] Step 7:

[0311] The user can understand that their intention has been reflected and appropriate actions have been taken according to the situation through the provided voice feedback, and can continue with further operations.

[0312] (Example 2)

[0313] Next, Example 2 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".

[0314] Existing voice output systems face the challenge of providing personalized responses based not only on user actions but also on emotions. In particular, conventional technologies fail to adequately understand user intent and provide real-time responses. As a result, the user experience may be limited, potentially leading to decreased satisfaction.

[0315] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0316] In this invention, the server includes a detection device means for detecting the user's head movements, an analysis device means for analyzing data from the detection device means and determining instructions based on the motion gestures, and an analysis device means for analyzing the user's voice characteristics and evaluating their emotional state. This enables the integration of the user's actions and emotions to provide appropriate responses in real time, improving the personalized user experience.

[0317] A "detection device means" is a device equipped with the function of accurately sensing the user's head movements and extracting them as data.

[0318] The "analysis device means" is a device that analyzes data supplied by the detection device means and identifies instructions based on the user's gestures.

[0319] An "analysis device" is a device that analyzes the tone and patterns from the user's voice and has the function of evaluating their current emotional state.

[0320] A "generative AI model" is an artificial intelligence model that generates appropriate information processing and responses based on user behavior and emotional data.

[0321] A "response device means" is a device that outputs the results of information processing as sound and provides feedback to the user.

[0322] This invention is a system that analyzes the user's head movements and emotional state of voice to provide personalized responses. The system consists of a motion sensor, an emotion engine, and an analysis device, all integrated into an earphone-type audio output device.

[0323] The device uses motion sensors to detect the user's head movements. This information is sent to an analysis device and analyzed as a motion gesture. For example, a user can instruct the system to change the music by moving their head from side to side.

[0324] Furthermore, the device's built-in emotion engine analyzes the user's voice tone, tempo, and speech patterns to assess their emotional state. Emotion analysis is performed using a generative AI model, enabling a more accurate understanding of the user's emotions. For example, if a user is stressed, this state can be detected from their voice tone.

[0325] The server receives this behavioral and emotional data and processes each piece of data in an integrated manner. A generative AI model is used to process information based on the user's intentions and emotions. As a result of the processing, an appropriate response is generated and fed back to the user through the device. For example, if the server determines that the user is feeling stressed, it selects relaxing music and plays it through the earphones.

[0326] A concrete example is a user prompt such as, "I want to listen to relaxing music." The system can analyze this request and adjust the music it provides. In this way, it is possible to provide a highly personalized experience that takes into account the user's emotions and actions.

[0327] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0328] Step 1:

[0329] The user wears earphone-type audio output devices and enjoys media such as music. Simultaneously, a motion sensor built into the device detects the user's head movements. This motion data functions as input data to capture the user's intentions. The detected data is transmitted to a motion analysis device. Specifically, the user indicates music playback or selection by moving their head up and down or left and right.

[0330] Step 2:

[0331] The terminal's motion analysis device analyzes the transmitted head movement data and maps it to specific motion gestures. This analysis determines what kind of command the user issued. For example, if "up and down movement = skipping a song" is defined, the corresponding instruction will be output. Specifically, an algorithm is implemented to convert motion data into motion gestures, and the necessary information is extracted and the data is converted.

[0332] Step 3:

[0333] Simultaneously, the device's emotion engine captures the user's voice and analyzes its tone, tempo, and speech patterns. Using a generative AI model, it evaluates the user's emotional state. This audio data becomes input for the emotion analysis, and the emotion engine outputs emotion labels such as "stressed." For example, it might detect a high-pitched, hurried voice pattern and classify it as "stressed."

[0334] Step 4:

[0335] The server integrates the motion gestures and emotion labels received from the terminal. Using a generative AI model, it analyzes this data and processes the information to match the user's intentions and emotions. For example, if the intention to "relax" is associated with an emotion label, the server will select relaxing music. In this process, motion gestures and emotion labels are input, and a specific response plan is generated and output.

[0336] Step 5:

[0337] The server, via a response device, sends the selected, appropriate music to the terminal and plays it through the user's earphones. The final output is carefully selected music or audio feedback. Specifically, the server performs signal conversion to execute the processing results as audio output. The user experiences a newly selected song playing through their earphones.

[0338] (Application Example 2)

[0339] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0340] Current personal robots and sound output devices have limited ability to adaptively respond based on user emotions and real-time actions. In particular, they are inadequate in supporting users in reducing stress and fatigue in their daily lives, and there is a need for more advanced interfaces to optimize the user experience.

[0341] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0342] In this invention, the server includes means for using an acoustic output device equipped with a motion sensor that detects head movements, means for using an analysis device that analyzes motion data and determines instructions based on motion gestures, and means for using emotion analysis means that analyze the characteristics of the user's voice and evaluate their emotional state. This enables the generation of personalized responses based on the user's actions and emotions.

[0343] A "motion sensor" is a device that detects the movement of a user's head and has the function of providing that motion data.

[0344] An "audio output device" is a device for transmitting sound or music to the user, and it is desirable that it be in a form that the user can wear.

[0345] An "analysis device" is a component that analyzes motion data and determines appropriate instructions based on the detected motion gestures.

[0346] A "response generation device" is a system that processes information based on analyzed instructions and emotional states, and then feeds the results back to the user through an acoustic output device.

[0347] "Emotional analysis techniques" are technologies that analyze characteristics such as the tone and tempo of a user's voice to evaluate the user's emotional state.

[0348] A "control mechanism" is a device that selects appropriate music or conversation based on the evaluated emotional state and provides adaptive responses to the user.

[0349] In an embodiment for carrying out the present invention, the system mainly consists of a server and a terminal. The terminal includes a motion sensor that detects the user's head movements and an acoustic output device that outputs sound. The motion sensor has the function of detecting the user's motion gestures and transmitting the data to the server. The acoustic output device is a device that provides voice feedback to the user.

[0350] The server is equipped with a motion analysis device and an emotion analysis device, and fulfills these roles. The motion analysis device analyzes data acquired from motion sensors using a deep learning model based on Keras to analyze the user's intentions. The emotion analysis device uses spaCy and TensorFlow to analyze the user's voice data and reveal their emotional state. This enables the generation of more personalized responses based on emotions.

[0351] For example, if the user is tired, the emotion analysis system detects this, and the server selects relaxing music via the control system and plays it to the user through the audio output device. At this time, the system takes the user's emotions into consideration and provides more appropriate music and conversation.

[0352] The generative AI model in this system selects the optimal response through prompts. An example of a prompt is, "If a user is feeling stressed, how should they choose relaxing music?" Based on this prompt, the AI ​​intelligently generates a response. As a result, not only is the user experience improved, but specifications that are adapted to the user's emotions and actions can be realized.

[0353] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0354] Step 1:

[0355] The device uses motion sensors to detect the user's head movements. The detected motion data is sent to the server. This input data includes the user's motion gestures, and its output is data transfer to the server.

[0356] Step 2:

[0357] The server analyzes the received motion data using a motion analysis device. The intent of the motion gesture is determined by a model using Keras. Here, gesture data from the motion sensor is used as input, and the analysis result is obtained as output.

[0358] Step 3:

[0359] The terminal records the user's voice using an acoustic output device and sends it to the server. The input here is the user's voice data, and the transmission to the server is the output.

[0360] Step 4:

[0361] The server analyzes the received voice data using emotion analysis tools. It evaluates the emotional state using spaCy and TensorFlow. In this process, the user's voice data is used as input, and the analyzed emotional information is the output.

[0362] Step 5:

[0363] The server combines the analysis results of the gestures and emotional information to generate an appropriate response using a generative AI model. It uses prompts to select the most suitable music or conversation. The input here is the analyzed gesture and emotional data, and the output is the response information to the user.

[0364] Step 6:

[0365] The terminal receives response information from the server and provides feedback to the user through an audio output device. Selected music or conversation is input to the output device, and its playback is the output.

[0366] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0367] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0368] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0369] [Third Embodiment]

[0370] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0371] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0372] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0373] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0374] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0375] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0376] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0377] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0378] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0379] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0380] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0381] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0382] This invention relates to a system that utilizes head movements to effectively communicate with a digital assistant without using voice. Specifically, the user wears an earphone-type audio output device, which has a built-in motion sensor. The motion sensor detects the user's head movements in real time and generates motion data representing those movements.

[0383] The terminal receives this motion data and transmits it to the server wirelessly. The motion analysis device then analyzes the motion data on the server and compares it to pre-registered motion gesture patterns. Based on this comparison, the user's intention is determined as a simple answer such as "Yes" or "No," or as another more complex instruction. The determination result is used as input for information processing by the response generation device, and the information processing is performed accordingly.

[0384] The server then generates the results of the information processing as audio feedback and sends this response back to the user's terminal. The terminal informs the user of the response through the feedback audio from its voice output device. This allows the user to interact with the digital assistant in a simple and natural way.

[0385] As a concrete example, consider a scenario where a user wants to manipulate their music playlist. Suppose a user wearing earphones indicates they want to increase the volume using a gesture. The user might move their head from side to side or perform a specific motion gesture to convey this intention. The device detects this movement through motion sensors, and the server interprets it as a "volume increase command." The server then increases the volume based on this command and provides confirmation via voice feedback, allowing the user to receive confirmation. This rapid execution of the process allows the user to complete the operation efficiently.

[0386] The following describes the processing flow.

[0387] Step 1:

[0388] The device uses motion sensors built into the earphones to detect the user's head movements. The detected movements are recorded as real-time digital data.

[0389] Step 2:

[0390] The terminal packets operational data and transmits it to the server via wireless communication. This communication is optimized to ensure that operational data is transmitted accurately and without delay.

[0391] Step 3:

[0392] The server analyzes the received behavioral data. The analysis algorithm compares this data to known behavioral gesture patterns and determines the user's intent as a specific instruction such as "Yes" or "No".

[0393] Step 4:

[0394] The server executes the corresponding information processing based on the determination results obtained from the operation analysis. This includes specific actions such as music playback and volume adjustment.

[0395] Step 5:

[0396] The server generates the results of the information processing as audio data and prepares it for feedback to the user.

[0397] Step 6:

[0398] The feedback audio data from the server is sent to the terminal. The terminal plays the received audio data to the user via an audio output device (earphones).

[0399] Step 7:

[0400] The user can confirm whether the action was successful or whether the desired operation has been completed through the played audio feedback.

[0401] (Example 1)

[0402] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0403] Traditional methods of interacting with digital assistants require the use of voice commands, which can raise privacy concerns in quiet environments. Furthermore, in environments where voice communication is unavailable, users may find it difficult to operate the assistant. Additionally, existing interfaces are often difficult for users with disabilities to use. There is a need to address these challenges and enable users to interact with digital assistants in a more natural and reliable way.

[0404] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0405] In this invention, the server includes data analysis means for analyzing motion data and determining instructions based on motion gestures; response generation means for performing information processing based on the determined instructions and generating the result as voice feedback; and feedback transmission means for transmitting the generated voice feedback via wireless communication. This allows the user to operate the digital assistant silently using head movements, enabling efficient use of the assistant while maintaining privacy even in quiet environments.

[0406] A "motion detection means" is a device that detects the user's head movements in real time, converts them into digital signals, and generates motion data.

[0407] "Transmission means" refers to a function for transmitting generated operational data to other devices using wireless communication technology.

[0408] A "data analysis means" is a device that analyzes transmitted motion data and compares it with pre-registered motion patterns to determine the user's intent.

[0409] The "response generation means" is a function that performs information processing based on instructions determined by the data analysis means and generates the result as audio feedback.

[0410] The "feedback transmission means" is a function that transmits the generated audio feedback to the user's terminal using wireless communication.

[0411] A "feedback presentation means" is a function that communicates the results to the user via an audio output device, used to inform the user of the transmitted audio feedback.

[0412] Wireless communication is a technology that uses radio waves to send and receive data, and is a means of data communication without using cables.

[0413] This invention is a system that enables users to interact with a digital assistant naturally and effectively without using their voice. Specifically, the user uses an audio output device with a built-in motion sensor, such as earphones, which can detect head movements.

[0414] When a user puts on earphones, the terminal receives detection data from its motion sensor. The terminal transmits the motion data to the server using wireless communication. The server uses a motion analysis device to compare the received data with known motion patterns and analyze the user's intent. High-performance data analysis software is used for this motion analysis, enabling rapid and accurate analysis.

[0415] The server processes the information using a response generation device based on the analysis results and generates audio feedback. The generated audio feedback is returned to the user's terminal via a feedback transmission means and finally presented to the user through an audio output device. This allows the user to perform operations such as increasing the music volume or skipping songs with simple gestures, such as moving their head.

[0416] A concrete example is when a user wants to increase the volume of a music playlist. The user indicates this intention by moving their head from side to side, and the device detects this action and sends it to the server. The server recognizes the "increase volume" instruction, adjusts the volume, and sends confirmation audio feedback back to the user.

[0417] Examples of prompts used in the generative AI model include "How do you move your head to increase the volume?" and "What is the correct gesture to skip to the next song?", helping users to utilize the system more effectively.

[0418] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0419] Step 1:

[0420] The user wears an earphone-type audio output device. A motion sensor detects head movements and generates detection data in real time. This data shows the acceleration and angular changes when the user moves their head in a specific direction.

[0421] Step 2:

[0422] The terminal transmits motion data received from the motion sensor to the server via wireless communication. The data transmitted includes the type of motion (e.g., left-right movement), intensity, and duration.

[0423] Step 3:

[0424] The server receives motion data and compares it to pre-programmed motion patterns using a motion analysis device. The analysis extracts features from the data to determine the user's intended instruction (e.g., "increase volume"). This process applies a gesture recognition algorithm to interpret the intent of the action.

[0425] Step 4:

[0426] The server processes the information using a response generator based on the analysis results. Specifically, the server executes the corresponding action (for example, increasing the volume) according to the determined instructions. The result after processing is generated as audio feedback.

[0427] Step 5:

[0428] The server sends the generated audio feedback to the terminal. This feedback includes a confirmation message indicating that a specific action has been completed (for example, "Volume increased").

[0429] Step 6:

[0430] The device presents the received audio feedback to the user through the earphones. The audio output device plays the audio feedback, and the user confirms that their instructions have been processed correctly.

[0431] This allows users to intuitively operate the digital assistant through the system without using voice commands.

[0432] (Application Example 1)

[0433] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0434] Currently, it is difficult for users of public transportation to obtain information or change their travel plans without using their hands. Furthermore, using voice instructions can be challenging in noisy environments. Therefore, there is a need for efficient and rapid access to information while on board.

[0435] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0436] In this invention, the server includes means for recognizing the user's gestures using an audio output device equipped with a motion sensor that detects head movements, motion analysis means for determining instructions based on the motion gestures, and information processing means for processing information based on the determined instructions and feeding the results back to the user through the audio output device. This makes it possible for the user to acquire information and update route information using only head movements without using their hands.

[0437] "Head movement" refers to the physical movement of the user's head, which can be used to express the user's intentions.

[0438] A "motion sensor" is a device used to detect physical movement, and specifically refers to a sensor used to accurately capture the movement of a user's head.

[0439] An "audio output device" is a device that transmits processed information to the user as sound, and examples include earphones and speakers.

[0440] "Motion data" refers to data output from motion sensors, which quantifies the user's head movements.

[0441] A "motion analysis device" is a device that analyzes motion data and recognizes specific gesture patterns.

[0442] An "instruction" is a command for an operation or control that the system determines based on the user's gestures.

[0443] "Information processing" is the process of calculation and decision-making that a system performs to generate the optimal response based on the instructions it receives.

[0444] A "response generation device" is a device that provides feedback to the user in the form of voice or other formats based on the results of information processing.

[0445] "Route information" refers to route information necessary for a user's travel, and is particularly useful when using public transportation.

[0446] "Moving" refers to a state in which the user is physically changing their location while moving.

[0447] "Processing means" refers to the functional elements of hardware or software necessary for a system to perform a specific function.

[0448] In implementing this system, the user uses an audio output device equipped with a motion sensor. The audio output device takes the form of an earphone and detects the user's head movements in real time. Motion data from this motion sensor is transmitted to a terminal and then transferred to a server via wireless communication.

[0449] The server includes a motion analysis device that analyzes the transmitted motion data. During the analysis, the user's instructions are determined by comparing them with pre-registered gesture patterns. For example, if the user makes a gesture indicating "Tell me the next stop," the server processes the information based on this determination. As a result of the information processing, a response generation device generates voice feedback, and the content of that feedback is sent back to the user through the terminal.

[0450] In this entire process, the Google Cloud Text-to-Speech API can be used to generate voice feedback. Furthermore, the processing power of servers and smartphones is utilized to perform calculations for real-time information retrieval and route updates. This allows users to obtain information and adjust their routes hands-free while on public transport, enabling more efficient travel.

[0451] As a concrete example, suppose a user is on a bus and wants to know the route to their destination. In this case, by slightly moving their head from side to side, the instruction "Tell me the next stop" is transmitted to the server, and the next stop is announced via voice feedback.

[0452] For example, prompts for the generated AI model can be formalized as follows: "Based on the user's head movements, analyze predefined gestures silently and simulate a scenario for a smart city traffic information assistant system." This formalization can contribute to improving the system's accuracy.

[0453] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0454] Step 1:

[0455] The user wears an audio output device equipped with a motion sensor. The motion sensor detects the user's head movements in real time and outputs them as motion data. This data is collected by a sensor built into the earphones.

[0456] Step 2:

[0457] The terminal transmits the detected motion data to the server via wireless communication. Here, the terminal's role is to receive motion data from the motion sensor and accurately relay it to the server.

[0458] Step 3:

[0459] The server receives motion data from a motion analysis device. The server analyzes this data and compares it to predefined gesture patterns. For example, a gesture of moving the head from side to side is interpreted as "Tell me the next station." The input is motion data, and the output is the interpreted instruction. This process involves data comparison and pattern recognition calculations.

[0460] Step 4:

[0461] The server processes information based on the determined instruction. Here, it retrieves information about the next stop from public transport and generates the content of the voice feedback. The input is the determined instruction, and the output is the feedback information. This process includes searching for online information and generating data feedback.

[0462] Step 5:

[0463] The server generates audio feedback using a response generator and sends the content as an audio signal to the terminal. For example, it sends audio data generated using the Google Cloud Text-to-Speech API.

[0464] Step 6:

[0465] The device provides the user with generated audio feedback through an audio output device. This allows the user to hear the response to their instructions and obtain necessary information hands-free, even on public transport.

[0466] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0467] This invention relates to a system that provides an interface integrating gestures based on head movements with user emotion recognition. The system includes a voice output device equipped with a motion sensor, a motion analysis device, a response generation device, and an emotion engine. The emotion engine is responsible for generating more appropriate responses by analyzing the emotions indicated by the user's voice and movements.

[0468] The device detects the user's head movements via motion sensors through an earphone-type audio output device worn by the user. In addition, a built-in emotion engine analyzes the user's emotional state using the tone, tempo, and voice patterns of their voice. This data is used during information processing to improve the user experience.

[0469] The server analyzes motion data from motion sensors to confirm the user's intent and determine the appropriate instruction. Simultaneously, the emotion engine analyzes the acquired emotion data and evaluates the user's current emotional state. Based on this evaluation, the response generation device formulates an appropriate response corresponding to the determined instruction and emotional state.

[0470] As a concrete example, consider a case where a user chooses relaxing music to alleviate stress they feel while listening to music. The user can indicate a change of music with a gesture, and the emotion engine detects the user's stress level from their voice. A motion analysis device determines the "change music" instruction from head movements and processes it on a server in combination with the emotion engine's detection. A response generation device recommends music that helps reduce stress and plays it to the user through earphones. In this way, the user's actions and emotions-based experience enable a personalized response.

[0471] The following describes the processing flow.

[0472] Step 1:

[0473] The device detects the user's head movements via motion sensors built into the earphones. Simultaneously, it captures audio signals to collect the user's voice.

[0474] Step 2:

[0475] The terminal transmits the acquired motion data and audio data to the server. In this process, motion data is transmitted immediately via wireless communication, and audio data is used for sentiment analysis.

[0476] Step 3:

[0477] The server analyzes motion data and determines the user's intent based on motion gestures. Simultaneously, it analyzes voice data using an emotion engine to evaluate the user's emotional state.

[0478] Step 4:

[0479] Based on the results of behavioral and emotional analysis, the server performs optimal information processing according to the user's state. For example, if the server determines that the user intends to change the music and is also feeling stressed, it will select music that is effective in reducing stress.

[0480] Step 5:

[0481] The server sends the response generated as a result of information processing (e.g., a new music track) to the terminal as audio output data.

[0482] Step 6:

[0483] The terminal transmits the audio output data received from the server to the user through an audio output device. The user can confirm that the requested action has been performed by listening to this feedback.

[0484] Step 7:

[0485] Through the provided voice feedback, users can understand that their intentions have been reflected and that appropriate responses have been given, allowing them to continue further operations.

[0486] (Example 2)

[0487] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0488] Existing voice output systems face the challenge of providing personalized responses based not only on user actions but also on emotions. In particular, conventional technologies fail to adequately understand user intent and provide real-time responses. As a result, the user experience may be limited, potentially leading to decreased satisfaction.

[0489] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0490] In this invention, the server includes a detection device means for detecting the user's head movements, an analysis device means for analyzing data from the detection device means and determining instructions based on the motion gestures, and an analysis device means for analyzing the user's voice characteristics and evaluating their emotional state. This enables the integration of the user's actions and emotions to provide appropriate responses in real time, improving the personalized user experience.

[0491] A "detection device means" is a device equipped with the function of accurately sensing the user's head movements and extracting them as data.

[0492] The "analysis device means" is a device that analyzes data supplied by the detection device means and identifies instructions based on the user's gestures.

[0493] An "analysis device" is a device that analyzes the tone and patterns from the user's voice and has the function of evaluating their current emotional state.

[0494] A "generative AI model" is an artificial intelligence model that generates appropriate information processing and responses based on user behavior and emotional data.

[0495] A "response device means" is a device that outputs the results of information processing as sound and provides feedback to the user.

[0496] This invention is a system that analyzes the user's head movements and emotional state of voice to provide personalized responses. The system consists of a motion sensor, an emotion engine, and an analysis device, all integrated into an earphone-type audio output device.

[0497] The device uses motion sensors to detect the user's head movements. This information is sent to an analysis device and analyzed as a motion gesture. For example, a user can instruct the system to change the music by moving their head from side to side.

[0498] Furthermore, the device's built-in emotion engine analyzes the user's voice tone, tempo, and speech patterns to assess their emotional state. Emotion analysis is performed using a generative AI model, enabling a more accurate understanding of the user's emotions. For example, if a user is stressed, this state can be detected from their voice tone.

[0499] The server receives this behavioral and emotional data and processes each piece of data in an integrated manner. A generative AI model is used to process information based on the user's intentions and emotions. As a result of the processing, an appropriate response is generated and fed back to the user through the device. For example, if the server determines that the user is feeling stressed, it selects relaxing music and plays it through the earphones.

[0500] A concrete example is a user prompt such as, "I want to listen to relaxing music." The system can analyze this request and adjust the music it provides. In this way, it is possible to provide a highly personalized experience that takes into account the user's emotions and actions.

[0501] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0502] Step 1:

[0503] The user wears earphone-type audio output devices and enjoys media such as music. Simultaneously, a motion sensor built into the device detects the user's head movements. This motion data functions as input data to capture the user's intentions. The detected data is transmitted to a motion analysis device. Specifically, the user indicates music playback or selection by moving their head up and down or left and right.

[0504] Step 2:

[0505] The terminal's motion analysis device analyzes the transmitted head movement data and maps it to specific motion gestures. This analysis determines what kind of command the user issued. For example, if "up and down movement = skipping a song" is defined, the corresponding instruction will be output. Specifically, an algorithm is implemented to convert motion data into motion gestures, and the necessary information is extracted and the data is converted.

[0506] Step 3:

[0507] Simultaneously, the device's emotion engine captures the user's voice and analyzes its tone, tempo, and speech patterns. Using a generative AI model, it evaluates the user's emotional state. This audio data becomes input for the emotion analysis, and the emotion engine outputs emotion labels such as "stressed." For example, it might detect a high-pitched, hurried voice pattern and classify it as "stressed."

[0508] Step 4:

[0509] The server integrates the motion gestures and emotion labels received from the terminal. Using a generative AI model, it analyzes this data and processes the information to match the user's intentions and emotions. For example, if the intention to "relax" is associated with an emotion label, the server will select relaxing music. In this process, motion gestures and emotion labels are input, and a specific response plan is generated and output.

[0510] Step 5:

[0511] The server, via a response device, sends the selected, appropriate music to the terminal and plays it through the user's earphones. The final output is carefully selected music or audio feedback. Specifically, the server performs signal conversion to execute the processing results as audio output. The user experiences a newly selected song playing through their earphones.

[0512] (Application Example 2)

[0513] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0514] Current personal robots and sound output devices have limited ability to adaptively respond based on user emotions and real-time actions. In particular, they are inadequate in supporting users in reducing stress and fatigue in their daily lives, and there is a need for more advanced interfaces to optimize the user experience.

[0515] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0516] In this invention, the server includes means for using an acoustic output device equipped with a motion sensor that detects head movements, means for using an analysis device that analyzes motion data and determines instructions based on motion gestures, and means for using emotion analysis means that analyze the characteristics of the user's voice and evaluate their emotional state. This enables the generation of personalized responses based on the user's actions and emotions.

[0517] A "motion sensor" is a device that detects the movement of a user's head and has the function of providing that motion data.

[0518] An "audio output device" is a device for transmitting sound or music to the user, and it is desirable that it be in a form that the user can wear.

[0519] An "analysis device" is a component that analyzes motion data and determines appropriate instructions based on the detected motion gestures.

[0520] A "response generation device" is a system that processes information based on analyzed instructions and emotional states, and then feeds the results back to the user through an acoustic output device.

[0521] "Emotional analysis techniques" are technologies that analyze characteristics such as the tone and tempo of a user's voice to evaluate the user's emotional state.

[0522] A "control mechanism" is a device that selects appropriate music or conversation based on the evaluated emotional state and provides adaptive responses to the user.

[0523] In an embodiment for carrying out the present invention, the system mainly consists of a server and a terminal. The terminal includes a motion sensor that detects the user's head movements and an acoustic output device that outputs sound. The motion sensor has the function of detecting the user's motion gestures and transmitting the data to the server. The acoustic output device is a device that provides voice feedback to the user.

[0524] The server is equipped with a motion analysis device and an emotion analysis device, and fulfills these roles. The motion analysis device analyzes data acquired from motion sensors using a deep learning model based on Keras to analyze the user's intentions. The emotion analysis device uses spaCy and TensorFlow to analyze the user's voice data and reveal their emotional state. This enables the generation of more personalized responses based on emotions.

[0525] For example, if the user is tired, the emotion analysis system detects this, and the server selects relaxing music via the control system and plays it to the user through the audio output device. At this time, the system takes the user's emotions into consideration and provides more appropriate music and conversation.

[0526] The generative AI model in this system selects the optimal response through prompts. An example of a prompt is, "If a user is feeling stressed, how should they choose relaxing music?" Based on this prompt, the AI ​​intelligently generates a response. As a result, not only is the user experience improved, but specifications that are adapted to the user's emotions and actions can be realized.

[0527] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0528] Step 1:

[0529] The device uses motion sensors to detect the user's head movements. The detected motion data is sent to the server. This input data includes the user's motion gestures, and its output is data transfer to the server.

[0530] Step 2:

[0531] The server analyzes the received motion data using a motion analysis device. The intent of the motion gesture is determined by a model using Keras. Here, gesture data from the motion sensor is used as input, and the analysis result is obtained as output.

[0532] Step 3:

[0533] The terminal records the user's voice using an acoustic output device and sends it to the server. The input here is the user's voice data, and the transmission to the server is the output.

[0534] Step 4:

[0535] The server analyzes the received voice data using emotion analysis tools. It evaluates the emotional state using spaCy and TensorFlow. In this process, the user's voice data is used as input, and the analyzed emotional information is the output.

[0536] Step 5:

[0537] The server combines the analysis results of the gestures and emotional information to generate an appropriate response using a generative AI model. It uses prompts to select the most suitable music or conversation. The input here is the analyzed gesture and emotional data, and the output is the response information to the user.

[0538] Step 6:

[0539] The terminal receives response information from the server and provides feedback to the user through an audio output device. Selected music or conversation is input to the output device, and its playback is the output.

[0540] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0541] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0542] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0543] [Fourth Embodiment]

[0544] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0545] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0546] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0547] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0548] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0549] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0550] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0551] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0552] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0553] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0554] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0555] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0556] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0557] This invention relates to a system that utilizes head movements to effectively communicate with a digital assistant without using voice. Specifically, the user wears an earphone-type audio output device, which has a built-in motion sensor. The motion sensor detects the user's head movements in real time and generates motion data representing those movements.

[0558] The terminal receives this motion data and transmits it to the server wirelessly. The motion analysis device then analyzes the motion data on the server and compares it to pre-registered motion gesture patterns. Based on this comparison, the user's intention is determined as a simple answer such as "Yes" or "No," or as another more complex instruction. The determination result is used as input for information processing by the response generation device, and the information processing is performed accordingly.

[0559] The server then generates the results of the information processing as audio feedback and sends this response back to the user's terminal. The terminal informs the user of the response through the feedback audio from its voice output device. This allows the user to interact with the digital assistant in a simple and natural way.

[0560] As a concrete example, consider a scenario where a user wants to manipulate their music playlist. Suppose a user wearing earphones indicates they want to increase the volume using a gesture. The user might move their head from side to side or perform a specific motion gesture to convey this intention. The device detects this movement through motion sensors, and the server interprets it as a "volume increase command." The server then increases the volume based on this command and provides confirmation via voice feedback, allowing the user to receive confirmation. This rapid execution of the process allows the user to complete the operation efficiently.

[0561] The following describes the processing flow.

[0562] Step 1:

[0563] The device uses motion sensors built into the earphones to detect the user's head movements. The detected movements are recorded as real-time digital data.

[0564] Step 2:

[0565] The terminal packets operational data and transmits it to the server via wireless communication. This communication is optimized to ensure that operational data is transmitted accurately and without delay.

[0566] Step 3:

[0567] The server analyzes the received behavioral data. The analysis algorithm compares this data to known behavioral gesture patterns and determines the user's intent as a specific instruction such as "Yes" or "No".

[0568] Step 4:

[0569] The server executes the corresponding information processing based on the determination results obtained from the operation analysis. This includes specific actions such as music playback and volume adjustment.

[0570] Step 5:

[0571] The server generates the results of the information processing as audio data and prepares it for feedback to the user.

[0572] Step 6:

[0573] The feedback audio data from the server is sent to the terminal. The terminal plays the received audio data to the user via an audio output device (earphones).

[0574] Step 7:

[0575] The user can confirm whether the action was successful or whether the desired operation has been completed through the played audio feedback.

[0576] (Example 1)

[0577] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0578] Traditional methods of interacting with digital assistants require the use of voice commands, which can raise privacy concerns in quiet environments. Furthermore, in environments where voice communication is unavailable, users may find it difficult to operate the assistant. Additionally, existing interfaces are often difficult for users with disabilities to use. There is a need to address these challenges and enable users to interact with digital assistants in a more natural and reliable way.

[0579] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0580] In this invention, the server includes data analysis means for analyzing motion data and determining instructions based on motion gestures; response generation means for performing information processing based on the determined instructions and generating the result as voice feedback; and feedback transmission means for transmitting the generated voice feedback via wireless communication. This allows the user to operate the digital assistant silently using head movements, enabling efficient use of the assistant while maintaining privacy even in quiet environments.

[0581] A "motion detection means" is a device that detects the user's head movements in real time, converts them into digital signals, and generates motion data.

[0582] "Transmission means" refers to a function for transmitting generated operational data to other devices using wireless communication technology.

[0583] A "data analysis means" is a device that analyzes transmitted motion data and compares it with pre-registered motion patterns to determine the user's intent.

[0584] The "response generation means" is a function that performs information processing based on instructions determined by the data analysis means and generates the result as audio feedback.

[0585] The "feedback transmission means" is a function that transmits the generated audio feedback to the user's terminal using wireless communication.

[0586] A "feedback presentation means" is a function that communicates the results to the user via an audio output device, used to inform the user of the transmitted audio feedback.

[0587] Wireless communication is a technology that uses radio waves to send and receive data, and is a means of data communication without using cables.

[0588] This invention is a system that enables users to interact with a digital assistant naturally and effectively without using their voice. Specifically, the user uses an audio output device with a built-in motion sensor, such as earphones, which can detect head movements.

[0589] When a user puts on earphones, the terminal receives detection data from its motion sensor. The terminal transmits the motion data to the server using wireless communication. The server uses a motion analysis device to compare the received data with known motion patterns and analyze the user's intent. High-performance data analysis software is used for this motion analysis, enabling rapid and accurate analysis.

[0590] The server processes the information using a response generation device based on the analysis results and generates audio feedback. The generated audio feedback is returned to the user's terminal via a feedback transmission means and finally presented to the user through an audio output device. This allows the user to perform operations such as increasing the music volume or skipping songs with simple gestures, such as moving their head.

[0591] A concrete example is when a user wants to increase the volume of a music playlist. The user indicates this intention by moving their head from side to side, and the device detects this action and sends it to the server. The server recognizes the "increase volume" instruction, adjusts the volume, and sends confirmation audio feedback back to the user.

[0592] Examples of prompts used in the generative AI model include "How do you move your head to increase the volume?" and "What is the correct gesture to skip to the next song?", helping users to utilize the system more effectively.

[0593] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0594] Step 1:

[0595] The user wears an earphone-type audio output device. A motion sensor detects head movements and generates detection data in real time. This data shows the acceleration and angular changes when the user moves their head in a specific direction.

[0596] Step 2:

[0597] The terminal transmits motion data received from the motion sensor to the server via wireless communication. The data transmitted includes the type of motion (e.g., left-right movement), intensity, and duration.

[0598] Step 3:

[0599] The server receives motion data and compares it to pre-programmed motion patterns using a motion analysis device. The analysis extracts features from the data to determine the user's intended instruction (e.g., "increase volume"). This process applies a gesture recognition algorithm to interpret the intent of the action.

[0600] Step 4:

[0601] The server processes the information using a response generator based on the analysis results. Specifically, the server executes the corresponding action (for example, increasing the volume) according to the determined instructions. The result after processing is generated as audio feedback.

[0602] Step 5:

[0603] The server sends the generated audio feedback to the terminal. This feedback includes a confirmation message indicating that a specific action has been completed (for example, "Volume increased").

[0604] Step 6:

[0605] The device presents the received audio feedback to the user through the earphones. The audio output device plays the audio feedback, and the user confirms that their instructions have been processed correctly.

[0606] This allows users to intuitively operate the digital assistant through the system without using voice commands.

[0607] (Application Example 1)

[0608] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0609] Currently, it is difficult for users of public transportation to obtain information or change their travel plans without using their hands. Furthermore, using voice instructions can be challenging in noisy environments. Therefore, there is a need for efficient and rapid access to information while on board.

[0610] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0611] In this invention, the server includes means for recognizing the user's gestures using an audio output device equipped with a motion sensor that detects head movements, motion analysis means for determining instructions based on the motion gestures, and information processing means for processing information based on the determined instructions and feeding the results back to the user through the audio output device. This makes it possible for the user to acquire information and update route information using only head movements without using their hands.

[0612] "Head movement" refers to the physical movement of the user's head, which can be used to express the user's intentions.

[0613] A "motion sensor" is a device used to detect physical movement, and specifically refers to a sensor used to accurately capture the movement of a user's head.

[0614] An "audio output device" is a device that transmits processed information to the user as sound, and examples include earphones and speakers.

[0615] "Motion data" refers to data output from motion sensors, which quantifies the user's head movements.

[0616] A "motion analysis device" is a device that analyzes motion data and recognizes specific gesture patterns.

[0617] An "instruction" is a command for an operation or control that the system determines based on the user's gestures.

[0618] "Information processing" is the process of calculation and decision-making that a system performs to generate the optimal response based on the instructions it receives.

[0619] A "response generation device" is a device that provides feedback to the user in the form of voice or other formats based on the results of information processing.

[0620] "Route information" refers to route information necessary for a user's travel, and is particularly useful when using public transportation.

[0621] "Moving" refers to a state in which the user is physically changing their location while moving.

[0622] "Processing means" refers to the functional elements of hardware or software necessary for a system to perform a specific function.

[0623] In implementing this system, the user uses an audio output device equipped with a motion sensor. The audio output device takes the form of an earphone and detects the user's head movements in real time. Motion data from this motion sensor is transmitted to a terminal and then transferred to a server via wireless communication.

[0624] The server includes a motion analysis device that analyzes the transmitted motion data. During the analysis, the user's instructions are determined by comparing them with pre-registered gesture patterns. For example, if the user makes a gesture indicating "Tell me the next stop," the server processes the information based on this determination. As a result of the information processing, a response generation device generates voice feedback, and the content of that feedback is sent back to the user through the terminal.

[0625] In this entire process, the Google Cloud Text-to-Speech API can be used to generate voice feedback. Furthermore, the processing power of servers and smartphones is utilized to perform calculations for real-time information retrieval and route updates. This allows users to obtain information and adjust their routes hands-free while on public transport, enabling more efficient travel.

[0626] As a concrete example, suppose a user is on a bus and wants to know the route to their destination. In this case, by slightly moving their head from side to side, the instruction "Tell me the next stop" is transmitted to the server, and the next stop is announced via voice feedback.

[0627] For example, prompts for the generated AI model can be formalized as follows: "Based on the user's head movements, analyze predefined gestures silently and simulate a scenario for a smart city traffic information assistant system." This formalization can contribute to improving the system's accuracy.

[0628] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0629] Step 1:

[0630] The user wears an audio output device equipped with a motion sensor. The motion sensor detects the user's head movements in real time and outputs them as motion data. This data is collected by a sensor built into the earphones.

[0631] Step 2:

[0632] The terminal transmits the detected motion data to the server via wireless communication. Here, the terminal's role is to receive motion data from the motion sensor and accurately relay it to the server.

[0633] Step 3:

[0634] The server receives motion data from a motion analysis device. The server analyzes this data and compares it to predefined gesture patterns. For example, a gesture of moving the head from side to side is interpreted as "Tell me the next station." The input is motion data, and the output is the interpreted instruction. This process involves data comparison and pattern recognition calculations.

[0635] Step 4:

[0636] The server processes information based on the determined instruction. Here, it retrieves information about the next stop from public transport and generates the content of the voice feedback. The input is the determined instruction, and the output is the feedback information. This process includes searching for online information and generating data feedback.

[0637] Step 5:

[0638] The server generates audio feedback using a response generator and sends the content as an audio signal to the terminal. For example, it sends audio data generated using the Google Cloud Text-to-Speech API.

[0639] Step 6:

[0640] The device provides the user with generated audio feedback through an audio output device. This allows the user to hear the response to their instructions and obtain necessary information hands-free, even on public transport.

[0641] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0642] This invention relates to a system that provides an interface integrating gestures based on head movements with user emotion recognition. The system includes a voice output device equipped with a motion sensor, a motion analysis device, a response generation device, and an emotion engine. The emotion engine is responsible for generating more appropriate responses by analyzing the emotions indicated by the user's voice and movements.

[0643] The device detects the user's head movements via motion sensors through an earphone-type audio output device worn by the user. In addition, a built-in emotion engine analyzes the user's emotional state using the tone, tempo, and voice patterns of their voice. This data is used during information processing to improve the user experience.

[0644] The server analyzes motion data from motion sensors to confirm the user's intent and determine the appropriate instruction. Simultaneously, the emotion engine analyzes the acquired emotion data and evaluates the user's current emotional state. Based on this evaluation, the response generation device formulates an appropriate response corresponding to the determined instruction and emotional state.

[0645] As a concrete example, consider a case where a user chooses relaxing music to alleviate stress they feel while listening to music. The user can indicate a change of music with a gesture, and the emotion engine detects the user's stress level from their voice. A motion analysis device determines the "change music" instruction from head movements and processes it on a server in combination with the emotion engine's detection. A response generation device recommends music that helps reduce stress and plays it to the user through earphones. In this way, the user's actions and emotions-based experience enable a personalized response.

[0646] The following describes the processing flow.

[0647] Step 1:

[0648] The device detects the user's head movements via motion sensors built into the earphones. Simultaneously, it captures audio signals to collect the user's voice.

[0649] Step 2:

[0650] The terminal transmits the acquired motion data and audio data to the server. In this process, motion data is transmitted immediately via wireless communication, and audio data is used for sentiment analysis.

[0651] Step 3:

[0652] The server analyzes motion data and determines the user's intent based on motion gestures. Simultaneously, it analyzes voice data using an emotion engine to evaluate the user's emotional state.

[0653] Step 4:

[0654] Based on the results of behavioral and emotional analysis, the server performs optimal information processing according to the user's state. For example, if the server determines that the user intends to change the music and is also feeling stressed, it will select music that is effective in reducing stress.

[0655] Step 5:

[0656] The server sends the response generated as a result of information processing (e.g., a new music track) to the terminal as audio output data.

[0657] Step 6:

[0658] The terminal transmits the audio output data received from the server to the user through an audio output device. The user can confirm that the requested action has been performed by listening to this feedback.

[0659] Step 7:

[0660] Through the provided voice feedback, users can understand that their intentions have been reflected and that appropriate responses have been given, allowing them to continue further operations.

[0661] (Example 2)

[0662] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0663] Existing voice output systems face the challenge of providing personalized responses based not only on user actions but also on emotions. In particular, conventional technologies fail to adequately understand user intent and provide real-time responses. As a result, the user experience may be limited, potentially leading to decreased satisfaction.

[0664] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0665] In this invention, the server includes a detection device means for detecting the user's head movements, an analysis device means for analyzing data from the detection device means and determining instructions based on the motion gestures, and an analysis device means for analyzing the user's voice characteristics and evaluating their emotional state. This enables the integration of the user's actions and emotions to provide appropriate responses in real time, improving the personalized user experience.

[0666] A "detection device means" is a device equipped with the function of accurately sensing the user's head movements and extracting them as data.

[0667] The "analysis device means" is a device that analyzes data supplied by the detection device means and identifies instructions based on the user's gestures.

[0668] An "analysis device" is a device that analyzes the tone and patterns from the user's voice and has the function of evaluating their current emotional state.

[0669] A "generative AI model" is an artificial intelligence model that generates appropriate information processing and responses based on user behavior and emotional data.

[0670] A "response device means" is a device that outputs the results of information processing as sound and provides feedback to the user.

[0671] This invention is a system that analyzes the user's head movements and emotional state of voice to provide personalized responses. The system consists of a motion sensor, an emotion engine, and an analysis device, all integrated into an earphone-type audio output device.

[0672] The device uses motion sensors to detect the user's head movements. This information is sent to an analysis device and analyzed as a motion gesture. For example, a user can instruct the system to change the music by moving their head from side to side.

[0673] Furthermore, the device's built-in emotion engine analyzes the user's voice tone, tempo, and speech patterns to assess their emotional state. Emotion analysis is performed using a generative AI model, enabling a more accurate understanding of the user's emotions. For example, if a user is stressed, this state can be detected from their voice tone.

[0674] The server receives this behavioral and emotional data and processes each piece of data in an integrated manner. A generative AI model is used to process information based on the user's intentions and emotions. As a result of the processing, an appropriate response is generated and fed back to the user through the device. For example, if the server determines that the user is feeling stressed, it selects relaxing music and plays it through the earphones.

[0675] A concrete example is a user prompt such as, "I want to listen to relaxing music." The system can analyze this request and adjust the music it provides. In this way, it is possible to provide a highly personalized experience that takes into account the user's emotions and actions.

[0676] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0677] Step 1:

[0678] The user wears earphone-type audio output devices and enjoys media such as music. Simultaneously, a motion sensor built into the device detects the user's head movements. This motion data functions as input data to capture the user's intentions. The detected data is transmitted to a motion analysis device. Specifically, the user indicates music playback or selection by moving their head up and down or left and right.

[0679] Step 2:

[0680] The terminal's motion analysis device analyzes the transmitted head movement data and maps it to specific motion gestures. This analysis determines what kind of command the user issued. For example, if "up and down movement = skipping a song" is defined, the corresponding instruction will be output. Specifically, an algorithm is implemented to convert motion data into motion gestures, and the necessary information is extracted and the data is converted.

[0681] Step 3:

[0682] Simultaneously, the device's emotion engine captures the user's voice and analyzes its tone, tempo, and speech patterns. Using a generative AI model, it evaluates the user's emotional state. This audio data becomes input for the emotion analysis, and the emotion engine outputs emotion labels such as "stressed." For example, it might detect a high-pitched, hurried voice pattern and classify it as "stressed."

[0683] Step 4:

[0684] The server integrates the motion gestures and emotion labels received from the terminal. Using a generative AI model, it analyzes this data and processes the information to match the user's intentions and emotions. For example, if the intention to "relax" is associated with an emotion label, the server will select relaxing music. In this process, motion gestures and emotion labels are input, and a specific response plan is generated and output.

[0685] Step 5:

[0686] The server, via a response device, sends the selected, appropriate music to the terminal and plays it through the user's earphones. The final output is carefully selected music or audio feedback. Specifically, the server performs signal conversion to execute the processing results as audio output. The user experiences a newly selected song playing through their earphones.

[0687] (Application Example 2)

[0688] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0689] Current personal robots and sound output devices have limited ability to adaptively respond based on user emotions and real-time actions. In particular, they are inadequate in supporting users in reducing stress and fatigue in their daily lives, and there is a need for more advanced interfaces to optimize the user experience.

[0690] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0691] In this invention, the server includes means for using an acoustic output device equipped with a motion sensor that detects head movements, means for using an analysis device that analyzes motion data and determines instructions based on motion gestures, and means for using emotion analysis means that analyze the characteristics of the user's voice and evaluate their emotional state. This enables the generation of personalized responses based on the user's actions and emotions.

[0692] A "motion sensor" is a device that detects the movement of a user's head and has the function of providing that motion data.

[0693] An "audio output device" is a device for transmitting sound or music to the user, and it is desirable that it be in a form that the user can wear.

[0694] An "analysis device" is a component that analyzes motion data and determines appropriate instructions based on the detected motion gestures.

[0695] A "response generation device" is a system that processes information based on analyzed instructions and emotional states, and then feeds the results back to the user through an acoustic output device.

[0696] "Emotional analysis techniques" are technologies that analyze characteristics such as the tone and tempo of a user's voice to evaluate the user's emotional state.

[0697] A "control mechanism" is a device that selects appropriate music or conversation based on the evaluated emotional state and provides adaptive responses to the user.

[0698] In an embodiment for carrying out the present invention, the system mainly consists of a server and a terminal. The terminal includes a motion sensor that detects the user's head movements and an acoustic output device that outputs sound. The motion sensor has the function of detecting the user's motion gestures and transmitting the data to the server. The acoustic output device is a device that provides voice feedback to the user.

[0699] The server is equipped with a motion analysis device and an emotion analysis device, and fulfills these roles. The motion analysis device analyzes data acquired from motion sensors using a deep learning model based on Keras to analyze the user's intentions. The emotion analysis device uses spaCy and TensorFlow to analyze the user's voice data and reveal their emotional state. This enables the generation of more personalized responses based on emotions.

[0700] For example, if the user is tired, the emotion analysis system detects this, and the server selects relaxing music via the control system and plays it to the user through the audio output device. At this time, the system takes the user's emotions into consideration and provides more appropriate music and conversation.

[0701] The generative AI model in this system selects the optimal response through prompts. An example of a prompt is, "If a user is feeling stressed, how should they choose relaxing music?" Based on this prompt, the AI ​​intelligently generates a response. As a result, not only is the user experience improved, but specifications that are adapted to the user's emotions and actions can be realized.

[0702] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0703] Step 1:

[0704] The device uses motion sensors to detect the user's head movements. The detected motion data is sent to the server. This input data includes the user's motion gestures, and its output is data transfer to the server.

[0705] Step 2:

[0706] The server analyzes the received motion data using a motion analysis device. The intent of the motion gesture is determined by a model using Keras. Here, gesture data from the motion sensor is used as input, and the analysis result is obtained as output.

[0707] Step 3:

[0708] The terminal records the user's voice using an acoustic output device and sends it to the server. The input here is the user's voice data, and the transmission to the server is the output.

[0709] Step 4:

[0710] The server analyzes the received voice data using emotion analysis tools. It evaluates the emotional state using spaCy and TensorFlow. In this process, the user's voice data is used as input, and the analyzed emotional information is the output.

[0711] Step 5:

[0712] The server combines the analysis results of the gestures and emotional information to generate an appropriate response using a generative AI model. It uses prompts to select the most suitable music or conversation. The input here is the analyzed gesture and emotional data, and the output is the response information to the user.

[0713] Step 6:

[0714] The terminal receives response information from the server and provides feedback to the user through an audio output device. Selected music or conversation is input to the output device, and its playback is the output.

[0715] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0716] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0717] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0718] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0719] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0720] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0721] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0722] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0723] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0724] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0725] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0726] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0727] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0728] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0729] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0730] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0731] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0732] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0733] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0734] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0735] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0736] The following is further disclosed regarding the embodiments described above.

[0737] (Claim 1)

[0738] A sound output device equipped with a motion sensor that detects head movements,

[0739] A motion analysis device that analyzes motion data output from the motion sensor and determines instructions based on motion gestures,

[0740] A response generation device that processes information based on the determined instructions and provides feedback to the user through an audio output device,

[0741] A system that includes this.

[0742] (Claim 2)

[0743] The system according to claim 1, wherein a first information processing is performed when the motion gesture indicates "affirmation," and a second information processing is performed when it indicates "negation."

[0744] (Claim 3)

[0745] The system according to claim 1, wherein the audio output device and the motion analysis device are connected by wireless communication.

[0746] "Example 1"

[0747] (Claim 1)

[0748] Motion detection means for detecting head movement,

[0749] A transmission means for transmitting the motion data detected by the motion detection means via wireless communication,

[0750] A data analysis means that analyzes the aforementioned motion data and determines instructions based on motion gestures,

[0751] A response generation means that performs information processing based on the determined instruction and generates the result as audio feedback,

[0752] A feedback transmission means for transmitting the generated audio feedback wirelessly,

[0753] A feedback presentation means for presenting the aforementioned audio feedback to the user,

[0754] A system that includes this.

[0755] (Claim 2)

[0756] The system according to claim 1, wherein a first information processing is performed when the motion gesture indicates a positive instruction, and a second information processing is performed when it indicates a negative instruction.

[0757] (Claim 3)

[0758] The system according to claim 1, wherein the motion detection means and the data analysis means are connected by wireless communication.

[0759] "Application Example 1"

[0760] (Claim 1)

[0761] A sound output device equipped with a motion sensor that detects head movements,

[0762] A motion analysis device that analyzes motion data output from the motion sensor and determines instructions based on motion gestures,

[0763] A response generation device that processes information based on the determined instructions and provides feedback to the user through an audio output device,

[0764] A processing means equipped with a function to acquire information using head movements while the user is moving and to support the updating of route information,

[0765] A system that includes this.

[0766] (Claim 2)

[0767] The system according to claim 1, wherein a first information processing is performed when the motion gesture indicates "affirmation," and a second information processing is performed when it indicates "negation."

[0768] (Claim 3)

[0769] The system according to claim 1, wherein the audio output device and the motion analysis device are connected by wireless communication.

[0770] "Example 2 of combining an emotion engine"

[0771] (Claim 1)

[0772] A detection device means for detecting the movement of the user's head,

[0773] An analysis device means analyzes data from the aforementioned detection device means and determines instructions based on the motion gesture,

[0774] An analysis device means for analyzing the user's voice characteristics and evaluating their emotional state,

[0775] A processing device that performs information processing using a generative AI model based on the determined instructions and analyzed emotional state,

[0776] A response device means that provides feedback to the user through audio output,

[0777] ...

[0778] A system that includes this.

[0779] (Claim 2)

[0780] The system according to claim 1, wherein if the gesture indicates "affirmation" and a positive evaluation is made through sentiment analysis, a first information processing is performed, and if it indicates "negation" and a negative evaluation is made, a second information processing is performed.

[0781] (Claim 3)

[0782] The system according to claim 1, wherein the audio output device and the data analysis device are connected by wireless communication.

[0783] "Application example 2 when combining with an emotional engine"

[0784] (Claim 1)

[0785] An acoustic output device equipped with a motion sensor that detects head movement,

[0786] An analysis device that analyzes motion data output from the motion sensor and determines instructions based on motion gestures,

[0787] A response generation device that processes information based on the determined instructions and analyzed emotional state, and feeds the results back to the user through an acoustic output device,

[0788] A means of sentiment analysis that analyzes the characteristics of user voices and evaluates their emotional state,

[0789] A control mechanism that selects appropriate music or conversation according to the evaluated emotional state,

[0790] A system that includes this.

[0791] (Claim 2)

[0792] The system according to claim 1, which performs a first information processing when the action gesture indicates "affirmation" and a second information processing when it indicates "negation," and in addition performs a third information processing based on the evaluated emotional state.

[0793] (Claim 3)

[0794] The system according to claim 1, wherein an acoustic output device, an analysis device, and an emotion analysis means are connected by wireless communication. [Explanation of symbols]

[0795] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A sound output device equipped with a motion sensor that detects head movements, A motion analysis device that analyzes motion data output from the motion sensor and determines instructions based on motion gestures, A response generation device that processes information based on the determined instructions and provides feedback to the user through an audio output device, A processing means equipped with a function to acquire information using head movements while the user is moving and to support the updating of route information, A system that includes this.

2. The system according to claim 1, wherein a first information processing is performed when the motion gesture indicates "affirmation," and a second information processing is performed when it indicates "negation."

3. The system according to claim 1, wherein the audio output device and the motion analysis device are connected by wireless communication.