system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The system addresses the challenge of emotional misinterpretation in communication by converting and analyzing audio and video data to provide real-time emotional insights, enhancing interaction quality.

JP2026100642APending Publication Date: 2026-06-19SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-09
Publication Date: 2026-06-19

Application Information

Patent Timeline

09 Dec 2024

Application

19 Jun 2026

Publication

JP2026100642A

IPC: G06F3/01; G06F3/0481; G06Q50/10

AI Tagging

Application Domain

Input/output for user-computer interaction Data processing applications

Technology Topics

Radiology Computer vision

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing communication systems struggle with accurately understanding and expressing emotions, particularly in telephone conversations and online meetings, leading to misunderstandings and reduced quality of interaction.

Method used

A system that acquires audio and video data, converts it into text, analyzes emotions, and visually displays the results in real-time to facilitate smoother communication.

Benefits of technology

Enables real-time emotional analysis and visualization, improving communication quality by allowing individuals to understand and respond to emotional changes promptly.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026100642000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] means for acquiring audio data and video data, A conversion means for generating text data using the audio data obtained by the acquisition means, An analysis means for analyzing the aforementioned audio data, text data, and video data to estimate emotions, A display means for visually displaying the emotional data obtained by the analysis means, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In modern communication, there is a problem that people often cannot communicate smoothly due to misunderstandings and oversights of emotions that are not expressed in words. In particular, this tendency is prominent in telephone conversations and online meetings where it is difficult to accurately read emotions. An environment where better human relationships can be built by appropriately understanding emotions is required, but there are limitations in current technologies.

Means for Solving the Problems

[0005] The present invention provides a system including acquisition means for acquiring audio data and video data, conversion means for converting audio data into text data, analysis means for analyzing this data to estimate emotions, and display means for visually displaying the obtained emotion data. This enables smoother communication by analyzing and visualizing emotions based on conversation and facial expressions in real time.

[0006] "Audio data" refers to data obtained by converting the waveform of sound collected using a microphone or similar device into a digital format.

[0007] "Video data" refers to digital data of images or videos acquired using cameras or other visual sensors.

[0008] "Acquisition means" refers to a function consisting of hardware and software for collecting audio and video data.

[0009] "Conversion means" refers to software or algorithms that perform the process of converting audio data into text format.

[0010] "Analysis means" refers to software or hardware configurations that perform processing to estimate emotions based on collected audio data, text data, and video data.

[0011] "Emotional data" refers to data that indicates emotional states such as joy, anxiety, and anger, estimated through analytical methods.

[0012] "Display means" refers to a system configuration that includes a display or interface for visually displaying acquired and analyzed emotion data. [Brief explanation of the drawing]

[0013] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2]It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

MODE FOR CARRYING OUT THE INVENTION

[0014] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0019] In the following embodiments, the numbered communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), etc.

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] The system according to the present invention performs real-time sentiment analysis and visualization to facilitate smoother user communication. This system is mainly composed of a terminal, a server, and the user, and performs a series of processes to acquire, analyze, and visualize audio and video data.

[0035] First, the device begins to acquire voice and video input from the user in real time. Using the microphone and camera, it accurately captures the user's speech and facial expressions. The voice data is converted into text data by speech recognition software, enabling analysis based on the user's linguistic expressions.

[0036] The acquired data is transferred to a server, where analysis processing begins. Here, information such as voice tone, speed, text content, and facial expressions are analyzed in detail by an AI analysis algorithm to identify emotions such as joy, anxiety, and anger. Based on this analysis, multiple data modules detect different emotional states using their own unique methods.

[0037] The analysis results are integrated on the server side, generating comprehensive emotional data. This emotional data is converted into graphs and charts in a visually easy-to-understand format and sent to the terminal. Here, the user can immediately check the analysis results as visual information, and grasp subtle emotional changes that might be difficult to notice through audio alone.

[0038] As a concrete example, in a customer service scenario, if a user (customer) expresses dissatisfaction during a phone call with a support representative, the device captures their voice and facial expressions expressing their anxiety and anger. The server then analyzes these emotions and sends the results back to the device in a graph, allowing the representative to understand the customer's emotions in real time and adjust their response accordingly. This is expected to improve the quality of communication.

[0039] The following describes the processing flow.

[0040] Step 1:

[0041] The user activates the device and begins communication. The device's microphone and camera become active, capturing the user's voice and video.

[0042] Step 2:

[0043] The device converts the acquired audio data into text data using real-time speech recognition technology. During this process, acoustic characteristics such as intonation and speed are also extracted for analysis.

[0044] Step 3:

[0045] The device uses an expression recognition algorithm to analyze the user's facial expressions from the video data it acquires. This analysis extracts facial features such as smiles and frown lines as digital data.

[0046] Step 4:

[0047] The terminal packages audio data, text data, and video data and sends them to the server for analysis.

[0048] Step 5:

[0049] The server uses a voice analysis module to analyze the tone and speed of the voice in the received audio data and extract elements that allow for the inference of emotion.

[0050] Step 6:

[0051] The server uses a text analysis module to identify sentiment expressions within text data using natural language processing. In this process, it generates sentiment scores using a word sentiment weighting dictionary.

[0052] Step 7:

[0053] The server uses a facial expression analysis module based on the video data to map each facial feature to an emotion and estimate the emotion.

[0054] Step 8:

[0055] The server integrates these analysis results to generate data that represents the overall emotional state.

[0056] Step 9:

[0057] The server converts the overall sentiment data into visualized data and formats it into graphs and charts.

[0058] Step 10:

[0059] The server sends the visualization data to the terminal.

[0060] Step 11:

[0061] The device receives visualization data and displays the emotional state on the screen in real time, allowing the user to visually understand their current emotional state.

[0062] (Example 1)

[0063] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0064] In conventional communication systems, it has been difficult to grasp changes in emotions in real time, which has been a factor in degrading the quality of communication. In this situation, there is a need to improve communication by instantly analyzing and visualizing the user's emotional state.

[0065] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0066] In this invention, the server includes acquisition means for acquiring acoustic data and visual data, conversion means for converting acoustic data into text data, analysis means for estimating emotional states, and display means for visualizing integrated emotional data. This makes it possible to analyze the user's emotional state in real time and present it visually.

[0067] "Acquisition means" refers to devices and technologies for collecting acoustic and visual data, and are responsible for capturing the user's speech and facial expressions.

[0068] "Conversion means" refers to a technology or device that converts acoustic data obtained by acquisition means into text data, thereby changing linguistic information into text format through a speech recognition process.

[0069] "Analysis means" refers to a technology or device that analyzes acoustic data, text data, and visual data to estimate the user's emotional state, and uses AI to identify emotions.

[0070] "Display means" refers to a technology or device for visualizing and presenting emotional data obtained by analysis means, in order to enable users to intuitively understand emotional changes.

[0071] This invention is an emotion analysis system designed to facilitate smoother user communication. The system is comprised of a terminal, a server, and the user, and features the ability to acquire, analyze, and visualize audio and video data in real time.

[0072] The device is responsible for acquiring audio and visual data from the user. Specifically, it uses a microphone to capture audio information and a camera to record the user's facial expressions. The audio data is immediately converted into text data by speech recognition software. This instantly converts the user's speech into digital text format, preparing it for analysis.

[0073] The collected audio and visual data are transferred to a server via an internet connection. On the server, AI analysis algorithms are executed based on this data. In this analysis process, the tone and speed of speech, text content, and facial expressions based on the video data are analyzed in detail. In this way, emotional states such as joy, anxiety, and anger are identified. Machine learning models and natural language processing techniques are used in this analysis.

[0074] The analyzed emotional data is integrated by a data module and converted into a visually easy-to-understand format. The device then presents these visualized results to the user through a user interface. This allows the user to understand their own or others' emotional changes in real time, improving the quality of communication.

[0075] A concrete example of its use is in customer service. If a user (customer) is dissatisfied during a call with a support representative, the system captures their voice and facial expressions, and the server analyzes their dissatisfaction or anger. As a result, the representative has the opportunity to understand the customer's emotions in real time and quickly adjust their response.

[0076] An example of a prompt message might be: "Explain how to use this emotion analysis system to identify emotions such as joy, anxiety, and anger in real time from the user's voice and facial expression data, and then visualize and present them."

[0077] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0078] Step 1:

[0079] The user starts the system, and the terminal uses a microphone and camera to acquire acoustic and visual data in real time. The input is the user's voice and facial expressions; the voice is output as digital audio data, and the facial expressions are output as video data. Specifically, the microphone captures the user's speech, and the camera photographs the user's face.

[0080] Step 2:

[0081] The device processes the acquired audio data through speech recognition software, converting it into text data. The input is digital audio data, and the output is the corresponding text string. In this conversion process, the audio waveform is analyzed, and the language recognition engine generates the corresponding string. Specifically, peaks and intervals in the audio waveform are identified and mapped to words.

[0082] Step 3:

[0083] The terminal sends audio, text, and visual data to the server. The input consists of converted text and video data, as well as the original audio data, all of which are transferred to the server. Specifically, the terminal's network module divides the data into packets and transmits the data using a secure communication protocol.

[0084] Step 4:

[0085] The server analyzes emotional states using an AI analysis algorithm based on the received data. The input consists of audio, text, and video data received by the server, and the output is the result of the emotional state determination. Specifically, the tone and speed of the audio data are analyzed, the content of the text is examined using natural language processing techniques, and facial expression analysis is performed on the video data.

[0086] Step 5:

[0087] The server integrates the analysis results to generate final emotion data. In this process, the input consists of partial results from each analysis, and the output is integrated emotion state data. Specifically, the individual analysis results are weighted to obtain a unified emotion evaluation score.

[0088] Step 6:

[0089] The server visualizes the data in a visually easy-to-understand format and sends it to the terminal. The input is integrated sentiment data, and the output is data in the form of graphs and charts. Specifically, the visualization tool analyzes the data and generates pie charts and line graphs.

[0090] Step 7:

[0091] The device presents visualized emotional data to the user. Input consists of graphs and charts received from the server, and output is a user-viewable screen display. Specifically, the user interface has been updated to intuitively show emotional states.

[0092] (Application Example 1)

[0093] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0094] In modern communication, misunderstandings and friction often arise due to insufficient emotional transmission. Particularly within families, the inability to understand each other's emotional nuances can significantly reduce the quality of daily life. As a solution to this problem, there is a need for a system that analyzes and displays emotions in real time.

[0095] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0096] In this invention, the server includes an acquisition means for acquiring audio and video information, a conversion means for generating text information using the audio information obtained by the acquisition means, an analysis means for analyzing the audio information, text information, and video information and estimating emotions, a display means for visually displaying the emotion information obtained by the analysis means, and a dialogue means for providing appropriate dialogue and suggestions based on the user's emotional state. This enables real-time understanding of emotional states even within the home, facilitating smoother communication.

[0097] "Acquisition means" refers to the device or method for initially collecting audio and video information.

[0098] "Conversion means" refers to devices or methods used to convert acquired audio information into text information.

[0099] "Analysis means" refers to devices or methods for analyzing audio information, text information, and video information, and estimating emotions based on that analysis.

[0100] "Display means" refers to devices or methods for visually representing and presenting analyzed emotional information to the user.

[0101] A "dialogue tool" is a device or method for providing users with appropriate conversations and suggestions based on their estimated emotional state.

[0102] To realize this invention, the user's terminal collects audio and video input. The terminal is equipped with a microphone and a camera, which are used to continuously acquire the user's voice and facial expressions. The acquired audio data is converted into text data using speech recognition software. In this case, speech recognition technology such as Google's® Speech-to-Text API is suitable.

[0103] The server receives audio, text, and video data transmitted from the terminal and executes an emotion analysis algorithm. Generative AI models such as the OpenAI® GPT series are effective for this analysis. The analysis algorithm comprehensively analyzes factors such as speech intonation and speed, text content, and facial expressions in the video to generate emotion data.

[0104] The generated emotional data is displayed in a visually easy-to-understand format. The server sends this visual data to the device and displays it on the device's screen as graphs and icons. Users can use these results to make decisions that facilitate communication within the family.

[0105] For example, if a robot observes a child doing homework and determines that their concentration is waning, it can suggest, "Why don't you take a break? Let's play together!" Another example of a prompt to facilitate interaction that responds to the user's emotions is, "Please tell me how to analyze my family's emotions in real time and make appropriate conversations and suggestions based on that." This prompt is an element that enables the automation of emotion-based communication using generative AI models.

[0106] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0107] Step 1:

[0108] The device acquires audio and video data. Specifically, it captures the user's speech and facial expressions in real time using the camera and microphone. The input is the user's audio and video information, and the output is their digital data. The device then prepares this data to send to the server.

[0109] Step 2:

[0110] The terminal converts the audio data into text data. Speech recognition software is used to convert the audio information into text. The input is the audio data acquired in step 1, and the output is the text data of that audio. The converted text data is transferred to a server for further analysis.

[0111] Step 3:

[0112] The server analyzes the audio, text, and video data it receives. The analysis uses a generative AI model to analyze information in detail, including speech intonation and speed, text content, and facial expressions. The input is audio, text, and video data sent from the terminal, and the output is the analysis result representing the emotional state.

[0113] Step 4:

[0114] The server converts the emotional data into a data format for visual display. The emotional analysis results are processed into graphs and icons, making them visually displayable. The input is the emotional analysis results obtained in step 3, and the output is the visually displayed data. This data is sent to the terminal.

[0115] Step 5:

[0116] The terminal receives display data and displays emotional information to the user. Emotional changes are visually represented on the display, allowing the user to intuitively understand the situation. The input is visual display data received from the server, and the output is the information displayed on the screen.

[0117] Step 6:

[0118] The server generates appropriate dialogue and suggestions based on the user's emotional state. A generative AI model is used to devise appropriate responses based on the generated emotional data. The input is the result of the emotional analysis in step 3, and the output is a dialogue message as a prompt. This message is sent back to the terminal.

[0119] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0120] The system according to the present invention aims to acquire audio and video data and accurately analyze the user's emotions, and is equipped with advanced emotion recognition capabilities combined with an emotion engine. The system mainly consists of a terminal, a server, and an emotion engine.

[0121] The device is equipped with a microphone and camera to capture the user's voice and facial expressions in real time. The voice data is converted into text data, while characteristics such as intonation and speech speed are analyzed in parallel. All data obtained through this process is sent to a server for further detailed analysis.

[0122] On the server, data analysis is performed through an integrated emotion engine. The functions provided by the server include speech recognition, natural language processing, and facial expression recognition, which are interconnected to perform comprehensive emotion analysis. The emotion engine generates emotion data in real time and has the ability to capture moments of emotional change by visually representing the user's emotional state.

[0123] The emotion engine employs machine learning algorithms and continuously optimizes various analysis models through reinforcement learning. Furthermore, the server accumulates past user data, enabling personalized analysis tailored to each user. As a result, even for the same user, past emotional history is reflected in the analysis, leading to more accurate emotion estimation.

[0124] As a concrete example, when used in an educational setting, users (students) can check their own understanding of the learning material using a terminal equipped with an emotion engine. The data collected via the server is transmitted to the instructor in real time, enabling the progress of lessons and countermeasures based on the emotional changes of individual students. Therefore, this invention contributes to creating an environment in which individualized learning is promoted and student motivation is further enhanced, even in educational settings.

[0125] The following describes the processing flow.

[0126] Step 1:

[0127] The user powers on the device, and the built-in microphone and camera automatically turn on. This initiates the acquisition of audio and video data.

[0128] Step 2:

[0129] The speech recognition software uses the audio data acquired by the device to convert it into text data in real time. Simultaneously, data is collected to analyze features such as intonation and speed of the speech.

[0130] Step 3:

[0131] The device uses video data acquired through its camera to apply a facial recognition algorithm to extract facial features. This data includes subtle expressions and moments of change.

[0132] Step 4:

[0133] The device collects data from voice, text, and video and sends it to the server. This transmission is performed using a secure and fast protocol.

[0134] Step 5:

[0135] The server inputs the received data into the emotion engine. The emotion engine comprehensively analyzes the tone of voice, the content of the text, and the facial expressions in the video to estimate the emotional state.

[0136] Step 6:

[0137] The server utilizes machine learning algorithms to update the analysis model through reinforcement learning. This improves the model's accuracy in estimating emotions.

[0138] Step 7:

[0139] The server uses past user data to perform personalized sentiment analysis. This data, based on the user's past emotional history, allows for a more precise estimation of their current emotional state.

[0140] Step 8:

[0141] The server visualizes the sentiment data obtained through analysis and converts it into pie charts and vertical / horizontal graphs. This visualized data is then sent to the terminal.

[0142] Step 9:

[0143] The device receives visualization data and displays it in real time on the user interface. This allows users to intuitively understand their own emotional state.

[0144] (Example 2)

[0145] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0146] Existing systems for analyzing emotions lack accuracy in analyzing audio and video data, making it difficult to accurately and in real time understand a user's emotional state. Furthermore, personalized analysis tailored to individual users is not sufficiently implemented, hindering the improvement of emotion analysis accuracy. Additionally, insufficient feedback through visual representations of emotional changes presents challenges to practical application in educational settings and other fields.

[0147] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0148] In this invention, the server includes an acquisition means equipped with a terminal for acquiring audio data and video data; a conversion means for generating text data using the audio data obtained by the acquisition means; an analysis means for analyzing the intonation and speed of the audio data and estimating emotions from the text data and video data; a display means for visually displaying the emotion data; and a personalization means for accumulating past emotion data and optimizing emotion analysis according to the user. This enables accurate and real-time analysis of the user's emotions, and provides personalized, highly accurate emotion analysis and visual feedback.

[0149] "Acquisition means" refers to devices or methods that have the function of acquiring audio data and video data in real time.

[0150] A "conversion means" refers to a device or method that has the function of generating text data based on acquired audio data.

[0151] "Analysis means" refers to devices or methods that have the function of analyzing audio data, text data, and video data to estimate the user's emotions.

[0152] "Display means" refers to devices or methods that have the function of visually displaying analyzed emotional data.

[0153] "Personalization methods" refer to devices or methods that accumulate past emotional data and have the function of optimizing emotional analysis according to the user.

[0154] "Speech recognition" is a technology that analyzes audio data to understand what is being said and converts it into text data.

[0155] "Natural language processing" is a technology that analyzes text data to understand emotions, intentions, and other related concepts.

[0156] "Facial recognition" is a technology that analyzes video data to read facial expressions and understand emotions based on those expressions.

[0157] A "machine learning algorithm" is a mathematical model and method for learning from large amounts of data and predicting future data.

[0158] Reinforcement learning is a type of machine learning that learns the optimal action through trial and error.

[0159] In this invention, the terminal is equipped with a microphone and a camera as devices for acquiring audio and video data. The audio data is used to capture features of the speaker's voice, including intonation and speed. The video data is used to capture the user's facial expressions and movements in real time. This data is converted into text data using speech recognition technology, and features such as intonation and speed are further analyzed.

[0160] Data acquired by the device is sent to a server where all analysis takes place. The server comprehensively analyzes the data using an emotion engine that includes speech recognition, natural language processing, and facial expression recognition. The emotion engine employs machine learning algorithms and continuously optimizes the analysis model through reinforcement learning. This makes it possible to accurately analyze the user's emotional state in real time.

[0161] Furthermore, the server incorporates previously accumulated emotional data for each user into its analysis, enabling personalized emotional analysis for each user. This allows for highly accurate emotional estimation based on past emotional history, even for the same user. The analyzed emotional data is visually displayed, allowing for a concrete representation of the user's emotional changes.

[0162] One concrete example of its use is in educational settings. Users (students) would use this system to check their understanding of their learning during class. The analyzed emotional data would also be provided to instructors in real time, enabling them to tailor their lessons to the individual student's emotional changes.

[0163] As an example of a prompt when using a generative AI model, you can use the instruction "Analyze and visualize the current emotional state" to specify how the system should analyze and visualize the user's emotions.

[0164] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0165] Step 1:

[0166] The device uses a microphone and camera to acquire user audio and video data. It receives audio and video from the environment as input and converts them into digital data as output. Specifically, it captures the user's voice and continuously records their facial expressions.

[0167] Step 2:

[0168] This process uses speech recognition technology to generate text data from audio data acquired by the device. Simultaneously, it analyzes the intonation and speed of the speech. The input for this step is the acquired audio data, and the output is text data and speech features. Specifically, it analyzes the speech, converts the spoken content into a string, and generates numerical data for intonation and speed.

[0169] Step 3:

[0170] The terminal packages the analyzed audio, text, and video data and sends it to the server. The input for this step is audio features, text data, and video data, and the output is the transmission of data to the server. Specifically, the terminal ensures real-time performance by accumulating a certain amount of data and then rapidly transferring it to the server.

[0171] Step 4:

[0172] The server comprehensively analyzes the received data based on audio, text, and video. It uses an emotion engine to analyze the data and estimate emotions. The input for this step is the dataset transferred to the server, and the output is the emotion data estimated through the analysis. Specific operations include keyword extraction using speech recognition, text analysis using natural language processing, and facial analysis using facial expression recognition.

[0173] Step 5:

[0174] The server visually represents the user's emotional state based on the analysis results. The input for this step is the emotional data obtained through analysis, and the output is the visualized emotional state. Specific actions include converting the emotional data into graphs and animations, making it easy for users and educational institutions to review the emotional state.

[0175] Step 6:

[0176] The server personalizes sentiment analysis using accumulated historical user data. The input for this step is historical sentiment data, and the output is a current sentiment estimate that reflects past analysis. Specifically, the process involves incorporating historical data into the analysis to learn each user's sentiment change patterns and improve the accuracy of predictions.

[0177] (Application Example 2)

[0178] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0179] Conventional consumer robots have difficulty accurately understanding user emotions and taking appropriate measures, resulting in insufficient stress reduction and the provision of a comfortable living environment. Therefore, there is a need for technology that can grasp changes in user emotions in real time and respond appropriately based on that information.

[0180] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0181] In this invention, the server includes a device means for acquiring audio information and video information, a conversion device means for generating text information using the audio information obtained by the device means, and an estimation means for analyzing the audio information, text information, and video information and estimating emotions. This makes it possible to analyze the user's emotions in real time and select and play the most suitable music based on those emotions, thereby reducing user stress and providing a comfortable living environment.

[0182] "Audio information" refers to data obtained from speech, including the user's vocal characteristics and intonation.

[0183] "Visual information" refers to visual data, including information that records the user's facial expressions and movements.

[0184] "Device means" refers to means that have hardware or software for acquiring audio information or video information.

[0185] A "conversion device means" is a device that performs a process to convert audio information into text information.

[0186] "Text information" refers to information that represents audio information as character data.

[0187] "Inference methods" refer to the process of analyzing and estimating emotions from acquired audio, text, and video information.

[0188] "Emotional information" refers to data that represents the user's emotional state and is obtained through inference methods.

[0189] "Acoustic device means" refers to a device for selecting and playing appropriate music based on emotional information.

[0190] The system implementing this invention mainly consists of terminal devices for acquiring audio and video information, and a server for analyzing and processing that data. This system employs highly integrated technology to grasp the user's emotions in real time.

[0191] The terminal is equipped with a microphone and camera to acquire audio and video information from the user. Audio information is acoustic data obtained from the user's speech, and video information is visual data representing the user's facial expressions and movements. This information is transmitted to the server in real time.

[0192] The server converts the received audio information into text information using a conversion device, and uses this text to perform emotion analysis. In particular, an emotion engine that combines speech recognition technology and facial expression analysis algorithms as an inference method comprehensively analyzes audio information, text information, and video information to estimate the user's emotion information. This emotion information includes the user's stress level and emotional state, and is represented visually.

[0193] Furthermore, based on emotional information, the server controls the sound system and selects and plays the most suitable music. For this purpose, it also includes an algorithm to smoothly find songs from the music library that match the user's emotions.

[0194] For example, after a busy day at work, music might be played to help a user relax upon returning home. The system detects the user's voice tone and facial expressions that indicate fatigue, and plays relaxing music accordingly to reduce the user's stress.

[0195] An example of a prompt message would be, "Based on the user's voice and video, detect their current emotional state. If the user is stressed, suggest actions to help them relax." This allows the system to provide user-centered interaction and care that was not possible with conventional technologies.

[0196] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0197] Step 1:

[0198] The device activates its microphone and camera to acquire audio and video information from the user in real time. The input is the user's raw voice and facial expression data, and the output is the acquired digital audio and visual data. This acquisition process provides foundational data for analyzing the user's emotional state.

[0199] Step 2:

[0200] The terminal sends the acquired audio information to the server. The server applies a speech recognition algorithm to convert the audio information into text information. The input is audio data, and the output is the corresponding text information. In this process, the intonation and speed of the speech are also analyzed, and characteristic information for emotion estimation is extracted along with the text data.

[0201] Step 3:

[0202] The terminal transmits video information to the server. The server analyzes the video information and applies a facial expression analysis algorithm to detect facial expressions. The input is visual data, and the output is digital information representing facial features. This analysis yields an index of emotion based on changes in facial expressions.

[0203] Step 4:

[0204] The server integrates voice characteristic information, text information, and facial feature information, and the emotion engine estimates the user's emotions. The input is all the information processed in the previous step, and the output is the estimated emotion information. The estimated emotions include emotional states such as stress levels and feelings of happiness. In this process, the emotion engine analyzes the data using machine learning algorithms.

[0205] Step 5:

[0206] The server uses emotional information to select the most suitable music and controls the audio device via the terminal to play the music. The input is estimated emotional information, and the output is the selected music data. This process uses a generative AI model to dynamically select the song that best suits the user's current emotions. Furthermore, based on prompts, the entire system completes an interaction that aligns with the user's emotions.

[0207] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0208] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0209] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may be performed by the smart device 14.

[0210] [Second Embodiment]

[0211] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0212] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0213] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0214] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0215] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0216] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0217] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0218] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0219] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0220] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0221] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0222] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0223] The system according to the present invention performs real-time sentiment analysis and visualization to facilitate smoother user communication. This system is mainly composed of a terminal, a server, and the user, and performs a series of processes to acquire, analyze, and visualize audio and video data.

[0224] First, the device begins to acquire voice and video input from the user in real time. Using the microphone and camera, it accurately captures the user's speech and facial expressions. The voice data is converted into text data by speech recognition software, enabling analysis based on the user's linguistic expressions.

[0225] The acquired data is transferred to a server, where analysis processing begins. Here, information such as voice tone, speed, text content, and facial expressions are analyzed in detail by an AI analysis algorithm to identify emotions such as joy, anxiety, and anger. Based on this analysis, multiple data modules detect different emotional states using their own unique methods.

[0226] The analysis results are integrated on the server side, generating comprehensive emotional data. This emotional data is converted into graphs and charts in a visually easy-to-understand format and sent to the terminal. Here, the user can immediately check the analysis results as visual information, and grasp subtle emotional changes that might be difficult to notice through audio alone.

[0227] As a concrete example, in a customer service scenario, if a user (customer) expresses dissatisfaction during a phone call with a support representative, the device captures their voice and facial expressions expressing their anxiety and anger. The server then analyzes these emotions and sends the results back to the device in a graph, allowing the representative to understand the customer's emotions in real time and adjust their response accordingly. This is expected to improve the quality of communication.

[0228] The following describes the processing flow.

[0229] Step 1:

[0230] The user activates the device and begins communication. The device's microphone and camera become active, capturing the user's voice and video.

[0231] Step 2:

[0232] The audio data acquired by the device is converted into text data using real-time speech recognition technology. During this process, acoustic characteristics such as intonation and speed are also extracted for analysis.

[0233] Step 3:

[0234] The device uses an expression recognition algorithm to analyze the user's facial expressions from the video data it acquires. This analysis extracts facial features such as smiles and frown lines as digital data.

[0235] Step 4:

[0236] The terminal packages audio data, text data, and video data and sends them to the server for analysis.

[0237] Step 5:

[0238] The server uses a voice analysis module to analyze the tone and speed of the voice in the received audio data and extract elements that allow for the inference of emotion.

[0239] Step 6:

[0240] The server uses a text analysis module to identify sentiment expressions within text data using natural language processing. In this process, it generates sentiment scores using a word sentiment weighting dictionary.

[0241] Step 7:

[0242] The server uses a facial expression analysis module based on the video data to map each facial feature to an emotion and estimate the emotion.

[0243] Step 8:

[0244] The server integrates these analysis results to generate data that represents the overall emotional state.

[0245] Step 9:

[0246] The server converts the overall sentiment data into visualized data and formats it into graphs and charts.

[0247] Step 10:

[0248] The server sends the visualization data to the terminal.

[0249] Step 11:

[0250] The device receives visualization data and displays the emotional state on the screen in real time, allowing the user to visually understand their current emotional state.

[0251] (Example 1)

[0252] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0253] In conventional communication systems, it has been difficult to grasp changes in emotions in real time, which has been a factor in degrading the quality of communication. In this situation, there is a need to improve communication by instantly analyzing and visualizing the user's emotional state.

[0254] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0255] In this invention, the server includes acquisition means for acquiring acoustic data and visual data, conversion means for converting acoustic data into text data, analysis means for estimating emotional states, and display means for visualizing integrated emotional data. This makes it possible to analyze the user's emotional state in real time and present it visually.

[0256] "Acquisition means" refers to devices and technologies for collecting acoustic and visual data, and are responsible for capturing the user's speech and facial expressions.

[0257] "Conversion means" refers to a technology or device that converts acoustic data obtained by acquisition means into text data, thereby changing linguistic information into text format through a speech recognition process.

[0258] "Analysis means" refers to a technology or device that analyzes acoustic data, text data, and visual data to estimate the user's emotional state, and uses AI to identify emotions.

[0259] "Display means" refers to a technology or device for visualizing and presenting emotional data obtained by analysis means, in order to enable users to intuitively understand emotional changes.

[0260] This invention is an emotion analysis system designed to facilitate smoother user communication. The system is comprised of a terminal, a server, and the user, and features the ability to acquire, analyze, and visualize audio and video data in real time.

[0261] The device is responsible for acquiring audio and visual data from the user. Specifically, it uses a microphone to capture audio information and a camera to record the user's facial expressions. The audio data is immediately converted into text data by speech recognition software. This instantly converts the user's speech into digital text format, preparing it for analysis.

[0262] The collected audio and visual data are transferred to a server via an internet connection. On the server, AI analysis algorithms are executed based on this data. In this analysis process, the tone and speed of speech, text content, and facial expressions based on the video data are analyzed in detail. In this way, emotional states such as joy, anxiety, and anger are identified. Machine learning models and natural language processing techniques are used in this analysis.

[0263] The analyzed emotional data is integrated by a data module and converted into a visually easy-to-understand format. The device then presents these visualized results to the user through a user interface. This allows the user to understand their own or others' emotional changes in real time, improving the quality of communication.

[0264] A concrete example of its use is in customer service. If a user (customer) is dissatisfied during a call with a support representative, the system captures their voice and facial expressions, and the server analyzes their dissatisfaction or anger. As a result, the representative has the opportunity to understand the customer's emotions in real time and quickly adjust their response.

[0265] An example of a prompt message might be: "Explain how to use this emotion analysis system to identify emotions such as joy, anxiety, and anger in real time from the user's voice and facial expression data, and then visualize and present them."

[0266] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0267] Step 1:

[0268] The user starts the system, and the terminal uses a microphone and camera to acquire acoustic and visual data in real time. The input is the user's voice and facial expressions; the voice is output as digital audio data, and the facial expressions are output as video data. Specifically, the microphone captures the user's speech, and the camera photographs the user's face.

[0269] Step 2:

[0270] The device processes the acquired audio data through speech recognition software, converting it into text data. The input is digital audio data, and the output is the corresponding text string. In this conversion process, the audio waveform is analyzed, and the language recognition engine generates the corresponding string. Specifically, peaks and intervals in the audio waveform are identified and mapped to words.

[0271] Step 3:

[0272] The terminal sends audio, text, and visual data to the server. The input consists of converted text and video data, as well as the original audio data, all of which are transferred to the server. Specifically, the terminal's network module divides the data into packets and transmits the data using a secure communication protocol.

[0273] Step 4:

[0274] The server analyzes emotional states using an AI analysis algorithm based on the received data. The input consists of audio, text, and video data received by the server, and the output is the result of the emotional state determination. Specifically, the tone and speed of the audio data are analyzed, the content of the text is examined using natural language processing techniques, and facial expression analysis is performed on the video data.

[0275] Step 5:

[0276] The server integrates the analysis results to generate final emotion data. In this process, the input consists of partial results from each analysis, and the output is integrated emotion state data. Specifically, the individual analysis results are weighted to obtain a unified emotion evaluation score.

[0277] Step 6:

[0278] The server visualizes the data in a visually understandable format and transmits it to the terminal. The input is integrated emotion data, and the output is data in the form of graphs and charts. As a specific operation, a visualization tool analyzes the data and generates pie charts and line graphs.

[0279] Step 7:

[0280] The terminal presents the visualized emotion data to the user. The input is the graphs and charts received from the server, and the output is a screen display that can be confirmed by the user. As a specific operation, the user interface is updated so that the emotional state is intuitively shown.

[0281] (Application Example 1)

[0282] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".

[0283] In modern communication, misunderstandings and frictions often occur due to insufficient transmission of emotions. Especially within a family, the inability to understand the nuances of emotions can sometimes be a factor that degrades the quality of daily life. As a solution to this problem, a system that analyzes and displays emotions in real time is required.

[0284] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0285] In this invention, the server includes an acquisition means for acquiring audio information and video information, a conversion means for generating character information using the audio information obtained by the acquisition means, an analysis means for analyzing the audio information, character information, and video information to estimate emotions, a display means for visually displaying the emotion information obtained by the analysis means, and a dialogue means for performing appropriate dialogues and proposals based on the user's emotional state. Thereby, even within a family, the emotional state can be grasped in real time, and smoother communication becomes possible.

[0286] The "acquisition means" is a device or method for first collecting audio information and video information.

[0287] The "conversion means" is a device or method used to convert the acquired audio information into character information.

[0288] The "analysis means" is a device or method for analyzing audio information, character information, and video information and estimating emotions based on them.

[0289] The "display means" is a device or method for visually expressing the analyzed emotion information and presenting it to the user.

[0290] The "dialogue means" is a device or method for providing conversations and suggestions suitable for the user based on the estimated emotional state.

[0291] To implement this invention, the terminal used by the user collects audio and video inputs. The terminal is equipped with a microphone and a camera, and these are used to continuously acquire the user's voice and expression. The acquired voice data is converted into character data using speech recognition software. In this case, it is suitable to use speech recognition technology such as Google Speech-to-Text API.

[0292] The server receives the voice data, character data, and video data transmitted from the terminal and executes an emotion analysis algorithm. It is effective to use a generative AI model such as the OpenAI GPT series for this analysis. The analysis algorithm integrally analyzes the intonation and speed of the voice, the text content, the expression on the video, etc., and generates emotion data.

[0293] The generated emotional data is displayed in a visually easy-to-understand format. The server sends this visual data to the device and displays it on the device's screen as graphs and icons. Users can use this information to make decisions that facilitate communication within the family.

[0294] For example, if a robot observes a child doing homework and determines that their concentration is waning, it can suggest, "Why don't you take a break? Let's play together!" Another example of a prompt to facilitate interaction that responds to the user's emotions is, "Please tell me how to analyze my family's emotions in real time and make appropriate conversations and suggestions based on that." This prompt is an element that enables the automation of emotion-based communication using generative AI models.

[0295] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0296] Step 1:

[0297] The device acquires audio and video data. Specifically, it captures the user's speech and facial expressions in real time using the camera and microphone. The input is the user's audio and video information, and the output is their digital data. The device then prepares this data to send to the server.

[0298] Step 2:

[0299] The terminal converts the audio data into text data. Speech recognition software is used to convert the audio information into text. The input is the audio data acquired in step 1, and the output is the text data of that audio. The converted text data is transferred to a server for further analysis.

[0300] Step 3:

[0301] The server analyzes the received voice data, text data, and video data. For the analysis, an AI generation model is used to analyze in detail information including the intonation and speed of the voice, the text content, and expressions. The input is the voice, text, and video data transmitted from the terminal, and the output is the analysis result representing the emotional state.

[0302] Step 4:

[0303] The server converts the emotion data into a data format for visual display. The emotion analysis result is processed into a graph or icon to make it in a visually displayable form. The input is the emotion analysis result obtained in Step 3, and the output is the visual display data. This data is transmitted to the terminal.

[0304] Step 5:

[0305] The terminal receives the display data and displays the emotion information to the user. The change in emotion is visually represented on the display so that the user can intuitively grasp the situation. The input is the visual display data received from the server, and the output is the display information on the display.

[0306] Step 6:

[0307] The server generates appropriate conversations and proposals based on the user's emotional state. Using the generated emotion data, an AI generation model is used to come up with appropriate responses. The input is the emotion analysis result of Step 3, and the output is the conversation message as a prompt. This message is sent back to the terminal.

[0308] Furthermore, an emotion engine for estimating the user's emotion may be combined. That is, the specific processing unit 290 may estimate the user's emotion using the emotion specific model 59 and perform specific processing using the user's emotion.

[0309] The system according to the present invention aims to acquire audio and video data and accurately analyze the user's emotions, and is equipped with advanced emotion recognition capabilities combined with an emotion engine. The system mainly consists of a terminal, a server, and an emotion engine.

[0310] The device is equipped with a microphone and camera to capture the user's voice and facial expressions in real time. The voice data is converted into text data, while characteristics such as intonation and speech speed are analyzed in parallel. All data obtained through this process is sent to a server for further detailed analysis.

[0311] On the server, data analysis is performed through an integrated emotion engine. The functions provided by the server include speech recognition, natural language processing, and facial expression recognition, which are interconnected to perform comprehensive emotion analysis. The emotion engine generates emotion data in real time and has the ability to capture moments of emotional change by visually representing the user's emotional state.

[0312] The emotion engine employs machine learning algorithms and continuously optimizes various analysis models through reinforcement learning. Furthermore, the server accumulates past user data, enabling personalized analysis tailored to each user. As a result, even for the same user, past emotional history is reflected in the analysis, leading to more accurate emotion estimation.

[0313] As a concrete example, when used in an educational setting, users (students) can check their own understanding of the learning material using a terminal equipped with an emotion engine. The data collected via the server is transmitted to the instructor in real time, enabling the progress of lessons and countermeasures based on the emotional changes of individual students. Therefore, this invention contributes to creating an environment in which individualized learning is promoted and student motivation is further enhanced, even in educational settings.

[0314] The following describes the processing flow.

[0315] Step 1:

[0316] The user powers on the device, and the built-in microphone and camera automatically turn on. This initiates the acquisition of audio and video data.

[0317] Step 2:

[0318] The speech recognition software uses the audio data acquired by the device to convert it into text data in real time. Simultaneously, data is collected to analyze features such as speech intonation and speed.

[0319] Step 3:

[0320] The device uses video data acquired through its camera to apply a facial recognition algorithm to extract facial features. This data includes subtle expressions and moments of change.

[0321] Step 4:

[0322] The device collects data from voice, text, and video and sends it to the server. This transmission is performed using a secure and fast protocol.

[0323] Step 5:

[0324] The server inputs the received data into the emotion engine. The emotion engine comprehensively analyzes the tone of voice, the content of the text, and the facial expressions in the video to estimate the emotional state.

[0325] Step 6:

[0326] The server utilizes machine learning algorithms to update the analysis model through reinforcement learning. This improves the model's accuracy in estimating emotions.

[0327] Step 7:

[0328] The server uses past user data to perform personalized sentiment analysis. This data, based on the user's past emotional history, allows for a more precise estimation of their current emotional state.

[0329] Step 8:

[0330] The server visualizes the sentiment data obtained through analysis and converts it into pie charts and vertical / horizontal graphs. This visualized data is then sent to the terminal.

[0331] Step 9:

[0332] The device receives visualization data and displays it in real time on the user interface. This allows users to intuitively understand their own emotional state.

[0333] (Example 2)

[0334] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0335] Existing systems for analyzing emotions lack accuracy in analyzing audio and video data, making it difficult to accurately and in real time understand a user's emotional state. Furthermore, personalized analysis tailored to individual users is not sufficiently implemented, hindering the improvement of emotion analysis accuracy. Additionally, insufficient feedback through visual representations of emotional changes presents challenges to practical application in educational settings and other fields.

[0336] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0337] In this invention, the server includes an acquisition means equipped with a terminal for acquiring audio data and video data; a conversion means for generating text data using the audio data obtained by the acquisition means; an analysis means for analyzing the intonation and speed of the audio data and estimating emotions from the text data and video data; a display means for visually displaying the emotion data; and a personalization means for accumulating past emotion data and optimizing emotion analysis according to the user. This enables accurate and real-time analysis of the user's emotions, and provides personalized, highly accurate emotion analysis and visual feedback.

[0338] "Acquisition means" refers to devices or methods that have the function of acquiring audio data and video data in real time.

[0339] A "conversion means" is a device or method that has the function of generating text data based on acquired audio data.

[0340] "Analysis means" refers to devices or methods that have the function of analyzing audio data, text data, and video data to estimate the user's emotions.

[0341] "Display means" refers to devices or methods that have the function of visually displaying analyzed emotional data.

[0342] "Personalization methods" refer to devices or methods that accumulate past emotional data and have the function of optimizing emotional analysis according to the user.

[0343] "Speech recognition" is a technology that analyzes audio data to understand what is being said and converts it into text data.

[0344] "Natural language processing" is a technology that analyzes text data to understand emotions, intentions, and other related concepts.

[0345] "Facial recognition" is a technology that analyzes video data to read facial expressions and understand emotions based on those expressions.

[0346] A "machine learning algorithm" is a mathematical model and method for learning from large amounts of data and predicting future data.

[0347] Reinforcement learning is a type of machine learning that learns the optimal action through trial and error.

[0348] In this invention, the terminal is equipped with a microphone and a camera as devices for acquiring audio and video data. The audio data is used to capture features of the speaker's voice, including intonation and speed. The video data is used to capture the user's facial expressions and movements in real time. This data is converted into text data using speech recognition technology, and features such as intonation and speed are further analyzed.

[0349] Data acquired by the device is sent to a server where all analysis takes place. The server comprehensively analyzes the data using an emotion engine that includes speech recognition, natural language processing, and facial expression recognition. The emotion engine employs machine learning algorithms and continuously optimizes the analysis model through reinforcement learning. This makes it possible to accurately analyze the user's emotional state in real time.

[0350] Furthermore, the server incorporates previously accumulated emotional data for each user into its analysis, enabling personalized emotional analysis for each user. This allows for highly accurate emotional estimation based on past emotional history, even for the same user. The analyzed emotional data is visually displayed, allowing for a concrete representation of the user's emotional changes.

[0351] One concrete example of its use is in educational settings. Users (students) would use this system to check their understanding of their learning during class. The analyzed emotional data would also be provided to instructors in real time, enabling them to tailor their lessons to the individual student's emotional changes.

[0352] As an example of a prompt when using a generative AI model, you can use the instruction "Analyze and visualize the current emotional state" to specify how the system should analyze and visualize the user's emotions.

[0353] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0354] Step 1:

[0355] The device uses a microphone and camera to acquire user audio and video data. It receives audio and video from the environment as input and converts them into digital data as output. Specifically, it captures the user's voice and continuously records their facial expressions.

[0356] Step 2:

[0357] This process uses speech recognition technology to generate text data from audio data acquired by the device. Simultaneously, it analyzes the intonation and speed of the speech. The input for this step is the acquired audio data, and the output is text data and speech features. Specifically, it analyzes the speech, converts the spoken content into a string, and generates numerical data for intonation and speed.

[0358] Step 3:

[0359] The terminal packages the analyzed audio, text, and video data and sends it to the server. The input for this step is audio features, text data, and video data, and the output is the transmission of data to the server. Specifically, the terminal ensures real-time performance by accumulating a certain amount of data and then rapidly transferring it to the server.

[0360] Step 4:

[0361] The server comprehensively analyzes the received data based on audio, text, and video. It uses an emotion engine to analyze the data and estimate emotions. The input for this step is the dataset transferred to the server, and the output is the emotion data estimated through the analysis. Specific operations include keyword extraction using speech recognition, text analysis using natural language processing, and facial analysis using facial expression recognition.

[0362] Step 5:

[0363] The server visually represents the user's emotional state based on the analysis results. The input for this step is the emotional data obtained through analysis, and the output is the visualized emotional state. Specific actions include converting the emotional data into graphs and animations, making it easy for users and educational institutions to review the emotional state.

[0364] Step 6:

[0365] The server personalizes sentiment analysis using accumulated historical user data. The input for this step is historical sentiment data, and the output is a current sentiment estimate that reflects past analysis. Specifically, the process involves incorporating historical data into the analysis to learn each user's sentiment change patterns and improve the accuracy of predictions.

[0366] (Application Example 2)

[0367] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0368] Conventional consumer robots have difficulty accurately understanding user emotions and taking appropriate measures, resulting in insufficient stress reduction and the provision of a comfortable living environment. Therefore, there is a need for technology that can grasp changes in user emotions in real time and respond appropriately based on that information.

[0369] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0370] In this invention, the server includes a device means for acquiring audio information and video information, a conversion device means for generating text information using the audio information obtained by the device means, and an estimation means for analyzing the audio information, text information, and video information and estimating emotions. This makes it possible to analyze the user's emotions in real time and select and play the most suitable music based on those emotions, thereby reducing user stress and providing a comfortable living environment.

[0371] "Audio information" refers to data obtained from speech, including the user's vocal characteristics and intonation.

[0372] "Visual information" refers to visual data, including information that records the user's facial expressions and movements.

[0373] "Device means" refers to means that have hardware or software for acquiring audio information or video information.

[0374] A "conversion device means" is a device that performs a process to convert audio information into text information.

[0375] "Text information" refers to information that represents audio information as character data.

[0376] "Inference methods" refer to the process of analyzing and estimating emotions from acquired audio, text, and video information.

[0377] "Emotional information" refers to data that represents the user's emotional state and is obtained through inference methods.

[0378] "Acoustic device means" refers to a device for selecting and playing appropriate music based on emotional information.

[0379] The system implementing this invention mainly consists of terminal devices for acquiring audio and video information, and a server for analyzing and processing that data. This system employs highly integrated technology to grasp the user's emotions in real time.

[0380] The terminal is equipped with a microphone and camera to acquire audio and video information from the user. Audio information is acoustic data obtained from the user's speech, and video information is visual data representing the user's facial expressions and movements. This information is transmitted to the server in real time.

[0381] The server converts the received audio information into text information using a conversion device, and uses this text to perform emotion analysis. In particular, an emotion engine that combines speech recognition technology and facial expression analysis algorithms as an inference method comprehensively analyzes audio information, text information, and video information to estimate the user's emotion information. This emotion information includes the user's stress level and emotional state, and is represented visually.

[0382] Furthermore, based on emotional information, the server controls the sound system and selects and plays the most suitable music. For this purpose, it also includes an algorithm to smoothly find songs from the music library that match the user's emotions.

[0383] For example, after a busy day at work, music might be played to help a user relax upon returning home. The system detects the user's voice tone and facial expressions that indicate fatigue, and plays relaxing music accordingly to reduce the user's stress.

[0384] An example of a prompt message would be, "Based on the user's voice and video, detect their current emotional state. If the user is stressed, suggest actions to help them relax." This allows the system to provide user-centered interaction and care that was not possible with conventional technologies.

[0385] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0386] Step 1:

[0387] The device activates its microphone and camera to acquire audio and video information from the user in real time. The input is the user's raw voice and facial expression data, and the output is the acquired digital audio and visual data. This acquisition process provides foundational data for analyzing the user's emotional state.

[0388] Step 2:

[0389] The terminal sends the acquired audio information to the server. The server applies a speech recognition algorithm to convert the audio information into text information. The input is audio data, and the output is the corresponding text information. In this process, the intonation and speed of the speech are also analyzed, and characteristic information for emotion estimation is extracted along with the text data.

[0390] Step 3:

[0391] The terminal transmits video information to the server. The server analyzes the video information and applies a facial expression analysis algorithm to detect facial expressions. The input is visual data, and the output is digital information representing facial features. This analysis yields an index of emotion based on changes in facial expressions.

[0392] Step 4:

[0393] The server integrates voice characteristic information, text information, and facial feature information, and the emotion engine estimates the user's emotions. The input is all the information processed in the previous step, and the output is the estimated emotion information. The estimated emotions include emotional states such as stress levels and feelings of happiness. In this process, the emotion engine analyzes the data using machine learning algorithms.

[0394] Step 5:

[0395] The server uses emotional information to select the most suitable music and controls the audio device via the terminal to play the music. The input is estimated emotional information, and the output is the selected music data. This process uses a generative AI model to dynamically select the song that best suits the user's current emotions. Furthermore, based on prompts, the entire system completes an interaction that aligns with the user's emotions.

[0396] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0397] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0398] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0399] [Third Embodiment]

[0400] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0401] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0402] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0403] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0404] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0405] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0406] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0407] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0408] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0409] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0410] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0411] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0412] The system according to the present invention performs real-time sentiment analysis and visualization to facilitate smoother user communication. This system is mainly composed of a terminal, a server, and the user, and performs a series of processes to acquire, analyze, and visualize audio and video data.

[0413] First, the device begins to acquire voice and video input from the user in real time. Using the microphone and camera, it accurately captures the user's speech and facial expressions. The voice data is converted into text data by speech recognition software, enabling analysis based on the user's linguistic expressions.

[0414] The acquired data is transferred to a server, where analysis processing begins. Here, information such as voice tone, speed, text content, and facial expressions are analyzed in detail by an AI analysis algorithm to identify emotions such as joy, anxiety, and anger. Based on this analysis, multiple data modules detect different emotional states using their own unique methods.

[0415] The analysis results are integrated on the server side, generating comprehensive emotional data. This emotional data is converted into graphs and charts in a visually easy-to-understand format and sent to the terminal. Here, the user can immediately check the analysis results as visual information, and grasp subtle emotional changes that might be difficult to notice through audio alone.

[0416] As a concrete example, in a customer service scenario, if a user (customer) expresses dissatisfaction during a phone call with a support representative, the device captures their voice and facial expressions expressing their anxiety and anger. The server then analyzes these emotions and sends the results back to the device in a graph, allowing the representative to understand the customer's emotions in real time and adjust their response accordingly. This is expected to improve the quality of communication.

[0417] The following describes the processing flow.

[0418] Step 1:

[0419] The user activates the device and begins communication. The device's microphone and camera become active, capturing the user's voice and video.

[0420] Step 2:

[0421] The audio data acquired by the device is converted into text data using real-time speech recognition technology. During this process, acoustic characteristics such as intonation and speed are also extracted for analysis.

[0422] Step 3:

[0423] The device uses an expression recognition algorithm to analyze the user's facial expressions from the video data it acquires. This analysis extracts facial features such as smiles and frown lines as digital data.

[0424] Step 4:

[0425] The terminal packages audio data, text data, and video data and sends them to the server for analysis.

[0426] Step 5:

[0427] The server uses a voice analysis module to analyze the tone and speed of the voice in the received audio data and extract elements that allow for the inference of emotion.

[0428] Step 6:

[0429] The server uses a text analysis module to identify sentiment expressions within text data using natural language processing. In this process, it generates sentiment scores using a word sentiment weighting dictionary.

[0430] Step 7:

[0431] The server uses a facial expression analysis module based on the video data to map each facial feature to an emotion and estimate the emotion.

[0432] Step 8:

[0433] The server integrates these analysis results to generate data that represents the overall emotional state.

[0434] Step 9:

[0435] The server converts the overall sentiment data into visualized data and formats it into graphs and charts.

[0436] Step 10:

[0437] The server sends the visualization data to the terminal.

[0438] Step 11:

[0439] The device receives visualization data and displays the emotional state on the screen in real time, allowing the user to visually understand their current emotional state.

[0440] (Example 1)

[0441] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0442] In conventional communication systems, it has been difficult to grasp changes in emotions in real time, which has been a factor in degrading the quality of communication. In this situation, there is a need to improve communication by instantly analyzing and visualizing the user's emotional state.

[0443] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0444] In this invention, the server includes acquisition means for acquiring acoustic data and visual data, conversion means for converting acoustic data into text data, analysis means for estimating emotional states, and display means for visualizing integrated emotional data. This makes it possible to analyze the user's emotional state in real time and present it visually.

[0445] "Acquisition means" refers to devices and technologies for collecting acoustic and visual data, and are responsible for capturing the user's speech and facial expressions.

[0446] "Conversion means" refers to a technology or device that converts acoustic data obtained by acquisition means into text data, thereby changing linguistic information into text format through a speech recognition process.

[0447] "Analysis means" refers to a technology or device that analyzes acoustic data, text data, and visual data to estimate the user's emotional state, and uses AI to identify emotions.

[0448] "Display means" refers to a technology or device for visualizing and presenting emotional data obtained by analysis means, in order to enable users to intuitively understand emotional changes.

[0449] This invention is an emotion analysis system designed to facilitate smoother user communication. The system is comprised of a terminal, a server, and the user, and features the ability to acquire, analyze, and visualize audio and video data in real time.

[0450] The device is responsible for acquiring audio and visual data from the user. Specifically, it uses a microphone to capture audio information and a camera to record the user's facial expressions. The audio data is immediately converted into text data by speech recognition software. This instantly converts the user's speech into digital text format, preparing it for analysis.

[0451] The collected audio and visual data are transferred to a server via an internet connection. On the server, AI analysis algorithms are executed based on this data. In this analysis process, the tone and speed of speech, text content, and facial expressions based on the video data are analyzed in detail. In this way, emotional states such as joy, anxiety, and anger are identified. Machine learning models and natural language processing techniques are used in this analysis.

[0452] The analyzed emotional data is integrated by a data module and converted into a visually easy-to-understand format. The device then presents these visualized results to the user through a user interface. This allows the user to understand their own or others' emotional changes in real time, improving the quality of communication.

[0453] A concrete example of its use is in customer service. If a user (customer) is dissatisfied during a call with a support representative, the system captures their voice and facial expressions, and the server analyzes their dissatisfaction or anger. As a result, the representative has the opportunity to understand the customer's emotions in real time and quickly adjust their response.

[0454] An example of a prompt message might be: "Explain how to use this emotion analysis system to identify emotions such as joy, anxiety, and anger in real time from the user's voice and facial expression data, and then visualize and present them."

[0455] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0456] Step 1:

[0457] The user starts the system, and the terminal uses a microphone and camera to acquire acoustic and visual data in real time. The input is the user's voice and facial expressions; the voice is output as digital audio data, and the facial expressions are output as video data. Specifically, the microphone captures the user's speech, and the camera photographs the user's face.

[0458] Step 2:

[0459] The device processes the acquired audio data through speech recognition software, converting it into text data. The input is digital audio data, and the output is the corresponding text string. In this conversion process, the audio waveform is analyzed, and the language recognition engine generates the corresponding string. Specifically, peaks and intervals in the audio waveform are identified and mapped to words.

[0460] Step 3:

[0461] The terminal sends audio, text, and visual data to the server. The input consists of converted text and video data, as well as the original audio data, all of which are transferred to the server. Specifically, the terminal's network module divides the data into packets and transmits the data using a secure communication protocol.

[0462] Step 4:

[0463] The server analyzes emotional states using an AI analysis algorithm based on the received data. The input consists of audio, text, and video data received by the server, and the output is the result of the emotional state determination. Specifically, the tone and speed of the audio data are analyzed, the content of the text is examined using natural language processing techniques, and facial expression analysis is performed on the video data.

[0464] Step 5:

[0465] The server integrates the analysis results to generate final emotion data. In this process, the input consists of partial results from each analysis, and the output is integrated emotion state data. Specifically, the individual analysis results are weighted to obtain a unified emotion evaluation score.

[0466] Step 6:

[0467] The server visualizes the data in a visually easy-to-understand format and sends it to the terminal. The input is integrated sentiment data, and the output is data in the form of graphs and charts. Specifically, the visualization tool analyzes the data and generates pie charts and line graphs.

[0468] Step 7:

[0469] The device presents visualized emotional data to the user. Input consists of graphs and charts received from the server, and output is a screen display that the user can view. Specifically, the user interface has been updated to intuitively show the emotional state.

[0470] (Application Example 1)

[0471] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0472] In modern communication, misunderstandings and friction often arise due to insufficient emotional transmission. Particularly within families, the inability to understand each other's emotional nuances can significantly reduce the quality of daily life. As a solution to this problem, there is a need for a system that analyzes and displays emotions in real time.

[0473] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0474] In this invention, the server includes an acquisition means for acquiring audio and video information, a conversion means for generating text information using the audio information obtained by the acquisition means, an analysis means for analyzing the audio information, text information, and video information and estimating emotions, a display means for visually displaying the emotion information obtained by the analysis means, and a dialogue means for providing appropriate dialogue and suggestions based on the user's emotional state. This enables real-time understanding of emotional states even within the home, facilitating smoother communication.

[0475] "Acquisition means" refers to the device or method for initially collecting audio and video information.

[0476] "Conversion means" refers to devices or methods used to convert acquired audio information into text information.

[0477] "Analysis means" refers to devices or methods for analyzing audio information, text information, and video information, and estimating emotions based on that analysis.

[0478] "Display means" refers to devices or methods for visually representing and presenting analyzed emotional information to the user.

[0479] A "dialogue tool" is a device or method for providing users with appropriate conversations and suggestions based on their estimated emotional state.

[0480] To realize this invention, the user's terminal collects audio and video input. The terminal is equipped with a microphone and a camera, which are used to continuously acquire the user's voice and facial expressions. The acquired audio data is converted into text data using speech recognition software. In this case, speech recognition technology such as Google Speech-to-Text API is suitable.

[0481] The server receives audio, text, and video data transmitted from the terminal and executes an emotion analysis algorithm. Generative AI models such as the OpenAI GPT series are effective for this analysis. The analysis algorithm comprehensively analyzes factors such as speech intonation and speed, text content, and facial expressions in the video to generate emotion data.

[0482] The generated emotional data is displayed in a visually easy-to-understand format. The server sends this visual data to the device and displays it on the device's screen as graphs and icons. Users can use this information to make decisions that facilitate communication within the family.

[0483] For example, if a robot observes a child doing homework and determines that their concentration is waning, it can suggest, "Why don't you take a break? Let's play together!" Another example of a prompt to facilitate interaction that responds to the user's emotions is, "Please tell me how to analyze my family's emotions in real time and make appropriate conversations and suggestions based on that." This prompt is an element that enables the automation of emotion-based communication using generative AI models.

[0484] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0485] Step 1:

[0486] The device acquires audio and video data. Specifically, it captures the user's speech and facial expressions in real time using the camera and microphone. The input is the user's audio and video information, and the output is their digital data. The device then prepares this data to send to the server.

[0487] Step 2:

[0488] The terminal converts the audio data into text data. Speech recognition software is used to convert the audio information into text. The input is the audio data acquired in step 1, and the output is the text data of that audio. The converted text data is transferred to a server for further analysis.

[0489] Step 3:

[0490] The server analyzes the audio, text, and video data it receives. The analysis uses a generative AI model to analyze information in detail, including speech intonation and speed, text content, and facial expressions. The input is audio, text, and video data sent from the terminal, and the output is the analysis result representing the emotional state.

[0491] Step 4:

[0492] The server converts the emotional data into a data format for visual display. The emotional analysis results are processed into graphs and icons, making them visually displayable. The input is the emotional analysis results obtained in step 3, and the output is the visually displayed data. This data is sent to the terminal.

[0493] Step 5:

[0494] The terminal receives display data and displays emotional information to the user. Emotional changes are visually represented on the display, allowing the user to intuitively understand the situation. The input is visual display data received from the server, and the output is the information displayed on the screen.

[0495] Step 6:

[0496] The server generates appropriate dialogue and suggestions based on the user's emotional state. A generative AI model is used to determine appropriate responses based on the generated emotional data. The input is the emotional analysis result from step 3, and the output is a dialogue message as a prompt. This message is sent back to the terminal.

[0497] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0498] The system according to the present invention aims to acquire audio and video data and accurately analyze the user's emotions, and is equipped with advanced emotion recognition capabilities combined with an emotion engine. The system mainly consists of a terminal, a server, and an emotion engine.

[0499] The device is equipped with a microphone and camera to capture the user's voice and facial expressions in real time. The voice data is converted into text data, while characteristics such as intonation and speech speed are analyzed in parallel. All data obtained through this process is sent to a server for further detailed analysis.

[0500] On the server, data analysis is performed through an integrated emotion engine. The functions provided by the server include speech recognition, natural language processing, and facial expression recognition, which are interconnected to perform comprehensive emotion analysis. The emotion engine generates emotion data in real time and has the ability to capture moments of emotional change by visually representing the user's emotional state.

[0501] The emotion engine employs machine learning algorithms and continuously optimizes various analysis models through reinforcement learning. Furthermore, the server accumulates past user data, enabling personalized analysis tailored to each user. As a result, even for the same user, past emotional history is reflected in the analysis, leading to more accurate emotion estimation.

[0502] As a concrete example, when used in an educational setting, users (students) can check their own understanding of the learning material using a terminal equipped with an emotion engine. The data collected via the server is transmitted to the instructor in real time, enabling the progress of lessons and countermeasures based on the emotional changes of individual students. Therefore, this invention contributes to creating an environment in which individualized learning is promoted and student motivation is further enhanced, even in educational settings.

[0503] The following describes the processing flow.

[0504] Step 1:

[0505] The user powers on the device, and the built-in microphone and camera automatically turn on. This initiates the acquisition of audio and video data.

[0506] Step 2:

[0507] The speech recognition software uses the audio data acquired by the device to convert it into text data in real time. Simultaneously, data is collected to analyze features such as speech intonation and speed.

[0508] Step 3:

[0509] The device uses video data acquired through its camera to apply a facial recognition algorithm to extract facial features. This data includes subtle expressions and moments of change.

[0510] Step 4:

[0511] The device collects data from voice, text, and video and sends it to the server. This transmission is performed using a secure and fast protocol.

[0512] Step 5:

[0513] The server inputs the received data into the emotion engine. The emotion engine comprehensively analyzes the tone of voice, the content of the text, and the facial expressions in the video to estimate the emotional state.

[0514] Step 6:

[0515] The server utilizes machine learning algorithms to update the analysis model through reinforcement learning. This improves the model's accuracy in estimating emotions.

[0516] Step 7:

[0517] The server uses past user data to perform personalized sentiment analysis. This data, based on the user's past emotional history, allows for a more precise estimation of their current emotional state.

[0518] Step 8:

[0519] The server visualizes the sentiment data obtained through analysis and converts it into pie charts and vertical / horizontal graphs. This visualized data is then sent to the terminal.

[0520] Step 9:

[0521] The device receives visualization data and displays it in real time on the user interface. This allows users to intuitively understand their own emotional state.

[0522] (Example 2)

[0523] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0524] Existing systems for analyzing emotions lack accuracy in analyzing audio and video data, making it difficult to accurately and in real time understand a user's emotional state. Furthermore, personalized analysis tailored to individual users is not sufficiently implemented, hindering the improvement of emotion analysis accuracy. Additionally, insufficient feedback through visual representations of emotional changes presents challenges to practical application in educational settings and other fields.

[0525] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0526] In this invention, the server includes an acquisition means equipped with a terminal for acquiring audio data and video data; a conversion means for generating text data using the audio data obtained by the acquisition means; an analysis means for analyzing the intonation and speed of the audio data and estimating emotions from the text data and video data; a display means for visually displaying the emotion data; and a personalization means for accumulating past emotion data and optimizing emotion analysis according to the user. This enables accurate and real-time analysis of the user's emotions, and provides personalized, highly accurate emotion analysis and visual feedback.

[0527] "Acquisition means" refers to devices or methods that have the function of acquiring audio data and video data in real time.

[0528] A "conversion means" is a device or method that has the function of generating text data based on acquired audio data.

[0529] "Analysis means" refers to devices or methods that have the function of analyzing audio data, text data, and video data to estimate the user's emotions.

[0530] "Display means" refers to devices or methods that have the function of visually displaying analyzed emotional data.

[0531] "Personalization methods" refer to devices or methods that accumulate past emotional data and have the function of optimizing emotional analysis according to the user.

[0532] "Speech recognition" is a technology that analyzes audio data to understand what is being said and converts it into text data.

[0533] "Natural language processing" is a technology that analyzes text data to understand emotions, intentions, and other related concepts.

[0534] "Facial recognition" is a technology that analyzes video data to read facial expressions and understand emotions based on those expressions.

[0535] A "machine learning algorithm" is a mathematical model and method for learning from large amounts of data and predicting future data.

[0536] Reinforcement learning is a type of machine learning that learns the optimal action through trial and error.

[0537] In this invention, the terminal is equipped with a microphone and a camera as devices for acquiring audio and video data. The audio data is used to capture features of the speaker's voice, including intonation and speed. The video data is used to capture the user's facial expressions and movements in real time. This data is converted into text data using speech recognition technology, and features such as intonation and speed are further analyzed.

[0538] Data acquired by the device is sent to a server where all analysis takes place. The server comprehensively analyzes the data using an emotion engine that includes speech recognition, natural language processing, and facial expression recognition. The emotion engine employs machine learning algorithms and continuously optimizes the analysis model through reinforcement learning. This makes it possible to accurately analyze the user's emotional state in real time.

[0539] Furthermore, the server incorporates previously accumulated emotional data for each user into its analysis, enabling personalized emotional analysis for each user. This allows for highly accurate emotional estimation based on past emotional history, even for the same user. The analyzed emotional data is visually displayed, allowing for a concrete representation of the user's emotional changes.

[0540] One concrete example of its use is in educational settings. Users (students) would use this system to check their understanding of their learning during class. The analyzed emotional data would also be provided to instructors in real time, enabling them to tailor their lessons to the individual student's emotional changes.

[0541] As an example of a prompt when using a generative AI model, you can use the instruction "Analyze and visualize the current emotional state" to specify how the system should analyze and visualize the user's emotions.

[0542] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0543] Step 1:

[0544] The device uses a microphone and camera to acquire user audio and video data. It receives audio and video from the environment as input and converts them into digital data as output. Specifically, it captures the user's voice and continuously records their facial expressions.

[0545] Step 2:

[0546] This process uses speech recognition technology to generate text data from audio data acquired by the device. Simultaneously, it analyzes the intonation and speed of the speech. The input for this step is the acquired audio data, and the output is text data and speech features. Specifically, it analyzes the speech, converts the spoken content into a string, and generates numerical data for intonation and speed.

[0547] Step 3:

[0548] The terminal packages the analyzed audio, text, and video data and sends it to the server. The input for this step is audio features, text data, and video data, and the output is the transmission of data to the server. Specifically, the terminal ensures real-time performance by accumulating a certain amount of data and then rapidly transferring it to the server.

[0549] Step 4:

[0550] The server comprehensively analyzes the received data based on audio, text, and video. It uses an emotion engine to analyze the data and estimate emotions. The input for this step is the dataset transferred to the server, and the output is the emotion data estimated through the analysis. Specific operations include keyword extraction using speech recognition, text analysis using natural language processing, and facial analysis using facial expression recognition.

[0551] Step 5:

[0552] The server visually represents the user's emotional state based on the analysis results. The input for this step is the emotional data obtained through analysis, and the output is the visualized emotional state. Specific actions include converting the emotional data into graphs and animations, making it easy for users and educational institutions to review the emotional state.

[0553] Step 6:

[0554] The server personalizes sentiment analysis using accumulated historical user data. The input for this step is historical sentiment data, and the output is a current sentiment estimate that reflects past analysis. Specifically, the process involves incorporating historical data into the analysis to learn each user's sentiment change patterns and improve the accuracy of predictions.

[0555] (Application Example 2)

[0556] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0557] Conventional consumer robots have difficulty accurately understanding user emotions and taking appropriate measures, resulting in insufficient stress reduction and the provision of a comfortable living environment. Therefore, there is a need for technology that can grasp changes in user emotions in real time and respond appropriately based on that information.

[0558] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0559] In this invention, the server includes a device means for acquiring audio information and video information, a conversion device means for generating text information using the audio information obtained by the device means, and an estimation means for analyzing the audio information, text information, and video information and estimating emotions. This makes it possible to analyze the user's emotions in real time and select and play the most suitable music based on those emotions, thereby reducing user stress and providing a comfortable living environment.

[0560] "Audio information" refers to data obtained from speech, including the user's vocal characteristics and intonation.

[0561] "Visual information" refers to visual data, including information that records the user's facial expressions and movements.

[0562] "Device means" refers to means that have hardware or software for acquiring audio information or video information.

[0563] A "conversion device means" is a device that performs a process to convert audio information into text information.

[0564] "Text information" refers to information that represents audio information as character data.

[0565] "Inference methods" refer to the process of analyzing and estimating emotions from acquired audio, text, and video information.

[0566] "Emotional information" refers to data that represents the user's emotional state and is obtained through inference methods.

[0567] "Acoustic device means" refers to a device for selecting and playing appropriate music based on emotional information.

[0568] The system implementing this invention mainly consists of terminal devices for acquiring audio and video information, and a server for analyzing and processing that data. This system employs highly integrated technology to grasp the user's emotions in real time.

[0569] The terminal is equipped with a microphone and camera to acquire audio and video information from the user. Audio information is acoustic data obtained from the user's speech, and video information is visual data representing the user's facial expressions and movements. This information is transmitted to the server in real time.

[0570] The server converts the received audio information into text information using a conversion device, and uses this text to perform emotion analysis. In particular, an emotion engine that combines speech recognition technology and facial expression analysis algorithms as an inference method comprehensively analyzes audio information, text information, and video information to estimate the user's emotion information. This emotion information includes the user's stress level and emotional state, and is represented visually.

[0571] Furthermore, based on emotional information, the server controls the sound system and selects and plays the most suitable music. For this purpose, it also includes an algorithm to smoothly find songs from the music library that match the user's emotions.

[0572] For example, after a busy day at work, music might be played to help a user relax upon returning home. The system detects the user's voice tone and facial expressions that indicate fatigue, and plays relaxing music accordingly to reduce the user's stress.

[0573] An example of a prompt message would be, "Based on the user's voice and video, detect their current emotional state. If the user is stressed, suggest actions to help them relax." This allows the system to provide user-centered interaction and care that was not possible with conventional technologies.

[0574] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0575] Step 1:

[0576] The device activates its microphone and camera to acquire audio and video information from the user in real time. The input is the user's raw voice and facial expression data, and the output is the acquired digital audio and visual data. This acquisition process provides foundational data for analyzing the user's emotional state.

[0577] Step 2:

[0578] The terminal sends the acquired audio information to the server. The server applies a speech recognition algorithm to convert the audio information into text information. The input is audio data, and the output is the corresponding text information. In this process, the intonation and speed of the speech are also analyzed, and characteristic information for emotion estimation is extracted along with the text data.

[0579] Step 3:

[0580] The terminal transmits video information to the server. The server analyzes the video information and applies a facial expression analysis algorithm to detect facial expressions. The input is visual data, and the output is digital information representing facial features. This analysis yields an index of emotion based on changes in facial expressions.

[0581] Step 4:

[0582] The server integrates voice characteristic information, text information, and facial feature information, and the emotion engine estimates the user's emotions. The input is all the information processed in the previous step, and the output is the estimated emotion information. The estimated emotions include emotional states such as stress levels and feelings of happiness. In this process, the emotion engine analyzes the data using machine learning algorithms.

[0583] Step 5:

[0584] The server uses emotional information to select the most suitable music and controls the audio device via the terminal to play the music. The input is estimated emotional information, and the output is the selected music data. This process uses a generative AI model to dynamically select the song that best suits the user's current emotions. Furthermore, based on prompts, the entire system completes an interaction that aligns with the user's emotions.

[0585] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0586] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0587] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0588] [Fourth Embodiment]

[0589] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0590] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0591] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0592] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0593] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0594] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0595] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0596] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0597] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0598] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0599] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0600] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0601] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0602] The system according to the present invention performs real-time sentiment analysis and visualization to facilitate smoother user communication. This system is mainly composed of a terminal, a server, and the user, and performs a series of processes to acquire, analyze, and visualize audio and video data.

[0603] First, the device begins to acquire voice and video input from the user in real time. Using the microphone and camera, it accurately captures the user's speech and facial expressions. The voice data is converted into text data by speech recognition software, enabling analysis based on the user's linguistic expressions.

[0604] The acquired data is transferred to a server, where analysis processing begins. Here, information such as voice tone, speed, text content, and facial expressions are analyzed in detail by an AI analysis algorithm to identify emotions such as joy, anxiety, and anger. Based on this analysis, multiple data modules detect different emotional states using their own unique methods.

[0605] The analysis results are integrated on the server side, generating comprehensive emotional data. This emotional data is converted into graphs and charts in a visually easy-to-understand format and sent to the terminal. Here, the user can immediately check the analysis results as visual information, and grasp subtle emotional changes that might be difficult to notice through audio alone.

[0606] As a concrete example, in a customer service scenario, if a user (customer) expresses dissatisfaction during a phone call with a support representative, the device captures their voice and facial expressions expressing their anxiety and anger. The server then analyzes these emotions and sends the results back to the device in a graph, allowing the representative to understand the customer's emotions in real time and adjust their response accordingly. This is expected to improve the quality of communication.

[0607] The following describes the processing flow.

[0608] Step 1:

[0609] The user activates the device and begins communication. The device's microphone and camera become active, capturing the user's voice and video.

[0610] Step 2:

[0611] The audio data acquired by the device is converted into text data using real-time speech recognition technology. During this process, acoustic characteristics such as intonation and speed are also extracted for analysis.

[0612] Step 3:

[0613] The device uses an expression recognition algorithm to analyze the user's facial expressions from the video data it acquires. This analysis extracts facial features such as smiles and frown lines as digital data.

[0614] Step 4:

[0615] The terminal packages audio data, text data, and video data and sends them to the server for analysis.

[0616] Step 5:

[0617] The server uses a voice analysis module to analyze the tone and speed of the voice in the received audio data and extract elements that allow for the inference of emotion.

[0618] Step 6:

[0619] The server uses a text analysis module to identify sentiment expressions within text data using natural language processing. In this process, it generates sentiment scores using a word sentiment weighting dictionary.

[0620] Step 7:

[0621] The server uses a facial expression analysis module based on the video data to map each facial feature to an emotion and estimate the emotion.

[0622] Step 8:

[0623] The server integrates these analysis results to generate data that represents the overall emotional state.

[0624] Step 9:

[0625] The server converts the overall sentiment data into visualized data and formats it into graphs and charts.

[0626] Step 10:

[0627] The server sends the visualization data to the terminal.

[0628] Step 11:

[0629] The device receives visualization data and displays the emotional state on the screen in real time, allowing the user to visually understand their current emotional state.

[0630] (Example 1)

[0631] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0632] In conventional communication systems, it has been difficult to grasp changes in emotions in real time, which has been a factor in degrading the quality of communication. In this situation, there is a need to improve communication by instantly analyzing and visualizing the user's emotional state.

[0633] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0634] In this invention, the server includes acquisition means for acquiring acoustic data and visual data, conversion means for converting acoustic data into text data, analysis means for estimating emotional states, and display means for visualizing integrated emotional data. This makes it possible to analyze the user's emotional state in real time and present it visually.

[0635] "Acquisition means" refers to devices and technologies for collecting acoustic and visual data, and are responsible for capturing the user's speech and facial expressions.

[0636] "Conversion means" refers to a technology or device that converts acoustic data obtained by acquisition means into text data, thereby changing linguistic information into text format through a speech recognition process.

[0637] "Analysis means" refers to a technology or device that analyzes acoustic data, text data, and visual data to estimate the user's emotional state, and uses AI to identify emotions.

[0638] "Display means" refers to a technology or device for visualizing and presenting emotional data obtained by analysis means, in order to enable users to intuitively understand emotional changes.

[0639] This invention is an emotion analysis system designed to facilitate smoother user communication. The system is comprised of a terminal, a server, and the user, and features the ability to acquire, analyze, and visualize audio and video data in real time.

[0640] The device is responsible for acquiring audio and visual data from the user. Specifically, it uses a microphone to capture audio information and a camera to record the user's facial expressions. The audio data is immediately converted into text data by speech recognition software. This instantly converts the user's speech into digital text format, preparing it for analysis.

[0641] The collected audio and visual data are transferred to a server via an internet connection. On the server, AI analysis algorithms are executed based on this data. In this analysis process, the tone and speed of speech, text content, and facial expressions based on the video data are analyzed in detail. In this way, emotional states such as joy, anxiety, and anger are identified. Machine learning models and natural language processing techniques are used in this analysis.

[0642] The analyzed emotional data is integrated by a data module and converted into a visually easy-to-understand format. The device then presents these visualized results to the user through a user interface. This allows the user to understand their own or others' emotional changes in real time, improving the quality of communication.

[0643] A concrete example of its use is in customer service. If a user (customer) is dissatisfied during a call with a support representative, the system captures their voice and facial expressions, and the server analyzes their dissatisfaction or anger. As a result, the representative has the opportunity to understand the customer's emotions in real time and quickly adjust their response.

[0644] An example of a prompt message might be: "Explain how to use this emotion analysis system to identify emotions such as joy, anxiety, and anger in real time from the user's voice and facial expression data, and then visualize and present them."

[0645] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0646] Step 1:

[0647] The user starts the system, and the terminal uses a microphone and camera to acquire acoustic and visual data in real time. The input is the user's voice and facial expressions; the voice is output as digital audio data, and the facial expressions are output as video data. Specifically, the microphone captures the user's speech, and the camera photographs the user's face.

[0648] Step 2:

[0649] The device processes the acquired audio data through speech recognition software, converting it into text data. The input is digital audio data, and the output is the corresponding text string. In this conversion process, the audio waveform is analyzed, and the language recognition engine generates the corresponding string. Specifically, peaks and intervals in the audio waveform are identified and mapped to words.

[0650] Step 3:

[0651] The terminal sends audio, text, and visual data to the server. The input consists of converted text and video data, as well as the original audio data, all of which are transferred to the server. Specifically, the terminal's network module divides the data into packets and transmits the data using a secure communication protocol.

[0652] Step 4:

[0653] The server analyzes emotional states using an AI analysis algorithm based on the received data. The input consists of audio, text, and video data received by the server, and the output is the result of the emotional state determination. Specifically, the tone and speed of the audio data are analyzed, the content of the text is examined using natural language processing techniques, and facial expression analysis is performed on the video data.

[0654] Step 5:

[0655] The server integrates the analysis results to generate final emotion data. In this process, the input consists of partial results from each analysis, and the output is integrated emotion state data. Specifically, the individual analysis results are weighted to obtain a unified emotion evaluation score.

[0656] Step 6:

[0657] The server visualizes the data in a visually easy-to-understand format and sends it to the terminal. The input is integrated sentiment data, and the output is data in the form of graphs and charts. Specifically, the visualization tool analyzes the data and generates pie charts and line graphs.

[0658] Step 7:

[0659] The device presents visualized emotional data to the user. Input consists of graphs and charts received from the server, and output is a screen display that the user can view. Specifically, the user interface has been updated to intuitively show the emotional state.

[0660] (Application Example 1)

[0661] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0662] In modern communication, misunderstandings and friction often arise due to insufficient emotional transmission. Particularly within families, the inability to understand each other's emotional nuances can significantly reduce the quality of daily life. As a solution to this problem, there is a need for a system that analyzes and displays emotions in real time.

[0663] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0664] In this invention, the server includes an acquisition means for acquiring audio and video information, a conversion means for generating text information using the audio information obtained by the acquisition means, an analysis means for analyzing the audio information, text information, and video information and estimating emotions, a display means for visually displaying the emotion information obtained by the analysis means, and a dialogue means for providing appropriate dialogue and suggestions based on the user's emotional state. This enables real-time understanding of emotional states even within the home, facilitating smoother communication.

[0665] "Acquisition means" refers to the device or method for initially collecting audio and video information.

[0666] "Conversion means" refers to devices or methods used to convert acquired audio information into text information.

[0667] "Analysis means" refers to devices or methods for analyzing audio information, text information, and video information, and estimating emotions based on that analysis.

[0668] "Display means" refers to devices or methods for visually representing and presenting analyzed emotional information to the user.

[0669] A "dialogue tool" is a device or method for providing users with appropriate conversations and suggestions based on their estimated emotional state.

[0670] To realize this invention, the user's terminal collects audio and video input. The terminal is equipped with a microphone and a camera, which are used to continuously acquire the user's voice and facial expressions. The acquired audio data is converted into text data using speech recognition software. In this case, speech recognition technology such as Google Speech-to-Text API is suitable.

[0671] The server receives audio, text, and video data transmitted from the terminal and executes an emotion analysis algorithm. Generative AI models such as the OpenAI GPT series are effective for this analysis. The analysis algorithm comprehensively analyzes factors such as speech intonation and speed, text content, and facial expressions in the video to generate emotion data.

[0672] The generated emotional data is displayed in a visually easy-to-understand format. The server sends this visual data to the device and displays it on the device's screen as graphs and icons. Users can use this information to make decisions that facilitate communication within the family.

[0673] For example, if a robot observes a child doing homework and determines that their concentration is waning, it can suggest, "Why don't you take a break? Let's play together!" Another example of a prompt to facilitate interaction that responds to the user's emotions is, "Please tell me how to analyze my family's emotions in real time and make appropriate conversations and suggestions based on that." This prompt is an element that enables the automation of emotion-based communication using generative AI models.

[0674] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0675] Step 1:

[0676] The device acquires audio and video data. Specifically, it captures the user's speech and facial expressions in real time using the camera and microphone. The input is the user's audio and video information, and the output is their digital data. The device then prepares this data to send to the server.

[0677] Step 2:

[0678] The terminal converts the audio data into text data. Speech recognition software is used to convert the audio information into text. The input is the audio data acquired in step 1, and the output is the text data of that audio. The converted text data is transferred to a server for further analysis.

[0679] Step 3:

[0680] The server analyzes the audio, text, and video data it receives. The analysis uses a generative AI model to analyze information in detail, including speech intonation and speed, text content, and facial expressions. The input is audio, text, and video data sent from the terminal, and the output is the analysis result representing the emotional state.

[0681] Step 4:

[0682] The server converts the emotional data into a data format for visual display. The emotional analysis results are processed into graphs and icons, making them visually displayable. The input is the emotional analysis results obtained in step 3, and the output is the visually displayed data. This data is sent to the terminal.

[0683] Step 5:

[0684] The terminal receives display data and displays emotional information to the user. Emotional changes are visually represented on the display, allowing the user to intuitively understand the situation. The input is visual display data received from the server, and the output is the information displayed on the screen.

[0685] Step 6:

[0686] The server generates appropriate dialogue and suggestions based on the user's emotional state. A generative AI model is used to determine appropriate responses based on the generated emotional data. The input is the emotional analysis result from step 3, and the output is a dialogue message as a prompt. This message is sent back to the terminal.

[0687] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0688] The system according to the present invention aims to acquire audio and video data and accurately analyze the user's emotions, and is equipped with advanced emotion recognition capabilities combined with an emotion engine. The system mainly consists of a terminal, a server, and an emotion engine.

[0689] The device is equipped with a microphone and camera to capture the user's voice and facial expressions in real time. The voice data is converted into text data, while characteristics such as intonation and speech speed are analyzed in parallel. All data obtained through this process is sent to a server for further detailed analysis.

[0690] On the server, data analysis is performed through an integrated emotion engine. The functions provided by the server include speech recognition, natural language processing, and facial expression recognition, which are interconnected to perform comprehensive emotion analysis. The emotion engine generates emotion data in real time and has the ability to capture moments of emotional change by visually representing the user's emotional state.

[0691] The emotion engine employs machine learning algorithms and continuously optimizes various analysis models through reinforcement learning. Furthermore, the server accumulates past user data, enabling personalized analysis tailored to each user. As a result, even for the same user, past emotional history is reflected in the analysis, leading to more accurate emotion estimation.

[0692] As a concrete example, when used in an educational setting, users (students) can check their own understanding of the learning material using a terminal equipped with an emotion engine. The data collected via the server is transmitted to the instructor in real time, enabling the progress of lessons and countermeasures based on the emotional changes of individual students. Therefore, this invention contributes to creating an environment in which individualized learning is promoted and student motivation is further enhanced, even in educational settings.

[0693] The following describes the processing flow.

[0694] Step 1:

[0695] The user powers on the device, and the built-in microphone and camera automatically turn on. This initiates the acquisition of audio and video data.

[0696] Step 2:

[0697] The speech recognition software uses the audio data acquired by the device to convert it into text data in real time. Simultaneously, data is collected to analyze features such as speech intonation and speed.

[0698] Step 3:

[0699] The device uses video data acquired through its camera to apply a facial recognition algorithm to extract facial features. This data includes subtle expressions and moments of change.

[0700] Step 4:

[0701] The device collects data from voice, text, and video and sends it to the server. This transmission is performed using a secure and fast protocol.

[0702] Step 5:

[0703] The server inputs the received data into the emotion engine. The emotion engine comprehensively analyzes the tone of voice, the content of the text, and the facial expressions in the video to estimate the emotional state.

[0704] Step 6:

[0705] The server utilizes machine learning algorithms to update the analysis model through reinforcement learning. This improves the model's accuracy in estimating emotions.

[0706] Step 7:

[0707] The server uses past user data to perform personalized sentiment analysis. This data, based on the user's past emotional history, allows for a more precise estimation of their current emotional state.

[0708] Step 8:

[0709] The server visualizes the sentiment data obtained through analysis and converts it into pie charts and vertical / horizontal graphs. This visualized data is then sent to the terminal.

[0710] Step 9:

[0711] The device receives visualization data and displays it in real time on the user interface. This allows users to intuitively understand their own emotional state.

[0712] (Example 2)

[0713] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0714] Existing systems for analyzing emotions lack accuracy in analyzing audio and video data, making it difficult to accurately and in real time understand a user's emotional state. Furthermore, personalized analysis tailored to individual users is not sufficiently implemented, hindering the improvement of emotion analysis accuracy. Additionally, insufficient feedback through visual representations of emotional changes presents challenges to practical application in educational settings and other fields.

[0715] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0716] In this invention, the server includes an acquisition means equipped with a terminal for acquiring audio data and video data; a conversion means for generating text data using the audio data obtained by the acquisition means; an analysis means for analyzing the intonation and speed of the audio data and estimating emotions from the text data and video data; a display means for visually displaying the emotion data; and a personalization means for accumulating past emotion data and optimizing emotion analysis according to the user. This enables accurate and real-time analysis of the user's emotions, and provides personalized, highly accurate emotion analysis and visual feedback.

[0717] "Acquisition means" refers to devices or methods that have the function of acquiring audio data and video data in real time.

[0718] A "conversion means" is a device or method that has the function of generating text data based on acquired audio data.

[0719] "Analysis means" refers to devices or methods that have the function of analyzing audio data, text data, and video data to estimate the user's emotions.

[0720] "Display means" refers to devices or methods that have the function of visually displaying analyzed emotional data.

[0721] "Personalization methods" refer to devices or methods that accumulate past emotional data and have the function of optimizing emotional analysis according to the user.

[0722] "Speech recognition" is a technology that analyzes audio data to understand what is being said and converts it into text data.

[0723] "Natural language processing" is a technology that analyzes text data to understand emotions, intentions, and other related concepts.

[0724] "Facial recognition" is a technology that analyzes video data to read facial expressions and understand emotions based on those expressions.

[0725] A "machine learning algorithm" is a mathematical model and method for learning from large amounts of data and predicting future data.

[0726] Reinforcement learning is a type of machine learning that learns the optimal action through trial and error.

[0727] In this invention, the terminal is equipped with a microphone and a camera as devices for acquiring audio and video data. The audio data is used to capture features of the speaker's voice, including intonation and speed. The video data is used to capture the user's facial expressions and movements in real time. This data is converted into text data using speech recognition technology, and features such as intonation and speed are further analyzed.

[0728] Data acquired by the device is sent to a server where all analysis takes place. The server comprehensively analyzes the data using an emotion engine that includes speech recognition, natural language processing, and facial expression recognition. The emotion engine employs machine learning algorithms and continuously optimizes the analysis model through reinforcement learning. This makes it possible to accurately analyze the user's emotional state in real time.

[0729] Furthermore, the server incorporates previously accumulated emotional data for each user into its analysis, enabling personalized emotional analysis for each user. This allows for highly accurate emotional estimation based on past emotional history, even for the same user. The analyzed emotional data is visually displayed, allowing for a concrete representation of the user's emotional changes.

[0730] One concrete example of its use is in educational settings. Users (students) would use this system to check their understanding of their learning during class. The analyzed emotional data would also be provided to instructors in real time, enabling them to tailor their lessons to the individual student's emotional changes.

[0731] As an example of a prompt when using a generative AI model, you can use the instruction "Analyze and visualize the current emotional state" to specify how the system should analyze and visualize the user's emotions.

[0732] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0733] Step 1:

[0734] The device uses a microphone and camera to acquire user audio and video data. It receives audio and video from the environment as input and converts them into digital data as output. Specifically, it captures the user's voice and continuously records their facial expressions.

[0735] Step 2:

[0736] This process uses speech recognition technology to generate text data from audio data acquired by the device. Simultaneously, it analyzes the intonation and speed of the speech. The input for this step is the acquired audio data, and the output is text data and speech features. Specifically, it analyzes the speech, converts the spoken content into a string, and generates numerical data for intonation and speed.

[0737] Step 3:

[0738] The terminal packages the analyzed audio, text, and video data and sends it to the server. The input for this step is audio features, text data, and video data, and the output is the transmission of data to the server. Specifically, the terminal ensures real-time performance by accumulating a certain amount of data and then rapidly transferring it to the server.

[0739] Step 4:

[0740] The server comprehensively analyzes the received data based on audio, text, and video. It uses an emotion engine to analyze the data and estimate emotions. The input for this step is the dataset transferred to the server, and the output is the emotion data estimated through the analysis. Specific operations include keyword extraction using speech recognition, text analysis using natural language processing, and facial analysis using facial expression recognition.

[0741] Step 5:

[0742] The server visually represents the user's emotional state based on the analysis results. The input for this step is the emotional data obtained through analysis, and the output is the visualized emotional state. Specific actions include converting the emotional data into graphs and animations, making it easy for users and educational institutions to review the emotional state.

[0743] Step 6:

[0744] The server personalizes sentiment analysis using accumulated historical user data. The input for this step is historical sentiment data, and the output is a current sentiment estimate that reflects past analysis. Specifically, the process involves incorporating historical data into the analysis to learn each user's sentiment change patterns and improve the accuracy of predictions.

[0745] (Application Example 2)

[0746] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0747] Conventional consumer robots have difficulty accurately understanding user emotions and taking appropriate measures, resulting in insufficient stress reduction and the provision of a comfortable living environment. Therefore, there is a need for technology that can grasp changes in user emotions in real time and respond appropriately based on that information.

[0748] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0749] In this invention, the server includes a device means for acquiring audio information and video information, a conversion device means for generating text information using the audio information obtained by the device means, and an estimation means for analyzing the audio information, text information, and video information and estimating emotions. This makes it possible to analyze the user's emotions in real time and select and play the most suitable music based on those emotions, thereby reducing user stress and providing a comfortable living environment.

[0750] "Audio information" refers to data obtained from speech, including the user's vocal characteristics and intonation.

[0751] "Visual information" refers to visual data, including information that records the user's facial expressions and movements.

[0752] "Device means" refers to means that have hardware or software for acquiring audio information or video information.

[0753] A "conversion device means" is a device that performs a process to convert audio information into text information.

[0754] "Text information" refers to information that represents audio information as character data.

[0755] "Inference methods" refer to the process of analyzing and estimating emotions from acquired audio, text, and video information.

[0756] "Emotional information" refers to data that represents the user's emotional state and is obtained through inference methods.

[0757] "Acoustic device means" refers to a device for selecting and playing appropriate music based on emotional information.

[0758] The system implementing this invention mainly consists of terminal devices for acquiring audio and video information, and a server for analyzing and processing that data. This system employs highly integrated technology to grasp the user's emotions in real time.

[0759] The terminal is equipped with a microphone and camera to acquire audio and video information from the user. Audio information is acoustic data obtained from the user's speech, and video information is visual data representing the user's facial expressions and movements. This information is transmitted to the server in real time.

[0760] The server converts the received audio information into text information using a conversion device, and uses this text to perform emotion analysis. In particular, an emotion engine that combines speech recognition technology and facial expression analysis algorithms as an inference method comprehensively analyzes audio information, text information, and video information to estimate the user's emotion information. This emotion information includes the user's stress level and emotional state, and is represented visually.

[0761] Furthermore, based on emotional information, the server controls the sound system and selects and plays the most suitable music. For this purpose, it also includes an algorithm to smoothly find songs from the music library that match the user's emotions.

[0762] For example, after a busy day at work, music might be played to help a user relax upon returning home. The system detects the user's voice tone and facial expressions that indicate fatigue, and plays relaxing music accordingly to reduce the user's stress.

[0763] An example of a prompt message would be, "Based on the user's voice and video, detect their current emotional state. If the user is stressed, suggest actions to help them relax." This allows the system to provide user-centered interaction and care that was not possible with conventional technologies.

[0764] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0765] Step 1:

[0766] The device activates its microphone and camera to acquire audio and video information from the user in real time. The input is the user's raw voice and facial expression data, and the output is the acquired digital audio and visual data. This acquisition process provides foundational data for analyzing the user's emotional state.

[0767] Step 2:

[0768] The terminal sends the acquired audio information to the server. The server applies a speech recognition algorithm to convert the audio information into text information. The input is audio data, and the output is the corresponding text information. In this process, the intonation and speed of the speech are also analyzed, and characteristic information for emotion estimation is extracted along with the text data.

[0769] Step 3:

[0770] The terminal transmits video information to the server. The server analyzes the video information and applies a facial expression analysis algorithm to detect facial expressions. The input is visual data, and the output is digital information representing facial features. This analysis yields an index of emotion based on changes in facial expressions.

[0771] Step 4:

[0772] The server integrates voice characteristic information, text information, and facial feature information, and the emotion engine estimates the user's emotions. The input is all the information processed in the previous step, and the output is the estimated emotion information. The estimated emotions include emotional states such as stress levels and feelings of happiness. In this process, the emotion engine analyzes the data using machine learning algorithms.

[0773] Step 5:

[0774] The server uses emotional information to select the most suitable music and controls the audio device via the terminal to play the music. The input is estimated emotional information, and the output is the selected music data. This process uses a generative AI model to dynamically select the song that best suits the user's current emotions. Furthermore, based on prompts, the entire system completes an interaction that aligns with the user's emotions.

[0775] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0776] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0777] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0778] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0779] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0780] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0781] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0782] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0783] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0784] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0785] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0786] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0787] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0788] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0789] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0790] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using this memory.

[0791] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0792] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0793] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0794] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0795] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0796] The following is further disclosed regarding the embodiments described above.

[0797] (Claim 1)

[0798] means for acquiring audio data and video data,

[0799] A conversion means for generating text data using the audio data obtained by the acquisition means,

[0800] An analysis means for analyzing the aforementioned audio data, text data, and video data to estimate emotions,

[0801] A display means for visually displaying the emotional data obtained by the analysis means,

[0802] A system that includes this.

[0803] (Claim 2)

[0804] The system according to claim 1, wherein the analysis means estimates emotion by analyzing the intonation and speed of the voice data.

[0805] (Claim 3)

[0806] The system according to claim 1, wherein the analysis means analyzes facial expressions based on video data and estimates emotions based on that.

[0807] "Example 1"

[0808] (Claim 1)

[0809] Acquisition means for acquiring acoustic data and visual data,

[0810] A conversion means for generating character data using the acoustic data obtained by the acquisition means,

[0811] An analysis means for analyzing the aforementioned acoustic data, text data, and visual data to estimate the emotional state,

[0812] A display means that integrates and visually displays the emotional data obtained by the analysis means,

[0813] A system that includes this.

[0814] (Claim 2)

[0815] The system according to claim 1, wherein the analysis means estimates an emotional state by analyzing the intonation and velocity of acoustic data and integrates inputs from multiple data modules.

[0816] (Claim 3)

[0817] The system according to claim 1, wherein the analysis means analyzes facial expressions based on visual data and visualizes emotional states by generating time-series charts and graphs.

[0818] "Application Example 1"

[0819] (Claim 1)

[0820] means for acquiring audio information and video information,

[0821] A conversion means for generating text information using the audio information obtained by the acquisition means,

[0822] An analysis means for analyzing the aforementioned audio information, text information, and video information to estimate emotions,

[0823] A display means for visually displaying the emotional information obtained by the analysis means,

[0824] A dialogue tool that provides appropriate dialogue and suggestions based on the user's emotional state,

[0825] A system that includes this.

[0826] (Claim 2)

[0827] The system according to claim 1, wherein the analysis means estimates emotion by analyzing the intonation and speed of the speech information.

[0828] (Claim 3)

[0829] The system according to claim 1, wherein the analysis means analyzes facial expressions based on video information and estimates emotions based on that.

[0830] "Example 2 of combining an emotion engine"

[0831] (Claim 1)

[0832] Acquisition means equipped with a terminal for acquiring audio data and video data,

[0833] A conversion means for generating text data using the audio data obtained by the acquisition means,

[0834] The server's analysis means analyzes the intonation and speed of the aforementioned audio data and estimates emotions from the text data and video data,

[0835] A display means for visually displaying the emotional data obtained by the analysis means,

[0836] The aforementioned server accumulates past emotional data and provides personalization means to optimize emotional analysis according to the user,

[0837] A system that includes this.

[0838] (Claim 2)

[0839] The system according to claim 1, which integrates speech recognition, natural language processing, and facial expression recognition to analyze emotions.

[0840] (Claim 3)

[0841] The system according to claim 1, which uses a machine learning algorithm to optimize the analysis model by reinforcement learning.

[0842] "Application example 2 when combining with an emotional engine"

[0843] (Claim 1)

[0844] A device means for acquiring audio information and video information,

[0845] A conversion device means that generates text information using the audio information obtained by the aforementioned device means,

[0846] An inference means for analyzing the aforementioned audio information, text information, and video information to estimate emotions,

[0847] A display device that visually displays the emotional information obtained by the estimation means,

[0848] A sound device means that selects and plays the optimal music based on the aforementioned emotional information,

[0849] A system that includes this.

[0850] (Claim 2)

[0851] The system according to claim 1, wherein the estimation means estimates emotion by analyzing the tone and speed of the voice information and controls the sound device means.

[0852] (Claim 3)

[0853] The system according to claim 1, wherein the estimation means analyzes facial expressions based on video information, estimates emotions based on that, and controls the sound device means. [Explanation of Symbols]

[0854] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. means for acquiring audio data and video data, A conversion means for generating text data using the audio data obtained by the acquisition means, An analysis means for analyzing the aforementioned audio data, text data, and video data to estimate emotions, A display means for visually displaying the emotional data obtained by the analysis means, A system that includes this.

2. The system according to claim 1, wherein the analysis means estimates emotion by analyzing the intonation and speed of the voice data.

3. The system according to claim 1, wherein the analysis means analyzes facial expressions based on video data and estimates emotions based on that.