system

A system that converts voice data to text, analyzes tone and emotion, and provides real-time feedback improves meeting efficiency by balancing participation and addressing cultural and language barriers.

JP2026101148APending Publication Date: 2026-06-22SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-10
Publication Date
2026-06-22

AI Technical Summary

Technical Problem

Modern meetings often suffer from decreased productivity due to factors like monopolization of discussions, cultural and language barriers leading to misunderstandings, and inefficient communication.

Method used

A system that collects participants' voice data, converts it into text, analyzes tone and emotion, and generates real-time feedback to improve communication quality and balance.

Benefits of technology

Enhances meeting productivity by providing immediate feedback to participants, addressing tone and participation imbalance, and maintaining discussion focus.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026101148000001_ABST
    Figure 2026101148000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] Means for collecting participants' voice information, Means for converting the aforementioned audio information into text information, A means for analyzing the content of a statement from the aforementioned textual information, Means for generating feedback based on the aforementioned analysis, The means for providing the aforementioned feedback, A means to improve the conversational environment inside autonomous vehicles in real time, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In a modern business environment, meetings are an important means of communication, but often productivity decreases and time is wasted. Such backgrounds include factors such as specific participants monopolizing discussions, blocking opinions, and deviating from topics. Also, due to cultural and language differences in multinational teams, misunderstandings and communication difficulties may occur. Therefore, a system for solving these problems and improving the quality of meetings is required.

Means for Solving the Problems

[0005] This invention provides an apparatus and means for collecting participants' voice data and converting it into text data. The means then analyzes the converted text data to evaluate the tone and emotion of speech, detect the level of participation and interruptions, and generate feedback. The generated feedback is notified to the user in real time, enabling immediate improvement of the quality of communication during meetings. This facilitates smooth communication that takes into account the cultural and linguistic differences of multinational teams, resulting in increased meeting productivity and reduced wasted time.

[0006] "Participant audio data" refers to audio information emitted by individual members participating in the meeting.

[0007] "Collection device" refers to the hardware and software configuration for receiving audio data from an external source and appropriately capturing it as digital information.

[0008] "Means for converting audio data to text data" refers to software functions that use speech recognition technology to perform the process of converting audio information into text information.

[0009] "Methods for analyzing the content of speech" refers to algorithms that perform natural language processing on text data to understand its context and nuances, and to interpret emotions and intentions.

[0010] "Means of generating feedback" refers to a function that creates information to provide users with appropriate advice and suggestions for improvement based on the results of the analyzed utterances.

[0011] "Means of notifying feedback" refers to communication and display technologies for conveying generated feedback information to the user's terminal in real time, either by displaying it or by voice. [Brief explanation of the drawing]

[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0013] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0014] First, the terms used in the following description will be explained.

[0015] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0016] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0017] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0018] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0020] [First Embodiment]

[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0033] This invention is a system for improving the quality of meetings using audio data collected by users when they participate in a meeting. The user terminal collects participants' audio in real time through a microphone during the meeting and transmits it to a server. The server converts the received audio data into text data using automatic speech recognition technology. At this time, the server removes background noise and performs processing to identify each speaker.

[0034] The server analyzes the converted text data using natural language processing techniques. Through this analysis, the server evaluates the tone and emotion of each statement, and also detects the consistency of the agenda and the level of discussion. Furthermore, the server analyzes the frequency of interruptions and the balance of contributions to understand the overall progress of the meeting.

[0035] Based on the analysis results, the server generates personalized feedback for each participant. This feedback includes suggestions for improving their tone of voice and specific advice for balancing the discussion. User terminals notify participants of the generated feedback during the meeting and present it to them in real time. This allows users to adjust their contributions during the meeting and improve its productivity.

[0036] For example, if a user is dominating a discussion by speaking too frequently, the server will provide feedback to that user encouraging them to consciously listen to others' opinions. Also, if a meeting is straying from a particular topic, the server will advise the facilitator to return to the main agenda. In this way, the system appropriately supports the progress of meetings, enabling efficient and streamlined communication.

[0037] The following describes the processing flow.

[0038] Step 1:

[0039] The user terminal collects audio from meeting participants in real time via a microphone. The audio data is transmitted to the server in digital format, and is therefore properly sampled and encoded.

[0040] Step 2:

[0041] The server receives audio data sent from the user's terminal and converts it into text data using an automatic speech recognition engine. Here, the server performs noise reduction and speaker identification, distinguishing and transcribing each utterance.

[0042] Step 3:

[0043] The server applies natural language processing algorithms to the converted text data to analyze the tone and emotion of the speech. This analysis utilizes sentiment analysis and keyword extraction to make the speaker's intent easier to understand.

[0044] Step 4:

[0045] Based on the analyzed data, the server evaluates the level of participation, the presence or absence of interruptions, and any deviations from the agenda. This allows for an understanding of the dynamic flow of the meeting as a whole and the contribution of each participant.

[0046] Step 5:

[0047] The server generates personalized feedback for each participant based on the results of the speech analysis. This feedback is designed to include specific areas for improvement and suggestions to facilitate the discussion.

[0048] Step 6:

[0049] User terminals notify participants in real time during the meeting of feedback received from the server. The feedback is provided via pop-ups and audio, helping participants immediately attempt to make improvements.

[0050] Step 7:

[0051] The server automatically generates a report after the meeting ends, including a summary and analysis of the entire meeting. Users can refer to this report to help improve communication.

[0052] (Example 1)

[0053] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0054] In meetings, there is a challenge in facilitating effective communication while maintaining a balance of participation and consistency in the discussion. In particular, the inefficiency of meetings due to imbalances in the tone, emotion, and frequency of participation among participants is a significant problem.

[0055] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0056] In this invention, the server includes means for collecting participants' voice information, means for converting the voice information into text information, and means for analyzing the content of the speech from the text information. This makes it possible to evaluate each participant's tone of voice and emotions, optimize the level of discussion activity, and generate and notify feedback in real time.

[0057] "Participants" refers to people who participate in discussions and contribute to the conversation during a meeting.

[0058] "Audio information" refers to audio data acquired as spoken words by participants during a meeting.

[0059] "Textual information" refers to data in text format that has been converted from audio information.

[0060] "Analysis" refers to the process of evaluating the meaning, tone, and emotion of a statement from converted text information, and extracting specific patterns and characteristics.

[0061] "Feedback" refers to information provided to participants based on the analysis results for improvement and evaluation.

[0062] "Notification" refers to the process of displaying or reporting generated feedback on participants' devices.

[0063] The system based on this invention has the function of collecting and analyzing participants' voice information in real time to generate feedback in order to improve the efficiency of meetings. The user terminal acquires the voice information of participants during the meeting using its built-in microphone. This voice information is transmitted to the server via an internet connection.

[0064] The server converts received audio information into text using speech recognition technology. During this process, the server employs a noise reduction algorithm to extract clear language signals. The converted text is then analyzed on the server using a natural language processing library. This analysis allows the server to identify the meaning, tone, and emotion of the spoken content. Based on the analyzed data, the server generates individual feedback for each participant. This feedback includes specific advice to help optimize emotional shifts and speaking balance.

[0065] User terminals receive feedback from the server in real time and notify users in visual and audio formats. This allows users to adjust their comments and actions on the spot, improving the productivity of meetings.

[0066] For example, if a participant is speaking too frequently and dominating the discussion, the server will provide feedback to encourage that participant to elicit opinions from other attendees. Also, if the overall discussion is straying from a particular topic, the server will advise the facilitator to bring it back on track.

[0067] An example of a prompt to input into a generative AI model is, "Please tell me how to analyze participants' voices in real time during a meeting and provide personalized feedback to each speaker." A system designed in this way supports efficient and harmonious meeting management.

[0068] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0069] Step 1: Collect audio information

[0070] The user terminal uses its built-in microphone to collect participants' speech in real time during a meeting. The input is the sound of the meeting room, and this audio signal is temporarily stored as digital data. Specifically, the terminal's recording application samples the audio and stores it in a buffer.

[0071] Step 2: Sending voice information to the server

[0072] The terminal transmits the collected audio digital data to the server via the internet. The input is the audio data collected in step 1, and the output is the audio file transferred to the server. The data is transmitted securely using an encryption protocol.

[0073] Step 3: Convert speech to text

[0074] The server processes the received audio data using speech recognition software and converts it into text data. The input is the audio file sent to the server, and the output is the corresponding text data. The server performs noise reduction and separates each audio using speaker identification technology.

[0075] Step 4: Analyzing the text information

[0076] The server analyzes text data using a natural language processing library. The input is text data, and the output is the analysis results, including the content of the statement, its emotion, and tone. Specifically, the server applies a text analysis algorithm and calculates an emotion score for each statement.

[0077] Step 5: Generating Feedback

[0078] The server generates feedback for each participant based on the analysis results. The input is the analysis results, and the output is a personalized feedback message. The server uses a generative AI model to create feedback that includes specific areas for improvement.

[0079] Step 6: Feedback Notification

[0080] The user terminal displays feedback sent from the server. The input is the feedback message, and the output is the visual or audible notification the user receives. Specifically, the terminal uses a GUI to display pop-up notifications and, if necessary, uses vibration or sound to draw attention.

[0081] (Application Example 1)

[0082] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0083] If communication between passengers or between passengers and the system is not smooth within an autonomous vehicle, passenger comfort and experience may be compromised. Conventional methods have the challenge of not being able to improve the real-time conversation environment and not being able to autonomously adjust the tone and content of passengers' speech appropriately.

[0084] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0085] In this invention, the server includes means for collecting participant voice information, means for converting the voice information into text information, and means for analyzing the content of the speech from the text information. This makes it possible to understand the conversational environment in an autonomous vehicle in real time, evaluate the tone of speech, emotions, and the level of engagement, and provide feedback to improve the passenger experience.

[0086] "Participants" refers to all individuals and devices that provide audio information within an autonomous vehicle.

[0087] "Audio information" refers to collected sound wave data, which is sound data converted into text information by speech recognition technology.

[0088] "Textual information" refers to text data converted from audio information and is used for analyzing the content of speech.

[0089] "Statement content" refers to the opinions and information of participants expressed as written information, and the intention and meaning of the conversation are understood by analyzing its content.

[0090] "Analysis" refers to the process of understanding and evaluating the meaning and intent of a statement using collected textual information.

[0091] "Feedback" refers to information provided to participants based on the analysis results, indicating areas for improvement in the tone and content of their statements.

[0092] "Real-time" refers to processing participants' voice information instantly and notifying them of the results without delay.

[0093] An "autonomous vehicle" refers to a vehicle in which driving operations are performed autonomously, and which is a means of transportation equipped with systems to improve the passenger experience inside the vehicle.

[0094] This invention realizes a support system to facilitate communication within autonomous vehicles. A server collects passenger voice information in real time through microphones installed in the vehicle and uses this voice information to improve passenger conversations. The voice information is converted into text information using speech recognition technology. For example, the speech_recognition library in Python or Google's® speech recognition API can be used.

[0095] This textual information is analyzed by the server's natural language processing module, which evaluates the tone and emotion of the speech, the level of engagement, and whether or not interruptions occurred. The analysis results are then materialized by custom modules such as text_processing_module and feedback_generator. In this process, the server generates optimal feedback for the passenger and notifies them in real time via their terminal. This allows the passenger to adjust their speech and improve the conversational environment.

[0096] For example, if one passenger is speaking for an extended period, the server might offer feedback such as, "Please listen to what other passengers have to say." If a passenger is speaking very loudly, the server might encourage them by saying, "Speaking in a quieter tone would help everyone relax."

[0097] An example of a prompt for a generative AI model could be: "Speech recognition result: {text generated from speech}. Analyze the tone and important keywords of this text to create feedback that will facilitate the conversation."

[0098] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0099] Step 1:

[0100] The server collects passenger voice information through microphones installed inside the vehicle. The input is sound wave data, and the output is a digitized voice signal. Here, the analog data is converted to a digital format to enable subsequent processing.

[0101] Step 2:

[0102] The server uses the collected audio information to perform a speech recognition process and generate text information. The input is a digitized audio signal, and the output is text data of the audio. In this step, the Python speech_recognition library is used to call Google's speech recognition API to convert the audio data to text.

[0103] Step 3:

[0104] The server analyzes the generated text information using a natural language processing module. The input is text data obtained through speech recognition, and the output is an evaluation result of the tone, emotion, liveliness, and presence or absence of interruptions in the speech. Specifically, the text_processing_module is used to perform calculations that evaluate the meaning and emotion of the text.

[0105] Step 4:

[0106] Based on the analysis results, the server generates feedback. The input is the analysis results, and the output is a specific feedback message for the passenger. The feedback_generator is used to create the feedback message, pointing out areas for improvement in tone and content.

[0107] Step 5:

[0108] The terminal notifies passengers of the generated feedback in real time. The input is the feedback message, and the output is visualized notification information. This process uses the terminal's display and audio output capabilities to provide passengers with feedback at the appropriate time.

[0109] Step 6:

[0110] The user adjusts their statements based on feedback received through the device. The input is the feedback message, and the output is the adjusted new statement. This process allows the user to improve the conversational environment and maintain comfortable communication.

[0111] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0112] This invention is a system that provides feedback by comprehensively analyzing participants' voices and emotions in user-participatory meetings. The user terminal is equipped with a collection device for collecting participants' voices in real time during the meeting and transmitting them to a server. The server receives the voice data and converts it into text data using a speech recognition engine. This conversion process is optimized to remove noise and identify each speaker.

[0113] Furthermore, the server is equipped with an emotion engine that comprehensively analyzes text data, voice tone, and facial expression information obtained from video data to evaluate the user's emotional state in detail. The server uses natural language processing algorithms to evaluate the tone and emotional tendencies of each statement, as well as to grasp the level of engagement and the progress of the discussion.

[0114] Based on the analysis results, the server generates personalized feedback for each participant. This feedback is adjusted according to the user's emotional state and is provided during and after the meeting. User terminals display the feedback in real time, visualizing areas for improvement and points to be aware of for the user.

[0115] For example, if the emotion engine detects that a user is emotionally agitated, the server generates feedback for that user, such as advice on breathing techniques to calm down. Also, if a discussion becomes too heated, the server provides feedback to the facilitator, prompting them to return to the topic. This allows the system to efficiently manage the meeting and create an environment where all participants can contribute constructively.

[0116] The following describes the processing flow.

[0117] Step 1:

[0118] The user terminal collects participants' audio data via the microphone during the meeting. The collected audio data is digitized and transmitted to the server in real time.

[0119] Step 2:

[0120] The server converts the received audio data into text data using a speech recognition engine. During the conversion, background noise is removed, the speaker is identified, and each utterance is neatly transcribed into text.

[0121] Step 3:

[0122] The server analyzes the converted text data and speech tone using natural language processing algorithms. Here, it evaluates the tone and emotion of each statement and checks the liveliness and balance of the discussion.

[0123] Step 4:

[0124] The server uses an emotion engine to further analyze the user's emotional state. It integrates voice tone, facial expression data, and text content to provide a detailed assessment of the user's emotional tendencies.

[0125] Step 5:

[0126] Based on the analysis results above, the server generates personalized feedback tailored to each participant, including specific advice and suggestions for improvement that take their emotional state into account.

[0127] Step 6:

[0128] The user's device notifies the user of the generated feedback in real time. Notifications are provided as screen displays or audio messages, helping the user to correct their actions on the spot.

[0129] Step 7:

[0130] The server generates a detailed report after the meeting, including a summary and analysis of the meeting. Users can use this report to identify areas for improvement in future meetings and prepare accordingly.

[0131] (Example 2)

[0132] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0133] In modern meetings, participants' contributions are diverse, and understanding their statements, including changes in emotion and tone, and providing appropriate feedback is essential for productive discussions. However, traditional systems have struggled to accurately assess participants' emotions and the state of the discussion in real time and to provide optimized improvements for each participant.

[0134] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0135] In this invention, the server includes means for collecting acoustic information of participants, means for converting the acoustic information into text information, and means for analyzing the content of speech based on the text information. This enables a detailed understanding of the content of participants' speech and the generation of personalized feedback that corresponds to the emotion and tone of each speech.

[0136] "Participants" refer to individual individuals who speak during a meeting, and are the subjects from whom the system collects audio information and other data.

[0137] "Acoustic information" refers to data that includes the characteristics of voices and sounds emitted by participants, and is collected in real time during the meeting.

[0138] "Textual information" refers to data in text format converted from acoustic information, and is used for analyzing the content of speech.

[0139] "Statements" refer to the opinions and information expressed by participants, and serve as the basis for analysis and feedback generation.

[0140] "Analysis" is the process of evaluating the tone of participants' speech, emotional state, and activity level based on textual information and acoustic tone, and extracting necessary information.

[0141] "Feedback" refers to information provided to participants based on analysis results, indicating areas for improvement and points to pay attention to, with the aim of improving the quality of the meeting and the performance of the participants.

[0142] This invention is a system that comprehensively analyzes acoustic information in user-participatory meetings and provides participants with real-time, personalized feedback. During the meeting, the user terminal collects participant acoustic information using a highly sensitive voice collection device. This collected acoustic information is encoded using a dedicated compression algorithm and transmitted to a server via the network.

[0143] The server first uses a digital signal processing unit to remove noise from the received acoustic information. It then uses the Google Cloud Speech-to-Text API or a similar recognition engine to convert this acoustic information into text. Next, to analyze the converted text, it uses OpenAI® generative AI models to precisely evaluate the content of the speech. The analysis comprehensively considers voice tone and facial expression information from the video to understand the emotional state of each participant.

[0144] Based on the analysis results, the server uses natural language generation technology to generate feedback best suited to each participant. For example, if a user is speaking in an angry tone, the server provides feedback suggesting breathing techniques to soften that tone. The generated feedback is transmitted to the user's terminal via the user interface and displayed visually to the participant in real time.

[0145] For example, if the server detects that the participants' discussion is becoming heated, it will prompt the facilitator to "adjust the meeting to ensure proper moderation." Another example of a prompt for the generative AI model could be, "Please tell me what kind of feedback should be given when participants become emotionally agitated."

[0146] As described above, implementing this system can support more effective and smoother meeting management and create an environment where all participants can contribute constructively.

[0147] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0148] Step 1:

[0149] The user terminal uses a high-sensitivity microphone to collect acoustic information from participants during the meeting. The terminal captures this information in real time, applies a noise reduction filter to clean the signal, and then encodes the result in digital format. The input is the participants' voices in the physical space, and the output is digitally filtered acoustic data.

[0150] Step 2:

[0151] The user terminal compresses the encoded audio data and sends it to the server in accordance with IT security protocols. The input is digital audio data, and the output is securely transmitted compressed audio data. This process applies algorithms to optimize data volume and reduce the load on the network.

[0152] Step 3:

[0153] The server decodes the received compressed audio data, performs noise reduction, and analyzes the speech characteristics. The input is the received compressed audio data, and the output is clean, analyzable audio data. This data is converted into text information using a speech recognition engine, such as a common API.

[0154] Step 4:

[0155] The server analyzes the generated text information and associated speech tone data to evaluate the content and emotional state of the utterance. The input consists of text information and speech tone data, while the output is evaluation information such as the emotional state of the analyzed utterance. This analysis is performed using a generative AI model and natural language processing techniques.

[0156] Step 5:

[0157] The server generates personalized feedback based on the analysis results and sends it to the user terminal via the user interface. The input is the analyzed emotion and utterance evaluation information, and the output is user-specific, customized feedback data. This feedback may include, for example, advice to alleviate the feelings of meeting participants.

[0158] Step 6:

[0159] The user terminal displays feedback received from the server in real time and notifies the user using audio and visuals. The input is feedback data, and the output is feedback information displayed on the screen. Based on this information, the terminal supports the user in taking specific actions during the meeting.

[0160] (Application Example 2)

[0161] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0162] In modern home and office environments, smooth communication and psychological safety are crucial elements for improving the quality of daily life. However, emotional outbursts and conversational stagnation can hinder smooth communication. In such situations, there is a need for a system that utilizes robots to understand emotions and conversational states in real time and provide appropriate support.

[0163] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0164] In this invention, the server includes a mechanism for acquiring acoustic information, a method for converting acoustic information into text information, a method for analyzing emotions and facial expressions from the text information and acoustic information, a method for generating instructions to support actions, and a mechanism for presenting the instructions via an operating device. This facilitates communication in home and office environments, enhances psychological safety, and improves the quality of life.

[0165] "Acoustic information" refers to speech and sound signals collected within a space, including conversations and ambient sounds.

[0166] A "mechanism" refers to a group of physical or electronic devices designed to perform a specific function.

[0167] "Textual information" refers to data in string format obtained by analyzing acoustic information, which is essentially a conversion of speech into text.

[0168] "Method" refers to the steps or processes that are taken to achieve a specific objective.

[0169] "Analysis" is the process of evaluating the content and characteristics of collected data and extracting information.

[0170] "Emotions" are internal reactions that indicate a person's psychological state, and they are expressed externally through tone of voice and facial expressions.

[0171] "Facial expression" refers to the outward changes produced by facial muscles and other structures to indicate an individual's emotional state.

[0172] "Instructions to support action" are advice or guidelines provided to encourage appropriate behavior in a particular situation.

[0173] An "operating device" refers to a unit that physically moves to execute instructions from a system.

[0174] As a form of implementing the invention, this system is designed to facilitate communication in home and office environments. The system includes a consumer robot with a built-in microphone for acquiring acoustic information. This robot acquires voice data in real time and transmits it to a server via a Wi-Fi network. The server then uses Google Cloud's Speech-to-Text API to convert the acoustic information into text.

[0175] The converted text and audio information are analyzed for emotions and facial expressions via IBM Watson®'s emotion analysis API. This analysis evaluates the user's emotional state and generates instructions to support their actions. A generative AI model (e.g., GPT-3®) is used to generate these instructions, and the server provides advice based on the generated prompts. The generated advice is displayed on the robot's screen or presented as a voice assistant.

[0176] As a concrete example, the robot observes how family conversations develop during dinner. If the system determines that emotions are running high, it will provide voice advice such as, "Let's calm down a bit and take a deep breath." Also, if it determines that the conversation is stalled, it will prompt the conversation with questions such as, "How was your day?"

[0177] A concrete example of a prompt sentence using a generative AI model is, "If a conversation at home becomes emotional, generate advice to help people calm down." This prompt sentence allows the generative AI model to generate appropriate feedback tailored to the specific situation.

[0178] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0179] Step 1:

[0180] The terminal acquires acoustic information through a microphone. The input is ambient sound data, which is converted into a digital signal and sent to the server. The output is the transmission of audio data to the server.

[0181] Step 2:

[0182] The server inputs the received audio data into the speech recognition service. Here, the Google Cloud Speech-to-Text API is used to filter out noise from the audio data and convert it into text. The output is the converted text data.

[0183] Step 3:

[0184] The server inputs the converted text information and the original audio data into the sentiment analysis engine. This engine utilizes IBM Watson's sentiment analysis API to analyze the user's emotional state and facial expressions. As a result of the analysis, data indicating emotional tendencies and tone is output.

[0185] Step 4:

[0186] The server inputs prompts into the generated AI model, which then generates instructions based on the analysis results. The GPT-3 model is used here, and an example prompt is "There is a heightened emotional situation at home; please provide advice on how to calm down." The output is specific action advice for the user.

[0187] Step 5:

[0188] The server sends the generated instructions to the terminal. The terminal receives these instructions and provides advice to the user via voice or display. The output is feedback to the user.

[0189] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0190] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0191] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0192] [Second Embodiment]

[0193] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0194] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0195] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0196] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0197] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0198] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0199] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0200] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0201] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0202] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0203] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0204] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0205] This invention is a system for improving the quality of meetings using audio data collected by users when they participate in a meeting. The user terminal collects participants' audio in real time through a microphone during the meeting and transmits it to a server. The server converts the received audio data into text data using automatic speech recognition technology. At this time, the server removes background noise and performs processing to identify each speaker.

[0206] The server analyzes the converted text data using natural language processing techniques. Through this analysis, the server evaluates the tone and emotion of each statement, and also detects the consistency of the agenda and the level of discussion. Furthermore, the server analyzes the frequency of interruptions and the balance of contributions to understand the overall progress of the meeting.

[0207] Based on the analysis results, the server generates personalized feedback for each participant. This feedback includes suggestions for improving their tone of voice and specific advice for balancing the discussion. User terminals notify participants of the generated feedback during the meeting and present it to them in real time. This allows users to adjust their contributions during the meeting and improve its productivity.

[0208] For example, if a user is dominating a discussion by speaking too frequently, the server will provide feedback to that user encouraging them to consciously listen to others' opinions. Also, if a meeting is straying from a particular topic, the server will advise the facilitator to return to the main agenda. In this way, the system appropriately supports the progress of meetings, enabling efficient and streamlined communication.

[0209] The following describes the processing flow.

[0210] Step 1:

[0211] The user terminal collects audio from meeting participants in real time via a microphone. The audio data is transmitted to the server in digital format, and is therefore properly sampled and encoded.

[0212] Step 2:

[0213] The server receives audio data sent from the user's terminal and converts it into text data using an automatic speech recognition engine. Here, the server performs noise reduction and speaker identification, distinguishing and transcribing each utterance.

[0214] Step 3:

[0215] The server applies natural language processing algorithms to the converted text data to analyze the tone and emotion of the speech. This analysis utilizes sentiment analysis and keyword extraction to make the speaker's intent easier to understand.

[0216] Step 4:

[0217] Based on the analyzed data, the server evaluates the level of participation, the presence or absence of interruptions, and any deviations from the agenda. This allows for an understanding of the dynamic flow of the meeting as a whole and the contribution of each participant.

[0218] Step 5:

[0219] The server generates personalized feedback for each participant based on the results of the speech analysis. This feedback is designed to include specific areas for improvement and suggestions to facilitate the discussion.

[0220] Step 6:

[0221] User terminals notify participants in real time during the meeting of feedback received from the server. The feedback is provided via pop-ups and audio, helping participants immediately attempt to make improvements.

[0222] Step 7:

[0223] The server automatically generates a report after the meeting ends, including a summary and analysis of the entire meeting. Users can refer to this report to help improve communication.

[0224] (Example 1)

[0225] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0226] In meetings, there is a challenge in facilitating effective communication while maintaining a balance of participation and consistency in the discussion. In particular, the inefficiency of meetings due to imbalances in the tone, emotion, and frequency of participation among participants is a significant problem.

[0227] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0228] In this invention, the server includes means for collecting participants' voice information, means for converting the voice information into text information, and means for analyzing the content of the speech from the text information. This makes it possible to evaluate each participant's tone of voice and emotions, optimize the level of discussion activity, and generate and notify feedback in real time.

[0229] "Participants" refers to people who participate in discussions and contribute to the conversation during a meeting.

[0230] "Audio information" refers to audio data acquired as spoken words by participants during a meeting.

[0231] "Textual information" refers to data in text format that has been converted from audio information.

[0232] "Analysis" refers to the process of evaluating the meaning, tone, and emotion of a statement from converted text information, and extracting specific patterns and characteristics.

[0233] "Feedback" refers to information provided to participants based on the analysis results for improvement and evaluation.

[0234] "Notification" refers to the process of displaying or reporting generated feedback on participants' devices.

[0235] The system based on this invention has the function of collecting and analyzing participants' voice information in real time to generate feedback in order to improve the efficiency of meetings. The user terminal acquires the voice information of participants during the meeting using its built-in microphone. This voice information is transmitted to the server via an internet connection.

[0236] The server converts received audio information into text using speech recognition technology. During this process, the server employs a noise reduction algorithm to extract clear language signals. The converted text is then analyzed on the server using a natural language processing library. This analysis allows the server to identify the meaning, tone, and emotion of the spoken content. Based on the analyzed data, the server generates individual feedback for each participant. This feedback includes specific advice to help optimize emotional shifts and speaking balance.

[0237] User terminals receive feedback from the server in real time and notify users in visual and audio formats. This allows users to adjust their comments and actions on the spot, improving the productivity of meetings.

[0238] For example, if a participant is speaking too frequently and dominating the discussion, the server will provide feedback to encourage that participant to elicit opinions from other attendees. Also, if the overall discussion is straying from a particular topic, the server will advise the facilitator to bring it back on track.

[0239] An example of a prompt to input into a generative AI model is, "Please tell me how to analyze participants' voices in real time during a meeting and provide personalized feedback to each speaker." A system designed in this way supports efficient and harmonious meeting management.

[0240] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0241] Step 1: Collect audio information

[0242] The user terminal uses its built-in microphone to collect participants' speech in real time during a meeting. The input is the sound of the meeting room, and this audio signal is temporarily stored as digital data. Specifically, the terminal's recording application samples the audio and stores it in a buffer.

[0243] Step 2: Sending voice information to the server

[0244] The terminal transmits the collected audio digital data to the server via the internet. The input is the audio data collected in step 1, and the output is the audio file transferred to the server. The data is transmitted securely using an encryption protocol.

[0245] Step 3: Convert speech to text

[0246] The server processes the received audio data using speech recognition software and converts it into text data. The input is the audio file sent to the server, and the output is the corresponding text data. The server performs noise reduction and separates each audio using speaker identification technology.

[0247] Step 4: Analyzing the text information

[0248] The server analyzes text data using a natural language processing library. The input is text data, and the output is the analysis results, including the content of the statement, its emotion, and tone. Specifically, the server applies a text analysis algorithm and calculates an emotion score for each statement.

[0249] Step 5: Generating Feedback

[0250] The server generates feedback for each participant based on the analysis results. The input is the analysis results, and the output is a personalized feedback message. The server uses a generative AI model to create feedback that includes specific areas for improvement.

[0251] Step 6: Feedback Notification

[0252] The user terminal displays feedback sent from the server. The input is the feedback message, and the output is the visual or audible notification the user receives. Specifically, the terminal uses a GUI to display pop-up notifications and, if necessary, uses vibration or sound to draw attention.

[0253] (Application Example 1)

[0254] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0255] If communication between passengers or between passengers and the system is not smooth within an autonomous vehicle, passenger comfort and experience may be compromised. Conventional methods have the challenge of not being able to improve the real-time conversation environment and not being able to autonomously adjust the tone and content of passengers' speech appropriately.

[0256] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0257] In this invention, the server includes means for collecting participant voice information, means for converting the voice information into text information, and means for analyzing the content of the speech from the text information. This makes it possible to understand the conversational environment in an autonomous vehicle in real time, evaluate the tone of speech, emotions, and the level of engagement, and provide feedback to improve the passenger experience.

[0258] "Participants" refers to all individuals and devices that provide audio information within an autonomous vehicle.

[0259] "Audio information" refers to collected sound wave data, which is sound data converted into text information by speech recognition technology.

[0260] "Textual information" refers to text data converted from audio information and is used for analyzing the content of speech.

[0261] "Statement content" refers to the opinions and information of participants expressed as written information, and the intention and meaning of the conversation are understood by analyzing its content.

[0262] "Analysis" refers to the process of understanding and evaluating the meaning and intent of a statement using collected textual information.

[0263] "Feedback" refers to information provided to participants based on the analysis results, indicating areas for improvement in the tone and content of their statements.

[0264] "Real-time" refers to processing participants' voice information instantly and notifying them of the results without delay.

[0265] An "autonomous vehicle" refers to a vehicle in which driving operations are performed autonomously, and which is a means of transportation equipped with systems to improve the passenger experience inside the vehicle.

[0266] This invention realizes a support system to facilitate communication within autonomous vehicles. A server collects passenger voice information in real time through microphones installed in the vehicle and uses this voice information to improve passenger conversations. The voice information is converted into text information using speech recognition technology. For example, the speech_recognition library in Python or Google's speech recognition API can be used.

[0267] This textual information is analyzed by the server's natural language processing module, which evaluates the tone and emotion of the speech, the level of engagement, and whether or not interruptions occurred. The analysis results are then materialized by custom modules such as text_processing_module and feedback_generator. In this process, the server generates optimal feedback for the passenger and notifies them in real time via their terminal. This allows the passenger to adjust their speech and improve the conversational environment.

[0268] For example, if one passenger is speaking for an extended period, the server might offer feedback such as, "Please listen to what other passengers have to say." If a passenger is speaking very loudly, the server might encourage them by saying, "Speaking in a quieter tone would help everyone relax."

[0269] An example of a prompt for a generative AI model could be: "Speech recognition result: {text generated from speech}. Analyze the tone and important keywords of this text to create feedback that will facilitate the conversation."

[0270] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0271] Step 1:

[0272] The server collects passenger voice information through microphones installed inside the vehicle. The input is sound wave data, and the output is a digitized voice signal. Here, the analog data is converted to a digital format to enable subsequent processing.

[0273] Step 2:

[0274] The server uses the collected audio information to perform a speech recognition process and generate text information. The input is a digitized audio signal, and the output is text data of the audio. In this step, the Python speech_recognition library is used to call Google's speech recognition API to convert the audio data to text.

[0275] Step 3:

[0276] The server analyzes the generated text information using a natural language processing module. The input is text data obtained through speech recognition, and the output is an evaluation result of the tone, emotion, liveliness, and presence or absence of interruptions in the speech. Specifically, the text_processing_module is used to perform calculations that evaluate the meaning and emotion of the text.

[0277] Step 4:

[0278] Based on the analysis results, the server generates feedback. The input is the analysis results, and the output is a specific feedback message for the passenger. The feedback_generator is used to create the feedback message, pointing out areas for improvement in tone and content.

[0279] Step 5:

[0280] The terminal notifies passengers of the generated feedback in real time. The input is the feedback message, and the output is visualized notification information. This process uses the terminal's display and audio output capabilities to provide passengers with feedback at the appropriate time.

[0281] Step 6:

[0282] The user adjusts their speech based on the feedback received via the terminal. The input is the feedback message, and the output is the adjusted new speech. Through this process, the user can improve the conversation environment and maintain comfortable communication.

[0283] Furthermore, an emotion engine for estimating the user's emotions may be combined. That is, the specific processing unit 290 may estimate the user's emotions using the emotion recognition model 59 and perform specific processing using the user's emotions.

[0284] This invention is a system that comprehensively analyzes the voices and emotions of participants in a user-participation type meeting and provides feedback. The user terminal is equipped with a collection device for collecting the voices of participants in real time during the meeting and transmitting them to the server. The server receives the voice data and converts it into text data by means of a voice recognition engine. In this conversion process, noise is removed and it is optimized so that it can be identified for each speaker.

[0285] Furthermore, the server is equipped with an emotion engine that comprehensively analyzes the text data, voice tone, and furthermore, facial expression information obtained from video data to evaluate the user's emotional state in detail. The server uses natural language processing algorithms to evaluate the tone and emotional tendency of each speech, and also grasps the activity of the speech and the progress of the discussion.

[0286] Based on the analysis results, the server generates personalized feedback for each participant. This feedback is adjusted according to the user's emotional state and is provided during and after the meeting. The user terminal displays the feedback in real time and visualizes the improvement points and points to be aware of for the user.

[0287] For example, if the emotion engine detects that a user is emotionally agitated, the server generates feedback for that user, such as advice on breathing techniques to calm down. Also, if a discussion becomes too heated, the server provides feedback to the facilitator, prompting them to return to the topic. This allows the system to efficiently manage the meeting and create an environment where all participants can contribute constructively.

[0288] The following describes the processing flow.

[0289] Step 1:

[0290] The user terminal collects participants' audio data via the microphone during the meeting. The collected audio data is digitized and transmitted to the server in real time.

[0291] Step 2:

[0292] The server converts the received audio data into text data using a speech recognition engine. During the conversion, background noise is removed, the speaker is identified, and each utterance is neatly transcribed into text.

[0293] Step 3:

[0294] The server analyzes the converted text data and speech tone using natural language processing algorithms. Here, it evaluates the tone and emotion of each statement and checks the liveliness and balance of the discussion.

[0295] Step 4:

[0296] The server uses an emotion engine to further analyze the user's emotional state. It integrates voice tone, facial expression data, and text content to provide a detailed assessment of the user's emotional tendencies.

[0297] Step 5:

[0298] Based on the analysis results above, the server generates personalized feedback tailored to each participant, including specific advice and suggestions for improvement that take their emotional state into account.

[0299] Step 6:

[0300] The user's device notifies the user of the generated feedback in real time. Notifications are provided as screen displays or audio messages, helping the user to correct their actions on the spot.

[0301] Step 7:

[0302] The server generates a detailed report after the meeting, including a summary and analysis of the meeting. Users can use this report to identify areas for improvement in future meetings and prepare accordingly.

[0303] (Example 2)

[0304] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0305] In modern meetings, participants' contributions are diverse, and understanding their statements, including changes in emotion and tone, and providing appropriate feedback is essential for productive discussions. However, traditional systems have struggled to accurately assess participants' emotions and the state of the discussion in real time and to provide optimized improvements for each participant.

[0306] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0307] In this invention, the server includes means for collecting the acoustic information of participants, means for converting the acoustic information into character information, and means for analyzing the speech content based on the character information. Thereby, it becomes possible to comprehensively understand the speech content of participants and generate personalized feedback according to the emotions and tones of each speech.

[0308] A "participant" refers to an individual who makes a speech in a meeting and is the object for which acoustic information and the like are collected by the system.

[0309] "Acoustic information" is data including the characteristics of voices and sounds emitted by participants and is information collected in real time during a meeting.

[0310] "Character information" is data in text format converted from acoustic information and is used for analyzing the speech content.

[0311] "Speech content" refers to opinions and information expressed by participants and is the basis for analysis and feedback generation.

[0312] "Analysis" is a process of evaluating the tone, emotional state, and liveliness of a participant's speech based on character information and acoustic tones and extracting necessary information.

[0313] "Feedback" is information indicating improvement points and points to be noted provided to participants based on the analysis results and is for the purpose of improving the quality of the meeting and the performance of participants.

[0314] This invention is a system that comprehensively analyzes acoustic information in a user-participation type meeting and provides personalized feedback to participants in real time. The user terminal collects the acoustic information of participants using a highly sensitive voice collection device during the meeting. The collected acoustic information is encoded using a dedicated compression algorithm and transmitted to the server via the network.

[0315] The server first uses a digital signal processing unit to remove noise from the received acoustic information. It then uses the Google Cloud Speech-to-Text API or a similar recognition engine to convert this acoustic information into text. Next, to analyze the converted text, it uses OpenAI's generative AI model to precisely evaluate the content of the speech. The analysis comprehensively considers voice tone and facial expressions from the video to understand the emotional state of each participant.

[0316] Based on the analysis results, the server uses natural language generation technology to generate feedback best suited to each participant. For example, if a user is speaking in an angry tone, the server provides feedback suggesting breathing techniques to soften that tone. The generated feedback is transmitted to the user's terminal via the user interface and displayed visually to the participant in real time.

[0317] For example, if the server detects that the participants' discussion is becoming heated, it will prompt the facilitator to "adjust the meeting to ensure proper moderation." Another example of a prompt for the generative AI model could be, "Please tell me what kind of feedback should be given when participants become emotionally agitated."

[0318] As described above, implementing this system can support more effective and smoother meeting management and create an environment where all participants can contribute constructively.

[0319] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0320] Step 1:

[0321] The user terminal uses a high-sensitivity microphone to collect acoustic information from participants during the meeting. The terminal captures this information in real time, applies a noise reduction filter to clean the signal, and then encodes the result in digital format. The input is the participants' voices in the physical space, and the output is digitally filtered acoustic data.

[0322] Step 2:

[0323] The user terminal compresses the encoded audio data and sends it to the server in accordance with IT security protocols. The input is digital audio data, and the output is securely transmitted compressed audio data. This process applies algorithms to optimize data volume and reduce the load on the network.

[0324] Step 3:

[0325] The server decodes the received compressed audio data, performs noise reduction, and analyzes the speech characteristics. The input is the received compressed audio data, and the output is clean, analyzable audio data. This data is converted into text information using a speech recognition engine, such as a common API.

[0326] Step 4:

[0327] The server analyzes the generated text information and associated speech tone data to evaluate the content and emotional state of the utterance. The input consists of text information and speech tone data, while the output is evaluation information such as the emotional state of the analyzed utterance. This analysis is performed using a generative AI model and natural language processing techniques.

[0328] Step 5:

[0329] The server generates personalized feedback based on the analysis results and sends it to the user terminal via the user interface. The input is the analyzed emotion and utterance evaluation information, and the output is user-specific, customized feedback data. This feedback may include, for example, advice to alleviate the feelings of meeting participants.

[0330] Step 6:

[0331] The user terminal displays feedback received from the server in real time and notifies the user using audio and visuals. The input is feedback data, and the output is feedback information displayed on the screen. Based on this information, the terminal supports the user in taking specific actions during the meeting.

[0332] (Application Example 2)

[0333] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0334] In modern home and office environments, smooth communication and psychological safety are crucial elements for improving the quality of daily life. However, emotional outbursts and conversational stagnation can hinder smooth communication. In such situations, there is a need for a system that utilizes robots to understand emotions and conversational states in real time and provide appropriate support.

[0335] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0336] In this invention, the server includes a mechanism for acquiring acoustic information, a method for converting acoustic information into text information, a method for analyzing emotions and facial expressions from the text information and acoustic information, a method for generating instructions to support actions, and a mechanism for presenting the instructions via an operating device. This facilitates communication in home and office environments, enhances psychological safety, and improves the quality of life.

[0337] "Acoustic information" refers to speech and sound signals collected within a space, including conversations and ambient sounds.

[0338] A "mechanism" refers to a group of physical or electronic devices designed to perform a specific function.

[0339] "Textual information" refers to data in string format obtained by analyzing acoustic information, which is essentially a conversion of speech into text.

[0340] "Method" refers to the steps or processes that are taken to achieve a specific objective.

[0341] "Analysis" is the process of evaluating the content and characteristics of collected data and extracting information.

[0342] "Emotions" are internal reactions that indicate a person's psychological state, and they are expressed externally through tone of voice and facial expressions.

[0343] "Facial expression" refers to the outward changes produced by facial muscles and other structures to indicate an individual's emotional state.

[0344] "Instructions to support action" are advice or guidelines provided to encourage appropriate behavior in a particular situation.

[0345] An "operating device" refers to a unit that physically moves to execute instructions from a system.

[0346] As a form of implementing the invention, this system is designed to facilitate communication in home and office environments. The system includes a consumer robot with a built-in microphone for acquiring acoustic information. This robot acquires voice data in real time and transmits it to a server via a Wi-Fi network. The server then uses Google Cloud's Speech-to-Text API to convert the acoustic information into text.

[0347] The converted text and audio information are analyzed for emotions and facial expressions via IBM Watson's emotion analysis API. This analysis evaluates the user's emotional state and generates instructions to support their actions. A generative AI model (e.g., GPT-3) is used to generate these instructions, and the server provides advice based on the generated prompts. The generated advice is displayed on the robot's screen or presented as a voice assistant.

[0348] As a concrete example, the robot observes how family conversations develop during dinner. If the system determines that emotions are running high, it will provide voice advice such as, "Let's calm down a bit and take a deep breath." Also, if it determines that the conversation is stalled, it will prompt the conversation with questions such as, "How was your day?"

[0349] A concrete example of a prompt sentence using a generative AI model is, "If a conversation at home becomes emotional, generate advice to help people calm down." This prompt sentence allows the generative AI model to generate appropriate feedback tailored to the specific situation.

[0350] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0351] Step 1:

[0352] The terminal acquires acoustic information through a microphone. The input is ambient sound data, which is converted into a digital signal and sent to the server. The output is the transmission of audio data to the server.

[0353] Step 2:

[0354] The server inputs the received audio data into the speech recognition service. Here, the Google Cloud Speech-to-Text API is used to filter out noise from the audio data and convert it into text. The output is the converted text data.

[0355] Step 3:

[0356] The server inputs the converted text information and the original audio data into the sentiment analysis engine. This engine utilizes IBM Watson's sentiment analysis API to analyze the user's emotional state and facial expressions. As a result of the analysis, data indicating emotional tendencies and tone is output.

[0357] Step 4:

[0358] The server inputs prompts into the generated AI model, which then generates instructions based on the analysis results. The GPT-3 model is used here, and an example prompt is "There is a heightened emotional situation at home; please provide advice on how to calm down." The output is specific action advice for the user.

[0359] Step 5:

[0360] The server sends the generated instructions to the terminal. The terminal receives these instructions and provides advice to the user via voice or display. The output is feedback to the user.

[0361] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0362] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0363] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0364] [Third Embodiment]

[0365] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0366] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0367] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0368] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0369] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0370] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0371] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0372] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0373] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0374] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0375] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0376] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0377] This invention is a system for improving the quality of meetings using audio data collected by users when they participate in a meeting. The user terminal collects participants' audio in real time through a microphone during the meeting and transmits it to a server. The server converts the received audio data into text data using automatic speech recognition technology. At this time, the server removes background noise and performs processing to identify each speaker.

[0378] The server analyzes the converted text data using natural language processing techniques. Through this analysis, the server evaluates the tone and emotion of each statement, and also detects the consistency of the agenda and the level of discussion. Furthermore, the server analyzes the frequency of interruptions and the balance of contributions to understand the overall progress of the meeting.

[0379] Based on the analysis results, the server generates personalized feedback for each participant. This feedback includes suggestions for improving their tone of voice and specific advice for balancing the discussion. User terminals notify participants of the generated feedback during the meeting and present it to them in real time. This allows users to adjust their contributions during the meeting and improve its productivity.

[0380] For example, if a user is dominating a discussion by speaking too frequently, the server will provide feedback to that user encouraging them to consciously listen to others' opinions. Also, if a meeting is straying from a particular topic, the server will advise the facilitator to return to the main agenda. In this way, the system appropriately supports the progress of meetings, enabling efficient and streamlined communication.

[0381] The following describes the processing flow.

[0382] Step 1:

[0383] The user terminal collects audio from meeting participants in real time via a microphone. The audio data is transmitted to the server in digital format, and is therefore properly sampled and encoded.

[0384] Step 2:

[0385] The server receives audio data sent from the user's terminal and converts it into text data using an automatic speech recognition engine. Here, the server performs noise reduction and speaker identification, distinguishing and transcribing each utterance.

[0386] Step 3:

[0387] The server applies natural language processing algorithms to the converted text data to analyze the tone and emotion of the speech. This analysis utilizes sentiment analysis and keyword extraction to make the speaker's intent easier to understand.

[0388] Step 4:

[0389] Based on the analyzed data, the server evaluates the level of participation, the presence or absence of interruptions, and any deviations from the agenda. This allows for an understanding of the dynamic flow of the meeting as a whole and the contribution of each participant.

[0390] Step 5:

[0391] The server generates personalized feedback for each participant based on the results of the speech analysis. This feedback is designed to include specific areas for improvement and suggestions to facilitate the discussion.

[0392] Step 6:

[0393] User terminals notify participants in real time during the meeting of feedback received from the server. The feedback is provided via pop-ups and audio, helping participants immediately attempt to make improvements.

[0394] Step 7:

[0395] The server automatically generates a report after the meeting ends, including a summary and analysis of the entire meeting. Users can refer to this report to help improve communication.

[0396] (Example 1)

[0397] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0398] In meetings, there is a challenge in facilitating effective communication while maintaining a balance of participation and consistency in the discussion. In particular, the inefficiency of meetings due to imbalances in the tone, emotion, and frequency of participation among participants is a significant problem.

[0399] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0400] In this invention, the server includes means for collecting participants' voice information, means for converting the voice information into text information, and means for analyzing the content of the speech from the text information. This makes it possible to evaluate each participant's tone of voice and emotions, optimize the level of discussion activity, and generate and notify feedback in real time.

[0401] "Participants" refers to people who participate in discussions and contribute to the conversation during a meeting.

[0402] "Audio information" refers to audio data acquired as spoken words by participants during a meeting.

[0403] "Textual information" refers to data in text format that has been converted from audio information.

[0404] "Analysis" refers to the process of evaluating the meaning, tone, and emotion of a statement from converted text information, and extracting specific patterns and characteristics.

[0405] "Feedback" refers to information provided to participants based on the analysis results for improvement and evaluation.

[0406] "Notification" refers to the process of displaying or reporting generated feedback on participants' devices.

[0407] The system based on this invention has the function of collecting and analyzing participants' voice information in real time to generate feedback in order to improve the efficiency of meetings. The user terminal acquires the voice information of participants during the meeting using its built-in microphone. This voice information is transmitted to the server via an internet connection.

[0408] The server converts received audio information into text using speech recognition technology. During this process, the server employs a noise reduction algorithm to extract clear language signals. The converted text is then analyzed on the server using a natural language processing library. This analysis allows the server to identify the meaning, tone, and emotion of the spoken content. Based on the analyzed data, the server generates individual feedback for each participant. This feedback includes specific advice to help optimize emotional shifts and speaking balance.

[0409] User terminals receive feedback from the server in real time and notify users in visual and audio formats. This allows users to adjust their comments and actions on the spot, improving the productivity of meetings.

[0410] For example, if a participant is speaking too frequently and dominating the discussion, the server will provide feedback to encourage that participant to elicit opinions from other attendees. Also, if the overall discussion is straying from a particular topic, the server will advise the facilitator to bring it back on track.

[0411] An example of a prompt to input into a generative AI model is, "Please tell me how to analyze participants' voices in real time during a meeting and provide personalized feedback to each speaker." A system designed in this way supports efficient and harmonious meeting management.

[0412] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0413] Step 1: Collect audio information

[0414] The user terminal uses its built-in microphone to collect participants' speech in real time during a meeting. The input is the sound of the meeting room, and this audio signal is temporarily stored as digital data. Specifically, the terminal's recording application samples the audio and stores it in a buffer.

[0415] Step 2: Sending voice information to the server

[0416] The terminal transmits the collected audio digital data to the server via the internet. The input is the audio data collected in step 1, and the output is the audio file transferred to the server. The data is transmitted securely using an encryption protocol.

[0417] Step 3: Convert speech to text

[0418] The server processes the received audio data using speech recognition software and converts it into text data. The input is the audio file sent to the server, and the output is the corresponding text data. The server performs noise reduction and separates each audio using speaker identification technology.

[0419] Step 4: Analyzing the text information

[0420] The server analyzes text data using a natural language processing library. The input is text data, and the output is the analysis results, including the content of the statement, its emotion, and tone. Specifically, the server applies a text analysis algorithm and calculates an emotion score for each statement.

[0421] Step 5: Generating Feedback

[0422] The server generates feedback for each participant based on the analysis results. The input is the analysis results, and the output is a personalized feedback message. The server uses a generative AI model to create feedback that includes specific areas for improvement.

[0423] Step 6: Feedback Notification

[0424] The user terminal displays feedback sent from the server. The input is the feedback message, and the output is the visual or audible notification the user receives. Specifically, the terminal uses a GUI to display pop-up notifications and, if necessary, uses vibration or sound to draw attention.

[0425] (Application Example 1)

[0426] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0427] If communication between passengers or between passengers and the system is not smooth within an autonomous vehicle, passenger comfort and experience may be compromised. Conventional methods have the challenge of not being able to improve the real-time conversation environment and not being able to autonomously adjust the tone and content of passengers' speech appropriately.

[0428] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0429] In this invention, the server includes means for collecting participant voice information, means for converting the voice information into text information, and means for analyzing the content of the speech from the text information. This makes it possible to understand the conversational environment in an autonomous vehicle in real time, evaluate the tone of speech, emotions, and the level of engagement, and provide feedback to improve the passenger experience.

[0430] "Participants" refers to all individuals and devices that provide audio information within an autonomous vehicle.

[0431] "Audio information" refers to collected sound wave data, which is sound data converted into text information by speech recognition technology.

[0432] "Textual information" refers to text data converted from audio information and is used for analyzing the content of speech.

[0433] "Statement content" refers to the opinions and information of participants expressed as written information, and the intention and meaning of the conversation are understood by analyzing its content.

[0434] "Analysis" refers to the process of understanding and evaluating the meaning and intent of a statement using collected textual information.

[0435] "Feedback" refers to information provided to participants based on the analysis results, indicating areas for improvement in the tone and content of their statements.

[0436] "Real-time" refers to processing participants' voice information instantly and notifying them of the results without delay.

[0437] An "autonomous vehicle" refers to a vehicle in which driving operations are performed autonomously, and which is a means of transportation equipped with systems to improve the passenger experience inside the vehicle.

[0438] This invention realizes a support system to facilitate communication within autonomous vehicles. A server collects passenger voice information in real time through microphones installed in the vehicle and uses this voice information to improve passenger conversations. The voice information is converted into text information using speech recognition technology. For example, the speech_recognition library in Python or Google's speech recognition API can be used.

[0439] This textual information is analyzed by the server's natural language processing module, which evaluates the tone and emotion of the speech, the level of engagement, and whether or not interruptions occurred. The analysis results are then materialized by custom modules such as text_processing_module and feedback_generator. In this process, the server generates optimal feedback for the passenger and notifies them in real time via their terminal. This allows the passenger to adjust their speech and improve the conversational environment.

[0440] For example, if one passenger is speaking for an extended period, the server might offer feedback such as, "Please listen to what other passengers have to say." If a passenger is speaking very loudly, the server might encourage them by saying, "Speaking in a quieter tone would help everyone relax."

[0441] An example of a prompt for a generative AI model could be: "Speech recognition result: {text generated from speech}. Analyze the tone and important keywords of this text to create feedback that will facilitate the conversation."

[0442] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0443] Step 1:

[0444] The server collects passenger voice information through microphones installed inside the vehicle. The input is sound wave data, and the output is a digitized voice signal. Here, the analog data is converted to a digital format to enable subsequent processing.

[0445] Step 2:

[0446] The server uses the collected audio information to perform a speech recognition process and generate text information. The input is a digitized audio signal, and the output is text data of the audio. In this step, the Python speech_recognition library is used to call Google's speech recognition API to convert the audio data to text.

[0447] Step 3:

[0448] The server analyzes the generated text information using a natural language processing module. The input is text data obtained through speech recognition, and the output is an evaluation result of the tone, emotion, liveliness, and presence or absence of interruptions in the speech. Specifically, the text_processing_module is used to perform calculations that evaluate the meaning and emotion of the text.

[0449] Step 4:

[0450] Based on the analysis results, the server generates feedback. The input is the analysis results, and the output is a specific feedback message for the passenger. The feedback_generator is used to create the feedback message, pointing out areas for improvement in tone and content.

[0451] Step 5:

[0452] The terminal notifies passengers of the generated feedback in real time. The input is the feedback message, and the output is visualized notification information. This process uses the terminal's display and audio output capabilities to provide passengers with feedback at the appropriate time.

[0453] Step 6:

[0454] The user adjusts their statements based on feedback received through the device. The input is the feedback message, and the output is the adjusted new statement. This process allows the user to improve the conversational environment and maintain comfortable communication.

[0455] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0456] This invention is a system that provides feedback by comprehensively analyzing participants' voices and emotions in user-participatory meetings. The user terminal is equipped with a collection device for collecting participants' voices in real time during the meeting and transmitting them to a server. The server receives the voice data and converts it into text data using a speech recognition engine. This conversion process is optimized to remove noise and identify each speaker.

[0457] Furthermore, the server is equipped with an emotion engine that comprehensively analyzes text data, voice tone, and facial expression information obtained from video data to evaluate the user's emotional state in detail. The server uses natural language processing algorithms to evaluate the tone and emotional tendencies of each statement, as well as to grasp the level of engagement and the progress of the discussion.

[0458] Based on the analysis results, the server generates personalized feedback for each participant. This feedback is adjusted according to the user's emotional state and is provided during and after the meeting. User terminals display the feedback in real time, visualizing areas for improvement and points to be aware of for the user.

[0459] For example, if the emotion engine detects that a user is emotionally agitated, the server generates feedback for that user, such as advice on breathing techniques to calm down. Also, if a discussion becomes too heated, the server provides feedback to the facilitator, prompting them to return to the topic. This allows the system to efficiently manage the meeting and create an environment where all participants can contribute constructively.

[0460] The following describes the processing flow.

[0461] Step 1:

[0462] The user terminal collects participants' audio data via the microphone during the meeting. The collected audio data is digitized and transmitted to the server in real time.

[0463] Step 2:

[0464] The server converts the received audio data into text data using a speech recognition engine. During the conversion, background noise is removed, the speaker is identified, and each utterance is neatly transcribed into text.

[0465] Step 3:

[0466] The server analyzes the converted text data and speech tone using natural language processing algorithms. Here, it evaluates the tone and emotion of each statement and checks the liveliness and balance of the discussion.

[0467] Step 4:

[0468] The server uses an emotion engine to further analyze the user's emotional state. It integrates voice tone, facial expression data, and text content to provide a detailed assessment of the user's emotional tendencies.

[0469] Step 5:

[0470] Based on the analysis results above, the server generates personalized feedback tailored to each participant, including specific advice and suggestions for improvement that take their emotional state into account.

[0471] Step 6:

[0472] The user's device notifies the user of the generated feedback in real time. Notifications are provided as screen displays or audio messages, helping the user to correct their actions on the spot.

[0473] Step 7:

[0474] The server generates a detailed report after the meeting, including a summary and analysis of the meeting. Users can use this report to identify areas for improvement in future meetings and prepare accordingly.

[0475] (Example 2)

[0476] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0477] In modern meetings, participants' contributions are diverse, and understanding their statements, including changes in emotion and tone, and providing appropriate feedback is essential for productive discussions. However, traditional systems have struggled to accurately assess participants' emotions and the state of the discussion in real time and to provide optimized improvements for each participant.

[0478] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0479] In this invention, the server includes means for collecting acoustic information of participants, means for converting the acoustic information into text information, and means for analyzing the content of speech based on the text information. This enables a detailed understanding of the content of participants' speech and the generation of personalized feedback that corresponds to the emotion and tone of each speech.

[0480] "Participants" refer to individual individuals who speak during a meeting, and are the subjects from whom the system collects audio information and other data.

[0481] "Acoustic information" refers to data that includes the characteristics of voices and sounds emitted by participants, and is collected in real time during the meeting.

[0482] "Textual information" refers to data in text format converted from acoustic information, and is used for analyzing the content of speech.

[0483] "Statements" refer to the opinions and information expressed by participants, and serve as the basis for analysis and feedback generation.

[0484] "Analysis" is the process of evaluating the tone of participants' speech, emotional state, and activity level based on textual information and acoustic tone, and extracting necessary information.

[0485] "Feedback" refers to information provided to participants based on analysis results, indicating areas for improvement and points to pay attention to, with the aim of improving the quality of the meeting and the performance of the participants.

[0486] This invention is a system that comprehensively analyzes acoustic information in user-participatory meetings and provides participants with real-time, personalized feedback. During the meeting, the user terminal collects participant acoustic information using a highly sensitive voice collection device. This collected acoustic information is encoded using a dedicated compression algorithm and transmitted to a server via the network.

[0487] The server first uses a digital signal processing unit to remove noise from the received acoustic information. It then uses the Google Cloud Speech-to-Text API or a similar recognition engine to convert this acoustic information into text. Next, to analyze the converted text, it uses OpenAI's generative AI model to precisely evaluate the content of the speech. The analysis comprehensively considers voice tone and facial expressions from the video to understand the emotional state of each participant.

[0488] Based on the analysis results, the server uses natural language generation technology to generate feedback best suited to each participant. For example, if a user is speaking in an angry tone, the server provides feedback suggesting breathing techniques to soften that tone. The generated feedback is transmitted to the user's terminal via the user interface and displayed visually to the participant in real time.

[0489] For example, if the server detects that the participants' discussion is becoming heated, it will prompt the facilitator to "adjust the meeting to ensure proper moderation." Another example of a prompt for the generative AI model could be, "Please tell me what kind of feedback should be given when participants become emotionally agitated."

[0490] As described above, implementing this system can support more effective and smoother meeting management and create an environment where all participants can contribute constructively.

[0491] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0492] Step 1:

[0493] The user terminal uses a high-sensitivity microphone to collect acoustic information from participants during the meeting. The terminal captures this information in real time, applies a noise reduction filter to clean the signal, and then encodes the result in digital format. The input is the participants' voices in the physical space, and the output is digitally filtered acoustic data.

[0494] Step 2:

[0495] The user terminal compresses the encoded audio data and sends it to the server in accordance with IT security protocols. The input is digital audio data, and the output is securely transmitted compressed audio data. This process applies algorithms to optimize data volume and reduce the load on the network.

[0496] Step 3:

[0497] The server decodes the received compressed audio data, performs noise reduction, and analyzes the speech characteristics. The input is the received compressed audio data, and the output is clean, analyzable audio data. This data is converted into text information using a speech recognition engine, such as a common API.

[0498] Step 4:

[0499] The server analyzes the generated text information and associated speech tone data to evaluate the content and emotional state of the utterance. The input consists of text information and speech tone data, while the output is evaluation information such as the emotional state of the analyzed utterance. This analysis is performed using a generative AI model and natural language processing techniques.

[0500] Step 5:

[0501] The server generates personalized feedback based on the analysis results and sends it to the user terminal via the user interface. The input is the analyzed emotion and utterance evaluation information, and the output is user-specific, customized feedback data. This feedback may include, for example, advice to alleviate the feelings of meeting participants.

[0502] Step 6:

[0503] The user terminal displays feedback received from the server in real time and notifies the user using audio and visuals. The input is feedback data, and the output is feedback information displayed on the screen. Based on this information, the terminal supports the user in taking specific actions during the meeting.

[0504] (Application Example 2)

[0505] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0506] In modern home and office environments, smooth communication and psychological safety are crucial elements for improving the quality of daily life. However, emotional outbursts and conversational stagnation can hinder smooth communication. In such situations, there is a need for a system that utilizes robots to understand emotions and conversational states in real time and provide appropriate support.

[0507] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0508] In this invention, the server includes a mechanism for acquiring acoustic information, a method for converting acoustic information into text information, a method for analyzing emotions and facial expressions from the text information and acoustic information, a method for generating instructions to support actions, and a mechanism for presenting the instructions via an operating device. This facilitates communication in home and office environments, enhances psychological safety, and improves the quality of life.

[0509] "Acoustic information" refers to speech and sound signals collected within a space, including conversations and ambient sounds.

[0510] A "mechanism" refers to a group of physical or electronic devices designed to perform a specific function.

[0511] "Textual information" refers to data in string format obtained by analyzing acoustic information, which is essentially a conversion of speech into text.

[0512] "Method" refers to the steps or processes that are taken to achieve a specific objective.

[0513] "Analysis" is the process of evaluating the content and characteristics of collected data and extracting information.

[0514] "Emotions" are internal reactions that indicate a person's psychological state, and they are expressed externally through tone of voice and facial expressions.

[0515] "Facial expression" refers to the outward changes produced by facial muscles and other structures to indicate an individual's emotional state.

[0516] "Instructions to support action" are advice or guidelines provided to encourage appropriate behavior in a particular situation.

[0517] An "operating device" refers to a unit that physically moves to execute instructions from a system.

[0518] As a form of implementing the invention, this system is designed to facilitate communication in home and office environments. The system includes a consumer robot with a built-in microphone for acquiring acoustic information. This robot acquires voice data in real time and transmits it to a server via a Wi-Fi network. The server then uses Google Cloud's Speech-to-Text API to convert the acoustic information into text.

[0519] The converted text and audio information are analyzed for emotions and facial expressions via IBM Watson's emotion analysis API. This analysis evaluates the user's emotional state and generates instructions to support their actions. A generative AI model (e.g., GPT-3) is used to generate these instructions, and the server provides advice based on the generated prompts. The generated advice is displayed on the robot's screen or presented as a voice assistant.

[0520] As a concrete example, the robot observes how family conversations develop during dinner. If the system determines that emotions are running high, it will provide voice advice such as, "Let's calm down a bit and take a deep breath." Also, if it determines that the conversation is stalled, it will prompt the conversation with questions such as, "How was your day?"

[0521] A concrete example of a prompt sentence using a generative AI model is, "If a conversation at home becomes emotional, generate advice to help people calm down." This prompt sentence allows the generative AI model to generate appropriate feedback tailored to the specific situation.

[0522] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0523] Step 1:

[0524] The terminal acquires acoustic information through a microphone. The input is ambient sound data, which is converted into a digital signal and sent to the server. The output is the transmission of audio data to the server.

[0525] Step 2:

[0526] The server inputs the received audio data into the speech recognition service. Here, the Google Cloud Speech-to-Text API is used to filter out noise from the audio data and convert it into text. The output is the converted text data.

[0527] Step 3:

[0528] The server inputs the converted text information and the original audio data into the sentiment analysis engine. This engine utilizes IBM Watson's sentiment analysis API to analyze the user's emotional state and facial expressions. As a result of the analysis, data indicating emotional tendencies and tone is output.

[0529] Step 4:

[0530] The server inputs prompts into the generated AI model, which then generates instructions based on the analysis results. The GPT-3 model is used here, and an example prompt is "There is a heightened emotional situation at home; please provide advice on how to calm down." The output is specific action advice for the user.

[0531] Step 5:

[0532] The server sends the generated instructions to the terminal. The terminal receives these instructions and provides advice to the user via voice or display. The output is feedback to the user.

[0533] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0534] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0535] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0536] [Fourth Embodiment]

[0537] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0538] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0539] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0540] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0541] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0542] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0543] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0544] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0545] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0546] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0547] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0548] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0549] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0550] This invention is a system for improving the quality of meetings using audio data collected by users when they participate in a meeting. The user terminal collects participants' audio in real time through a microphone during the meeting and transmits it to a server. The server converts the received audio data into text data using automatic speech recognition technology. At this time, the server removes background noise and performs processing to identify each speaker.

[0551] The server analyzes the converted text data using natural language processing techniques. Through this analysis, the server evaluates the tone and emotion of each statement, and also detects the consistency of the agenda and the level of discussion. Furthermore, the server analyzes the frequency of interruptions and the balance of contributions to understand the overall progress of the meeting.

[0552] Based on the analysis results, the server generates personalized feedback for each participant. This feedback includes suggestions for improving their tone of voice and specific advice for balancing the discussion. User terminals notify participants of the generated feedback during the meeting and present it to them in real time. This allows users to adjust their contributions during the meeting and improve its productivity.

[0553] For example, if a user is dominating a discussion by speaking too frequently, the server will provide feedback to that user encouraging them to consciously listen to others' opinions. Also, if a meeting is straying from a particular topic, the server will advise the facilitator to return to the main agenda. In this way, the system appropriately supports the progress of meetings, enabling efficient and streamlined communication.

[0554] The following describes the processing flow.

[0555] Step 1:

[0556] The user terminal collects audio from meeting participants in real time via a microphone. The audio data is transmitted to the server in digital format, and is therefore properly sampled and encoded.

[0557] Step 2:

[0558] The server receives audio data sent from the user's terminal and converts it into text data using an automatic speech recognition engine. Here, the server performs noise reduction and speaker identification, distinguishing and transcribing each utterance.

[0559] Step 3:

[0560] The server applies natural language processing algorithms to the converted text data to analyze the tone and emotion of the speech. This analysis utilizes sentiment analysis and keyword extraction to make the speaker's intent easier to understand.

[0561] Step 4:

[0562] Based on the analyzed data, the server evaluates the level of participation, the presence or absence of interruptions, and any deviations from the agenda. This allows for an understanding of the dynamic flow of the meeting as a whole and the contribution of each participant.

[0563] Step 5:

[0564] The server generates personalized feedback for each participant based on the results of the speech analysis. This feedback is designed to include specific areas for improvement and suggestions to facilitate the discussion.

[0565] Step 6:

[0566] User terminals notify participants in real time during the meeting of feedback received from the server. The feedback is provided via pop-ups and audio, helping participants immediately attempt to make improvements.

[0567] Step 7:

[0568] The server automatically generates a report after the meeting ends, including a summary and analysis of the entire meeting. Users can refer to this report to help improve communication.

[0569] (Example 1)

[0570] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0571] In meetings, there is a challenge in facilitating effective communication while maintaining a balance of participation and consistency in the discussion. In particular, the inefficiency of meetings due to imbalances in the tone, emotion, and frequency of participation among participants is a significant problem.

[0572] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0573] In this invention, the server includes means for collecting participants' voice information, means for converting the voice information into text information, and means for analyzing the content of the speech from the text information. This makes it possible to evaluate each participant's tone of voice and emotions, optimize the level of discussion activity, and generate and notify feedback in real time.

[0574] "Participants" refers to people who participate in discussions and contribute to the conversation during a meeting.

[0575] "Audio information" refers to audio data acquired as spoken words by participants during a meeting.

[0576] "Textual information" refers to data in text format that has been converted from audio information.

[0577] "Analysis" refers to the process of evaluating the meaning, tone, and emotion of a statement from converted text information, and extracting specific patterns and characteristics.

[0578] "Feedback" refers to information provided to participants based on the analysis results for improvement and evaluation.

[0579] "Notification" refers to the process of displaying or reporting generated feedback on participants' devices.

[0580] The system based on this invention has the function of collecting and analyzing participants' voice information in real time to generate feedback in order to improve the efficiency of meetings. The user terminal acquires the voice information of participants during the meeting using its built-in microphone. This voice information is transmitted to the server via an internet connection.

[0581] The server converts received audio information into text using speech recognition technology. During this process, the server employs a noise reduction algorithm to extract clear language signals. The converted text is then analyzed on the server using a natural language processing library. This analysis allows the server to identify the meaning, tone, and emotion of the spoken content. Based on the analyzed data, the server generates individual feedback for each participant. This feedback includes specific advice to help optimize emotional shifts and speaking balance.

[0582] User terminals receive feedback from the server in real time and notify users in visual and audio formats. This allows users to adjust their comments and actions on the spot, improving the productivity of meetings.

[0583] For example, if a participant is speaking too frequently and dominating the discussion, the server will provide feedback to encourage that participant to elicit opinions from other attendees. Also, if the overall discussion is straying from a particular topic, the server will advise the facilitator to bring it back on track.

[0584] An example of a prompt to input into a generative AI model is, "Please tell me how to analyze participants' voices in real time during a meeting and provide personalized feedback to each speaker." A system designed in this way supports efficient and harmonious meeting management.

[0585] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0586] Step 1: Collect audio information

[0587] The user terminal uses its built-in microphone to collect participants' speech in real time during a meeting. The input is the sound of the meeting room, and this audio signal is temporarily stored as digital data. Specifically, the terminal's recording application samples the audio and stores it in a buffer.

[0588] Step 2: Sending voice information to the server

[0589] The terminal transmits the collected audio digital data to the server via the internet. The input is the audio data collected in step 1, and the output is the audio file transferred to the server. The data is transmitted securely using an encryption protocol.

[0590] Step 3: Convert speech to text

[0591] The server processes the received audio data using speech recognition software and converts it into text data. The input is the audio file sent to the server, and the output is the corresponding text data. The server performs noise reduction and separates each audio using speaker identification technology.

[0592] Step 4: Analyzing the text information

[0593] The server analyzes text data using a natural language processing library. The input is text data, and the output is the analysis results, including the content of the statement, its emotion, and tone. Specifically, the server applies a text analysis algorithm and calculates an emotion score for each statement.

[0594] Step 5: Generating Feedback

[0595] The server generates feedback for each participant based on the analysis results. The input is the analysis results, and the output is a personalized feedback message. The server uses a generative AI model to create feedback that includes specific areas for improvement.

[0596] Step 6: Feedback Notification

[0597] The user terminal displays feedback sent from the server. The input is the feedback message, and the output is the visual or audible notification the user receives. Specifically, the terminal uses a GUI to display pop-up notifications and, if necessary, uses vibration or sound to draw attention.

[0598] (Application Example 1)

[0599] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0600] If communication between passengers or between passengers and the system is not smooth within an autonomous vehicle, passenger comfort and experience may be compromised. Conventional methods have the challenge of not being able to improve the real-time conversation environment and not being able to autonomously adjust the tone and content of passengers' speech appropriately.

[0601] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0602] In this invention, the server includes means for collecting participant voice information, means for converting the voice information into text information, and means for analyzing the content of the speech from the text information. This makes it possible to understand the conversational environment in an autonomous vehicle in real time, evaluate the tone of speech, emotions, and the level of engagement, and provide feedback to improve the passenger experience.

[0603] "Participants" refers to all individuals and devices that provide audio information within an autonomous vehicle.

[0604] "Audio information" refers to collected sound wave data, which is sound data converted into text information by speech recognition technology.

[0605] "Textual information" refers to text data converted from audio information and is used for analyzing the content of speech.

[0606] "Statement content" refers to the opinions and information of participants expressed as written information, and the intention and meaning of the conversation are understood by analyzing its content.

[0607] "Analysis" refers to the process of understanding and evaluating the meaning and intent of a statement using collected textual information.

[0608] "Feedback" refers to information provided to participants based on the analysis results, indicating areas for improvement in the tone and content of their statements.

[0609] "Real-time" refers to processing participants' voice information instantly and notifying them of the results without delay.

[0610] An "autonomous vehicle" refers to a vehicle in which driving operations are performed autonomously, and which is a means of transportation equipped with systems to improve the passenger experience inside the vehicle.

[0611] This invention realizes a support system to facilitate communication within autonomous vehicles. A server collects passenger voice information in real time through microphones installed in the vehicle and uses this voice information to improve passenger conversations. The voice information is converted into text information using speech recognition technology. For example, the speech_recognition library in Python or Google's speech recognition API can be used.

[0612] This textual information is analyzed by the server's natural language processing module, which evaluates the tone and emotion of the speech, the level of engagement, and whether or not interruptions occurred. The analysis results are then materialized by custom modules such as text_processing_module and feedback_generator. In this process, the server generates optimal feedback for the passenger and notifies them in real time via their terminal. This allows the passenger to adjust their speech and improve the conversational environment.

[0613] For example, if one passenger is speaking for an extended period, the server might offer feedback such as, "Please listen to what other passengers have to say." If a passenger is speaking very loudly, the server might encourage them by saying, "Speaking in a quieter tone would help everyone relax."

[0614] An example of a prompt for a generative AI model could be: "Speech recognition result: {text generated from speech}. Analyze the tone and important keywords of this text to create feedback that will facilitate the conversation."

[0615] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0616] Step 1:

[0617] The server collects passenger voice information through microphones installed inside the vehicle. The input is sound wave data, and the output is a digitized voice signal. Here, the analog data is converted to a digital format to enable subsequent processing.

[0618] Step 2:

[0619] The server uses the collected audio information to perform a speech recognition process and generate text information. The input is a digitized audio signal, and the output is text data of the audio. In this step, the Python speech_recognition library is used to call Google's speech recognition API to convert the audio data to text.

[0620] Step 3:

[0621] The server analyzes the generated text information using a natural language processing module. The input is text data obtained through speech recognition, and the output is an evaluation result of the tone, emotion, liveliness, and presence or absence of interruptions in the speech. Specifically, the text_processing_module is used to perform calculations that evaluate the meaning and emotion of the text.

[0622] Step 4:

[0623] Based on the analysis results, the server generates feedback. The input is the analysis results, and the output is a specific feedback message for the passenger. The feedback_generator is used to create the feedback message, pointing out areas for improvement in tone and content.

[0624] Step 5:

[0625] The terminal notifies passengers of the generated feedback in real time. The input is the feedback message, and the output is visualized notification information. This process uses the terminal's display and audio output capabilities to provide passengers with feedback at the appropriate time.

[0626] Step 6:

[0627] The user adjusts their statements based on feedback received through the device. The input is the feedback message, and the output is the adjusted new statement. This process allows the user to improve the conversational environment and maintain comfortable communication.

[0628] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0629] This invention is a system that provides feedback by comprehensively analyzing participants' voices and emotions in user-participatory meetings. The user terminal is equipped with a collection device for collecting participants' voices in real time during the meeting and transmitting them to a server. The server receives the voice data and converts it into text data using a speech recognition engine. This conversion process is optimized to remove noise and identify each speaker.

[0630] Furthermore, the server is equipped with an emotion engine that comprehensively analyzes text data, voice tone, and facial expression information obtained from video data to evaluate the user's emotional state in detail. The server uses natural language processing algorithms to evaluate the tone and emotional tendencies of each statement, as well as to grasp the level of engagement and the progress of the discussion.

[0631] Based on the analysis results, the server generates personalized feedback for each participant. This feedback is adjusted according to the user's emotional state and is provided during and after the meeting. User terminals display the feedback in real time, visualizing areas for improvement and points to be aware of for the user.

[0632] For example, if the emotion engine detects that a user is emotionally agitated, the server generates feedback for that user, such as advice on breathing techniques to calm down. Also, if a discussion becomes too heated, the server provides feedback to the facilitator, prompting them to return to the topic. This allows the system to efficiently manage the meeting and create an environment where all participants can contribute constructively.

[0633] The following describes the processing flow.

[0634] Step 1:

[0635] The user terminal collects participants' audio data via the microphone during the meeting. The collected audio data is digitized and transmitted to the server in real time.

[0636] Step 2:

[0637] The server converts the received audio data into text data using a speech recognition engine. During the conversion, background noise is removed, the speaker is identified, and each utterance is neatly transcribed into text.

[0638] Step 3:

[0639] The server analyzes the converted text data and speech tone using natural language processing algorithms. Here, it evaluates the tone and emotion of each statement and checks the liveliness and balance of the discussion.

[0640] Step 4:

[0641] The server uses an emotion engine to further analyze the user's emotional state. It integrates voice tone, facial expression data, and text content to provide a detailed assessment of the user's emotional tendencies.

[0642] Step 5:

[0643] Based on the analysis results above, the server generates personalized feedback tailored to each participant, including specific advice and suggestions for improvement that take their emotional state into account.

[0644] Step 6:

[0645] The user's device notifies the user of the generated feedback in real time. Notifications are provided as screen displays or audio messages, helping the user to correct their actions on the spot.

[0646] Step 7:

[0647] The server generates a detailed report after the meeting, including a summary and analysis of the meeting. Users can use this report to identify areas for improvement in future meetings and prepare accordingly.

[0648] (Example 2)

[0649] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0650] In modern meetings, participants' contributions are diverse, and understanding their statements, including changes in emotion and tone, and providing appropriate feedback is essential for productive discussions. However, traditional systems have struggled to accurately assess participants' emotions and the state of the discussion in real time and to provide optimized improvements for each participant.

[0651] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0652] In this invention, the server includes means for collecting acoustic information of participants, means for converting the acoustic information into text information, and means for analyzing the content of speech based on the text information. This enables a detailed understanding of the content of participants' speech and the generation of personalized feedback that corresponds to the emotion and tone of each speech.

[0653] "Participants" refer to individual individuals who speak during a meeting, and are the subjects from whom the system collects audio information and other data.

[0654] "Acoustic information" refers to data that includes the characteristics of voices and sounds emitted by participants, and is collected in real time during the meeting.

[0655] "Textual information" refers to data in text format converted from acoustic information, and is used for analyzing the content of speech.

[0656] "Statements" refer to the opinions and information expressed by participants, and serve as the basis for analysis and feedback generation.

[0657] "Analysis" is the process of evaluating the tone of participants' speech, emotional state, and activity level based on textual information and acoustic tone, and extracting necessary information.

[0658] "Feedback" refers to information provided to participants based on analysis results, indicating areas for improvement and points to pay attention to, with the aim of improving the quality of the meeting and the performance of the participants.

[0659] This invention is a system that comprehensively analyzes acoustic information in user-participatory meetings and provides participants with real-time, personalized feedback. During the meeting, the user terminal collects participant acoustic information using a highly sensitive voice collection device. This collected acoustic information is encoded using a dedicated compression algorithm and transmitted to a server via the network.

[0660] The server first uses a digital signal processing unit to remove noise from the received acoustic information. It then uses the Google Cloud Speech-to-Text API or a similar recognition engine to convert this acoustic information into text. Next, to analyze the converted text, it uses OpenAI's generative AI model to precisely evaluate the content of the speech. The analysis comprehensively considers voice tone and facial expressions from the video to understand the emotional state of each participant.

[0661] Based on the analysis results, the server uses natural language generation technology to generate feedback best suited to each participant. For example, if a user is speaking in an angry tone, the server provides feedback suggesting breathing techniques to soften that tone. The generated feedback is transmitted to the user's terminal via the user interface and displayed visually to the participant in real time.

[0662] For example, if the server detects that the participants' discussion is becoming heated, it will prompt the facilitator to "adjust the meeting to ensure proper moderation." Another example of a prompt for the generative AI model could be, "Please tell me what kind of feedback should be given when participants become emotionally agitated."

[0663] As described above, implementing this system can support more effective and smoother meeting management and create an environment where all participants can contribute constructively.

[0664] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0665] Step 1:

[0666] The user terminal uses a high-sensitivity microphone to collect acoustic information from participants during the meeting. The terminal captures this information in real time, applies a noise reduction filter to clean the signal, and then encodes the result in digital format. The input is the participants' voices in the physical space, and the output is digitally filtered acoustic data.

[0667] Step 2:

[0668] The user terminal compresses the encoded audio data and sends it to the server in accordance with IT security protocols. The input is digital audio data, and the output is securely transmitted compressed audio data. This process applies algorithms to optimize data volume and reduce the load on the network.

[0669] Step 3:

[0670] The server decodes the received compressed audio data, performs noise reduction, and analyzes the speech characteristics. The input is the received compressed audio data, and the output is clean, analyzable audio data. This data is converted into text information using a speech recognition engine, such as a common API.

[0671] Step 4:

[0672] The server analyzes the generated text information and associated speech tone data to evaluate the content and emotional state of the utterance. The input consists of text information and speech tone data, while the output is evaluation information such as the emotional state of the analyzed utterance. This analysis is performed using a generative AI model and natural language processing techniques.

[0673] Step 5:

[0674] The server generates personalized feedback based on the analysis results and sends it to the user terminal via the user interface. The input is the analyzed emotion and utterance evaluation information, and the output is user-specific, customized feedback data. This feedback may include, for example, advice to alleviate the feelings of meeting participants.

[0675] Step 6:

[0676] The user terminal displays feedback received from the server in real time and notifies the user using audio and visuals. The input is feedback data, and the output is feedback information displayed on the screen. Based on this information, the terminal supports the user in taking specific actions during the meeting.

[0677] (Application Example 2)

[0678] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0679] In modern home and office environments, smooth communication and psychological safety are crucial elements for improving the quality of daily life. However, emotional outbursts and conversational stagnation can hinder smooth communication. In such situations, there is a need for a system that utilizes robots to understand emotions and conversational states in real time and provide appropriate support.

[0680] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0681] In this invention, the server includes a mechanism for acquiring acoustic information, a method for converting acoustic information into text information, a method for analyzing emotions and facial expressions from the text information and acoustic information, a method for generating instructions to support actions, and a mechanism for presenting the instructions via an operating device. This facilitates communication in home and office environments, enhances psychological safety, and improves the quality of life.

[0682] "Acoustic information" refers to speech and sound signals collected within a space, including conversations and ambient sounds.

[0683] A "mechanism" refers to a group of physical or electronic devices designed to perform a specific function.

[0684] "Textual information" refers to data in string format obtained by analyzing acoustic information, which is essentially a conversion of speech into text.

[0685] "Method" refers to the steps or processes that are taken to achieve a specific objective.

[0686] "Analysis" is the process of evaluating the content and characteristics of collected data and extracting information.

[0687] "Emotions" are internal reactions that indicate a person's psychological state, and they are expressed externally through tone of voice and facial expressions.

[0688] "Facial expression" refers to the outward changes produced by facial muscles and other structures to indicate an individual's emotional state.

[0689] "Instructions to support action" are advice or guidelines provided to encourage appropriate behavior in a particular situation.

[0690] An "operating device" refers to a unit that physically moves to execute instructions from a system.

[0691] As a form of implementing the invention, this system is designed to facilitate communication in home and office environments. The system includes a consumer robot with a built-in microphone for acquiring acoustic information. This robot acquires voice data in real time and transmits it to a server via a Wi-Fi network. The server then uses Google Cloud's Speech-to-Text API to convert the acoustic information into text.

[0692] The converted text and audio information are analyzed for emotions and facial expressions via IBM Watson's emotion analysis API. This analysis evaluates the user's emotional state and generates instructions to support their actions. A generative AI model (e.g., GPT-3) is used to generate these instructions, and the server provides advice based on the generated prompts. The generated advice is displayed on the robot's screen or presented as a voice assistant.

[0693] As a concrete example, the robot observes how family conversations develop during dinner. If the system determines that emotions are running high, it will provide voice advice such as, "Let's calm down a bit and take a deep breath." Also, if it determines that the conversation is stalled, it will prompt the conversation with questions such as, "How was your day?"

[0694] A concrete example of a prompt sentence using a generative AI model is, "If a conversation at home becomes emotional, generate advice to help people calm down." This prompt sentence allows the generative AI model to generate appropriate feedback tailored to the specific situation.

[0695] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0696] Step 1:

[0697] The terminal acquires acoustic information through a microphone. The input is ambient sound data, which is converted into a digital signal and sent to the server. The output is the transmission of audio data to the server.

[0698] Step 2:

[0699] The server inputs the received audio data into the speech recognition service. Here, the Google Cloud Speech-to-Text API is used to filter out noise from the audio data and convert it into text. The output is the converted text data.

[0700] Step 3:

[0701] The server inputs the converted text information and the original audio data into the sentiment analysis engine. This engine utilizes IBM Watson's sentiment analysis API to analyze the user's emotional state and facial expressions. As a result of the analysis, data indicating emotional tendencies and tone is output.

[0702] Step 4:

[0703] The server inputs prompts into the generated AI model, which then generates instructions based on the analysis results. The GPT-3 model is used here, and an example prompt is "There is a heightened emotional situation at home; please provide advice on how to calm down." The output is specific action advice for the user.

[0704] Step 5:

[0705] The server sends the generated instructions to the terminal. The terminal receives these instructions and provides advice to the user via voice or display. The output is feedback to the user.

[0706] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0707] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0708] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0709] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0710] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0711] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0712] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0713] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0714] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0715] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0716] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0717] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0718] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0719] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0720] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0721] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0722] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0723] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0724] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0725] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0726] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0727] The following is further disclosed regarding the embodiments described above.

[0728] (Claim 1)

[0729] A device for collecting participants' voice data,

[0730] Means for converting the aforementioned audio data into text data,

[0731] A means for analyzing the content of statements from the aforementioned text data,

[0732] Means for generating feedback based on the aforementioned analysis,

[0733] A system including means for notifying the aforementioned feedback.

[0734] (Claim 2)

[0735] The system according to claim 1, wherein the analysis means comprises means for evaluating the tone and emotion of a statement.

[0736] (Claim 3)

[0737] The system according to claim 1, wherein the analysis means is equipped with means for detecting the level of activity in speech and whether or not interruptions occur.

[0738] "Example 1"

[0739] (Claim 1)

[0740] Means for collecting participants' voice information,

[0741] Means for converting the aforementioned audio information into text information,

[0742] A means for analyzing the content of a statement from the aforementioned textual information,

[0743] Means for generating individual feedback based on the aforementioned analysis,

[0744] A system including means for notifying the aforementioned feedback.

[0745] (Claim 2)

[0746] The system according to claim 1, wherein the analysis means comprises means for evaluating the impression and emotion of a statement.

[0747] (Claim 3)

[0748] The system according to claim 1, wherein the analysis means is equipped with means for detecting the frequency of speech and whether or not there are interruptions.

[0749] "Application Example 1"

[0750] (Claim 1)

[0751] Means for collecting participants' voice information,

[0752] Means for converting the aforementioned audio information into text information,

[0753] A means for analyzing the content of a statement from the aforementioned textual information,

[0754] Means for generating feedback based on the aforementioned analysis,

[0755] The means for providing the aforementioned feedback,

[0756] A means to improve the conversational environment inside autonomous vehicles in real time,

[0757] A system that includes this.

[0758] (Claim 2)

[0759] The system according to claim 1, wherein the analysis means is a means for evaluating the tone and emotion of a statement and providing feedback for facilitating communication within an autonomous vehicle.

[0760] (Claim 3)

[0761] The system according to claim 1, wherein the analysis means includes means for detecting the level of speech activity and the presence or absence of interruptions, and for providing feedback to improve the passenger experience in an autonomous vehicle.

[0762] "Example 2 of combining an emotion engine"

[0763] (Claim 1)

[0764] Means for collecting acoustic information from participants,

[0765] means for converting the aforementioned acoustic information into textual information,

[0766] A means for analyzing the content of a statement based on the aforementioned textual information,

[0767] Means for generating personalized feedback based on the aforementioned analysis,

[0768] A means for visualizing the aforementioned feedback,

[0769] A system that includes this.

[0770] (Claim 2)

[0771] The system according to claim 1, wherein the analysis means comprises means for evaluating the tone and emotional state of a statement.

[0772] (Claim 3)

[0773] The system according to claim 1, wherein the analysis means is equipped with means for detecting the activity status of statements and deviations from topics.

[0774] "Application example 2 when combining with an emotional engine"

[0775] (Claim 1)

[0776] A mechanism for acquiring acoustic information,

[0777] A method for converting the aforementioned acoustic information into textual information,

[0778] A method for analyzing emotions and facial expressions from the aforementioned textual and acoustic information,

[0779] A method for generating instructions to support behavior based on the aforementioned analysis results,

[0780] A system including a mechanism for presenting the aforementioned instructions via an operating device.

[0781] (Claim 2)

[0782] The system according to claim 1, wherein the analysis method includes a mechanism for detecting heightened emotions or stagnation in conversation and providing behavioral support.

[0783] (Claim 3)

[0784] The system according to claim 1, wherein the aforementioned behavioral support provides instructions using prompts generated by a naturally generated algorithm. [Explanation of Symbols]

[0785] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. Means for collecting participants' voice information, Means for converting the aforementioned audio information into text information, A means for analyzing the content of a statement from the aforementioned textual information, Means for generating feedback based on the aforementioned analysis, The means for providing the aforementioned feedback, A means to improve the conversational environment inside autonomous vehicles in real time, A system that includes this.

2. The system according to claim 1, wherein the analysis means is a means for evaluating the tone and emotion of a statement and providing feedback for facilitating communication within an autonomous vehicle.

3. The system according to claim 1, wherein the analysis means is equipped with means for detecting the level of activity in speech and the presence or absence of interruptions, and for providing feedback to improve the passenger experience in an autonomous vehicle.