system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The system addresses communication inefficiencies in meetings by providing real-time, culturally adapted feedback based on audio analysis, improving productivity and communication skills in multinational teams.

JP2026096405APending Publication Date: 2026-06-15SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-03
Publication Date: 2026-06-15

AI Technical Summary

Technical Problem

Modern enterprises face decreased productivity in meetings due to variations in communication abilities, particularly in multinational teams, where cross-cultural misunderstandings and frequent interruptions lead to inefficient discussions.

Method used

A system that acquires audio data in real-time, converts it to text using speech recognition, analyzes it with natural language processing to evaluate activity levels and interruptions, and provides personalized, culturally adapted feedback to improve communication.

Benefits of technology

Enhances meeting productivity by reducing misunderstandings and promoting effective communication through real-time feedback tailored to individual and cultural contexts.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026096405000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A means of acquiring voice data via a communication network, A speech recognition means for converting the audio data into text, A natural language processing means analyzes the converted text data and evaluates the level of activity in the exchange of opinions, the frequency of interruptions, and the degree to which the topic deviates. A means for generating feedback based on the results of the analysis and notifying the user of warnings in real time, A means of providing personalized feedback to each user, A means of providing culturally adapted feedback to multinational teams, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a character of the chatbot, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance as a response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In modern enterprises, the productivity of meetings is decreasing, and the variation in the communication ability of participants is remarkable. In particular, in multinational teams, there are problems such as difficulties and misunderstandings in cross-cultural communication, which make meetings inefficient. Furthermore, with the increase in online meetings, participants often interrupt others' speeches frequently or discussions deviate from the topic, which are factors reducing the quality of meetings. Solving these problems and improving the quality of meetings is an urgent task for enterprises.

Means for Solving the Problems

[0005] This invention acquires audio data in real time and converts it to text using speech recognition. The converted text data is then analyzed using natural language processing to evaluate the activity level of the meeting, the frequency of interruptions, and other factors. Based on the analysis results, personalized feedback is generated for each participant, and this feedback is notified in real time. Furthermore, for multinational teams, culturally adapted feedback is provided to mitigate problems caused by misunderstandings and cultural differences, thereby supporting efficient meeting management. This results in improved meeting productivity and enhanced communication skills among participants.

[0006] A "communication network" is a combination of computers and communication equipment that enables the transmission of digital data.

[0007] "Audio data" refers to a data format that contains information obtained by converting human voices into electrical or digital signals.

[0008] "Speech recognition means" refers to a technology or device used to convert speech into text.

[0009] "Natural language processing means" refers to a series of technologies for understanding, interpreting, and generating natural language using computers.

[0010] "Feedback" is a means of providing participants in a meeting with information for evaluation and improvement.

[0011] A "warning" is a notification that instantly alerts you when inappropriate behavior or remarks are detected.

[0012] "Personalized" refers to a system that is customized according to the individual user's characteristics and behavior.

[0013] A "multinational team" is a workplace group composed of members with different nationalities or cultural backgrounds.

[0014] "Culturally adapted" refers to something that is designed with content and methods that are appropriate for a specific culture or social context. [Brief explanation of the drawing]

[0015] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of the data processing device and smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, when an emotion engine is combined. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine.

Embodiments for Implementing the Invention

[0016] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0017] First, the terms used in the following description will be explained.

[0018] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0019] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0020] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0021] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0022] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0023] [First Embodiment]

[0024] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0025] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0026] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0027] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0028] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0029] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0030] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0031] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0032] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0033] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0034] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0035] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0036] This invention is a system that improves the quality of online meetings by analyzing audio data from online meetings using a communication network. In one embodiment, a server acquires audio data in real time via the API of an online meeting platform. The audio data is converted into text data using a speech recognition engine. This text data is analyzed by a natural language processing (NLP) model on the terminal to evaluate the activity level of the meeting, the frequency of interruptions, and the degree to which the topic is derailed.

[0037] Based on the analysis results, the device generates personalized feedback for each participant. This feedback includes advice on effective conversation styles and points out the importance of listening to others' opinions. For multinational teams, it takes cultural differences into account and provides specific communication improvement suggestions to prevent misunderstandings.

[0038] Users can receive real-time feedback from their devices, review it during the meeting, and use it to make improvements. This increases meeting productivity and encourages more active and collaborative engagement from participants.

[0039] As a concrete example, consider a scenario where a multinational team is exchanging opinions online. Suppose the server acquires audio data and instantly performs analysis, determining that a participant tends to frequently interrupt others. In this case, the device immediately generates a message stating, "It is important to respect the time other participants have to speak," and notifies the user. This allows the participant to reconsider their conversation style, leading to smoother discussion.

[0040] Thus, the present invention aims to improve the efficiency of meeting management and the communication skills of participants, thereby contributing to increased productivity for companies.

[0041] The following describes the processing flow.

[0042] Step 1:

[0043] The server acquires audio data in real time via the conferencing platform's API. It establishes a connection to accurately capture the audio stream and stores the acquired audio data in a buffer for processing.

[0044] Step 2:

[0045] The server sends the audio data stored in the buffer to the speech recognition engine, where it is converted into text data. The speech recognition engine uses speaker separation technology to identify each speaker and adds this information to the text as metadata.

[0046] Step 3:

[0047] The terminal receives the converted text data and inputs it into a natural language processing (NLP) model. The NLP model analyzes the text and evaluates the content of the statements, emotions, tone, etc. It detects specific keywords and phrases and measures the level of activity in the exchange of opinions, the frequency of interruptions, and the degree to which the discussion deviates.

[0048] Step 4:

[0049] Based on the NLP analysis results, the server evaluates the overall meeting situation and generates personalized feedback for each participant. This feedback includes suggestions for improving the timing and expression of their contributions.

[0050] Step 5:

[0051] The device notifies the user of feedback generated in real time. These notifications are provided as on-screen pop-ups or audio guides, allowing the user to immediately understand areas for improvement.

[0052] Step 6:

[0053] Based on the feedback, users can review their speaking style and their attitude towards listening to others' opinions. Especially in multinational teams, culturally adapted advice can help them avoid misunderstandings and practice effective communication.

[0054] (Example 1)

[0055] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0056] Improving the quality of communication among participants in online meetings is crucial. This is especially true in multinational teams or other settings with diverse cultural backgrounds, where discussions can easily become disorganized and interruptions can easily lead to tangents. Furthermore, frequent interruptions by participants during meetings often reduce the efficiency of the discussion and ultimately compromise productivity. Therefore, there is a need for systems that can resolve these issues in real time and promote smooth communication.

[0057] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0058] In this invention, the server includes means for acquiring audio data via a communication network, means for converting the audio data into text, and means for analyzing the text data to evaluate the level of activity in discussions and the frequency of interruptions. This enables accurate evaluation of communication and provision of feedback even in meetings involving multinational teams. In particular, by evaluating how often participants interrupt others and suggesting areas for improvement in real time, smooth discussion progress is facilitated.

[0059] A "communication network" is a system in which multiple computers and devices are connected to send and receive data.

[0060] "Audio data" refers to data recorded in digital format, such as audio recordings used in meetings.

[0061] "Speech recognition means" refers to a technology that receives speech as input and converts it into text data.

[0062] "Text data" refers to digital information expressed in characters, and is structured as documents or strings of characters.

[0063] "Natural language processing methods" are technologies for analyzing text data and understanding and evaluating its content.

[0064] "Feedback" refers to evaluation and suggestion information provided by the system to the user, including suggestions for improving the user's behavior.

[0065] "Personalized feedback" refers to feedback that is customized to each individual user, providing advice tailored to their specific situation.

[0066] "Culturally adapted feedback" refers to feedback that takes into account the characteristics of a multinational team and includes suggestions to avoid misunderstandings based on cultural backgrounds.

[0067] "Real-time" refers to a time setting where processing and information provision occur almost simultaneously, meaning a state without delay.

[0068] "Visualization means" refers to a display device or technology that presents data and analysis results to users in an easily understandable way.

[0069] This invention is a system for improving the quality of communication in online meetings. Specific embodiments for carrying out this invention are described below.

[0070] The server acquires audio data in real time via the communication network using the API of the online meeting platform. This server's functionality allows it to directly acquire audio streams via standard APIs, without relying on online meeting services available as a general platform, such as products or technologies from a specific company.

[0071] The audio data is converted into text data on the server by speech recognition software. Specifically, services such as Google Cloud Speech-to-Text can be used to accurately and quickly convert speech to text.

[0072] The terminal that receives the text data analyzes this data using an advanced natural language processing (NLP) model. For example, one of OpenAI's generative AI models is used to evaluate the level of activity in the meeting, the frequency of interruptions, and the degree to which the topic deviates.

[0073] Based on the analysis results, the device generates personalized feedback for each participant. This feedback is notified to participants in real time, suggesting areas for improvement during online meetings. Because culturally adapted feedback is also provided, taking cultural backgrounds into consideration, it is expected to be particularly effective in multinational teams.

[0074] As a concrete example, consider a project meeting involving a multinational team. The server collects audio data and analyzes each participant's statements using natural language processing. For example, if a tendency for some participants to talk excessively is detected, the terminal generates feedback such as "It is important to listen to the opinions of other members" and notifies them quickly. This system encourages real-time awareness and facilitates collaborative communication among participants.

[0075] An example of input to the generating AI model is: "Analyze the following meeting audio data and evaluate the communication styles of the participants. Based on the analysis results, generate improvement suggestions for each participant." This system allows users to improve the quality of meetings and increase productivity.

[0076] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0077] Step 1:

[0078] The server retrieves audio data in real time from the API of the online meeting platform via the communication network. The input is an audio stream from the online meeting, and the output is raw audio data collected on the server. Specifically, the server continuously calls the API and receives data in stream format.

[0079] Step 2:

[0080] The server passes the acquired audio data to speech recognition software, which converts it into text data. The input is the audio data obtained in step 1, and the output is the data with its content converted into text format. Specifically, the server calls a speech recognition engine such as Google Cloud Speech-to-Text, which analyzes the audio and generates the corresponding text.

[0081] Step 3:

[0082] The terminal receives text data sent from the server and processes it through a natural language processing model. The input is text data converted from speech, and the output is an analysis result including the level of activity in the meeting, the frequency of interruptions, and the degree of topic digression. Specifically, the terminal launches a generative AI model, inputs the text data, and evaluates the communication style of the meeting.

[0083] Step 4:

[0084] The device generates personalized feedback for each participant based on the analysis results. The input is the analysis results from step 3, and the output is a customized feedback message for each participant. Specifically, the device extracts key points from the generated information and creates feedback that includes individual improvement suggestions and advice.

[0085] Step 5:

[0086] The user receives feedback from the device in real time. The input is the feedback message generated in step 4, and the output is the notification information the user receives. Specifically, the device immediately sends the feedback to the user's device, making it available for viewing via pop-up notifications or dashboard displays.

[0087] This series of steps allows users to receive real-time feedback during meetings and effectively improve their communication style.

[0088] (Application Example 1)

[0089] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0090] In online seminars and meetings involving participants from diverse cultures and backgrounds, a lack of real-time feedback on the level of conversational activity and tangents can hinder productive communication. Furthermore, the failure to provide individualized feedback can hinder the smooth progress of discussions. Therefore, improving the quality of communication by analyzing participants' contributions and providing immediate feedback is crucial.

[0091] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0092] In this invention, the server includes means for acquiring acoustic information via a communication path, a recognition device for converting the acoustic information into text information, and a language processor for analyzing the converted text information and evaluating the level of activity in information exchange, the frequency of interruptions, and the degree of topic deviation. This enables real-time feedback to be provided to participants, improving the quality of conversation and facilitating smooth discussion.

[0093] A "communication path" refers to the route or network configuration used to transmit data, enabling the sending and receiving of voice and text information.

[0094] "Acoustic information" refers to sound data in digital or analog format, including human voices and other sounds.

[0095] A "recognition device" is a device that converts acoustic information into text information, and it plays a role in converting speech into text using speech recognition technology.

[0096] "Textual information" refers to data expressed as text, and is primarily the result of converting acoustic information into text format.

[0097] A "language processor" is a device that analyzes textual information and evaluates the level of activity in information exchange, the frequency of interruptions, and the degree of topic deviation.

[0098] "Information exchange activity level" is an indicator that shows how actively conversations and discussions are taking place.

[0099] "Interruption frequency" is an indicator that shows how often participants interrupt each other during a conversation.

[0100] "Degree of topic deviation" is an indicator of how much a discussion or conversation deviates from its original subject.

[0101] A "response generation device" is a device that generates feedback based on analysis results and notifies the user in real time.

[0102] A "multicultural group" refers to a group of individuals with different cultural backgrounds and nationalities.

[0103] A "culturally adapted response" is feedback provided in a way that is considerate of members of a multicultural group, taking into account their different cultural backgrounds.

[0104] In the system for realizing this invention, a server acquires acoustic information via a communication path. The server converts the acoustic information into text information using a recognition device, and the results are analyzed by a language processor. The language processor evaluates the activity level of information exchange, the frequency of interruptions, and the degree of topic deviation, and based on the analysis results, a response generation device generates feedback for the user. The feedback is notified to the user's terminal in real time.

[0105] The "Google Cloud Speech-to-Text" API is useful as the recognition tool used here. Natural language processing models such as "Hugging Face Transformers" are applied to the language processor. "Firebase Cloud Messaging" is used to send responses, providing immediate notifications to the user.

[0106] For example, in a scenario where a multicultural group participates in an international online seminar, the server can analyze each participant's comments in real time and, if deviations from a specific topic or frequent interruptions are detected, generate feedback such as "You're going off-topic. Let's return to the original topic" and notify the user's device, thereby facilitating smooth discussion.

[0107] An example of a prompt to the generative AI model is, "Summarize the topic of the current conversation and generate advice to keep the discussion focused." In this way, an embodiment that improves the quality of communication is realized.

[0108] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0109] Step 1:

[0110] The server acquires audio information from the online meeting via the communication path. In this process, the server receives the audio stream in real time using the meeting platform's API. The input is the meeting's audio data, and the output is the acquired raw audio information.

[0111] Step 2:

[0112] The server uses a recognition system to convert the acquired acoustic information into text information. Specifically, it uses the "Google Cloud Speech-to-Text" API to convert this acoustic information into text data. The input is raw acoustic information, and the output is text information converted from the speech.

[0113] Step 3:

[0114] The server analyzes the converted character information using a language processor. This uses natural language processing models such as "Hugging Face Transformers" to evaluate the activity level of information exchange, the frequency of interruptions, and the degree of topic deviation. The input is character information, and the output is an evaluation index based on the analysis results.

[0115] Step 4:

[0116] The server uses a response generator based on the analysis results to generate feedback for the user. The generated feedback includes different content depending on the evaluation metrics and the individual participant's type, ensuring it is useful to the user. The input is the evaluation metrics, and the output is the generated feedback data.

[0117] Step 5:

[0118] The server uses Firebase Cloud Messaging to notify the user's device in real time of the feedback it has generated. During this process, appropriate messaging is used to ensure that the feedback reaches the user immediately. The input is the feedback data, and the output is the notification sent to the device.

[0119] Step 6:

[0120] The user receives the feedback provided and uses it to improve communication during the meeting. Specific actions include checking notifications and adjusting the flow of the conversation as needed. Input is the notifications displayed on the device, and output is the user's adjustments to their actions.

[0121] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0122] The present invention is a system equipped with speech recognition means that acquires voice data via a communication network and converts that voice data into text. Furthermore, this system combines natural language processing means that analyzes the converted text data with an emotion engine that recognizes the user's emotions. This makes it possible to evaluate the level of activity in the exchange of opinions among users during a meeting, the frequency of interruptions, the degree to which the topic deviates, and changes in emotions.

[0123] The server acquires audio data in real time and converts it into text data via a speech recognition engine. In addition, an emotion engine analyzes the audio data and, if necessary, video data to determine the user's emotional state. The terminal receives this data and uses a natural language processing model to analyze the content of the speech. By also taking into account the output of the emotion engine, feedback is generated that takes into account the user's emotional state during the meeting.

[0124] This feedback is communicated to the user in real time, providing advice tailored to the meeting's progress and the user's emotional state. For example, if the emotion engine detects that a user is experiencing tension or stress during a meeting, the device will provide feedback such as, "Relax and prepare to listen to the next opinion." This allows users to communicate more effectively and improve meeting productivity.

[0125] In multinational teams, feedback can be provided that takes into account cultural differences in emotional expression. To prevent misunderstandings among users with different cultural backgrounds, the device suggests appropriate communication styles, enabling smoother discussions.

[0126] The system configured in this way, through the combination of speech recognition, natural language processing, and an emotion engine, provides advanced feedback that goes beyond conventional meeting analysis, thereby improving the communication value of companies.

[0127] The following describes the processing flow.

[0128] Step 1:

[0129] The server retrieves audio data from the conferencing platform in real time. It connects via an API, converts the retrieved audio data into an appropriate format for analysis, and prepares it for streaming.

[0130] Step 2:

[0131] The server uses a speech recognition engine to convert the audio data into text data. Here, speaker separation technology is used to identify each participant's speech and transcribe it along with their metadata.

[0132] Step 3:

[0133] The emotion engine analyzes acquired audio and video data (if available) to classify the user's emotions into multiple categories (joy, tension, excitement, etc.). It evaluates the emotional state at each point in time in real time and records the results.

[0134] Step 4:

[0135] The terminal passes the converted text data and sentiment evaluation data from the sentiment engine to a natural language processing model. The model understands what is being said, analyzes the tone, evaluates the activity level and interruptions in the meeting, and grasps the overall progress of the meeting.

[0136] Step 5:

[0137] The server generates personalized feedback for each user based on the results of natural language processing and sentiment analysis. This feedback includes advice tailored to changes in emotions and suggestions for improving communication styles.

[0138] Step 6:

[0139] The device notifies the user of the generated feedback in real time. The notification is displayed on the screen or provided as an audio guide, allowing the user to adjust their actions accordingly.

[0140] Step 7:

[0141] Users receive feedback and review their own statements and emotional states. In multinational teams, feedback that takes cultural backgrounds into account is used to facilitate smoother intercultural communication.

[0142] (Example 2)

[0143] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0144] In modern business meetings, it is essential to accurately grasp the level of participation and emotional state of attendees and provide real-time feedback. Furthermore, international meetings are prone to misunderstandings and inefficient communication due to cultural differences. Traditional systems have failed to adequately address these challenges.

[0145] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0146] In this invention, the server includes means for acquiring voice information via a communication network, voice recognition means for converting the voice information into text, natural language processing means for analyzing the converted text information and evaluating the level of conversational activity, the frequency of interruptions, and the degree of topic deviation, and emotion recognition means for analyzing the user's emotional state. This makes it possible to accurately grasp the speaking attitudes and emotions of participants during a meeting in real time and provide accurate feedback and advice. Furthermore, by providing culture-appropriate feedback, it is possible to reduce misunderstandings and inefficient communication between cultures.

[0147] A "communication network" is a network system that connects multiple devices and systems, enabling the transmission and reception of voice information and data.

[0148] "Audio information" refers to a data format used to record and transmit human speech and sounds.

[0149] "Speech recognition means" refers to technologies and devices that analyze acquired speech information to understand the speaker's intent and convert it into text information.

[0150] "Text information" refers to information consisting of a series of characters or words that have been converted by speech recognition technology.

[0151] "Natural language processing methods" are techniques and methods that analyze text information to understand its context and meaning, and to evaluate the structure and characteristics of conversations.

[0152] "Emotion recognition means" refers to methods and devices for detecting a person's emotional state from speech or text and evaluating changes in that state.

[0153] "Feedback" refers to instructions and advice provided to the user based on analyzed information.

[0154] "Intercultural communication styles" refer to the unique ways and methods that people from different cultural backgrounds use to communicate.

[0155] This invention is a system that improves the efficiency and effectiveness of meetings by acquiring audio information during meetings in real time, analyzing it, and providing feedback. The information flow between the server, terminals, and users will be described in detail.

[0156] The server acquires audio information from users in real time via the communication network. In this process, audio recorded by each conference participant's terminal is sent to the server in streaming format. The acquired audio information is immediately converted into text information by a speech recognition system. This speech recognition uses a speech recognition engine (for example, a general-purpose speech recognition API).

[0157] Next, the terminal receives text information sent from the server and performs analysis using natural language processing. During this process, a generative AI model (e.g., a natural language processing engine) is used to identify the content, category, and characteristics of the conversation between users. Subsequently, emotion recognition is used to determine the user's emotional state from voice tone and other nonverbal indicators.

[0158] Based on the analysis results, the device can generate feedback and notify the user in real time. For example, if a participant shows signs of stress during a meeting, the device can provide specific advice such as, "Calm down, gather your thoughts, and then speak next time."

[0159] Furthermore, in multinational teams, culturally sensitive feedback is provided to ensure that participants from different cultural backgrounds can communicate without misunderstanding. For example, feedback can be customized to suit different communication styles across cultures.

[0160] Examples of prompts include specific instructions such as, "Analyze the statements made in this meeting and generate culture-aware feedback as needed."

[0161] Thus, the present invention provides an advanced feedback system that combines speech recognition, natural language processing, and emotion recognition, thereby supporting participants in conducting effective meetings.

[0162] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0163] Step 1:

[0164] The server acquires voice information from the user via a communication network. The input is audio streamed in real time from the user's terminal. This voice information is recorded in digital format for future analysis.

[0165] Step 2:

[0166] The server sends the acquired audio information to the speech recognition engine for conversion into text. The speech recognition engine analyzes the input audio data using a language model and converts it into text format. The output is text information, recording the spoken content and corresponding timestamps.

[0167] Step 3:

[0168] The server passes the converted text information to an emotion recognition system to analyze the user's emotional state. The input consists of text information and, if necessary, non-verbal characteristics of speech. This analysis identifies emotional indicators (e.g., joy, anger, tension) and outputs them as the emotional state.

[0169] Step 4:

[0170] The terminal uses text information and sentiment states received from the server to perform natural language processing. Using a natural language processing model, it analyzes the input text and extracts the conversation structure, context, and keywords. The output consists of the analyzed conversation features and categories of user statements.

[0171] Step 5:

[0172] The device generates feedback for the user based on the results of natural language processing and emotion recognition. The input is the results of previous analysis, and the AI model is used to create the feedback, taking this into consideration. The output is a customized feedback message tailored to the individual user.

[0173] Step 6:

[0174] The device notifies the user of the generated feedback in real time. The feedback appears on the screen as a dialog or pop-up, providing the user with immediate advice. This process allows the user to instantly incorporate the feedback into their meeting communication.

[0175] (Application Example 2)

[0176] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0177] In family communication, individual emotions are often not properly recognized, resulting in a decline in the quality of interaction and creating misunderstandings and stress. This is particularly difficult in intercultural families or those with diverse family structures. Solving this problem is essential.

[0178] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0179] In this invention, the server includes means for acquiring sound data, means for converting sound data into text, means for analyzing the text data to evaluate the level of activity in communication, and means for recognizing emotions and tone in a home environment and making suggestions to promote relaxation. This makes it possible to improve the quality of communication within the home, prevent misunderstandings, and promote smooth interaction.

[0180] A "communication path" refers to a network means used to exchange information such as sound and data between other devices and systems.

[0181] "Audio data" refers to audio information captured by a microphone or recording device converted into a signal.

[0182] "Speech recognition means" refers to technologies and devices that analyze sound data and convert it into corresponding strings of characters.

[0183] "Character data" refers to text-formatted data converted by speech recognition technology.

[0184] "Language processing means" refers to technologies that analyze text data and evaluate the level of conversational activity, intervention frequency, and other factors.

[0185] "Analysis" refers to the process of examining acquired data in detail to detect patterns and specific features.

[0186] "Information" refers to data and knowledge that are generated based on the results of analysis and provided to users in a useful format.

[0187] "Personalized information" refers to feedback that is customized to each user's specific needs and emotional state.

[0188] "Adaptive information" refers to feedback that is adjusted to different cultural backgrounds and environments to facilitate appropriate communication.

[0189] "Home environment" refers to the living conditions and surrounding circumstances of an individual's residence.

[0190] "Relaxation-promoting suggestions" refer to providing instructions and advice to help users reduce tension and stress and reach a comfortable mental state.

[0191] The system for implementing this invention primarily focuses on the function of collecting and analyzing audio data in real time. The system acquires audio data using a microphone and converts this data into text using speech recognition software (e.g., Google Speech-to-Text API). The converted text is then analyzed using language analysis software (e.g., Google Cloud Natural Language API) to evaluate the level of activity and tone of conversations within the home.

[0192] An emotion engine (e.g., IBM Watson® Tone Analyzer) analyzes the user's emotions and tone of voice, and generates relaxation-promoting suggestions as needed. These suggestions and feedback are communicated to the user through the display and speakers. As a result, it helps ensure smooth and comfortable conversations among family members in the living room.

[0193] The server processes the acquired data in the cloud and provides personalized feedback to each user. For multicultural households, it also provides adaptive feedback to reduce misunderstandings that may arise during intercultural communication.

[0194] For example, if family conversations during dinner are awkward and tense, the system might suggest, "Let's find a fun topic to talk about and relax." In this case, the system might prompt, "Based on the analysis of the family conversation, please provide advice on how to relax." Thus, the invention can be used to improve communication within the family.

[0195] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0196] Step 1:

[0197] The server receives audio data collected through the microphone. This audio data becomes the input. The audio data is acquired in a compressed format, decoded, and passed on to subsequent processing.

[0198] Step 2:

[0199] The server uses speech recognition software to convert speech data into text data. The input for this step is speech data, and the output is the corresponding text data. Phonological analysis is performed during this process, which is then used to convert the data into text.

[0200] Step 3:

[0201] The server passes the text data to language analysis software to analyze the conversation's activity level and sentiment. The input to this step is the text data generated in step 2, and the output is the analysis results, including activity level and sentiment indicators. A natural language processing algorithm evaluates the text and generates feature vectors.

[0202] Step 4:

[0203] The emotion engine determines the user's emotional state based on the analysis results and generates relaxation suggestions as needed. The input for this step is the analysis results from step 3, and the output is a suggestion message. For example, if it is determined that the user is in a state of tension, a message encouraging relaxation will be generated.

[0204] Step 5:

[0205] The terminal notifies the user of suggestions from the server via its display or speaker. The input in this step is the suggestion message sent from the server, and the output is the notification to the user. Notification methods include on-screen text display and audio announcements.

[0206] Step 6:

[0207] Users modify their behavior based on the suggestions provided, improving the quality of communication within the family. The input for this step is the displayed suggestions, and the output is the change in the user's behavior. By actively incorporating the suggestions, users can create a more harmonious atmosphere within the family.

[0208] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0209] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0210] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0211] [Second Embodiment]

[0212] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0213] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0214] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0215] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0216] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0217] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0218] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0219] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0220] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0221] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0222] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0223] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0224] This invention is a system that improves the quality of online meetings by analyzing audio data from online meetings using a communication network. In one embodiment, a server acquires audio data in real time via the API of an online meeting platform. The audio data is converted into text data using a speech recognition engine. This text data is analyzed by a natural language processing (NLP) model on the terminal to evaluate the activity level of the meeting, the frequency of interruptions, and the degree to which the topic is derailed.

[0225] Based on the analysis results, the device generates personalized feedback for each participant. This feedback includes advice on effective conversation styles and points out the importance of listening to others' opinions. For multinational teams, it takes cultural differences into account and provides specific communication improvement suggestions to prevent misunderstandings.

[0226] Users can receive real-time feedback from their devices, review it during the meeting, and use it to make improvements. This increases meeting productivity and encourages more active and collaborative engagement from participants.

[0227] As a concrete example, consider a scenario where a multinational team is exchanging opinions online. Suppose the server acquires audio data and instantly performs analysis, determining that a participant tends to frequently interrupt others. In this case, the device immediately generates a message stating, "It is important to respect the time other participants have to speak," and notifies the user. This allows the participant to reconsider their conversation style, leading to smoother discussion.

[0228] Thus, the present invention aims to improve the efficiency of meeting management and the communication skills of participants, thereby contributing to increased productivity for companies.

[0229] The following describes the processing flow.

[0230] Step 1:

[0231] The server acquires audio data in real time via the conferencing platform's API. It establishes a connection to accurately capture the audio stream and stores the acquired audio data in a buffer for processing.

[0232] Step 2:

[0233] The server sends the audio data stored in the buffer to the speech recognition engine, where it is converted into text data. The speech recognition engine uses speaker separation technology to identify each speaker and adds this information to the text as metadata.

[0234] Step 3:

[0235] The terminal receives the converted text data and inputs it into a natural language processing (NLP) model. The NLP model analyzes the text and evaluates the content of the statements, emotions, tone, etc. It detects specific keywords and phrases and measures the level of activity in the exchange of opinions, the frequency of interruptions, and the degree to which the discussion deviates.

[0236] Step 4:

[0237] Based on the NLP analysis results, the server evaluates the overall meeting situation and generates personalized feedback for each participant. This feedback includes suggestions for improving the timing and expression of their contributions.

[0238] Step 5:

[0239] The device notifies the user of feedback generated in real time. These notifications are provided as on-screen pop-ups or audio guides, allowing the user to immediately understand areas for improvement.

[0240] Step 6:

[0241] Based on the feedback, users can review their speaking style and their attitude towards listening to others' opinions. Especially in multinational teams, culturally adapted advice can help them avoid misunderstandings and practice effective communication.

[0242] (Example 1)

[0243] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0244] Improving the quality of communication among participants in online meetings is crucial. This is especially true in multinational teams or other settings with diverse cultural backgrounds, where discussions can easily become disorganized and interruptions can easily lead to tangents. Furthermore, frequent interruptions by participants during meetings often reduce the efficiency of the discussion and ultimately compromise productivity. Therefore, there is a need for systems that can resolve these issues in real time and promote smooth communication.

[0245] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0246] In this invention, the server includes means for acquiring audio data via a communication network, means for converting the audio data into text, and means for analyzing the text data to evaluate the level of activity in discussions and the frequency of interruptions. This enables accurate evaluation of communication and provision of feedback even in meetings involving multinational teams. In particular, by evaluating how often participants interrupt others and suggesting areas for improvement in real time, smooth discussion progress is facilitated.

[0247] A "communication network" is a system in which multiple computers and devices are connected to send and receive data.

[0248] "Audio data" refers to data recorded in digital format, such as audio recordings used in meetings.

[0249] "Speech recognition means" refers to a technology that receives speech as input and converts it into text data.

[0250] "Text data" refers to digital information expressed in characters, and is structured as documents or strings of characters.

[0251] "Natural language processing methods" are technologies for analyzing text data and understanding and evaluating its content.

[0252] "Feedback" refers to evaluation and suggestion information provided by the system to the user, including suggestions for improving the user's behavior.

[0253] "Personalized feedback" refers to feedback that is customized to each individual user, providing advice tailored to their specific situation.

[0254] "Culturally adapted feedback" refers to feedback that takes into account the characteristics of a multinational team and includes suggestions to avoid misunderstandings based on cultural backgrounds.

[0255] "Real-time" refers to a time setting where processing and information provision occur almost simultaneously, meaning a state without delay.

[0256] "Visualization means" refers to a display device or technology that presents data and analysis results to users in an easily understandable way.

[0257] This invention is a system for improving the quality of communication in online meetings. Specific embodiments for carrying out this invention are described below.

[0258] The server acquires audio data in real time via the communication network using the API of the online meeting platform. This server's functionality allows it to directly acquire audio streams via standard APIs, without relying on online meeting services available as a general platform, such as products or technologies from a specific company.

[0259] The audio data is converted into text data on the server using speech recognition software. Specifically, services like Google Cloud Speech-to-Text can be used to convert speech to text accurately and quickly.

[0260] The terminal that receives the text data analyzes this data using an advanced natural language processing (NLP) model. For example, one of OpenAI's generative AI models is used to evaluate the activity level of the meeting, the frequency of interruptions, and the degree to which the topic deviates.

[0261] Based on the analysis results, the device generates personalized feedback for each participant. This feedback is notified to participants in real time, suggesting areas for improvement during online meetings. Because culturally adapted feedback is also provided, taking cultural backgrounds into consideration, it is expected to be particularly effective in multinational teams.

[0262] As a concrete example, consider a project meeting involving a multinational team. The server collects audio data and analyzes each participant's statements using natural language processing. For example, if a tendency for some participants to talk excessively is detected, the terminal generates feedback such as "It is important to listen to the opinions of other members" and notifies them quickly. This system encourages real-time awareness and facilitates collaborative communication among participants.

[0263] An example of input to the generating AI model is: "Analyze the following meeting audio data and evaluate the communication styles of the participants. Based on the analysis results, generate improvement suggestions for each participant." This system allows users to improve the quality of meetings and increase productivity.

[0264] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0265] Step 1:

[0266] The server retrieves audio data in real time from the API of the online meeting platform via the communication network. The input is an audio stream from the online meeting, and the output is raw audio data collected on the server. Specifically, the server continuously calls the API and receives data in stream format.

[0267] Step 2:

[0268] The server passes the acquired audio data to speech recognition software, which converts it into text data. The input is the audio data obtained in step 1, and the output is the data with its content converted into text format. Specifically, the server calls a speech recognition engine such as Google Cloud Speech-to-Text, which analyzes the audio and generates the corresponding text.

[0269] Step 3:

[0270] The terminal receives text data sent from the server and processes it through a natural language processing model. The input is text data converted from speech, and the output is an analysis result including the level of activity in the meeting, the frequency of interruptions, and the degree of topic digression. Specifically, the terminal launches a generative AI model, inputs the text data, and evaluates the communication style of the meeting.

[0271] Step 4:

[0272] The device generates personalized feedback for each participant based on the analysis results. The input is the analysis results from step 3, and the output is a customized feedback message for each participant. Specifically, the device extracts key points from the generated information and creates feedback that includes individual improvement suggestions and advice.

[0273] Step 5:

[0274] The user receives feedback from the device in real time. The input is the feedback message generated in step 4, and the output is the notification information the user receives. Specifically, the device immediately sends the feedback to the user's device, making it available for viewing via pop-up notifications or dashboard displays.

[0275] This series of steps allows users to receive real-time feedback during meetings and effectively improve their communication style.

[0276] (Application Example 1)

[0277] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0278] In online seminars and meetings involving participants from diverse cultures and backgrounds, a lack of real-time feedback on the level of conversational activity and tangents can hinder productive communication. Furthermore, the failure to provide individualized feedback can hinder the smooth progress of discussions. Therefore, improving the quality of communication by analyzing participants' contributions and providing immediate feedback is crucial.

[0279] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0280] In this invention, the server includes means for acquiring acoustic information via a communication path, a recognition device for converting the acoustic information into text information, and a language processor for analyzing the converted text information and evaluating the level of activity in information exchange, the frequency of interruptions, and the degree of topic deviation. This enables real-time feedback to be provided to participants, improving the quality of conversation and facilitating smooth discussion.

[0281] "Communication path" refers to the path or network configuration for transmitting data, enabling the transmission and reception of voice and text information.

[0282] "Acoustic information" refers to digital or analog audio data including human voices and sounds.

[0283] "Recognizer" is a device that converts acoustic information into text information and has the role of converting voice into text using speech recognition technology.

[0284] "Text information" refers to data expressed as text, mainly the result of converting acoustic information into text format.

[0285] "Language processor" is a device that analyzes text information and evaluates the activity level of information exchange, the frequency of interruptions, and the degree of deviation from the topic.

[0286] "Activity level of information exchange" is an indicator showing how active conversations or discussions are.

[0287] "Frequency of interruptions" is an indicator showing how often participants interrupt each other's speech during a conversation.

[0288] "Degree of deviation from the topic" is an indicator showing how much a discussion or conversation has deviated from the original topic.

[0289] "Response generation device" is a device that generates feedback based on the analysis results and notifies the user in real time.

[0290] "Multicultural group" refers to a group formed by individuals with different cultural backgrounds or nationalities.

[0291] "Culture - adapted response" is feedback provided in a form that takes into account the members of a multicultural group and considers different cultural backgrounds.

[0292] In the system for realizing this invention, a server acquires acoustic information via a communication path. The server converts the acoustic information into text information using a recognition device, and the results are analyzed by a language processor. The language processor evaluates the activity level of information exchange, the frequency of interruptions, and the degree of topic deviation, and based on the analysis results, a response generation device generates feedback for the user. The feedback is notified to the user's terminal in real time.

[0293] The "Google Cloud Speech-to-Text" API is useful as the recognition tool used here. Natural language processing models such as "Hugging Face Transformers" are applied to the language processor. "Firebase Cloud Messaging" is used to send responses, providing immediate notifications to the user.

[0294] For example, in a scenario where a multicultural group participates in an international online seminar, the server can analyze each participant's comments in real time and, if deviations from a specific topic or frequent interruptions are detected, generate feedback such as "You're going off-topic. Let's return to the original topic" and notify the user's device, thereby facilitating smooth discussion.

[0295] An example of a prompt to the generative AI model is, "Summarize the topic of the current conversation and generate advice to keep the discussion focused." In this way, an embodiment that improves the quality of communication is realized.

[0296] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0297] Step 1:

[0298] The server acquires audio information from the online meeting via the communication path. In this process, the server receives the audio stream in real time using the meeting platform's API. The input is the meeting's audio data, and the output is the acquired raw audio information.

[0299] Step 2:

[0300] The server uses a recognizer to convert the acquired acoustic information into character information. Specifically, it uses the "Google Cloud Speech-to-Text" API to convert this acoustic information into text data. The input is raw acoustic information, and the output is character information converted from the speech.

[0301] Step 3:

[0302] The server analyzes the converted character information with a language processor. Here, natural language processing models such as "Hugging Face Transformers" are used to evaluate the activity level of information exchange, the frequency of interruptions, and the degree of topic deviation. The input is character information, and the output is an evaluation index based on the analysis result.

[0303] Step 4:

[0304] The server uses a response generation device based on the analysis result to generate feedback for the user. The generated feedback contains different contents according to the evaluation index and the type of each participant in order to be useful for the user. The input is the evaluation index, and the output is the generated feedback data.

[0305] Step 5:

[0306] The server uses "Firebase Cloud Messaging" to notify the user's terminal in real time of the generated feedback. In this process, appropriate messaging is performed so that the feedback reaches the user immediately. The input is the feedback data, and the output is the notification sent to the terminal.

[0307] Step 6:

[0308] The user receives the feedback provided and uses it to improve communication during the meeting. Specific actions include checking notifications and adjusting the flow of the conversation as needed. Input is the notifications displayed on the device, and output is the user's adjustments to their actions.

[0309] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0310] The present invention is a system equipped with speech recognition means that acquires voice data via a communication network and converts that voice data into text. Furthermore, this system combines natural language processing means that analyzes the converted text data with an emotion engine that recognizes the user's emotions. This makes it possible to evaluate the level of activity in the exchange of opinions among users during a meeting, the frequency of interruptions, the degree to which the topic deviates, and changes in emotions.

[0311] The server acquires audio data in real time and converts it into text data via a speech recognition engine. In addition, an emotion engine analyzes the audio data and, if necessary, video data to determine the user's emotional state. The terminal receives this data and uses a natural language processing model to analyze the content of the speech. By also taking into account the output of the emotion engine, feedback is generated that takes into account the user's emotional state during the meeting.

[0312] This feedback is communicated to the user in real time, providing advice tailored to the meeting's progress and the user's emotional state. For example, if the emotion engine detects that a user is experiencing tension or stress during a meeting, the device will provide feedback such as, "Relax and prepare to listen to the next opinion." This allows users to communicate more effectively and improve meeting productivity.

[0313] In multinational teams, feedback can be provided that takes into account cultural differences in emotional expression. To prevent misunderstandings among users with different cultural backgrounds, the device suggests appropriate communication styles, enabling smoother discussions.

[0314] The system configured in this way, through the combination of speech recognition, natural language processing, and an emotion engine, provides advanced feedback that goes beyond conventional meeting analysis, thereby improving the communication value of companies.

[0315] The following describes the processing flow.

[0316] Step 1:

[0317] The server retrieves audio data from the conferencing platform in real time. It connects via an API, converts the retrieved audio data into an appropriate format for analysis, and prepares it for streaming.

[0318] Step 2:

[0319] The server uses a speech recognition engine to convert the audio data into text data. Here, speaker separation technology is used to identify each participant's speech and transcribe it along with their metadata.

[0320] Step 3:

[0321] The emotion engine analyzes acquired audio and video data (if available) to classify the user's emotions into multiple categories (joy, tension, excitement, etc.). It evaluates the emotional state at each point in time in real time and records the results.

[0322] Step 4:

[0323] The terminal passes the converted text data and sentiment evaluation data from the sentiment engine to a natural language processing model. The model understands what is being said, analyzes the tone, evaluates the activity level and interruptions in the meeting, and grasps the overall progress of the meeting.

[0324] Step 5:

[0325] The server generates personalized feedback for each user based on the results of natural language processing and sentiment analysis. This feedback includes advice tailored to changes in emotions and suggestions for improving communication styles.

[0326] Step 6:

[0327] The device notifies the user of the generated feedback in real time. The notification is displayed on the screen or provided as an audio guide, allowing the user to adjust their actions accordingly.

[0328] Step 7:

[0329] Users receive feedback and review their own statements and emotional states. In multinational teams, feedback that takes cultural backgrounds into account is used to facilitate smoother intercultural communication.

[0330] (Example 2)

[0331] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0332] In modern business meetings, it is essential to accurately grasp the level of participation and emotional state of attendees and provide real-time feedback. Furthermore, international meetings are prone to misunderstandings and inefficient communication due to cultural differences. Traditional systems have failed to adequately address these challenges.

[0333] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0334] In this invention, the server includes means for acquiring voice information via a communication network, voice recognition means for converting the voice information into text, natural language processing means for analyzing the converted text information and evaluating the level of conversational activity, the frequency of interruptions, and the degree of topic deviation, and emotion recognition means for analyzing the user's emotional state. This makes it possible to accurately grasp the speaking attitudes and emotions of participants during a meeting in real time and provide accurate feedback and advice. Furthermore, by providing culture-appropriate feedback, it is possible to reduce misunderstandings and inefficient communication between cultures.

[0335] A "communication network" is a network system that connects multiple devices and systems, enabling the transmission and reception of voice information and data.

[0336] "Audio information" refers to a data format used to record and transmit human speech and sounds.

[0337] "Speech recognition means" refers to technologies and devices that analyze acquired speech information to understand the speaker's intent and convert it into text information.

[0338] "Text information" refers to information consisting of a series of characters or words that have been converted by speech recognition technology.

[0339] "Natural language processing methods" are techniques and methods that analyze text information to understand its context and meaning, and to evaluate the structure and characteristics of conversations.

[0340] "Emotion recognition means" refers to methods and devices for detecting a person's emotional state from speech or text and evaluating changes in that state.

[0341] "Feedback" refers to instructions and advice provided to the user based on analyzed information.

[0342] "Intercultural communication styles" refer to the unique ways and methods that people from different cultural backgrounds use to communicate.

[0343] This invention is a system that improves the efficiency and effectiveness of meetings by acquiring audio information during meetings in real time, analyzing it, and providing feedback. The information flow between the server, terminals, and users will be described in detail.

[0344] The server acquires audio information from users in real time via the communication network. In this process, audio recorded by each conference participant's terminal is sent to the server in streaming format. The acquired audio information is immediately converted into text information by a speech recognition system. This speech recognition uses a speech recognition engine (for example, a general-purpose speech recognition API).

[0345] Next, the terminal receives text information sent from the server and performs analysis using natural language processing. During this process, a generative AI model (e.g., a natural language processing engine) is used to identify the content, category, and characteristics of the conversation between users. Subsequently, emotion recognition is used to determine the user's emotional state from voice tone and other nonverbal indicators.

[0346] Based on the analysis results, the device can generate feedback and notify the user in real time. For example, if a participant shows signs of stress during a meeting, the device can provide specific advice such as, "Calm down, gather your thoughts, and then speak next time."

[0347] Furthermore, in multinational teams, culturally sensitive feedback is provided to ensure that participants from different cultural backgrounds can communicate without misunderstanding. For example, feedback can be customized to suit different communication styles across cultures.

[0348] Examples of prompts include specific instructions such as, "Analyze the statements made in this meeting and generate culture-aware feedback as needed."

[0349] Thus, the present invention provides an advanced feedback system that combines speech recognition, natural language processing, and emotion recognition, thereby supporting participants in conducting effective meetings.

[0350] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0351] Step 1:

[0352] The server acquires voice information from the user via a communication network. The input is audio streamed in real time from the user's terminal. This voice information is recorded in digital format for future analysis.

[0353] Step 2:

[0354] The server sends the acquired audio information to the speech recognition engine for conversion into text. The speech recognition engine analyzes the input audio data using a language model and converts it into text format. The output is text information, recording the spoken content and corresponding timestamps.

[0355] Step 3:

[0356] The server passes the converted text information to an emotion recognition system to analyze the user's emotional state. The input consists of text information and, if necessary, non-verbal characteristics of speech. This analysis identifies emotional indicators (e.g., joy, anger, tension) and outputs them as the emotional state.

[0357] Step 4:

[0358] The terminal uses text information and sentiment states received from the server to perform natural language processing. Using a natural language processing model, it analyzes the input text and extracts the conversation structure, context, and keywords. The output consists of the analyzed conversation features and categories of user statements.

[0359] Step 5:

[0360] The device generates feedback for the user based on the results of natural language processing and emotion recognition. The input is the results of previous analysis, and the AI model is used to create the feedback, taking this into consideration. The output is a customized feedback message tailored to the individual user.

[0361] Step 6:

[0362] The device notifies the user of the generated feedback in real time. The feedback appears on the screen as a dialog or pop-up, providing the user with immediate advice. This process allows the user to instantly incorporate the feedback into their meeting communication.

[0363] (Application Example 2)

[0364] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0365] In family communication, individual emotions are often not properly recognized, resulting in a decline in the quality of interaction and creating misunderstandings and stress. This is particularly difficult in intercultural families or those with diverse family structures. Solving this problem is essential.

[0366] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0367] In this invention, the server includes means for acquiring sound data, means for converting sound data into text, means for analyzing the text data to evaluate the level of activity in communication, and means for recognizing emotions and tone in a home environment and making suggestions to promote relaxation. This makes it possible to improve the quality of communication within the home, prevent misunderstandings, and promote smooth interaction.

[0368] A "communication path" refers to a network means used to exchange information such as sound and data between other devices and systems.

[0369] "Audio data" refers to audio information captured by a microphone or recording device converted into a signal.

[0370] "Speech recognition means" refers to technologies and devices that analyze sound data and convert it into corresponding strings of characters.

[0371] "Character data" refers to text-formatted data converted by speech recognition technology.

[0372] "Language processing means" refers to technologies that analyze text data and evaluate the level of conversational activity, intervention frequency, and other factors.

[0373] "Analysis" refers to the process of examining acquired data in detail to detect patterns and specific features.

[0374] "Information" refers to data and knowledge that are generated based on the results of analysis and provided to users in a useful format.

[0375] "Personalized information" refers to feedback that is customized to each user's specific needs and emotional state.

[0376] "Adaptive information" refers to feedback that is adjusted to different cultural backgrounds and environments to facilitate appropriate communication.

[0377] "Home environment" refers to the living conditions and surrounding circumstances of an individual's residence.

[0378] "Relaxation-promoting suggestions" refer to providing instructions and advice to help users reduce tension and stress and reach a comfortable mental state.

[0379] The system for implementing this invention primarily focuses on the function of collecting and analyzing audio data in real time. The system acquires audio data using a microphone and converts this data into text using speech recognition software (e.g., Google Speech-to-Text API). The converted text is then analyzed using language analysis software (e.g., Google Cloud Natural Language API) to evaluate the level of activity and tone of conversations within the home.

[0380] An emotion engine (e.g., IBM Watson Tone Analyzer) analyzes the user's emotions and tone of voice, and generates relaxation suggestions as needed. These suggestions and feedback are communicated to the user through the display and speakers. As a result, it helps ensure smooth and comfortable conversations among family members in the living room.

[0381] The server processes the acquired data in the cloud and provides personalized feedback to each user. For multicultural households, it also provides adaptive feedback to reduce misunderstandings that may arise during intercultural communication.

[0382] For example, if family conversations during dinner are awkward and tense, the system might suggest, "Let's find a fun topic to talk about and relax." In this case, the system might prompt, "Based on the analysis of the family conversation, please provide advice on how to relax." Thus, the invention can be used to improve communication within the family.

[0383] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0384] Step 1:

[0385] The server receives audio data collected through the microphone. This audio data becomes the input. The audio data is acquired in a compressed format, decoded, and passed on to subsequent processing.

[0386] Step 2:

[0387] The server uses speech recognition software to convert speech data into text data. The input for this step is speech data, and the output is the corresponding text data. Phonological analysis is performed during this process, which is then used to convert the data into text.

[0388] Step 3:

[0389] The server passes the text data to language analysis software to analyze the conversation's activity level and sentiment. The input to this step is the text data generated in step 2, and the output is the analysis results, including activity level and sentiment indicators. A natural language processing algorithm evaluates the text and generates feature vectors.

[0390] Step 4:

[0391] The emotion engine determines the user's emotional state based on the analysis results and generates relaxation suggestions as needed. The input for this step is the analysis results from step 3, and the output is a suggestion message. For example, if it is determined that the user is in a state of tension, a message encouraging relaxation will be generated.

[0392] Step 5:

[0393] The terminal notifies the user of suggestions from the server via its display or speaker. The input in this step is the suggestion message sent from the server, and the output is the notification to the user. Notification methods include on-screen text display and audio announcements.

[0394] Step 6:

[0395] Users modify their behavior based on the suggestions provided, improving the quality of communication within the family. The input for this step is the displayed suggestions, and the output is the change in the user's behavior. By actively incorporating the suggestions, users can create a more harmonious atmosphere within the family.

[0396] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0397] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0398] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0399] [Third Embodiment]

[0400] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0401] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0402] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0403] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0404] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0405] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0406] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0407] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0408] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0409] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0410] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0411] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0412] This invention is a system that improves the quality of online meetings by analyzing audio data from online meetings using a communication network. In one embodiment, a server acquires audio data in real time via the API of an online meeting platform. The audio data is converted into text data using a speech recognition engine. This text data is analyzed by a natural language processing (NLP) model on the terminal to evaluate the activity level of the meeting, the frequency of interruptions, and the degree to which the topic is derailed.

[0413] Based on the analysis results, the device generates personalized feedback for each participant. This feedback includes advice on effective conversation styles and points out the importance of listening to others' opinions. For multinational teams, it takes cultural differences into account and provides specific communication improvement suggestions to prevent misunderstandings.

[0414] Users can receive real-time feedback from their devices, review it during the meeting, and use it to make improvements. This increases meeting productivity and encourages more active and collaborative engagement from participants.

[0415] As a concrete example, consider a scenario where a multinational team is exchanging opinions online. Suppose the server acquires audio data and instantly performs analysis, determining that a participant tends to frequently interrupt others. In this case, the device immediately generates a message stating, "It is important to respect the time other participants have to speak," and notifies the user. This allows the participant to reconsider their conversation style, leading to smoother discussion.

[0416] Thus, the present invention aims to improve the efficiency of meeting management and the communication skills of participants, thereby contributing to increased productivity for companies.

[0417] The following describes the processing flow.

[0418] Step 1:

[0419] The server acquires audio data in real time via the conferencing platform's API. It establishes a connection to accurately capture the audio stream and stores the acquired audio data in a buffer for processing.

[0420] Step 2:

[0421] The server sends the audio data stored in the buffer to the speech recognition engine, where it is converted into text data. The speech recognition engine uses speaker separation technology to identify each speaker and adds this information to the text as metadata.

[0422] Step 3:

[0423] The terminal receives the converted text data and inputs it into a natural language processing (NLP) model. The NLP model analyzes the text and evaluates the content of the statements, emotions, tone, etc. It detects specific keywords and phrases and measures the level of activity in the exchange of opinions, the frequency of interruptions, and the degree to which the discussion deviates.

[0424] Step 4:

[0425] Based on the NLP analysis results, the server evaluates the overall meeting situation and generates personalized feedback for each participant. This feedback includes suggestions for improving the timing and expression of their contributions.

[0426] Step 5:

[0427] The device notifies the user of feedback generated in real time. These notifications are provided as on-screen pop-ups or audio guides, allowing the user to immediately understand areas for improvement.

[0428] Step 6:

[0429] Based on the feedback, users can review their speaking style and their attitude towards listening to others' opinions. Especially in multinational teams, culturally adapted advice can help them avoid misunderstandings and practice effective communication.

[0430] (Example 1)

[0431] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0432] Improving the quality of communication among participants in online meetings is crucial. This is especially true in multinational teams or other settings with diverse cultural backgrounds, where discussions can easily become disorganized and interruptions can easily lead to tangents. Furthermore, frequent interruptions by participants during meetings often reduce the efficiency of the discussion and ultimately compromise productivity. Therefore, there is a need for systems that can resolve these issues in real time and promote smooth communication.

[0433] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0434] In this invention, the server includes means for acquiring audio data via a communication network, means for converting the audio data into text, and means for analyzing the text data to evaluate the level of activity in discussions and the frequency of interruptions. This enables accurate evaluation of communication and provision of feedback even in meetings involving multinational teams. In particular, by evaluating how often participants interrupt others and suggesting areas for improvement in real time, smooth discussion progress is facilitated.

[0435] A "communication network" is a system in which multiple computers and devices are connected to send and receive data.

[0436] "Audio data" refers to data recorded in digital format, such as audio recordings used in meetings.

[0437] "Speech recognition means" refers to a technology that receives speech as input and converts it into text data.

[0438] "Text data" refers to digital information expressed in characters, and is structured as documents or strings of characters.

[0439] "Natural language processing methods" are technologies for analyzing text data and understanding and evaluating its content.

[0440] "Feedback" refers to evaluation and suggestion information provided by the system to the user, including suggestions for improving the user's behavior.

[0441] "Personalized feedback" refers to feedback that is customized to each individual user, providing advice tailored to their specific situation.

[0442] "Culturally adapted feedback" refers to feedback that takes into account the characteristics of a multinational team and includes suggestions to avoid misunderstandings based on cultural backgrounds.

[0443] "Real-time" refers to a time setting where processing and information provision occur almost simultaneously, meaning a state without delay.

[0444] "Visualization means" refers to a display device or technology that presents data and analysis results to users in an easily understandable way.

[0445] This invention is a system for improving the quality of communication in online meetings. Specific embodiments for carrying out this invention are described below.

[0446] The server acquires audio data in real time via the communication network using the API of the online meeting platform. This server's functionality allows it to directly acquire audio streams via standard APIs, without relying on online meeting services available as a general platform, such as products or technologies from a specific company.

[0447] The audio data is converted into text data on the server using speech recognition software. Specifically, services like Google Cloud Speech-to-Text can be used to convert speech to text accurately and quickly.

[0448] The terminal that receives the text data analyzes this data using an advanced natural language processing (NLP) model. For example, one of OpenAI's generative AI models is used to evaluate the activity level of the meeting, the frequency of interruptions, and the degree to which the topic deviates.

[0449] Based on the analysis results, the device generates personalized feedback for each participant. This feedback is notified to participants in real time, suggesting areas for improvement during online meetings. Because culturally adapted feedback is also provided, taking cultural backgrounds into consideration, it is expected to be particularly effective in multinational teams.

[0450] As a concrete example, consider a project meeting involving a multinational team. The server collects audio data and analyzes each participant's statements using natural language processing. For example, if a tendency for some participants to talk excessively is detected, the terminal generates feedback such as "It is important to listen to the opinions of other members" and notifies them quickly. This system encourages real-time awareness and facilitates collaborative communication among participants.

[0451] An example of input to the generating AI model is: "Analyze the following meeting audio data and evaluate the communication styles of the participants. Based on the analysis results, generate improvement suggestions for each participant." This system allows users to improve the quality of meetings and increase productivity.

[0452] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0453] Step 1:

[0454] The server retrieves audio data in real time from the API of the online meeting platform via the communication network. The input is an audio stream from the online meeting, and the output is raw audio data collected on the server. Specifically, the server continuously calls the API and receives data in stream format.

[0455] Step 2:

[0456] The server passes the acquired audio data to speech recognition software, which converts it into text data. The input is the audio data obtained in step 1, and the output is the data with its content converted into text format. Specifically, the server calls a speech recognition engine such as Google Cloud Speech-to-Text, which analyzes the audio and generates the corresponding text.

[0457] Step 3:

[0458] The terminal receives text data sent from the server and processes it through a natural language processing model. The input is text data converted from speech, and the output is an analysis result including the level of activity in the meeting, the frequency of interruptions, and the degree of topic digression. Specifically, the terminal launches a generative AI model, inputs the text data, and evaluates the communication style of the meeting.

[0459] Step 4:

[0460] The device generates personalized feedback for each participant based on the analysis results. The input is the analysis results from step 3, and the output is a customized feedback message for each participant. Specifically, the device extracts key points from the generated information and creates feedback that includes individual improvement suggestions and advice.

[0461] Step 5:

[0462] The user receives feedback from the device in real time. The input is the feedback message generated in step 4, and the output is the notification information the user receives. Specifically, the device immediately sends the feedback to the user's device, making it available for viewing via pop-up notifications or dashboard displays.

[0463] This series of steps allows users to receive real-time feedback during meetings and effectively improve their communication style.

[0464] (Application Example 1)

[0465] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0466] In online seminars and meetings involving participants from diverse cultures and backgrounds, a lack of real-time feedback on the level of conversational activity and tangents can hinder productive communication. Furthermore, the failure to provide individualized feedback can hinder the smooth progress of discussions. Therefore, improving the quality of communication by analyzing participants' contributions and providing immediate feedback is crucial.

[0467] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0468] In this invention, the server includes means for acquiring acoustic information via a communication path, a recognition device for converting the acoustic information into text information, and a language processor for analyzing the converted text information and evaluating the level of activity in information exchange, the frequency of interruptions, and the degree of topic deviation. This enables real-time feedback to be provided to participants, improving the quality of conversation and facilitating smooth discussion.

[0469] A "communication path" refers to the route or network configuration used to transmit data, enabling the sending and receiving of voice and text information.

[0470] "Acoustic information" refers to sound data in digital or analog format, including human voices and other sounds.

[0471] A "recognition device" is a device that converts acoustic information into text information, and it plays a role in converting speech into text using speech recognition technology.

[0472] "Textual information" refers to data expressed as text, and is primarily the result of converting acoustic information into text format.

[0473] A "language processor" is a device that analyzes textual information and evaluates the level of activity in information exchange, the frequency of interruptions, and the degree of topic deviation.

[0474] "Information exchange activity level" is an indicator that shows how actively conversations and discussions are taking place.

[0475] "Interruption frequency" is an indicator that shows how often participants interrupt each other during a conversation.

[0476] "Degree of topic deviation" is an indicator of how much a discussion or conversation deviates from its original subject.

[0477] A "response generation device" is a device that generates feedback based on analysis results and notifies the user in real time.

[0478] A "multicultural group" refers to a group of individuals with different cultural backgrounds and nationalities.

[0479] A "culturally adapted response" is feedback provided in a way that is considerate of members of a multicultural group, taking into account their different cultural backgrounds.

[0480] In the system for realizing this invention, a server acquires acoustic information via a communication path. The server converts the acoustic information into text information using a recognition device, and the results are analyzed by a language processor. The language processor evaluates the activity level of information exchange, the frequency of interruptions, and the degree of topic deviation, and based on the analysis results, a response generation device generates feedback for the user. The feedback is notified to the user's terminal in real time.

[0481] The "Google Cloud Speech-to-Text" API is useful as the recognition tool used here. Natural language processing models such as "Hugging Face Transformers" are applied to the language processor. "Firebase Cloud Messaging" is used to send responses, providing immediate notifications to the user.

[0482] For example, in a scenario where a multicultural group participates in an international online seminar, the server can analyze each participant's comments in real time and, if deviations from a specific topic or frequent interruptions are detected, generate feedback such as "You're going off-topic. Let's return to the original topic" and notify the user's device, thereby facilitating smooth discussion.

[0483] An example of a prompt to the generative AI model is, "Summarize the topic of the current conversation and generate advice to keep the discussion focused." In this way, an embodiment that improves the quality of communication is realized.

[0484] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0485] Step 1:

[0486] The server acquires audio information from the online meeting via the communication path. In this process, the server receives the audio stream in real time using the meeting platform's API. The input is the meeting's audio data, and the output is the acquired raw audio information.

[0487] Step 2:

[0488] The server uses a recognition system to convert the acquired acoustic information into text information. Specifically, it uses the "Google Cloud Speech-to-Text" API to convert this acoustic information into text data. The input is raw acoustic information, and the output is text information converted from the speech.

[0489] Step 3:

[0490] The server analyzes the converted character information using a language processor. This uses natural language processing models such as "Hugging Face Transformers" to evaluate the activity level of information exchange, the frequency of interruptions, and the degree of topic deviation. The input is character information, and the output is an evaluation index based on the analysis results.

[0491] Step 4:

[0492] The server uses a response generator based on the analysis results to generate feedback for the user. The generated feedback includes different content depending on the evaluation metrics and the individual participant's type, ensuring it is useful to the user. The input is the evaluation metrics, and the output is the generated feedback data.

[0493] Step 5:

[0494] The server uses Firebase Cloud Messaging to notify the user's device in real time of the feedback it has generated. During this process, appropriate messaging is used to ensure that the feedback reaches the user immediately. The input is the feedback data, and the output is the notification sent to the device.

[0495] Step 6:

[0496] The user receives the feedback provided and uses it to improve communication during the meeting. Specific actions include checking notifications and adjusting the flow of the conversation as needed. Input is the notifications displayed on the device, and output is the user's adjustments to their actions.

[0497] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0498] The present invention is a system equipped with speech recognition means that acquires voice data via a communication network and converts that voice data into text. Furthermore, this system combines natural language processing means that analyzes the converted text data with an emotion engine that recognizes the user's emotions. This makes it possible to evaluate the level of activity in the exchange of opinions among users during a meeting, the frequency of interruptions, the degree to which the topic deviates, and changes in emotions.

[0499] The server acquires audio data in real time and converts it into text data via a speech recognition engine. In addition, an emotion engine analyzes the audio data and, if necessary, video data to determine the user's emotional state. The terminal receives this data and uses a natural language processing model to analyze the content of the speech. By also taking into account the output of the emotion engine, feedback is generated that takes into account the user's emotional state during the meeting.

[0500] This feedback is communicated to the user in real time, providing advice tailored to the meeting's progress and the user's emotional state. For example, if the emotion engine detects that a user is experiencing tension or stress during a meeting, the device will provide feedback such as, "Relax and prepare to listen to the next opinion." This allows users to communicate more effectively and improve meeting productivity.

[0501] In multinational teams, feedback can be provided that takes into account cultural differences in emotional expression. To prevent misunderstandings among users with different cultural backgrounds, the device suggests appropriate communication styles, enabling smoother discussions.

[0502] The system configured in this way, through the combination of speech recognition, natural language processing, and an emotion engine, provides advanced feedback that goes beyond conventional meeting analysis, thereby improving the communication value of companies.

[0503] The following describes the processing flow.

[0504] Step 1:

[0505] The server retrieves audio data from the conferencing platform in real time. It connects via an API, converts the retrieved audio data into an appropriate format for analysis, and prepares it for streaming.

[0506] Step 2:

[0507] The server uses a speech recognition engine to convert the audio data into text data. Here, speaker separation technology is used to identify each participant's speech and transcribe it along with their metadata.

[0508] Step 3:

[0509] The emotion engine analyzes acquired audio and video data (if available) to classify the user's emotions into multiple categories (joy, tension, excitement, etc.). It evaluates the emotional state at each point in time in real time and records the results.

[0510] Step 4:

[0511] The terminal passes the converted text data and sentiment evaluation data from the sentiment engine to a natural language processing model. The model understands what is being said, analyzes the tone, evaluates the activity level and interruptions in the meeting, and grasps the overall progress of the meeting.

[0512] Step 5:

[0513] The server generates personalized feedback for each user based on the results of natural language processing and sentiment analysis. This feedback includes advice tailored to changes in emotions and suggestions for improving communication styles.

[0514] Step 6:

[0515] The device notifies the user of the generated feedback in real time. The notification is displayed on the screen or provided as an audio guide, allowing the user to adjust their actions accordingly.

[0516] Step 7:

[0517] Users receive feedback and review their own statements and emotional states. In multinational teams, feedback that takes cultural backgrounds into account is used to facilitate smoother intercultural communication.

[0518] (Example 2)

[0519] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0520] In modern business meetings, it is essential to accurately grasp the level of participation and emotional state of attendees and provide real-time feedback. Furthermore, international meetings are prone to misunderstandings and inefficient communication due to cultural differences. Traditional systems have failed to adequately address these challenges.

[0521] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0522] In this invention, the server includes means for acquiring voice information via a communication network, voice recognition means for converting the voice information into text, natural language processing means for analyzing the converted text information and evaluating the level of conversational activity, the frequency of interruptions, and the degree of topic deviation, and emotion recognition means for analyzing the user's emotional state. This makes it possible to accurately grasp the speaking attitudes and emotions of participants during a meeting in real time and provide accurate feedback and advice. Furthermore, by providing culture-appropriate feedback, it is possible to reduce misunderstandings and inefficient communication between cultures.

[0523] A "communication network" is a network system that connects multiple devices and systems, enabling the transmission and reception of voice information and data.

[0524] "Audio information" refers to a data format used to record and transmit human speech and sounds.

[0525] "Speech recognition means" refers to technologies and devices that analyze acquired speech information to understand the speaker's intent and convert it into text information.

[0526] "Text information" refers to information consisting of a series of characters or words that have been converted by speech recognition technology.

[0527] "Natural language processing methods" are techniques and methods that analyze text information to understand its context and meaning, and to evaluate the structure and characteristics of conversations.

[0528] "Emotion recognition means" refers to methods and devices for detecting a person's emotional state from speech or text and evaluating changes in that state.

[0529] "Feedback" refers to instructions and advice provided to the user based on analyzed information.

[0530] "Intercultural communication styles" refer to the unique ways and methods that people from different cultural backgrounds use to communicate.

[0531] This invention is a system that improves the efficiency and effectiveness of meetings by acquiring audio information during meetings in real time, analyzing it, and providing feedback. The information flow between the server, terminals, and users will be described in detail.

[0532] The server acquires audio information from users in real time via the communication network. In this process, audio recorded by each conference participant's terminal is sent to the server in streaming format. The acquired audio information is immediately converted into text information by a speech recognition system. This speech recognition uses a speech recognition engine (for example, a general-purpose speech recognition API).

[0533] Next, the terminal receives text information sent from the server and performs analysis using natural language processing. During this process, a generative AI model (e.g., a natural language processing engine) is used to identify the content, category, and characteristics of the conversation between users. Subsequently, emotion recognition is used to determine the user's emotional state from voice tone and other nonverbal indicators.

[0534] Based on the analysis results, the device can generate feedback and notify the user in real time. For example, if a participant shows signs of stress during a meeting, the device can provide specific advice such as, "Calm down, gather your thoughts, and then speak next time."

[0535] Furthermore, in multinational teams, culturally sensitive feedback is provided to ensure that participants from different cultural backgrounds can communicate without misunderstanding. For example, feedback can be customized to suit different communication styles across cultures.

[0536] Examples of prompts include specific instructions such as, "Analyze the statements made in this meeting and generate culture-aware feedback as needed."

[0537] Thus, the present invention provides an advanced feedback system that combines speech recognition, natural language processing, and emotion recognition, thereby supporting participants in conducting effective meetings.

[0538] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0539] Step 1:

[0540] The server acquires voice information from the user via a communication network. The input is audio streamed in real time from the user's terminal. This voice information is recorded in digital format for future analysis.

[0541] Step 2:

[0542] The server sends the acquired audio information to the speech recognition engine for conversion into text. The speech recognition engine analyzes the input audio data using a language model and converts it into text format. The output is text information, recording the spoken content and corresponding timestamps.

[0543] Step 3:

[0544] The server passes the converted text information to an emotion recognition system to analyze the user's emotional state. The input consists of text information and, if necessary, non-verbal characteristics of speech. This analysis identifies emotional indicators (e.g., joy, anger, tension) and outputs them as the emotional state.

[0545] Step 4:

[0546] The terminal uses text information and sentiment states received from the server to perform natural language processing. Using a natural language processing model, it analyzes the input text and extracts the conversation structure, context, and keywords. The output consists of the analyzed conversation features and categories of user statements.

[0547] Step 5:

[0548] The device generates feedback for the user based on the results of natural language processing and emotion recognition. The input is the results of previous analysis, and the AI model is used to create the feedback, taking this into consideration. The output is a customized feedback message tailored to the individual user.

[0549] Step 6:

[0550] The device notifies the user of the generated feedback in real time. The feedback appears on the screen as a dialog or pop-up, providing the user with immediate advice. This process allows the user to instantly incorporate the feedback into their meeting communication.

[0551] (Application Example 2)

[0552] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0553] In family communication, individual emotions are often not properly recognized, resulting in a decline in the quality of interaction and creating misunderstandings and stress. This is particularly difficult in intercultural families or those with diverse family structures. Solving this problem is essential.

[0554] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0555] In this invention, the server includes means for acquiring sound data, means for converting sound data into text, means for analyzing the text data to evaluate the level of activity in communication, and means for recognizing emotions and tone in a home environment and making suggestions to promote relaxation. This makes it possible to improve the quality of communication within the home, prevent misunderstandings, and promote smooth interaction.

[0556] A "communication path" refers to a network means used to exchange information such as sound and data between other devices and systems.

[0557] "Audio data" refers to audio information captured by a microphone or recording device converted into a signal.

[0558] "Speech recognition means" refers to technologies and devices that analyze sound data and convert it into corresponding strings of characters.

[0559] "Character data" refers to text-formatted data converted by speech recognition technology.

[0560] "Language processing means" refers to technologies that analyze text data and evaluate the level of conversational activity, intervention frequency, and other factors.

[0561] "Analysis" refers to the process of examining acquired data in detail to detect patterns and specific features.

[0562] "Information" refers to data and knowledge that are generated based on the results of analysis and provided to users in a useful format.

[0563] "Personalized information" refers to feedback that is customized to each user's specific needs and emotional state.

[0564] "Adaptive information" refers to feedback that is adjusted to different cultural backgrounds and environments to facilitate appropriate communication.

[0565] "Home environment" refers to the living conditions and surrounding circumstances of an individual's residence.

[0566] "Relaxation-promoting suggestions" refer to providing instructions and advice to help users reduce tension and stress and reach a comfortable mental state.

[0567] The system for implementing this invention primarily focuses on the function of collecting and analyzing audio data in real time. The system acquires audio data using a microphone and converts this data into text using speech recognition software (e.g., Google Speech-to-Text API). The converted text is then analyzed using language analysis software (e.g., Google Cloud Natural Language API) to evaluate the level of activity and tone of conversations within the home.

[0568] An emotion engine (e.g., IBM Watson Tone Analyzer) analyzes the user's emotions and tone of voice, and generates relaxation suggestions as needed. These suggestions and feedback are communicated to the user through the display and speakers. As a result, it helps ensure smooth and comfortable conversations among family members in the living room.

[0569] The server processes the acquired data in the cloud and provides personalized feedback to each user. For multicultural households, it also provides adaptive feedback to reduce misunderstandings that may arise during intercultural communication.

[0570] For example, if family conversations during dinner are awkward and tense, the system might suggest, "Let's find a fun topic to talk about and relax." In this case, the system might prompt, "Based on the analysis of the family conversation, please provide advice on how to relax." Thus, the invention can be used to improve communication within the family.

[0571] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0572] Step 1:

[0573] The server receives audio data collected through the microphone. This audio data becomes the input. The audio data is acquired in a compressed format, decoded, and passed on to subsequent processing.

[0574] Step 2:

[0575] The server uses speech recognition software to convert speech data into text data. The input for this step is speech data, and the output is the corresponding text data. Phonological analysis is performed during this process, which is then used to convert the data into text.

[0576] Step 3:

[0577] The server passes the text data to language analysis software to analyze the conversation's activity level and sentiment. The input to this step is the text data generated in step 2, and the output is the analysis results, including activity level and sentiment indicators. A natural language processing algorithm evaluates the text and generates feature vectors.

[0578] Step 4:

[0579] The emotion engine determines the user's emotional state based on the analysis results and generates relaxation suggestions as needed. The input for this step is the analysis results from step 3, and the output is a suggestion message. For example, if it is determined that the user is in a state of tension, a message encouraging relaxation will be generated.

[0580] Step 5:

[0581] The terminal notifies the user of suggestions from the server via its display or speaker. The input in this step is the suggestion message sent from the server, and the output is the notification to the user. Notification methods include on-screen text display and audio announcements.

[0582] Step 6:

[0583] Users modify their behavior based on the suggestions provided, improving the quality of communication within the family. The input for this step is the displayed suggestions, and the output is the change in the user's behavior. By actively incorporating the suggestions, users can create a more harmonious atmosphere within the family.

[0584] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0585] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0586] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0587] [Fourth Embodiment]

[0588] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0589] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0590] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0591] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0592] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0593] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0594] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0595] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0596] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0597] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0598] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0599] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0600] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0601] This invention is a system that improves the quality of online meetings by analyzing audio data from online meetings using a communication network. In one embodiment, a server acquires audio data in real time via the API of an online meeting platform. The audio data is converted into text data using a speech recognition engine. This text data is analyzed by a natural language processing (NLP) model on the terminal to evaluate the activity level of the meeting, the frequency of interruptions, and the degree to which the topic is derailed.

[0602] Based on the analysis results, the device generates personalized feedback for each participant. This feedback includes advice on effective conversation styles and points out the importance of listening to others' opinions. For multinational teams, it takes cultural differences into account and provides specific communication improvement suggestions to prevent misunderstandings.

[0603] Users can receive real-time feedback from their devices, review it during the meeting, and use it to make improvements. This increases meeting productivity and encourages more active and collaborative engagement from participants.

[0604] As a concrete example, consider a scenario where a multinational team is exchanging opinions online. Suppose the server acquires audio data and instantly performs analysis, determining that a participant tends to frequently interrupt others. In this case, the device immediately generates a message stating, "It is important to respect the time other participants have to speak," and notifies the user. This allows the participant to reconsider their conversation style, leading to smoother discussion.

[0605] Thus, the present invention aims to improve the efficiency of meeting management and the communication skills of participants, thereby contributing to increased productivity for companies.

[0606] The following describes the processing flow.

[0607] Step 1:

[0608] The server acquires audio data in real time via the conferencing platform's API. It establishes a connection to accurately capture the audio stream and stores the acquired audio data in a buffer for processing.

[0609] Step 2:

[0610] The server sends the audio data stored in the buffer to the speech recognition engine, where it is converted into text data. The speech recognition engine uses speaker separation technology to identify each speaker and adds this information to the text as metadata.

[0611] Step 3:

[0612] The terminal receives the converted text data and inputs it into a natural language processing (NLP) model. The NLP model analyzes the text and evaluates the content of the statements, emotions, tone, etc. It detects specific keywords and phrases and measures the level of activity in the exchange of opinions, the frequency of interruptions, and the degree to which the discussion deviates.

[0613] Step 4:

[0614] Based on the NLP analysis results, the server evaluates the overall meeting situation and generates personalized feedback for each participant. This feedback includes suggestions for improving the timing and expression of their contributions.

[0615] Step 5:

[0616] The device notifies the user of feedback generated in real time. These notifications are provided as on-screen pop-ups or audio guides, allowing the user to immediately understand areas for improvement.

[0617] Step 6:

[0618] Based on the feedback, users can review their speaking style and their attitude towards listening to others' opinions. Especially in multinational teams, culturally adapted advice can help them avoid misunderstandings and practice effective communication.

[0619] (Example 1)

[0620] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0621] Improving the quality of communication among participants in online meetings is crucial. This is especially true in multinational teams or other settings with diverse cultural backgrounds, where discussions can easily become disorganized and interruptions can easily lead to tangents. Furthermore, frequent interruptions by participants during meetings often reduce the efficiency of the discussion and ultimately compromise productivity. Therefore, there is a need for systems that can resolve these issues in real time and promote smooth communication.

[0622] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0623] In this invention, the server includes means for acquiring audio data via a communication network, means for converting the audio data into text, and means for analyzing the text data to evaluate the level of activity in discussions and the frequency of interruptions. This enables accurate evaluation of communication and provision of feedback even in meetings involving multinational teams. In particular, by evaluating how often participants interrupt others and suggesting areas for improvement in real time, smooth discussion progress is facilitated.

[0624] A "communication network" is a system in which multiple computers and devices are connected to send and receive data.

[0625] "Audio data" refers to data recorded in digital format, such as audio recordings used in meetings.

[0626] "Speech recognition means" refers to a technology that receives speech as input and converts it into text data.

[0627] "Text data" refers to digital information expressed in characters, and is structured as documents or strings of characters.

[0628] "Natural language processing methods" are technologies for analyzing text data and understanding and evaluating its content.

[0629] "Feedback" refers to evaluation and suggestion information provided by the system to the user, including suggestions for improving the user's behavior.

[0630] "Personalized feedback" refers to feedback that is customized to each individual user, providing advice tailored to their specific situation.

[0631] "Culturally adapted feedback" refers to feedback that takes into account the characteristics of a multinational team and includes suggestions to avoid misunderstandings based on cultural backgrounds.

[0632] "Real-time" refers to a time setting where processing and information provision occur almost simultaneously, meaning a state without delay.

[0633] "Visualization means" refers to a display device or technology that presents data and analysis results to users in an easily understandable way.

[0634] This invention is a system for improving the quality of communication in online meetings. Specific embodiments for carrying out this invention are described below.

[0635] The server acquires audio data in real time via the communication network using the API of the online meeting platform. This server's functionality allows it to directly acquire audio streams via standard APIs, without relying on online meeting services available as a general platform, such as products or technologies from a specific company.

[0636] The audio data is converted into text data on the server using speech recognition software. Specifically, services like Google Cloud Speech-to-Text can be used to convert speech to text accurately and quickly.

[0637] The terminal that receives the text data analyzes this data using an advanced natural language processing (NLP) model. For example, one of OpenAI's generative AI models is used to evaluate the activity level of the meeting, the frequency of interruptions, and the degree to which the topic deviates.

[0638] Based on the analysis results, the device generates personalized feedback for each participant. This feedback is notified to participants in real time, suggesting areas for improvement during online meetings. Because culturally adapted feedback is also provided, taking cultural backgrounds into consideration, it is expected to be particularly effective in multinational teams.

[0639] As a concrete example, consider a project meeting involving a multinational team. The server collects audio data and analyzes each participant's statements using natural language processing. For example, if a tendency for some participants to talk excessively is detected, the terminal generates feedback such as "It is important to listen to the opinions of other members" and notifies them quickly. This system encourages real-time awareness and facilitates collaborative communication among participants.

[0640] An example of input to the generating AI model is: "Analyze the following meeting audio data and evaluate the communication styles of the participants. Based on the analysis results, generate improvement suggestions for each participant." This system allows users to improve the quality of meetings and increase productivity.

[0641] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0642] Step 1:

[0643] The server retrieves audio data in real time from the API of the online meeting platform via the communication network. The input is an audio stream from the online meeting, and the output is raw audio data collected on the server. Specifically, the server continuously calls the API and receives data in stream format.

[0644] Step 2:

[0645] The server passes the acquired audio data to speech recognition software, which converts it into text data. The input is the audio data obtained in step 1, and the output is the data with its content converted into text format. Specifically, the server calls a speech recognition engine such as Google Cloud Speech-to-Text, which analyzes the audio and generates the corresponding text.

[0646] Step 3:

[0647] The terminal receives text data sent from the server and processes it through a natural language processing model. The input is text data converted from speech, and the output is an analysis result including the level of activity in the meeting, the frequency of interruptions, and the degree of topic digression. Specifically, the terminal launches a generative AI model, inputs the text data, and evaluates the communication style of the meeting.

[0648] Step 4:

[0649] The device generates personalized feedback for each participant based on the analysis results. The input is the analysis results from step 3, and the output is a customized feedback message for each participant. Specifically, the device extracts key points from the generated information and creates feedback that includes individual improvement suggestions and advice.

[0650] Step 5:

[0651] The user receives feedback from the device in real time. The input is the feedback message generated in step 4, and the output is the notification information the user receives. Specifically, the device immediately sends the feedback to the user's device, making it available for viewing via pop-up notifications or dashboard displays.

[0652] This series of steps allows users to receive real-time feedback during meetings and effectively improve their communication style.

[0653] (Application Example 1)

[0654] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0655] In online seminars and meetings involving participants from diverse cultures and backgrounds, a lack of real-time feedback on the level of conversational activity and tangents can hinder productive communication. Furthermore, the failure to provide individualized feedback can hinder the smooth progress of discussions. Therefore, improving the quality of communication by analyzing participants' contributions and providing immediate feedback is crucial.

[0656] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0657] In this invention, the server includes means for acquiring acoustic information via a communication path, a recognition device for converting the acoustic information into text information, and a language processor for analyzing the converted text information and evaluating the level of activity in information exchange, the frequency of interruptions, and the degree of topic deviation. This enables real-time feedback to be provided to participants, improving the quality of conversation and facilitating smooth discussion.

[0658] A "communication path" refers to the route or network configuration used to transmit data, enabling the sending and receiving of voice and text information.

[0659] "Acoustic information" refers to sound data in digital or analog format, including human voices and other sounds.

[0660] A "recognition device" is a device that converts acoustic information into text information, and it plays a role in converting speech into text using speech recognition technology.

[0661] "Textual information" refers to data expressed as text, and is primarily the result of converting acoustic information into text format.

[0662] A "language processor" is a device that analyzes textual information and evaluates the level of activity in information exchange, the frequency of interruptions, and the degree of topic deviation.

[0663] "Information exchange activity level" is an indicator that shows how actively conversations and discussions are taking place.

[0664] "Interruption frequency" is an indicator that shows how often participants interrupt each other during a conversation.

[0665] "Degree of topic deviation" is an indicator of how much a discussion or conversation deviates from its original subject.

[0666] A "response generation device" is a device that generates feedback based on analysis results and notifies the user in real time.

[0667] A "multicultural group" refers to a group of individuals with different cultural backgrounds and nationalities.

[0668] A "culturally adapted response" is feedback provided in a way that is considerate of members of a multicultural group, taking into account their different cultural backgrounds.

[0669] In the system for realizing this invention, a server acquires acoustic information via a communication path. The server converts the acoustic information into text information using a recognition device, and the results are analyzed by a language processor. The language processor evaluates the activity level of information exchange, the frequency of interruptions, and the degree of topic deviation, and based on the analysis results, a response generation device generates feedback for the user. The feedback is notified to the user's terminal in real time.

[0670] The "Google Cloud Speech-to-Text" API is useful as the recognition tool used here. Natural language processing models such as "Hugging Face Transformers" are applied to the language processor. "Firebase Cloud Messaging" is used to send responses, providing immediate notifications to the user.

[0671] For example, in a scenario where a multicultural group participates in an international online seminar, the server can analyze each participant's comments in real time and, if deviations from a specific topic or frequent interruptions are detected, generate feedback such as "You're going off-topic. Let's return to the original topic" and notify the user's device, thereby facilitating smooth discussion.

[0672] An example of a prompt to the generative AI model is, "Summarize the topic of the current conversation and generate advice to keep the discussion focused." In this way, an embodiment that improves the quality of communication is realized.

[0673] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0674] Step 1:

[0675] The server acquires audio information from the online meeting via the communication path. In this process, the server receives the audio stream in real time using the meeting platform's API. The input is the meeting's audio data, and the output is the acquired raw audio information.

[0676] Step 2:

[0677] The server uses a recognition system to convert the acquired acoustic information into text information. Specifically, it uses the "Google Cloud Speech-to-Text" API to convert this acoustic information into text data. The input is raw acoustic information, and the output is text information converted from the speech.

[0678] Step 3:

[0679] The server analyzes the converted character information using a language processor. This uses natural language processing models such as "Hugging Face Transformers" to evaluate the activity level of information exchange, the frequency of interruptions, and the degree of topic deviation. The input is character information, and the output is an evaluation index based on the analysis results.

[0680] Step 4:

[0681] The server uses a response generator based on the analysis results to generate feedback for the user. The generated feedback includes different content depending on the evaluation metrics and the individual participant's type, ensuring it is useful to the user. The input is the evaluation metrics, and the output is the generated feedback data.

[0682] Step 5:

[0683] The server uses Firebase Cloud Messaging to notify the user's device in real time of the feedback it has generated. During this process, appropriate messaging is used to ensure that the feedback reaches the user immediately. The input is the feedback data, and the output is the notification sent to the device.

[0684] Step 6:

[0685] The user receives the feedback provided and uses it to improve communication during the meeting. Specific actions include checking notifications and adjusting the flow of the conversation as needed. Input is the notifications displayed on the device, and output is the user's adjustments to their actions.

[0686] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0687] The present invention is a system equipped with speech recognition means that acquires voice data via a communication network and converts that voice data into text. Furthermore, this system combines natural language processing means that analyzes the converted text data with an emotion engine that recognizes the user's emotions. This makes it possible to evaluate the level of activity in the exchange of opinions among users during a meeting, the frequency of interruptions, the degree to which the topic deviates, and changes in emotions.

[0688] The server acquires audio data in real time and converts it into text data via a speech recognition engine. In addition, an emotion engine analyzes the audio data and, if necessary, video data to determine the user's emotional state. The terminal receives this data and uses a natural language processing model to analyze the content of the speech. By also taking into account the output of the emotion engine, feedback is generated that takes into account the user's emotional state during the meeting.

[0689] This feedback is communicated to the user in real time, providing advice tailored to the meeting's progress and the user's emotional state. For example, if the emotion engine detects that a user is experiencing tension or stress during a meeting, the device will provide feedback such as, "Relax and prepare to listen to the next opinion." This allows users to communicate more effectively and improve meeting productivity.

[0690] In multinational teams, feedback can be provided that takes into account cultural differences in emotional expression. To prevent misunderstandings among users with different cultural backgrounds, the device suggests appropriate communication styles, enabling smoother discussions.

[0691] The system configured in this way, through the combination of speech recognition, natural language processing, and an emotion engine, provides advanced feedback that goes beyond conventional meeting analysis, thereby improving the communication value of companies.

[0692] The following describes the processing flow.

[0693] Step 1:

[0694] The server retrieves audio data from the conferencing platform in real time. It connects via an API, converts the retrieved audio data into an appropriate format for analysis, and prepares it for streaming.

[0695] Step 2:

[0696] The server uses a speech recognition engine to convert the audio data into text data. Here, speaker separation technology is used to identify each participant's speech and transcribe it along with their metadata.

[0697] Step 3:

[0698] The emotion engine analyzes acquired audio and video data (if available) to classify the user's emotions into multiple categories (joy, tension, excitement, etc.). It evaluates the emotional state at each point in time in real time and records the results.

[0699] Step 4:

[0700] The terminal passes the converted text data and sentiment evaluation data from the sentiment engine to a natural language processing model. The model understands what is being said, analyzes the tone, evaluates the activity level and interruptions in the meeting, and grasps the overall progress of the meeting.

[0701] Step 5:

[0702] The server generates personalized feedback for each user based on the results of natural language processing and sentiment analysis. This feedback includes advice tailored to changes in emotions and suggestions for improving communication styles.

[0703] Step 6:

[0704] The device notifies the user of the generated feedback in real time. The notification is displayed on the screen or provided as an audio guide, allowing the user to adjust their actions accordingly.

[0705] Step 7:

[0706] Users receive feedback and review their own statements and emotional states. In multinational teams, feedback that takes cultural backgrounds into account is used to facilitate smoother intercultural communication.

[0707] (Example 2)

[0708] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0709] In modern business meetings, it is essential to accurately grasp the level of participation and emotional state of attendees and provide real-time feedback. Furthermore, international meetings are prone to misunderstandings and inefficient communication due to cultural differences. Traditional systems have failed to adequately address these challenges.

[0710] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0711] In this invention, the server includes means for acquiring voice information via a communication network, voice recognition means for converting the voice information into text, natural language processing means for analyzing the converted text information and evaluating the level of conversational activity, the frequency of interruptions, and the degree of topic deviation, and emotion recognition means for analyzing the user's emotional state. This makes it possible to accurately grasp the speaking attitudes and emotions of participants during a meeting in real time and provide accurate feedback and advice. Furthermore, by providing culture-appropriate feedback, it is possible to reduce misunderstandings and inefficient communication between cultures.

[0712] A "communication network" is a network system that connects multiple devices and systems, enabling the transmission and reception of voice information and data.

[0713] "Audio information" refers to a data format used to record and transmit human speech and sounds.

[0714] "Speech recognition means" refers to technologies and devices that analyze acquired speech information to understand the speaker's intent and convert it into text information.

[0715] "Text information" refers to information consisting of a series of characters or words that have been converted by speech recognition technology.

[0716] "Natural language processing methods" are techniques and methods that analyze text information to understand its context and meaning, and to evaluate the structure and characteristics of conversations.

[0717] "Emotion recognition means" refers to methods and devices for detecting a person's emotional state from speech or text and evaluating changes in that state.

[0718] "Feedback" refers to instructions and advice provided to the user based on analyzed information.

[0719] "Intercultural communication styles" refer to the unique ways and methods that people from different cultural backgrounds use to communicate.

[0720] This invention is a system that improves the efficiency and effectiveness of meetings by acquiring audio information during meetings in real time, analyzing it, and providing feedback. The information flow between the server, terminals, and users will be described in detail.

[0721] The server acquires audio information from users in real time via the communication network. In this process, audio recorded by each conference participant's terminal is sent to the server in streaming format. The acquired audio information is immediately converted into text information by a speech recognition system. This speech recognition uses a speech recognition engine (for example, a general-purpose speech recognition API).

[0722] Next, the terminal receives text information sent from the server and performs analysis using natural language processing. During this process, a generative AI model (e.g., a natural language processing engine) is used to identify the content, category, and characteristics of the conversation between users. Subsequently, emotion recognition is used to determine the user's emotional state from voice tone and other nonverbal indicators.

[0723] Based on the analysis results, the device can generate feedback and notify the user in real time. For example, if a participant shows signs of stress during a meeting, the device can provide specific advice such as, "Calm down, gather your thoughts, and then speak next time."

[0724] Furthermore, in multinational teams, culturally sensitive feedback is provided to ensure that participants from different cultural backgrounds can communicate without misunderstanding. For example, feedback can be customized to suit different communication styles across cultures.

[0725] Examples of prompts include specific instructions such as, "Analyze the statements made in this meeting and generate culture-aware feedback as needed."

[0726] Thus, the present invention provides an advanced feedback system that combines speech recognition, natural language processing, and emotion recognition, thereby supporting participants in conducting effective meetings.

[0727] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0728] Step 1:

[0729] The server acquires voice information from the user via a communication network. The input is audio streamed in real time from the user's terminal. This voice information is recorded in digital format for future analysis.

[0730] Step 2:

[0731] The server sends the acquired audio information to the speech recognition engine for conversion into text. The speech recognition engine analyzes the input audio data using a language model and converts it into text format. The output is text information, recording the spoken content and corresponding timestamps.

[0732] Step 3:

[0733] The server passes the converted text information to an emotion recognition system to analyze the user's emotional state. The input consists of text information and, if necessary, non-verbal characteristics of speech. This analysis identifies emotional indicators (e.g., joy, anger, tension) and outputs them as the emotional state.

[0734] Step 4:

[0735] The terminal uses text information and sentiment states received from the server to perform natural language processing. Using a natural language processing model, it analyzes the input text and extracts the conversation structure, context, and keywords. The output consists of the analyzed conversation features and categories of user statements.

[0736] Step 5:

[0737] The device generates feedback for the user based on the results of natural language processing and emotion recognition. The input is the results of previous analysis, and the AI model is used to create the feedback, taking this into consideration. The output is a customized feedback message tailored to the individual user.

[0738] Step 6:

[0739] The device notifies the user of the generated feedback in real time. The feedback appears on the screen as a dialog or pop-up, providing the user with immediate advice. This process allows the user to instantly incorporate the feedback into their meeting communication.

[0740] (Application Example 2)

[0741] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0742] In family communication, individual emotions are often not properly recognized, resulting in a decline in the quality of interaction and creating misunderstandings and stress. This is particularly difficult in intercultural families or those with diverse family structures. Solving this problem is essential.

[0743] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0744] In this invention, the server includes means for acquiring sound data, means for converting sound data into text, means for analyzing the text data to evaluate the level of activity in communication, and means for recognizing emotions and tone in a home environment and making suggestions to promote relaxation. This makes it possible to improve the quality of communication within the home, prevent misunderstandings, and promote smooth interaction.

[0745] A "communication path" refers to a network means used to exchange information such as sound and data between other devices and systems.

[0746] "Audio data" refers to audio information captured by a microphone or recording device converted into a signal.

[0747] "Speech recognition means" refers to technologies and devices that analyze sound data and convert it into corresponding strings of characters.

[0748] "Character data" refers to text-formatted data converted by speech recognition technology.

[0749] "Language processing means" refers to technologies that analyze text data and evaluate the level of conversational activity, intervention frequency, and other factors.

[0750] "Analysis" refers to the process of examining acquired data in detail to detect patterns and specific features.

[0751] "Information" refers to data and knowledge that are generated based on the results of analysis and provided to users in a useful format.

[0752] "Personalized information" refers to feedback that is customized to each user's specific needs and emotional state.

[0753] "Adaptive information" refers to feedback that is adjusted to different cultural backgrounds and environments to facilitate appropriate communication.

[0754] "Home environment" refers to the living conditions and surrounding circumstances of an individual's residence.

[0755] "Relaxation-promoting suggestions" refer to providing instructions and advice to help users reduce tension and stress and reach a comfortable mental state.

[0756] The system for implementing this invention primarily focuses on the function of collecting and analyzing audio data in real time. The system acquires audio data using a microphone and converts this data into text using speech recognition software (e.g., Google Speech-to-Text API). The converted text is then analyzed using language analysis software (e.g., Google Cloud Natural Language API) to evaluate the level of activity and tone of conversations within the home.

[0757] An emotion engine (e.g., IBM Watson Tone Analyzer) analyzes the user's emotions and tone of voice, and generates relaxation suggestions as needed. These suggestions and feedback are communicated to the user through the display and speakers. As a result, it helps ensure smooth and comfortable conversations among family members in the living room.

[0758] The server processes the acquired data in the cloud and provides personalized feedback to each user. For multicultural households, it also provides adaptive feedback to reduce misunderstandings that may arise during intercultural communication.

[0759] For example, if family conversations during dinner are awkward and tense, the system might suggest, "Let's find a fun topic to talk about and relax." In this case, the system might prompt, "Based on the analysis of the family conversation, please provide advice on how to relax." Thus, the invention can be used to improve communication within the family.

[0760] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0761] Step 1:

[0762] The server receives audio data collected through the microphone. This audio data becomes the input. The audio data is acquired in a compressed format, decoded, and passed on to subsequent processing.

[0763] Step 2:

[0764] The server uses speech recognition software to convert speech data into text data. The input for this step is speech data, and the output is the corresponding text data. Phonological analysis is performed during this process, which is then used to convert the data into text.

[0765] Step 3:

[0766] The server passes the text data to language analysis software to analyze the conversation's activity level and sentiment. The input to this step is the text data generated in step 2, and the output is the analysis results, including activity level and sentiment indicators. A natural language processing algorithm evaluates the text and generates feature vectors.

[0767] Step 4:

[0768] The emotion engine determines the user's emotional state based on the analysis results and generates relaxation suggestions as needed. The input for this step is the analysis results from step 3, and the output is a suggestion message. For example, if it is determined that the user is in a state of tension, a message encouraging relaxation will be generated.

[0769] Step 5:

[0770] The terminal notifies the user of suggestions from the server via its display or speaker. The input in this step is the suggestion message sent from the server, and the output is the notification to the user. Notification methods include on-screen text display and audio announcements.

[0771] Step 6:

[0772] Users modify their behavior based on the suggestions provided, improving the quality of communication within the family. The input for this step is the displayed suggestions, and the output is the change in the user's behavior. By actively incorporating the suggestions, users can create a more harmonious atmosphere within the family.

[0773] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0774] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0775] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0776] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0777] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0778] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0779] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0780] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0781] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0782] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0783] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0784] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0785] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0786] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0787] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0788] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0789] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0790] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0791] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0792] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0793] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0794] The following is further disclosed regarding the embodiments described above.

[0795] (Claim 1)

[0796] A means of acquiring voice data via a communication network,

[0797] A speech recognition means for converting the audio data into text,

[0798] A natural language processing means analyzes the converted text data and evaluates the level of activity in the exchange of opinions, the frequency of interruptions, and the degree to which the topic deviates.

[0799] A means for generating feedback based on the results of the analysis and notifying the user of warnings in real time,

[0800] A means of providing personalized feedback to each user,

[0801] A means of providing culturally adapted feedback to multinational teams,

[0802] A system that includes this.

[0803] (Claim 2)

[0804] The system according to claim 1, wherein the natural language processing means includes means for analyzing emotions and tone.

[0805] (Claim 3)

[0806] The system according to claim 1, wherein the feedback generation means includes means for suggesting improvements to prevent misunderstandings based on multicultural communication styles.

[0807] "Example 1"

[0808] (Claim 1)

[0809] A means of acquiring voice data via a communication network,

[0810] A speech recognition means for converting the audio data into text,

[0811] A natural language processing means analyzes the converted text data and evaluates the level of activity in the exchange of opinions, the frequency of interruptions, and the degree to which the topic deviates.

[0812] A means for generating feedback based on the results of the analysis and notifying the user in real time,

[0813] A means of providing personalized feedback to each user,

[0814] A means of providing culturally adapted feedback to multinational teams,

[0815] A method for continuously analyzing audio data and evaluating how often a participant's speaking time is interrupted by others,

[0816] Based on the above evaluation, a means of generating specific advice regarding communication improvement,

[0817] A visualization method that immediately notifies the user of the generated feedback,

[0818] A system that includes this.

[0819] (Claim 2)

[0820] The system according to claim 1, wherein the natural language processing means includes a function for analyzing emotions and tone.

[0821] (Claim 3)

[0822] The system according to claim 1, wherein the feedback generation means includes a function of suggesting improvements to prevent misunderstandings based on multicultural communication styles.

[0823] "Application Example 1"

[0824] (Claim 1)

[0825] A means of acquiring acoustic information via a communication path,

[0826] A recognition device that converts the acoustic information into textual information,

[0827] A language processor analyzes the converted character information and evaluates the activity level of information exchange, the frequency of interruptions, and the degree of topic deviation.

[0828] A device that generates a response based on the analysis results and immediately notifies the user of a warning,

[0829] A device that provides individualized responses for each user,

[0830] Means of providing culturally adapted responses to multicultural organizations,

[0831] A means of immediately providing notifications regarding deviations from the topic or monopolistic statements in information exchange activities,

[0832] A system that includes this.

[0833] (Claim 2)

[0834] The system according to claim 1, wherein the language processor includes means for analyzing emotions and tone of voice.

[0835] (Claim 3)

[0836] The system according to claim 1, wherein the response generating device includes means for presenting suggestions to prevent misunderstandings based on multicultural communication styles.

[0837] "Example 2 of combining an emotion engine"

[0838] (Claim 1)

[0839] A means of acquiring voice information via a communication network,

[0840] A speech recognition means for converting the audio information into text,

[0841] A natural language processing means analyzes the converted text information and evaluates the level of conversational activity, the frequency of interruptions, and the degree of topic deviation.

[0842] An emotion recognition method for analyzing the user's emotional state,

[0843] A means for generating feedback based on the results of the analysis and notifying the user of the advice in real time,

[0844] A means of providing feedback tailored to individual users,

[0845] A means of providing culture-aware feedback to teams with diverse cultural backgrounds,

[0846] A system that includes this.

[0847] (Claim 2)

[0848] The system according to claim 1, wherein the natural language processing means includes means for analyzing emotional states and tone of voice.

[0849] (Claim 3)

[0850] The system according to claim 1, wherein the feedback generation means includes means for presenting suggestions for avoiding misunderstandings based on intercultural communication styles.

[0851] "Application example 2 when combining with an emotional engine"

[0852] (Claim 1)

[0853] A means of acquiring sound data via a communication path,

[0854] A speech recognition means that converts the sound data into text,

[0855] A language processing means that analyzes the converted character data and evaluates the degree of activity in communication, the frequency of intervention, and the degree of topic deviation,

[0856] A means for generating information based on the results of the analysis and immediately notifying the user of a warning,

[0857] A means of providing personalized information to each user,

[0858] Means of providing adaptive information to groups with different cultural backgrounds,

[0859] A means of recognizing emotions and tones in the home environment and offering suggestions to promote relaxation,

[0860] A system that includes this.

[0861] (Claim 2)

[0862] The system according to claim 1, wherein the language processing means includes a function for analyzing emotions and intonation.

[0863] (Claim 3)

[0864] The system according to claim 1, wherein the information generation means includes means for presenting suggestions to prevent misunderstandings based on intercultural communication styles and for facilitating communication within the family. [Explanation of symbols]

[0865] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of acquiring voice data via a communication network, A speech recognition means for converting the audio data into text, A natural language processing means analyzes the converted text data and evaluates the level of activity in the exchange of opinions, the frequency of interruptions, and the degree to which the topic deviates. A means for generating feedback based on the results of the analysis and notifying the user of warnings in real time, A means of providing personalized feedback to each user, A means of providing culturally adapted feedback to multinational teams, A system that includes this.

2. The system according to claim 1, wherein the natural language processing means includes means for analyzing emotions and tone.

3. The system according to claim 1, wherein the feedback generation means includes means for suggesting improvements to prevent misunderstandings based on multicultural communication styles.