system

The system addresses the challenge of real-time meeting progress understanding by converting audio to text, extracting key information, and displaying summaries, enhancing meeting efficiency.

JP2026101264APending Publication Date: 2026-06-22SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-10
Publication Date
2026-06-22

AI Technical Summary

Technical Problem

Participants in meetings face challenges in grasping key points in real time, especially for latecomers or those joining from other meetings, as existing systems lack real-time means to understand the meeting progress.

Method used

A system that acquires audio data during a meeting, converts it into text data in real time, extracts important information, and displays summaries on visual devices at regular intervals, allowing participants to grasp the meeting progress efficiently.

Benefits of technology

Enables participants to understand the meeting's progress in real time, improving meeting efficiency and facilitating effective participation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026101264000001_ABST
    Figure 2026101264000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A means of acquiring audio data during a meeting, A means of converting acquired audio data into text data in real time, A means of extracting important information from converted text data, A means for generating a summary based on extracted important information, A means for displaying the generated summary on the device's display device at regular intervals, A means of acquiring voice data within the home and providing a real-time summary to family members, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] The problem to be solved by the present invention is that during the progress of a meeting, there is a lack of means for participants to grasp the key points of the discussion in real time. In particular, there is a problem that it is difficult for latecomers or those who join from other meetings to quickly understand the current situation of the meeting. Since the existing minutes creation function is used after the meeting ends, it cannot provide an effective means to follow the progress of the meeting in real time.

Means for Solving the Problems

[0005] To solve this problem, the present invention provides means for acquiring audio data during a meeting and converting it into text data in real time. It also provides means for extracting important information from the converted text data and generating a summary based on the extracted information. Furthermore, it constructs a system that includes means for displaying the generated summary on the visual devices of meeting participants at regular intervals. This system allows participants to grasp the progress of the meeting in real time, enabling efficient participation.

[0006] "Audio data" refers to digital or analog data used to record or transmit audio.

[0007] "Real-time" means that information processing and data transmission are performed immediately, and that processing takes place at the moment when waiting time is minimized.

[0008] "Text data" refers to a data format represented as a string of characters, consisting of character information converted from audio or other formats.

[0009] "Important information" refers to information or keywords that should be given particular consideration or focus when conducting a meeting or making decisions.

[0010] A "summary" is a concise, summarized form of information that provides a broad overview and key points for understanding the overall picture.

[0011] "Visual devices" refer to equipment or display devices that visually display information so that users can view it. [Brief explanation of the drawing]

[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3]It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

Embodiments for Carrying Out the Invention

[0013] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described according to the accompanying drawings.

[0014] First, the language used in the following description will be explained.

[0015] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0016] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0017] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0018] In the following embodiments, the numbered communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), and the like.

[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0020] [First Embodiment]

[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0033] The system of this invention is configured to improve the efficiency of information transmission in meetings. This system functions through the interaction of its server, terminal, and user elements.

[0034] The server acquires the conference audio data in real time. Specifically, the server receives the audio stream via the conference system's communication interface. Upon receiving the audio data, the server sends it to a speech recognition API, where it is converted into text data in real time. This speech recognition uses a highly accurate recognition algorithm that can handle both everyday conversations and specialized terminology.

[0035] Next, the server analyzes the text data and utilizes a natural language processing model to extract important information. This allows it to determine the meaning and context of the conversation and identify key keywords and phrases. For example, if the discussion concerns the progress of a project, the system might extract phrases such as "progress," "challenges," and "next steps."

[0036] Based on the extracted information, the server generates a summary of about one line. This summary is a concise overview designed to allow participants to quickly grasp the progress of the meeting. The server periodically sends the generated summary to the terminals.

[0037] The terminal receives summaries sent from the server. The received summaries are displayed as tickers on the visual devices used by the meeting participants. The display appears for a set period of time in the corner or bottom of the screen and is updated each time a new summary arrives.

[0038] Users can view a summary displayed on their device and stay informed about the meeting's progress in real time. This feature makes it easy for latecomers and those joining from other meetings to understand the flow of the meeting and participate effectively.

[0039] Thus, the system of the present invention helps to improve the efficiency of meetings and reduce wasted time by making it easier to instantly grasp meeting information.

[0040] The following describes the processing flow.

[0041] Step 1:

[0042] The server receives audio streaming data from the conference in real time using the conference system's communication interface. This data is sequentially stored in a buffer for subsequent processing.

[0043] Step 2:

[0044] The server sends the received audio streaming data to a speech recognition API, which converts it into text data in real time. This API recognizes the speech, generates the corresponding string, and returns it to the server.

[0045] Step 3:

[0046] The server inputs the converted text data into a natural language processing model, which analyzes the context of the text. This model identifies keywords and phrases that can be used to extract important information from the conversation.

[0047] Step 4:

[0048] The server generates a summary based on the extracted key information. This summary is designed to highlight the core points of the conversation in a single-line format.

[0049] Step 5:

[0050] The server sends summaries generated at regular intervals to the terminal. This data is encoded as text and delivered to the terminal in real time.

[0051] Step 6:

[0052] The terminal receives a summary sent from the server and displays it on the user's screen in a scrolling text format. This display appears as a pop-up at the bottom or corner of the screen and is continuously updated.

[0053] Step 7:

[0054] Users can understand the progress of the meeting by checking the summary displayed on their device screen. This allows users to efficiently understand the meeting content and obtain important information for participating in the discussion.

[0055] (Example 1)

[0056] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0057] In modern meetings, it is crucial for participants to grasp information appropriately in real time and participate effectively in discussions. However, the frequent use of complex conversations and technical jargon can sometimes lead to inefficient information transmission. In such situations, there is a need for a system that allows participants to efficiently understand the progress of the meeting and for those joining late to easily catch up.

[0058] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0059] In this invention, the server includes means for acquiring audio data, means for converting the acquired audio data into text data, and means for analyzing the converted text data to extract important information. This allows participants to grasp important information during the meeting in real time and participate in the meeting efficiently through concise summaries that take into account the conversation and context.

[0060] "Audio data" refers to information used to record or transmit sounds and speech generated in a meeting or similar setting in digital format.

[0061] "Text data" refers to digital data that represents audio data as textual information.

[0062] "Important information" refers to items or keywords that are particularly noteworthy among the topics discussed during a meeting, and is information necessary to understand the main points of the meeting.

[0063] A "summary" is a short, concise summary of the meeting's content and key information, designed to allow participants to quickly grasp the information.

[0064] "Visual devices" refer to devices or displays used by meeting participants to visually confirm information.

[0065] A "communication network" is a system that provides infrastructure for sending and receiving data, enabling the mutual transfer of digital information over a wide or limited area.

[0066] "Language analysis technology" refers to techniques for processing natural language and analyzing its content, meaning, and context.

[0067] The system of this invention is an information processing device for efficiently transmitting information in meetings. The system performs a series of processes: acquiring audio data, converting it into text data, extracting important information, generating a summary, and displaying it on a visual device. This allows meeting participants to grasp information in real time and participate in the meeting smoothly.

[0068] The server acquires audio data in real time via the conference system's communication network. The server then uses a speech recognition API (e.g., a common speech recognition service) to convert the acquired audio data into text data. The speech recognition API is equipped with advanced recognition algorithms and can accurately transcribe everyday conversations and technical terms into text.

[0069] Next, the server analyzes the text data using a natural language processing model (e.g., BERT or GPT-3®). The server extracts important keywords and phrases from the analyzed text data and generates a concise summary based on them. For example, if a discussion takes place regarding project progress, the server might generate a summary such as, "New feature implementation complete. Bugs reported in user testing. Bug fixes are the priority."

[0070] The terminal receives a summary sent from the server and displays it on a visual device in a scrolling text format. Users can check the summary displayed on the terminal and efficiently grasp the progress of the meeting. This helps users who join late to quickly understand the content of the meeting.

[0071] Example prompt: "Extract the key information from this text and generate a concise one-line summary." This allows the server to use a generative AI model to create an appropriate summary.

[0072] This invention makes it possible to quickly organize information during meetings and provide an environment in which participants can effectively engage in discussions.

[0073] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0074] Step 1:

[0075] The server acquires audio data in real time from the conference system via the communication network. Specifically, it transfers the audio signal acquired from the microphone as digital data to the server and receives it as input. The server's role is to prepare this audio data for analysis.

[0076] Step 2:

[0077] The server sends the acquired audio data to a speech recognition API, where it is converted into text data. This data processing involves converting the audio signal into text information. The resulting text data is returned to the server in string format and becomes the input for the next analysis step.

[0078] Step 3:

[0079] The server analyzes the converted text data using a natural language processing model. It processes the text data as input and performs data operations to identify important keywords and phrases. The output of this step is a list of extracted important information, which forms the basis for summary generation.

[0080] Step 4:

[0081] The server generates a summary based on the extracted key information. This process uses a generative AI model to create a concise, one-line summary that is relevant to the context of the conversation. The server outputs this summary and prepares it for transmission to the terminal.

[0082] Step 5:

[0083] The terminal receives a summary sent from the server and displays it on the user's visual device. The summary, as input, is displayed on the screen in a ticker-style format via display software on the terminal. As output, it provides an environment where the user can visually review the summary.

[0084] Step 6:

[0085] Users review the summary displayed on their devices to understand the progress of the meeting. Their specific actions include searching for relevant information on their devices as needed and preparing to participate in the discussion based on the meeting content. The output of this step is user understanding and effective participation in the meeting.

[0086] (Application Example 1)

[0087] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0088] In modern households, many family members have different schedules and roles, often making information sharing and decision-making among family members complex. This invention aims to provide a system that enables smooth communication and schedule management through the efficient use of voice information within the home.

[0089] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0090] In this invention, the server includes means for acquiring audio data during a meeting, means for converting the acquired audio data into text data in real time, means for extracting important information from the converted text data, means for displaying the generated summary on the device's display device at regular intervals, and means for acquiring audio data within the home and providing summaries to home members in real time. This makes it possible to grasp the content of conversations within the home in real time and provide necessary information in a concise summary.

[0091] "Audio data" refers to recorded or transmitted data of spoken language that occurs in meetings or within a household.

[0092] "Means of converting to text data in real time" refers to a process or device that receives audio data and instantly converts it into text information.

[0093] "Means for extracting important information" refers to methods or devices for identifying highly relevant keywords or phrases from text data.

[0094] "Means for generating a summary" refers to a process or apparatus for creating a concise summary based on extracted key information.

[0095] A "display device" is a piece of hardware or digital device used to visually present generated information.

[0096] "Means for acquiring audio data within the home" refers to a system or device for effectively recording or recognizing spoken language in a home environment.

[0097] "A means of providing real-time summaries to family members" refers to a method or device that instantly analyzes conversations within the family, summarizes important information, and immediately communicates it to family members.

[0098] This invention relates to a system that efficiently analyzes audio information within a home, extracts important information, and automatically presents a summary using a visual device. Specific embodiments are described below.

[0099] The server is connected to devices (e.g., smart speaker microphones) that acquire audio in real time within a home environment such as a living room or kitchen. This server uses speech recognition software (e.g., Google's Speech Recognition API) to convert the acquired audio data into text data in real time. This process enables smooth conversational processing.

[0100] Furthermore, the server utilizes natural language processing models (e.g., BERT and GPT) to extract important keywords and phrases from text data. This allows it to understand the flow of the conversation and select appropriate information.

[0101] On the device, a summary sent from the server is received and displayed on the screen. This can be done using a home smart display or television. For example, if a family discusses a picnic for the next weekend, the system will present a summary of relevant information such as "next steps" and "items to prepare."

[0102] This allows users to stay informed about family conversations and important schedules in real time, enabling them to act more effectively. This feature is particularly useful in preventing misunderstandings when not all family members are together or during busy daily lives.

[0103] Example of a prompt:

[0104] "Summarize the content of the family meeting from the audio recording and extract key keywords. Examples: date, location, division of roles."

[0105] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0106] Step 1:

[0107] The server receives audio data in real time as input from audio acquisition devices installed in the home. This data is sent to the server in audio file format.

[0108] Step 2:

[0109] The server converts the received audio data into text data using speech recognition software. This speech recognition process involves analyzing the audio waveform data and mapping it to corresponding character information. The output is text data that directly transcribes the spoken content.

[0110] Step 3:

[0111] The server processes the generated text data as input into a natural language processing model to extract important keywords and phrases. This process involves data calculations that analyze the text structure and detect specific words and patterns. The output is a list of key phrases that reflect the context of the conversation.

[0112] Step 4:

[0113] The server performs further data processing to generate a concise summary from the extracted key phrases. Specifically, it reorganizes relevant information and creates shortened text. The output is a simplified summary text provided to family members.

[0114] Step 5:

[0115] The terminal receives summary text sent from the server and displays it on a home display or smart screen. Displaying it requires converting it into a user-friendly format. The output is a visually presented summary text.

[0116] Step 6:

[0117] By visually reviewing the summary displayed on the screen, users can understand the conversations and decisions made within the household. The input in this step is the visual information of the summary text, which leads to output that facilitates user understanding and action.

[0118] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0119] This invention provides a system that can efficiently grasp information during a meeting and recognize the emotional state of the users, thereby enabling a real-time understanding of the meeting's atmosphere. This system operates through the coordinated efforts of a server, terminals, and users.

[0120] The server acquires audio data from the conference system in real time. Specifically, the server receives the audio stream via the conference system's communication interface and processes the data. The server then sends this audio data to a speech recognition API to convert it into text data. This process transforms the audio into text, laying the foundation for recording and analyzing meeting minutes.

[0121] Next, the server analyzes the text data. Using a natural language processing model, it understands the context of the conversation and extracts key information. Based on this, a summary capturing the essence of the meeting is generated. This summary efficiently extracts the main points from the lengthy conversation and presents them clearly to the participants.

[0122] Furthermore, the server uses an emotion engine to analyze the user's emotional state from the audio data. The emotion engine identifies emotions such as joy, anger, and sadness based on characteristics such as the speaker's tone, intonation, and speed. This emotional information is incorporated into the summary and used to provide participants with an overview of the meeting's atmosphere and emotional progression.

[0123] The terminal receives summary and sentiment information transmitted from the server and displays it on the user's visual device. This display, along with the progress of the meeting, is shown as a visual feedback of the emotional state in the form of a ticker on a portion of the screen.

[0124] Users can view summaries and sentiment information displayed on their devices, allowing them to grasp the meeting content in real time. This enables users to gain a deeper understanding of the meeting's context and participate more effectively based on comprehensive information, including the emotional states of other participants.

[0125] Thus, the present invention provides a new means for improving the quality of meetings by comprehensively capturing the content of the discussions and the emotions of the participants.

[0126] The following describes the processing flow.

[0127] Step 1:

[0128] The server acquires audio data from the conference system in real time. In this process, the server receives the conference audio stream via a communication interface and continuously collects data.

[0129] Step 2:

[0130] The server sends the acquired audio data to a speech recognition API, which converts it into text data in real time. This conversion process utilizes high-precision speech recognition technology to generate accurate text strings from the audio.

[0131] Step 3:

[0132] The server analyzes the converted text data using a natural language processing model. This involves understanding the context of the text and extracting important keywords and phrases.

[0133] Step 4:

[0134] The server generates a summary based on the extracted key information. The generated summary condenses the meeting's main points into a single line, making it easy for participants to quickly understand.

[0135] Step 5:

[0136] The server uses an emotion engine to analyze the user's emotional state based on the audio data. By evaluating the tone, speed, and intonation of the voice, the emotion engine identifies the speaker's emotions.

[0137] Step 6:

[0138] The server incorporates the emotional information analyzed by the emotion engine into a summary, further compiling it into information that reveals the overall atmosphere of the meeting.

[0139] Step 7:

[0140] The server sends the generated summary and sentiment information to the terminal. This data is encoded in preparation for providing information visually.

[0141] Step 8:

[0142] The terminal receives data transmitted from the server and displays it in ticker format on the meeting participants' visual devices. The displayed information includes the progress of the meeting and the emotional state of the participants, and is visually indicated in the appropriate part of the screen.

[0143] Step 9:

[0144] Users can understand the key points of a meeting and the emotional state of participants in real time by viewing summaries and sentiment information displayed on their devices. This allows users to gain a deeper understanding of the meeting content and use it as material for effective decision-making.

[0145] (Example 2)

[0146] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0147] In meetings, participants must understand a vast amount of information in a short time, and it is particularly difficult to grasp spoken conversations in real time as text. Furthermore, while understanding participants' emotional states during a meeting is valuable, capturing emotional changes through audio alone presents a challenge. This can hinder the efficient progress of meetings and decision-making.

[0148] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0149] In this invention, the server includes means for acquiring audio information during a meeting, means for converting the acquired audio information into text information in real time, and means for analyzing emotional states from the audio information to acquire emotional information. This enables efficient information gathering during a meeting, captures changes in participants' emotions, and facilitates more effective meeting management.

[0150] "Audio information" refers to data that digitally represents the characteristics of the voices spoken during a meeting.

[0151] "Textual information" refers to data in text format generated based on audio information.

[0152] "Important information" refers to essential content that is relevant to the purpose of the meeting and can be extracted from the conversation during the meeting.

[0153] A "summary" is a text document that condenses the main points from the vast amount of information gathered during a meeting.

[0154] A "visual device" is a hardware device that displays information in a way that allows users to visually confirm it.

[0155] "Emotional state" refers to data that describes the mental state of meeting participants, based on factors such as tone, intonation, and speed of speech.

[0156] "Emotional information" refers to digital data obtained by analyzing audio information to indicate the emotional state of participants.

[0157] A "natural language processing model" is a collection of algorithms and systems that understand the meaning and context from text data and perform data extraction and generation.

[0158] This invention is a system for efficiently grasping information during meetings and recognizing the emotional state of participants in real time. This system consists of three elements: a server, a terminal, and a user, each playing a specific role.

[0159] The server acquires audio information in real time using the conference system's communication method. This process involves receiving audio in streaming format using protocols such as WebRTC or SIP. Then, to convert the audio information into text, it utilizes a speech recognition API (e.g., a general speech recognition service). This speech recognition API instantly transcribes the acquired audio information into text, enabling accurate recording of the conversation.

[0160] Next, the server analyzes the text data using a natural language processing model (e.g., generative AI models such as BERT or GPT). This allows it to understand the context of the conversation and extract important information. This information extraction efficiently summarizes key points such as "When is the next meeting?" or "Client feedback on a specific product."

[0161] Furthermore, the server uses an emotion analysis engine (e.g., a common emotion analysis tool) to analyze the user's emotional state from the audio information. It determines the participant's emotions from the tone, intonation, and speed of their voice, identifying emotional information such as joy, anger, and sadness. This emotional information is integrated with the summary and used to provide participants with a sense of the meeting's atmosphere.

[0162] The terminal receives summaries and sentiment information sent from the server and displays them on the user's visual device. Specifically, the information is presented on the display device in the form of pop-up windows or tickers, allowing the user to understand the progress of the meeting and the emotional state of the participants.

[0163] Users can view the information provided by this device and understand the meeting content in real time. This allows users to gain a deeper understanding of the meeting context and engage in more effective discussions that take into account the emotional states of other participants.

[0164] For example, in a company meeting, information such as "sales are increasing" is summarized, and the speaker's emotion of "joy" is identified and displayed on the screen. An example of such a prompt would be, "Summarize the meeting content in real time and analyze the emotional state of the participants."

[0165] This system makes it possible to improve the quality of meetings by integrating meeting information with participants' emotions.

[0166] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0167] Step 1:

[0168] The server acquires audio information in real time using the conference system's communication methods. It receives an audio stream from the conference as input. For processing, it collects the audio data in digital format using the WebRTC or SIP protocol. As output, the server holds the audio information in digital format.

[0169] Step 2:

[0170] The server sends the acquired audio information to a speech recognition API, which converts it into text information. Digital audio information is used as input. The processing involves the speech recognition service analyzing the audio data and converting its contents into text format. The server then obtains the text information as output.

[0171] Step 3:

[0172] The server feeds the textual information obtained through speech recognition into a natural language processing model to extract important information. Textual information is used as input. During processing, a generative AI model (e.g., BERT or GPT) understands the context of the text and extracts the key points of the meeting. The output is summarized text information.

[0173] Step 4:

[0174] The server uses an emotion engine to recognize the emotional state in the audio information in order to perform emotion analysis based on the obtained text information. The inputs used are audio and text information. The processing involves analyzing the speaker's tone, intonation, and speaking speed to identify emotions such as joy, anger, and sadness. The output is the identified emotion information.

[0175] Step 5:

[0176] The server integrates extracted key information and sentiment information to generate a summary. It uses summarized text information and sentiment information as input. The processing combines them to create information that reflects the content and atmosphere of the meeting. The output is a visually presentable summary.

[0177] Step 6:

[0178] The terminal receives summary and sentiment information transmitted from the server and displays it to the user. Integrated summary information is used as input. Processing includes outputting information to a display device, and presentation in the form of a pop-up window or ticker. The output is information that the user can visually confirm.

[0179] Step 7:

[0180] Users review summaries and sentiment information displayed on their devices to understand the meeting content in real time. Input is the information provided by the device. Processing involves users making decisions based on this information and guiding the discussion. The output is effective meeting participation and decision-making.

[0181] (Application Example 2)

[0182] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0183] During meetings, a large amount of information is exchanged, making it difficult to efficiently grasp and understand it. Furthermore, understanding the emotional state of participants is crucial for improving meeting outcomes, but in typical meetings, this information cannot be shared in real time.

[0184] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0185] In this invention, the server includes means for acquiring audio information during a meeting, means for converting the acquired audio information into text information in real time, means for extracting important information from the converted text information and generating summary information, and means for analyzing emotional states from the audio information and generating emotional information. This makes it possible to provide real-time visual feedback of meeting summary information and participants' emotional information.

[0186] "Audio information" refers to the sound signals emitted by participants during a meeting, which convey the content of their communication.

[0187] "Textual information" refers to text data obtained by analyzing and converting audio information, and is a format designed to facilitate recording and analysis.

[0188] "Important information" refers to key points and keywords extracted from conversations using natural language processing technology.

[0189] "Summary information" refers to abbreviated information generated based on extracted key information, which helps in understanding the content of the meeting.

[0190] "Visual devices" are devices that enable meeting participants to visually perceive information, and include display screens and projectors.

[0191] "Emotional state" refers to the type of emotion identified by analyzing the emotional characteristics of the speaker contained in the audio information.

[0192] "Emotional information" refers to data indicating emotional states, analyzed by the emotion engine from audio information, and is used to understand the atmosphere of a meeting and the reactions of the participants.

[0193] The system for implementing the present invention mainly consists of elements such as a server, a terminal, and a user.

[0194] During a meeting, the server acquires the speech of meeting participants as audio information. This involves a process of receiving audio signals via communication means to stream the meeting audio to the server in real time. The server converts the received audio information into text information using speech recognition technology, and then utilizes natural language processing models to extract important information from the text information. Specifically, speech recognition APIs such as Google Cloud Speech-to-Text and natural language processing libraries such as spaCy can be used.

[0195] Furthermore, the server uses an emotion analysis library to analyze the emotional state contained in the audio information. For example, it uses IBM Watson® Tone Analyzer to identify the speaker's emotional state from the audio information and generate emotion information.

[0196] The generated summary and sentiment information are periodically displayed on the meeting participants' visual devices via their terminals. This allows users to understand the participants' emotional responses in real time, along with the important agenda items, as the meeting progresses.

[0197] As a concrete example, consider a project progress report at a weekly meeting. The server receives the project leader's statements as audio information, extracts the main progress and problems, and converts them into text information. It also generates emotional information, such as the leader's expectations and other members' anxieties, detected in the statements, visualizes it, and conveys it to the participants.

[0198] An example of a prompt for a generative AI model is: "Analyze the speaker's emotions in real time during the meeting and create the following text summary: {Audio data} Analyze the emotional state and identify the following key emotional tones: Joy, Anger, Sadness Incorporate the overall atmosphere of the session into the summary."

[0199] In this way, the present invention enables meeting participants to deeply understand the meeting content and receive detailed feedback, including the emotional aspects of the dialogue, in real time.

[0200] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0201] Step 1:

[0202] The server acquires audio information during a meeting. As input, it receives the meeting's audio stream in real time via a communication method. The server temporarily stores this audio data as an audio file.

[0203] Step 2:

[0204] The server converts the acquired audio information into text information. It sends the audio data as input to a speech recognition API such as Google Cloud Speech-to-Text, obtaining text data as output. This process converts the audio into understandable text information.

[0205] Step 3:

[0206] The server extracts important information from text data. It uses spaCy, a natural language processing library, to analyze text data as input. As output, it generates summary information containing keywords and important phrases from the conversation.

[0207] Step 4:

[0208] The server generates emotional information from audio data. It processes the audio data as input using an emotional analysis library such as IBM Watson Tone Analyzer. The output is data indicating the speaker's emotional state. This information includes emotional indicators such as joy and anger.

[0209] Step 5:

[0210] The terminal displays the generated summary and sentiment information on a visual device. It receives summary and sentiment information from a server as input and displays it on a screen or projector. The output is provided to participants as visual feedback.

[0211] Step 6:

[0212] Users understand the meeting content through information displayed on visual devices. Based on information provided by the terminal as input, users grasp the progress of the meeting and the emotional state of participants in real time. This allows for deeper participation in the meeting.

[0213] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0214] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0215] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0216] [Second Embodiment]

[0217] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0218] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0219] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0220] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0221] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0222] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0223] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0224] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0225] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0226] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0227] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0228] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0229] The system of this invention is configured to improve the efficiency of information transmission in meetings. This system functions through the interaction of its server, terminal, and user elements.

[0230] The server acquires the conference audio data in real time. Specifically, the server receives the audio stream via the conference system's communication interface. Upon receiving the audio data, the server sends it to a speech recognition API, where it is converted into text data in real time. This speech recognition uses a highly accurate recognition algorithm that can handle both everyday conversations and specialized terminology.

[0231] Next, the server analyzes the text data and utilizes a natural language processing model to extract important information. This allows it to determine the meaning and context of the conversation and identify key keywords and phrases. For example, if the discussion concerns the progress of a project, the system might extract phrases such as "progress," "challenges," and "next steps."

[0232] Based on the extracted information, the server generates a summary of about one line. This summary is a concise overview designed to allow participants to quickly grasp the progress of the meeting. The server periodically sends the generated summary to the terminals.

[0233] The terminal receives summaries sent from the server. The received summaries are displayed as tickers on the visual devices used by the meeting participants. The display appears for a set period of time in the corner or bottom of the screen and is updated each time a new summary arrives.

[0234] Users can view a summary displayed on their device and stay informed about the meeting's progress in real time. This feature makes it easy for latecomers and those joining from other meetings to understand the flow of the meeting and participate effectively.

[0235] Thus, the system of the present invention helps to improve the efficiency of meetings and reduce wasted time by making it easier to instantly grasp meeting information.

[0236] The following describes the processing flow.

[0237] Step 1:

[0238] The server receives audio streaming data from the conference in real time using the conference system's communication interface. This data is sequentially stored in a buffer for subsequent processing.

[0239] Step 2:

[0240] The server sends the received audio streaming data to a speech recognition API, which converts it into text data in real time. This API recognizes the speech, generates the corresponding string, and returns it to the server.

[0241] Step 3:

[0242] The server inputs the converted text data into a natural language processing model, which analyzes the context of the text. This model identifies keywords and phrases that can be used to extract important information from the conversation.

[0243] Step 4:

[0244] The server generates a summary based on the extracted key information. This summary is designed to highlight the core points of the conversation in a single-line format.

[0245] Step 5:

[0246] The server sends summaries generated at regular intervals to the terminal. This data is encoded as text and delivered to the terminal in real time.

[0247] Step 6:

[0248] The terminal receives a summary sent from the server and displays it on the user's screen in a scrolling text format. This display appears as a pop-up at the bottom or corner of the screen and is continuously updated.

[0249] Step 7:

[0250] Users can understand the progress of the meeting by checking the summary displayed on their device screen. This allows users to efficiently understand the meeting content and obtain important information for participating in the discussion.

[0251] (Example 1)

[0252] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0253] In modern meetings, it is crucial for participants to grasp information appropriately in real time and participate effectively in discussions. However, the frequent use of complex conversations and technical jargon can sometimes lead to inefficient information transmission. In such situations, there is a need for a system that allows participants to efficiently understand the progress of the meeting and for those joining late to easily catch up.

[0254] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0255] In this invention, the server includes means for acquiring audio data, means for converting the acquired audio data into text data, and means for analyzing the converted text data to extract important information. This allows participants to grasp important information during the meeting in real time and participate in the meeting efficiently through concise summaries that take into account the conversation and context.

[0256] "Audio data" refers to information used to record or transmit sounds and speech generated in a meeting or similar setting in digital format.

[0257] "Text data" refers to digital data that represents audio data as textual information.

[0258] "Important information" refers to items or keywords that are particularly noteworthy among the topics discussed during a meeting, and is information necessary to understand the main points of the meeting.

[0259] A "summary" is a short, concise summary of the meeting's content and key information, designed to allow participants to quickly grasp the information.

[0260] "Visual devices" refer to devices or displays used by meeting participants to visually confirm information.

[0261] A "communication network" is a system that provides infrastructure for sending and receiving data, enabling the mutual transfer of digital information over a wide or limited area.

[0262] "Language analysis technology" refers to techniques for processing natural language and analyzing its content, meaning, and context.

[0263] The system of this invention is an information processing device for efficiently transmitting information in meetings. The system performs a series of processes: acquiring audio data, converting it into text data, extracting important information, generating a summary, and displaying it on a visual device. This allows meeting participants to grasp information in real time and participate in the meeting smoothly.

[0264] The server acquires audio data in real time via the conference system's communication network. The server then uses a speech recognition API (e.g., a common speech recognition service) to convert the acquired audio data into text data. The speech recognition API is equipped with advanced recognition algorithms and can accurately transcribe everyday conversations and technical terms into text.

[0265] Next, the server analyzes the text data using a natural language processing model (e.g., BERT or GPT-3). The server extracts important keywords and phrases from the analyzed text data and generates a concise summary based on them. For example, if a discussion took place regarding project progress, the server might generate a summary such as, "New feature implementation complete. Bugs reported in user testing. Bug fixing is the priority."

[0266] The terminal receives a summary sent from the server and displays it on a visual device in a scrolling text format. Users can check the summary displayed on the terminal and efficiently grasp the progress of the meeting. This helps users who join late to quickly understand the content of the meeting.

[0267] Example prompt: "Extract the key information from this text and generate a concise one-line summary." This allows the server to use a generative AI model to create an appropriate summary.

[0268] This invention makes it possible to quickly organize information during meetings and provide an environment in which participants can effectively engage in discussions.

[0269] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0270] Step 1:

[0271] The server acquires audio data in real time from the conference system via the communication network. Specifically, it transfers the audio signal acquired from the microphone as digital data to the server and receives it as input. The server's role is to prepare this audio data for analysis.

[0272] Step 2:

[0273] The server sends the acquired audio data to a speech recognition API, where it is converted into text data. This data processing involves converting the audio signal into text information. The resulting text data is returned to the server in string format and becomes the input for the next analysis step.

[0274] Step 3:

[0275] The server analyzes the converted text data using a natural language processing model. It processes the text data as input and performs data operations to identify important keywords and phrases. The output of this step is a list of extracted important information, which forms the basis for summary generation.

[0276] Step 4:

[0277] The server generates a summary based on the extracted key information. This process uses a generative AI model to create a concise, one-line summary that is relevant to the context of the conversation. The server outputs this summary and prepares it for transmission to the terminal.

[0278] Step 5:

[0279] The terminal receives a summary sent from the server and displays it on the user's visual device. The summary, as input, is displayed on the screen in a ticker-style format via display software on the terminal. As output, it provides an environment where the user can visually review the summary.

[0280] Step 6:

[0281] Users review the summary displayed on their devices to understand the progress of the meeting. Their specific actions include searching for relevant information on their devices as needed and preparing to participate in the discussion based on the meeting content. The output of this step is user understanding and effective participation in the meeting.

[0282] (Application Example 1)

[0283] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0284] In modern families, many family members have different schedules and roles, and information sharing and decision-making among family members often become complicated. The purpose of the present invention is to provide a system that enables smooth communication and schedule management through efficient utilization of voice information within a household.

[0285] The specific processing by the specific processing unit 290 of the data processing apparatus 12 in Application Example 1 is realized by the following means.

[0286] In this invention, the server includes means for acquiring voice data during a meeting, means for converting the acquired voice data into text data in real time, means for extracting important information from the converted text data, means for displaying the generated summary on the display device of the device at regular intervals, and means for acquiring voice data within a household and providing a summary to family members in real time. Thereby, it becomes possible to grasp the conversation content within a household in real time and provide necessary information in a concise summary.

[0287] "Voice data" is data obtained by recording or transmitting spoken words generated during a meeting or within a household in a recordable or transmissible form.

[0288] "Means for converting into text data in real time" is a process or device that receives voice data and instantaneously replaces it with character information.

[0289] "Means for extracting important information" is a method or device for identifying highly relevant keywords and phrases from text data.

[0290] "Means for generating a summary" is a process or device that creates a summary that concisely summarizes the content based on the extracted important information.

[0291] "Display device" is hardware or a digital device for visually presenting the generated information.

[0292] "Means for acquiring audio data within the home" refers to a system or device for effectively recording or recognizing spoken language in a home environment.

[0293] "A means of providing real-time summaries to family members" refers to a method or device that instantly analyzes conversations within the family, summarizes important information, and immediately communicates it to family members.

[0294] This invention relates to a system that efficiently analyzes audio information within a home, extracts important information, and automatically presents a summary using a visual device. Specific embodiments are described below.

[0295] The server is connected to devices (e.g., built-in microphones in smart speakers) in a home environment such as a living room or kitchen to acquire audio in real time. This server uses speech recognition software (e.g., Google's speech recognition API) to convert the acquired audio data into text data in real time. This process enables smooth conversational processing.

[0296] Furthermore, the server utilizes natural language processing models (e.g., BERT and GPT) to extract important keywords and phrases from text data. This allows it to understand the flow of the conversation and select appropriate information.

[0297] On the device, a summary sent from the server is received and displayed on the screen. This can be done using a home smart display or television. For example, if a family discusses a picnic for the next weekend, the system will present a summary of relevant information such as "next steps" and "items to prepare."

[0298] This allows users to stay informed about family conversations and important schedules in real time, enabling them to act more effectively. This feature is particularly useful in preventing misunderstandings when not all family members are together or during busy daily lives.

[0299] Example of prompt sentence:

[0300] "Please summarize the content of the family meeting from the voice data and extract important keywords. Example: schedule, location, role assignment"

[0301] The flow of the specific process in Application Example 1 will be described with reference to FIG. 12.

[0302] Step 1:

[0303] The server receives voice data in real time as input from a voice acquisition device installed in the home. This data is sent to the server in the form of a voice file.

[0304] Step 2:

[0305] The server converts the received voice data into text data using voice recognition software. In this voice recognition process, data processing is performed to analyze the voice waveform data and map it to the corresponding character information. The output is text data that directly forms the spoken content into text.

[0306] Step 3:

[0307] The server passes the generated text data through a natural language processing model as input to extract important keywords and phrases. In this process, data operations are performed to detect specific words and patterns while analyzing the structure of the text. The output is a list of key phrases that reflects the context of the conversation.

[0308] Step 4:

[0309] The server performs further data processing to generate a concise summary from the extracted key phrases. Specifically, it creates a text that reorganizes and shortens highly relevant information. The output is a simple summary text provided to the family members.

[0310] Step 5:

[0311] The terminal receives summary text sent from the server and displays it on a home display or smart screen. Displaying it requires converting it into a user-friendly format. The output is a visually presented summary text.

[0312] Step 6:

[0313] By visually reviewing the summary displayed on the screen, users can understand the conversations and decisions made within the household. The input in this step is the visual information of the summary text, which leads to output that facilitates user understanding and action.

[0314] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0315] This invention provides a system that can efficiently grasp information during a meeting and recognize the emotional state of the users, thereby enabling a real-time understanding of the meeting's atmosphere. This system operates through the coordinated efforts of a server, terminals, and users.

[0316] The server acquires audio data from the conference system in real time. Specifically, the server receives the audio stream via the conference system's communication interface and processes the data. The server then sends this audio data to a speech recognition API to convert it into text data. This process transforms the audio into text, laying the foundation for recording and analyzing meeting minutes.

[0317] Next, the server analyzes the text data. Using a natural language processing model, it understands the context of the conversation and extracts key information. Based on this, a summary capturing the essence of the meeting is generated. This summary efficiently extracts the main points from the lengthy conversation and presents them clearly to the participants.

[0318] Furthermore, the server uses an emotion engine to analyze the user's emotional state from the audio data. The emotion engine identifies emotions such as joy, anger, and sadness based on characteristics such as the speaker's tone, intonation, and speed. This emotional information is incorporated into the summary and used to provide participants with an overview of the meeting's atmosphere and emotional progression.

[0319] The terminal receives summary and sentiment information transmitted from the server and displays it on the user's visual device. This display, along with the progress of the meeting, is shown as a visual feedback of the emotional state in the form of a ticker on a portion of the screen.

[0320] Users can view summaries and sentiment information displayed on their devices, allowing them to grasp the meeting content in real time. This enables users to gain a deeper understanding of the meeting's context and participate more effectively based on comprehensive information, including the emotional states of other participants.

[0321] Thus, the present invention provides a new means for improving the quality of meetings by comprehensively capturing the content of the discussions and the emotions of the participants.

[0322] The following describes the processing flow.

[0323] Step 1:

[0324] The server acquires audio data from the conference system in real time. In this process, the server receives the conference audio stream via a communication interface and continuously collects data.

[0325] Step 2:

[0326] The server sends the acquired audio data to a speech recognition API, which converts it into text data in real time. This conversion process utilizes high-precision speech recognition technology to generate accurate text strings from the audio.

[0327] Step 3:

[0328] The server analyzes the converted text data using a natural language processing model. This involves understanding the context of the text and extracting important keywords and phrases.

[0329] Step 4:

[0330] The server generates a summary based on the extracted key information. The generated summary condenses the meeting's main points into a single line, making it easy for participants to quickly understand.

[0331] Step 5:

[0332] The server uses an emotion engine to analyze the user's emotional state based on the audio data. By evaluating the tone, speed, and intonation of the voice, the emotion engine identifies the speaker's emotions.

[0333] Step 6:

[0334] The server incorporates the emotional information analyzed by the emotion engine into a summary, further compiling it into information that reveals the overall atmosphere of the meeting.

[0335] Step 7:

[0336] The server sends the generated summary and sentiment information to the terminal. This data is encoded in preparation for providing information visually.

[0337] Step 8:

[0338] The terminal receives data transmitted from the server and displays it in ticker format on the meeting participants' visual devices. The displayed information includes the progress of the meeting and the emotional state of the participants, and is visually indicated in the appropriate part of the screen.

[0339] Step 9:

[0340] Users can understand the key points of a meeting and the emotional state of participants in real time by viewing summaries and sentiment information displayed on their devices. This allows users to gain a deeper understanding of the meeting content and use it as material for effective decision-making.

[0341] (Example 2)

[0342] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0343] In meetings, participants must understand a vast amount of information in a short time, and it is particularly difficult to grasp spoken conversations in real time as text. Furthermore, while understanding participants' emotional states during a meeting is valuable, capturing emotional changes through audio alone presents a challenge. This can hinder the efficient progress of meetings and decision-making.

[0344] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0345] In this invention, the server includes means for acquiring audio information during a meeting, means for converting the acquired audio information into text information in real time, and means for analyzing emotional states from the audio information to acquire emotional information. This enables efficient information gathering during a meeting, captures changes in participants' emotions, and facilitates more effective meeting management.

[0346] "Audio information" refers to data that digitally represents the characteristics of the voices spoken during a meeting.

[0347] "Textual information" refers to data in text format generated based on audio information.

[0348] "Important information" refers to essential content that is relevant to the purpose of the meeting and can be extracted from the conversation during the meeting.

[0349] A "summary" is a text document that condenses the main points from the vast amount of information gathered during a meeting.

[0350] A "visual device" is a hardware device that displays information in a way that allows users to visually confirm it.

[0351] "Emotional state" refers to data that describes the mental state of meeting participants, based on factors such as tone, intonation, and speed of speech.

[0352] "Emotional information" refers to digital data obtained by analyzing audio information to indicate the emotional state of participants.

[0353] A "natural language processing model" is a collection of algorithms and systems that understand the meaning and context from text data and perform data extraction and generation.

[0354] This invention is a system for efficiently grasping information during meetings and recognizing the emotional state of participants in real time. This system consists of three elements: a server, a terminal, and a user, each playing a specific role.

[0355] The server acquires audio information in real time using the conference system's communication method. This process involves receiving audio in streaming format using protocols such as WebRTC or SIP. Then, to convert the audio information into text, it utilizes a speech recognition API (e.g., a general speech recognition service). This speech recognition API instantly transcribes the acquired audio information into text, enabling accurate recording of the conversation.

[0356] Next, the server analyzes the text data using a natural language processing model (e.g., generative AI models such as BERT or GPT). This allows it to understand the context of the conversation and extract important information. This information extraction efficiently summarizes key points such as "When is the next meeting?" or "Client feedback on a specific product."

[0357] Furthermore, the server uses an emotion analysis engine (e.g., a common emotion analysis tool) to analyze the user's emotional state from the audio information. It determines the participant's emotions from the tone, intonation, and speed of their voice, identifying emotional information such as joy, anger, and sadness. This emotional information is integrated with the summary and used to provide participants with a sense of the meeting's atmosphere.

[0358] The terminal receives summaries and sentiment information sent from the server and displays them on the user's visual device. Specifically, the information is presented on the display device in the form of pop-up windows or tickers, allowing the user to understand the progress of the meeting and the emotional state of the participants.

[0359] Users can view the information provided by this device and understand the meeting content in real time. This allows users to gain a deeper understanding of the meeting context and engage in more effective discussions that take into account the emotional states of other participants.

[0360] For example, in a company meeting, information such as "sales are increasing" is summarized, and the speaker's emotion of "joy" is identified and displayed on the screen. An example of such a prompt would be, "Summarize the meeting content in real time and analyze the emotional state of the participants."

[0361] This system makes it possible to improve the quality of meetings by integrating meeting information with participants' emotions.

[0362] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0363] Step 1:

[0364] The server acquires audio information in real time using the conference system's communication methods. It receives an audio stream from the conference as input. For processing, it collects the audio data in digital format using the WebRTC or SIP protocol. As output, the server holds the audio information in digital format.

[0365] Step 2:

[0366] The server sends the acquired audio information to a speech recognition API, which converts it into text information. Digital audio information is used as input. The processing involves the speech recognition service analyzing the audio data and converting its contents into text format. The server then obtains the text information as output.

[0367] Step 3:

[0368] The server feeds the textual information obtained through speech recognition into a natural language processing model to extract important information. Textual information is used as input. During processing, a generative AI model (e.g., BERT or GPT) understands the context of the text and extracts the key points of the meeting. The output is summarized text information.

[0369] Step 4:

[0370] The server uses an emotion engine to recognize the emotional state in the audio information in order to perform emotion analysis based on the obtained text information. The inputs used are audio and text information. The processing involves analyzing the speaker's tone, intonation, and speaking speed to identify emotions such as joy, anger, and sadness. The output is the identified emotion information.

[0371] Step 5:

[0372] The server integrates extracted key information and sentiment information to generate a summary. It uses summarized text information and sentiment information as input. The processing combines them to create information that reflects the content and atmosphere of the meeting. The output is a visually presentable summary.

[0373] Step 6:

[0374] The terminal receives summary and sentiment information transmitted from the server and displays it to the user. Integrated summary information is used as input. Processing includes outputting information to a display device, and presentation in the form of a pop-up window or ticker. The output is information that the user can visually confirm.

[0375] Step 7:

[0376] Users review summaries and sentiment information displayed on their devices to understand the meeting content in real time. Input is the information provided by the device. Processing involves users making decisions based on this information and guiding the discussion. The output is effective meeting participation and decision-making.

[0377] (Application Example 2)

[0378] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0379] During meetings, a large amount of information is exchanged, making it difficult to efficiently grasp and understand it. Furthermore, understanding the emotional state of participants is crucial for improving meeting outcomes, but in typical meetings, this information cannot be shared in real time.

[0380] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0381] In this invention, the server includes means for acquiring audio information during a meeting, means for converting the acquired audio information into text information in real time, means for extracting important information from the converted text information and generating summary information, and means for analyzing emotional states from the audio information and generating emotional information. This makes it possible to provide real-time visual feedback of meeting summary information and participants' emotional information.

[0382] "Audio information" refers to the sound signals emitted by participants during a meeting, which convey the content of their communication.

[0383] "Textual information" refers to text data obtained by analyzing and converting audio information, and is a format designed to facilitate recording and analysis.

[0384] "Important information" refers to key points and keywords extracted from conversations using natural language processing technology.

[0385] "Summary information" refers to abbreviated information generated based on extracted key information, which helps in understanding the content of the meeting.

[0386] "Visual devices" are devices that enable meeting participants to visually perceive information, and include display screens and projectors.

[0387] "Emotional state" refers to the type of emotion identified by analyzing the emotional characteristics of the speaker contained in the audio information.

[0388] "Emotional information" refers to data indicating emotional states, analyzed by the emotion engine from audio information, and is used to understand the atmosphere of a meeting and the reactions of the participants.

[0389] The system for implementing the present invention mainly consists of elements such as a server, a terminal, and a user.

[0390] During a meeting, the server acquires the speech of meeting participants as audio information. This involves a process of receiving audio signals via communication means to stream the meeting audio to the server in real time. The server converts the received audio information into text information using speech recognition technology, and then utilizes natural language processing models to extract important information from the text information. Specifically, speech recognition APIs such as Google Cloud Speech-to-Text and natural language processing libraries such as spaCy can be used.

[0391] Furthermore, the server uses sentiment analysis libraries to analyze the emotional state contained in the audio information. For example, it uses IBM Watson Tone Analyzer to identify the speaker's emotional state from the audio information and generate sentiment data.

[0392] The generated summary and sentiment information are periodically displayed on the meeting participants' visual devices via their terminals. This allows users to understand the participants' emotional responses in real time, along with the important agenda items, as the meeting progresses.

[0393] As a concrete example, consider a project progress report at a weekly meeting. The server receives the project leader's statements as audio information, extracts the main progress and problems, and converts them into text information. It also generates emotional information, such as the leader's expectations and other members' anxieties, detected in the statements, visualizes it, and conveys it to the participants.

[0394] An example of a prompt for a generative AI model is: "Analyze the speaker's emotions in real time during the meeting and create the following text summary: {Audio data} Analyze the emotional state and identify the following key emotional tones: Joy, Anger, Sadness Incorporate the overall atmosphere of the session into the summary."

[0395] In this way, the present invention enables meeting participants to deeply understand the meeting content and receive detailed feedback, including the emotional aspects of the dialogue, in real time.

[0396] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0397] Step 1:

[0398] The server acquires audio information during a meeting. As input, it receives the meeting's audio stream in real time via a communication method. The server temporarily stores this audio data as an audio file.

[0399] Step 2:

[0400] The server converts the acquired audio information into text information. It sends the audio data as input to a speech recognition API such as Google Cloud Speech-to-Text, obtaining text data as output. This process converts the audio into understandable text information.

[0401] Step 3:

[0402] The server extracts important information from text data. It uses spaCy, a natural language processing library, to analyze text data as input. As output, it generates summary information containing keywords and important phrases from the conversation.

[0403] Step 4:

[0404] The server generates emotional information from audio data. It processes the audio data as input using an emotional analysis library such as IBM Watson Tone Analyzer. The output is data indicating the speaker's emotional state. This information includes emotional indicators such as joy and anger.

[0405] Step 5:

[0406] The terminal displays the generated summary and sentiment information on a visual device. It receives summary and sentiment information from a server as input and displays it on a screen or projector. The output is provided to participants as visual feedback.

[0407] Step 6:

[0408] Users understand the meeting content through information displayed on visual devices. Based on information provided by the terminal as input, users grasp the progress of the meeting and the emotional state of participants in real time. This allows for deeper participation in the meeting.

[0409] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0410] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0411] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0412] [Third Embodiment]

[0413] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0414] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0415] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0416] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0417] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0418] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0419] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0420] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0421] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0422] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0423] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0424] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0425] The system of this invention is configured to improve the efficiency of information transmission in meetings. This system functions through the interaction of its server, terminal, and user elements.

[0426] The server acquires the conference audio data in real time. Specifically, the server receives the audio stream via the conference system's communication interface. Upon receiving the audio data, the server sends it to a speech recognition API, where it is converted into text data in real time. This speech recognition uses a highly accurate recognition algorithm that can handle both everyday conversations and specialized terminology.

[0427] Next, the server analyzes the text data and utilizes a natural language processing model to extract important information. This allows it to determine the meaning and context of the conversation and identify key keywords and phrases. For example, if the discussion concerns the progress of a project, the system might extract phrases such as "progress," "challenges," and "next steps."

[0428] Based on the extracted information, the server generates a summary of about one line. This summary is a concise overview designed to allow participants to quickly grasp the progress of the meeting. The server periodically sends the generated summary to the terminals.

[0429] The terminal receives summaries sent from the server. The received summaries are displayed as tickers on the visual devices used by the meeting participants. The display appears for a set period of time in the corner or bottom of the screen and is updated each time a new summary arrives.

[0430] Users can view a summary displayed on their device and stay informed about the meeting's progress in real time. This feature makes it easy for latecomers and those joining from other meetings to understand the flow of the meeting and participate effectively.

[0431] Thus, the system of the present invention helps to improve the efficiency of meetings and reduce wasted time by making it easier to instantly grasp meeting information.

[0432] The following describes the processing flow.

[0433] Step 1:

[0434] The server receives audio streaming data from the conference in real time using the conference system's communication interface. This data is sequentially stored in a buffer for subsequent processing.

[0435] Step 2:

[0436] The server sends the received audio streaming data to a speech recognition API, which converts it into text data in real time. This API recognizes the speech, generates the corresponding string, and returns it to the server.

[0437] Step 3:

[0438] The server inputs the converted text data into a natural language processing model, which analyzes the context of the text. This model identifies keywords and phrases that can be used to extract important information from the conversation.

[0439] Step 4:

[0440] The server generates a summary based on the extracted key information. This summary is designed to highlight the core points of the conversation in a single-line format.

[0441] Step 5:

[0442] The server sends summaries generated at regular intervals to the terminal. This data is encoded as text and delivered to the terminal in real time.

[0443] Step 6:

[0444] The terminal receives a summary sent from the server and displays it on the user's screen in a scrolling text format. This display appears as a pop-up at the bottom or corner of the screen and is continuously updated.

[0445] Step 7:

[0446] Users can understand the progress of the meeting by checking the summary displayed on their device screen. This allows users to efficiently understand the meeting content and obtain important information for participating in the discussion.

[0447] (Example 1)

[0448] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0449] In modern meetings, it is crucial for participants to grasp information appropriately in real time and participate effectively in discussions. However, the frequent use of complex conversations and technical jargon can sometimes lead to inefficient information transmission. In such situations, there is a need for a system that allows participants to efficiently understand the progress of the meeting and for those joining late to easily catch up.

[0450] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0451] In this invention, the server includes means for acquiring audio data, means for converting the acquired audio data into text data, and means for analyzing the converted text data to extract important information. This allows participants to grasp important information during the meeting in real time and participate in the meeting efficiently through concise summaries that take into account the conversation and context.

[0452] "Audio data" refers to information used to record or transmit sounds and speech generated in a meeting or similar setting in digital format.

[0453] "Text data" refers to digital data that represents audio data as textual information.

[0454] "Important information" refers to items or keywords that are particularly noteworthy among the topics discussed during a meeting, and is information necessary to understand the main points of the meeting.

[0455] A "summary" is a short, concise summary of the meeting's content and key information, designed to allow participants to quickly grasp the information.

[0456] "Visual devices" refer to devices or displays used by meeting participants to visually confirm information.

[0457] A "communication network" is a system that provides infrastructure for sending and receiving data, enabling the mutual transfer of digital information over a wide or limited area.

[0458] "Language analysis technology" refers to techniques for processing natural language and analyzing its content, meaning, and context.

[0459] The system of this invention is an information processing device for efficiently transmitting information in meetings. The system performs a series of processes: acquiring audio data, converting it into text data, extracting important information, generating a summary, and displaying it on a visual device. This allows meeting participants to grasp information in real time and participate in the meeting smoothly.

[0460] The server acquires audio data in real time via the conference system's communication network. The server then uses a speech recognition API (e.g., a common speech recognition service) to convert the acquired audio data into text data. The speech recognition API is equipped with advanced recognition algorithms and can accurately transcribe everyday conversations and technical terms into text.

[0461] Next, the server analyzes the text data using a natural language processing model (e.g., BERT or GPT-3). The server extracts important keywords and phrases from the analyzed text data and generates a concise summary based on them. For example, if a discussion took place regarding project progress, the server might generate a summary such as, "New feature implementation complete. Bugs reported in user testing. Bug fixing is the priority."

[0462] The terminal receives a summary sent from the server and displays it on a visual device in a scrolling text format. Users can check the summary displayed on the terminal and efficiently grasp the progress of the meeting. This helps users who join late to quickly understand the content of the meeting.

[0463] Example prompt: "Extract the key information from this text and generate a concise one-line summary." This allows the server to use a generative AI model to create an appropriate summary.

[0464] This invention makes it possible to quickly organize information during meetings and provide an environment in which participants can effectively engage in discussions.

[0465] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0466] Step 1:

[0467] The server acquires audio data in real time from the conference system via the communication network. Specifically, it transfers the audio signal acquired from the microphone as digital data to the server and receives it as input. The server's role is to prepare this audio data for analysis.

[0468] Step 2:

[0469] The server sends the acquired audio data to a speech recognition API, where it is converted into text data. This data processing involves converting the audio signal into text information. The resulting text data is returned to the server in string format and becomes the input for the next analysis step.

[0470] Step 3:

[0471] The server analyzes the converted text data using a natural language processing model. It processes the text data as input and performs data operations to identify important keywords and phrases. The output of this step is a list of extracted important information, which forms the basis for summary generation.

[0472] Step 4:

[0473] The server generates a summary based on the extracted key information. This process uses a generative AI model to create a concise, one-line summary that is relevant to the context of the conversation. The server outputs this summary and prepares it for transmission to the terminal.

[0474] Step 5:

[0475] The terminal receives a summary sent from the server and displays it on the user's visual device. The summary, as input, is displayed on the screen in a ticker-style format via display software on the terminal. As output, it provides an environment where the user can visually review the summary.

[0476] Step 6:

[0477] Users review the summary displayed on their devices to understand the progress of the meeting. Their specific actions include searching for relevant information on their devices as needed and preparing to participate in the discussion based on the meeting content. The output of this step is user understanding and effective participation in the meeting.

[0478] (Application Example 1)

[0479] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0480] In modern households, many family members have different schedules and roles, often making information sharing and decision-making among family members complex. This invention aims to provide a system that enables smooth communication and schedule management through the efficient use of voice information within the home.

[0481] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0482] In this invention, the server includes means for acquiring audio data during a meeting, means for converting the acquired audio data into text data in real time, means for extracting important information from the converted text data, means for displaying the generated summary on the device's display device at regular intervals, and means for acquiring audio data within the home and providing summaries to home members in real time. This makes it possible to grasp the content of conversations within the home in real time and provide necessary information in a concise summary.

[0483] "Audio data" refers to recorded or transmitted data of spoken language that occurs in meetings or within a household.

[0484] "Means of converting to text data in real time" refers to a process or device that receives audio data and instantly converts it into text information.

[0485] "Means for extracting important information" refers to methods or devices for identifying highly relevant keywords or phrases from text data.

[0486] "Means for generating a summary" refers to a process or apparatus for creating a concise summary based on extracted key information.

[0487] A "display device" is a piece of hardware or digital device used to visually present generated information.

[0488] "Means for acquiring audio data within the home" refers to a system or device for effectively recording or recognizing spoken language in a home environment.

[0489] "A means of providing real-time summaries to family members" refers to a method or device that instantly analyzes conversations within the family, summarizes important information, and immediately communicates it to family members.

[0490] This invention relates to a system that efficiently analyzes audio information within a home, extracts important information, and automatically presents a summary using a visual device. Specific embodiments are described below.

[0491] The server is connected to devices (e.g., built-in microphones in smart speakers) in a home environment such as a living room or kitchen to acquire audio in real time. This server uses speech recognition software (e.g., Google's speech recognition API) to convert the acquired audio data into text data in real time. This process enables smooth conversational processing.

[0492] Furthermore, the server utilizes natural language processing models (e.g., BERT and GPT) to extract important keywords and phrases from text data. This allows it to understand the flow of the conversation and select appropriate information.

[0493] On the device, a summary sent from the server is received and displayed on the screen. This can be done using a home smart display or television. For example, if a family discusses a picnic for the next weekend, the system will present a summary of relevant information such as "next steps" and "items to prepare."

[0494] This allows users to stay informed about family conversations and important schedules in real time, enabling them to act more effectively. This feature is particularly useful in preventing misunderstandings when not all family members are together or during busy daily lives.

[0495] Example of a prompt:

[0496] "Summarize the content of the family meeting from the audio recording and extract key keywords. Examples: date, location, division of roles."

[0497] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0498] Step 1:

[0499] The server receives audio data in real time as input from audio acquisition devices installed in the home. This data is sent to the server in audio file format.

[0500] Step 2:

[0501] The server converts the received audio data into text data using speech recognition software. This speech recognition process involves analyzing the audio waveform data and mapping it to corresponding character information. The output is text data that directly transcribes the spoken content.

[0502] Step 3:

[0503] The server processes the generated text data as input into a natural language processing model to extract important keywords and phrases. This process involves data calculations that analyze the text structure and detect specific words and patterns. The output is a list of key phrases that reflect the context of the conversation.

[0504] Step 4:

[0505] The server performs further data processing to generate a concise summary from the extracted key phrases. Specifically, it reorganizes relevant information and creates shortened text. The output is a simplified summary text provided to family members.

[0506] Step 5:

[0507] The terminal receives summary text sent from the server and displays it on a home display or smart screen. Displaying it requires converting it into a user-friendly format. The output is a visually presented summary text.

[0508] Step 6:

[0509] By visually reviewing the summary displayed on the screen, users can understand the conversations and decisions made within the household. The input in this step is the visual information of the summary text, which leads to output that facilitates user understanding and action.

[0510] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0511] This invention provides a system that can efficiently grasp information during a meeting and recognize the emotional state of the users, thereby enabling a real-time understanding of the meeting's atmosphere. This system operates through the coordinated efforts of a server, terminals, and users.

[0512] The server acquires audio data from the conference system in real time. Specifically, the server receives the audio stream via the conference system's communication interface and processes the data. The server then sends this audio data to a speech recognition API to convert it into text data. This process transforms the audio into text, laying the foundation for recording and analyzing meeting minutes.

[0513] Next, the server analyzes the text data. Using a natural language processing model, it understands the context of the conversation and extracts key information. Based on this, a summary capturing the essence of the meeting is generated. This summary efficiently extracts the main points from the lengthy conversation and presents them clearly to the participants.

[0514] Furthermore, the server uses an emotion engine to analyze the user's emotional state from the audio data. The emotion engine identifies emotions such as joy, anger, and sadness based on characteristics such as the speaker's tone, intonation, and speed. This emotional information is incorporated into the summary and used to provide participants with an overview of the meeting's atmosphere and emotional progression.

[0515] The terminal receives summary and sentiment information transmitted from the server and displays it on the user's visual device. This display, along with the progress of the meeting, is shown as a visual feedback of the emotional state in the form of a ticker on a portion of the screen.

[0516] Users can view summaries and sentiment information displayed on their devices, allowing them to grasp the meeting content in real time. This enables users to gain a deeper understanding of the meeting's context and participate more effectively based on comprehensive information, including the emotional states of other participants.

[0517] Thus, the present invention provides a new means for improving the quality of meetings by comprehensively capturing the content of the discussions and the emotions of the participants.

[0518] The following describes the processing flow.

[0519] Step 1:

[0520] The server acquires audio data from the conference system in real time. In this process, the server receives the conference audio stream via a communication interface and continuously collects data.

[0521] Step 2:

[0522] The server sends the acquired audio data to a speech recognition API, which converts it into text data in real time. This conversion process utilizes high-precision speech recognition technology to generate accurate text strings from the audio.

[0523] Step 3:

[0524] The server analyzes the converted text data using a natural language processing model. This involves understanding the context of the text and extracting important keywords and phrases.

[0525] Step 4:

[0526] The server generates a summary based on the extracted key information. The generated summary condenses the meeting's main points into a single line, making it easy for participants to quickly understand.

[0527] Step 5:

[0528] The server uses an emotion engine to analyze the user's emotional state based on the audio data. By evaluating the tone, speed, and intonation of the voice, the emotion engine identifies the speaker's emotions.

[0529] Step 6:

[0530] The server incorporates the emotional information analyzed by the emotion engine into a summary, further compiling it into information that reveals the overall atmosphere of the meeting.

[0531] Step 7:

[0532] The server sends the generated summary and sentiment information to the terminal. This data is encoded in preparation for providing information visually.

[0533] Step 8:

[0534] The terminal receives data transmitted from the server and displays it in ticker format on the meeting participants' visual devices. The displayed information includes the progress of the meeting and the emotional state of the participants, and is visually indicated in the appropriate part of the screen.

[0535] Step 9:

[0536] Users can understand the key points of a meeting and the emotional state of participants in real time by viewing summaries and sentiment information displayed on their devices. This allows users to gain a deeper understanding of the meeting content and use it as material for effective decision-making.

[0537] (Example 2)

[0538] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0539] In meetings, participants must understand a vast amount of information in a short time, and it is particularly difficult to grasp spoken conversations in real time as text. Furthermore, while understanding participants' emotional states during a meeting is valuable, capturing emotional changes through audio alone presents a challenge. This can hinder the efficient progress of meetings and decision-making.

[0540] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0541] In this invention, the server includes means for acquiring audio information during a meeting, means for converting the acquired audio information into text information in real time, and means for analyzing emotional states from the audio information to acquire emotional information. This enables efficient information gathering during a meeting, captures changes in participants' emotions, and facilitates more effective meeting management.

[0542] "Audio information" refers to data that digitally represents the characteristics of the voices spoken during a meeting.

[0543] "Textual information" refers to data in text format generated based on audio information.

[0544] "Important information" refers to essential content that is relevant to the purpose of the meeting and can be extracted from the conversation during the meeting.

[0545] A "summary" is a text document that condenses the main points from the vast amount of information gathered during a meeting.

[0546] A "visual device" is a hardware device that displays information in a way that allows users to visually confirm it.

[0547] "Emotional state" refers to data that describes the mental state of meeting participants, based on factors such as tone, intonation, and speed of speech.

[0548] "Emotional information" refers to digital data obtained by analyzing audio information to indicate the emotional state of participants.

[0549] A "natural language processing model" is a collection of algorithms and systems that understand the meaning and context from text data and perform data extraction and generation.

[0550] This invention is a system for efficiently grasping information during meetings and recognizing the emotional state of participants in real time. This system consists of three elements: a server, a terminal, and a user, each playing a specific role.

[0551] The server acquires audio information in real time using the conference system's communication method. This process involves receiving audio in streaming format using protocols such as WebRTC or SIP. Then, to convert the audio information into text, it utilizes a speech recognition API (e.g., a general speech recognition service). This speech recognition API instantly transcribes the acquired audio information into text, enabling accurate recording of the conversation.

[0552] Next, the server analyzes the text data using a natural language processing model (e.g., generative AI models such as BERT or GPT). This allows it to understand the context of the conversation and extract important information. This information extraction efficiently summarizes key points such as "When is the next meeting?" or "Client feedback on a specific product."

[0553] Furthermore, the server uses an emotion analysis engine (e.g., a common emotion analysis tool) to analyze the user's emotional state from the audio information. It determines the participant's emotions from the tone, intonation, and speed of their voice, identifying emotional information such as joy, anger, and sadness. This emotional information is integrated with the summary and used to provide participants with a sense of the meeting's atmosphere.

[0554] The terminal receives summaries and sentiment information sent from the server and displays them on the user's visual device. Specifically, the information is presented on the display device in the form of pop-up windows or tickers, allowing the user to understand the progress of the meeting and the emotional state of the participants.

[0555] Users can view the information provided by this device and understand the meeting content in real time. This allows users to gain a deeper understanding of the meeting context and engage in more effective discussions that take into account the emotional states of other participants.

[0556] For example, in a company meeting, information such as "sales are increasing" is summarized, and the speaker's emotion of "joy" is identified and displayed on the screen. An example of such a prompt would be, "Summarize the meeting content in real time and analyze the emotional state of the participants."

[0557] This system makes it possible to improve the quality of meetings by integrating meeting information with participants' emotions.

[0558] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0559] Step 1:

[0560] The server acquires audio information in real time using the conference system's communication methods. It receives an audio stream from the conference as input. For processing, it collects the audio data in digital format using the WebRTC or SIP protocol. As output, the server holds the audio information in digital format.

[0561] Step 2:

[0562] The server sends the acquired audio information to a speech recognition API, which converts it into text information. Digital audio information is used as input. The processing involves the speech recognition service analyzing the audio data and converting its contents into text format. The server then obtains the text information as output.

[0563] Step 3:

[0564] The server feeds the textual information obtained through speech recognition into a natural language processing model to extract important information. Textual information is used as input. During processing, a generative AI model (e.g., BERT or GPT) understands the context of the text and extracts the key points of the meeting. The output is summarized text information.

[0565] Step 4:

[0566] The server uses an emotion engine to recognize the emotional state in the audio information in order to perform emotion analysis based on the obtained text information. The inputs used are audio and text information. The processing involves analyzing the speaker's tone, intonation, and speaking speed to identify emotions such as joy, anger, and sadness. The output is the identified emotion information.

[0567] Step 5:

[0568] The server integrates extracted key information and sentiment information to generate a summary. It uses summarized text information and sentiment information as input. The processing combines them to create information that reflects the content and atmosphere of the meeting. The output is a visually presentable summary.

[0569] Step 6:

[0570] The terminal receives summary and sentiment information transmitted from the server and displays it to the user. Integrated summary information is used as input. Processing includes outputting information to a display device, and presentation in the form of a pop-up window or ticker. The output is information that the user can visually confirm.

[0571] Step 7:

[0572] Users review summaries and sentiment information displayed on their devices to understand the meeting content in real time. Input is the information provided by the device. Processing involves users making decisions based on this information and guiding the discussion. The output is effective meeting participation and decision-making.

[0573] (Application Example 2)

[0574] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0575] During meetings, a large amount of information is exchanged, making it difficult to efficiently grasp and understand it. Furthermore, understanding the emotional state of participants is crucial for improving meeting outcomes, but in typical meetings, this information cannot be shared in real time.

[0576] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0577] In this invention, the server includes means for acquiring audio information during a meeting, means for converting the acquired audio information into text information in real time, means for extracting important information from the converted text information and generating summary information, and means for analyzing emotional states from the audio information and generating emotional information. This makes it possible to provide real-time visual feedback of meeting summary information and participants' emotional information.

[0578] "Audio information" refers to the sound signals emitted by participants during a meeting, which convey the content of their communication.

[0579] "Textual information" refers to text data obtained by analyzing and converting audio information, and is a format designed to facilitate recording and analysis.

[0580] "Important information" refers to key points and keywords extracted from conversations using natural language processing technology.

[0581] "Summary information" refers to abbreviated information generated based on extracted key information, which helps in understanding the content of the meeting.

[0582] "Visual devices" are devices that enable meeting participants to visually perceive information, and include display screens and projectors.

[0583] "Emotional state" refers to the type of emotion identified by analyzing the emotional characteristics of the speaker contained in the audio information.

[0584] "Emotional information" refers to data indicating emotional states, analyzed by the emotion engine from audio information, and is used to understand the atmosphere of a meeting and the reactions of the participants.

[0585] The system for implementing the present invention mainly consists of elements such as a server, a terminal, and a user.

[0586] During a meeting, the server acquires the speech of meeting participants as audio information. This involves a process of receiving audio signals via communication means to stream the meeting audio to the server in real time. The server converts the received audio information into text information using speech recognition technology, and then utilizes natural language processing models to extract important information from the text information. Specifically, speech recognition APIs such as Google Cloud Speech-to-Text and natural language processing libraries such as spaCy can be used.

[0587] Furthermore, the server uses sentiment analysis libraries to analyze the emotional state contained in the audio information. For example, it uses IBM Watson Tone Analyzer to identify the speaker's emotional state from the audio information and generate sentiment data.

[0588] The generated summary and sentiment information are periodically displayed on the meeting participants' visual devices via their terminals. This allows users to understand the participants' emotional responses in real time, along with the important agenda items, as the meeting progresses.

[0589] As a concrete example, consider a project progress report at a weekly meeting. The server receives the project leader's statements as audio information, extracts the main progress and problems, and converts them into text information. It also generates emotional information, such as the leader's expectations and other members' anxieties, detected in the statements, visualizes it, and conveys it to the participants.

[0590] An example of a prompt for a generative AI model is: "Analyze the speaker's emotions in real time during the meeting and create the following text summary: {Audio data} Analyze the emotional state and identify the following key emotional tones: Joy, Anger, Sadness Incorporate the overall atmosphere of the session into the summary."

[0591] In this way, the present invention enables meeting participants to deeply understand the meeting content and receive detailed feedback, including the emotional aspects of the dialogue, in real time.

[0592] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0593] Step 1:

[0594] The server acquires audio information during a meeting. As input, it receives the meeting's audio stream in real time via a communication method. The server temporarily stores this audio data as an audio file.

[0595] Step 2:

[0596] The server converts the acquired audio information into text information. It sends the audio data as input to a speech recognition API such as Google Cloud Speech-to-Text, obtaining text data as output. This process converts the audio into understandable text information.

[0597] Step 3:

[0598] The server extracts important information from text data. It uses spaCy, a natural language processing library, to analyze text data as input. As output, it generates summary information containing keywords and important phrases from the conversation.

[0599] Step 4:

[0600] The server generates emotional information from audio data. It processes the audio data as input using an emotional analysis library such as IBM Watson Tone Analyzer. The output is data indicating the speaker's emotional state. This information includes emotional indicators such as joy and anger.

[0601] Step 5:

[0602] The terminal displays the generated summary and sentiment information on a visual device. It receives summary and sentiment information from a server as input and displays it on a screen or projector. The output is provided to participants as visual feedback.

[0603] Step 6:

[0604] Users understand the meeting content through information displayed on visual devices. Based on information provided by the terminal as input, users grasp the progress of the meeting and the emotional state of participants in real time. This allows for deeper participation in the meeting.

[0605] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0606] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0607] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0608] [Fourth Embodiment]

[0609] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0610] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0611] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0612] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0613] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0614] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0615] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0616] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0617] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0618] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0619] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0620] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0621] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0622] The system of this invention is configured to improve the efficiency of information transmission in meetings. This system functions through the interaction of its server, terminal, and user elements.

[0623] The server acquires the conference audio data in real time. Specifically, the server receives the audio stream via the conference system's communication interface. Upon receiving the audio data, the server sends it to a speech recognition API, where it is converted into text data in real time. This speech recognition uses a highly accurate recognition algorithm that can handle both everyday conversations and specialized terminology.

[0624] Next, the server analyzes the text data and utilizes a natural language processing model to extract important information. This allows it to determine the meaning and context of the conversation and identify key keywords and phrases. For example, if the discussion concerns the progress of a project, the system might extract phrases such as "progress," "challenges," and "next steps."

[0625] Based on the extracted information, the server generates a summary of about one line. This summary is a concise overview designed to allow participants to quickly grasp the progress of the meeting. The server periodically sends the generated summary to the terminals.

[0626] The terminal receives summaries sent from the server. The received summaries are displayed as tickers on the visual devices used by the meeting participants. The display appears for a set period of time in the corner or bottom of the screen and is updated each time a new summary arrives.

[0627] Users can view a summary displayed on their device and stay informed about the meeting's progress in real time. This feature makes it easy for latecomers and those joining from other meetings to understand the flow of the meeting and participate effectively.

[0628] Thus, the system of the present invention helps to improve the efficiency of meetings and reduce wasted time by making it easier to instantly grasp meeting information.

[0629] The following describes the processing flow.

[0630] Step 1:

[0631] The server receives audio streaming data from the conference in real time using the conference system's communication interface. This data is sequentially stored in a buffer for subsequent processing.

[0632] Step 2:

[0633] The server sends the received audio streaming data to a speech recognition API, which converts it into text data in real time. This API recognizes the speech, generates the corresponding string, and returns it to the server.

[0634] Step 3:

[0635] The server inputs the converted text data into a natural language processing model, which analyzes the context of the text. This model identifies keywords and phrases that can be used to extract important information from the conversation.

[0636] Step 4:

[0637] The server generates a summary based on the extracted key information. This summary is designed to highlight the core points of the conversation in a single-line format.

[0638] Step 5:

[0639] The server sends summaries generated at regular intervals to the terminal. This data is encoded as text and delivered to the terminal in real time.

[0640] Step 6:

[0641] The terminal receives a summary sent from the server and displays it on the user's screen in a scrolling text format. This display appears as a pop-up at the bottom or corner of the screen and is continuously updated.

[0642] Step 7:

[0643] Users can understand the progress of the meeting by checking the summary displayed on their device screen. This allows users to efficiently understand the meeting content and obtain important information for participating in the discussion.

[0644] (Example 1)

[0645] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0646] In modern meetings, it is crucial for participants to grasp information appropriately in real time and participate effectively in discussions. However, the frequent use of complex conversations and technical jargon can sometimes lead to inefficient information transmission. In such situations, there is a need for a system that allows participants to efficiently understand the progress of the meeting and for those joining late to easily catch up.

[0647] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0648] In this invention, the server includes means for acquiring audio data, means for converting the acquired audio data into text data, and means for analyzing the converted text data to extract important information. This allows participants to grasp important information during the meeting in real time and participate in the meeting efficiently through concise summaries that take into account the conversation and context.

[0649] "Audio data" refers to information used to record or transmit sounds and speech generated in a meeting or similar setting in digital format.

[0650] "Text data" refers to digital data that represents audio data as textual information.

[0651] "Important information" refers to items or keywords that are particularly noteworthy among the topics discussed during a meeting, and is information necessary to understand the main points of the meeting.

[0652] A "summary" is a short, concise summary of the meeting's content and key information, designed to allow participants to quickly grasp the information.

[0653] "Visual devices" refer to devices or displays used by meeting participants to visually confirm information.

[0654] A "communication network" is a system that provides infrastructure for sending and receiving data, enabling the mutual transfer of digital information over a wide or limited area.

[0655] "Language analysis technology" refers to techniques for processing natural language and analyzing its content, meaning, and context.

[0656] The system of this invention is an information processing device for efficiently transmitting information in meetings. The system performs a series of processes: acquiring audio data, converting it into text data, extracting important information, generating a summary, and displaying it on a visual device. This allows meeting participants to grasp information in real time and participate in the meeting smoothly.

[0657] The server acquires audio data in real time via the conference system's communication network. The server then uses a speech recognition API (e.g., a common speech recognition service) to convert the acquired audio data into text data. The speech recognition API is equipped with advanced recognition algorithms and can accurately transcribe everyday conversations and technical terms into text.

[0658] Next, the server analyzes the text data using a natural language processing model (e.g., BERT or GPT-3). The server extracts important keywords and phrases from the analyzed text data and generates a concise summary based on them. For example, if a discussion took place regarding project progress, the server might generate a summary such as, "New feature implementation complete. Bugs reported in user testing. Bug fixing is the priority."

[0659] The terminal receives a summary sent from the server and displays it on a visual device in a scrolling text format. Users can check the summary displayed on the terminal and efficiently grasp the progress of the meeting. This helps users who join late to quickly understand the content of the meeting.

[0660] Example prompt: "Extract the key information from this text and generate a concise one-line summary." This allows the server to use a generative AI model to create an appropriate summary.

[0661] This invention makes it possible to quickly organize information during meetings and provide an environment in which participants can effectively engage in discussions.

[0662] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0663] Step 1:

[0664] The server acquires audio data in real time from the conference system via the communication network. Specifically, it transfers the audio signal acquired from the microphone as digital data to the server and receives it as input. The server's role is to prepare this audio data for analysis.

[0665] Step 2:

[0666] The server sends the acquired audio data to a speech recognition API, where it is converted into text data. This data processing involves converting the audio signal into text information. The resulting text data is returned to the server in string format and becomes the input for the next analysis step.

[0667] Step 3:

[0668] The server analyzes the converted text data using a natural language processing model. It processes the text data as input and performs data operations to identify important keywords and phrases. The output of this step is a list of extracted important information, which forms the basis for summary generation.

[0669] Step 4:

[0670] The server generates a summary based on the extracted key information. This process uses a generative AI model to create a concise, one-line summary that is relevant to the context of the conversation. The server outputs this summary and prepares it for transmission to the terminal.

[0671] Step 5:

[0672] The terminal receives a summary sent from the server and displays it on the user's visual device. The summary, as input, is displayed on the screen in a ticker-style format via display software on the terminal. As output, it provides an environment where the user can visually review the summary.

[0673] Step 6:

[0674] Users review the summary displayed on their devices to understand the progress of the meeting. Their specific actions include searching for relevant information on their devices as needed and preparing to participate in the discussion based on the meeting content. The output of this step is user understanding and effective participation in the meeting.

[0675] (Application Example 1)

[0676] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0677] In modern households, many family members have different schedules and roles, often making information sharing and decision-making among family members complex. This invention aims to provide a system that enables smooth communication and schedule management through the efficient use of voice information within the home.

[0678] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0679] In this invention, the server includes means for acquiring audio data during a meeting, means for converting the acquired audio data into text data in real time, means for extracting important information from the converted text data, means for displaying the generated summary on the device's display device at regular intervals, and means for acquiring audio data within the home and providing summaries to home members in real time. This makes it possible to grasp the content of conversations within the home in real time and provide necessary information in a concise summary.

[0680] "Audio data" refers to recorded or transmitted data of spoken language that occurs in meetings or within a household.

[0681] "Means of converting to text data in real time" refers to a process or device that receives audio data and instantly converts it into text information.

[0682] "Means for extracting important information" refers to methods or devices for identifying highly relevant keywords or phrases from text data.

[0683] "Means for generating a summary" refers to a process or apparatus for creating a concise summary based on extracted key information.

[0684] A "display device" is a piece of hardware or digital device used to visually present generated information.

[0685] "Means for acquiring audio data within the home" refers to a system or device for effectively recording or recognizing spoken language in a home environment.

[0686] "A means of providing real-time summaries to family members" refers to a method or device that instantly analyzes conversations within the family, summarizes important information, and immediately communicates it to family members.

[0687] This invention relates to a system that efficiently analyzes audio information within a home, extracts important information, and automatically presents a summary using a visual device. Specific embodiments are described below.

[0688] The server is connected to devices (e.g., built-in microphones in smart speakers) in a home environment such as a living room or kitchen to acquire audio in real time. This server uses speech recognition software (e.g., Google's speech recognition API) to convert the acquired audio data into text data in real time. This process enables smooth conversational processing.

[0689] Furthermore, the server utilizes natural language processing models (e.g., BERT and GPT) to extract important keywords and phrases from text data. This allows it to understand the flow of the conversation and select appropriate information.

[0690] On the device, a summary sent from the server is received and displayed on the screen. This can be done using a home smart display or television. For example, if a family discusses a picnic for the next weekend, the system will present a summary of relevant information such as "next steps" and "items to prepare."

[0691] This allows users to stay informed about family conversations and important schedules in real time, enabling them to act more effectively. This feature is particularly useful in preventing misunderstandings when not all family members are together or during busy daily lives.

[0692] Example of a prompt:

[0693] "Summarize the content of the family meeting from the audio recording and extract key keywords. Examples: date, location, division of roles."

[0694] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0695] Step 1:

[0696] The server receives audio data in real time as input from audio acquisition devices installed in the home. This data is sent to the server in audio file format.

[0697] Step 2:

[0698] The server converts the received audio data into text data using speech recognition software. This speech recognition process involves analyzing the audio waveform data and mapping it to corresponding character information. The output is text data that directly transcribes the spoken content.

[0699] Step 3:

[0700] The server processes the generated text data as input into a natural language processing model to extract important keywords and phrases. This process involves data calculations that analyze the text structure and detect specific words and patterns. The output is a list of key phrases that reflect the context of the conversation.

[0701] Step 4:

[0702] The server performs further data processing to generate a concise summary from the extracted key phrases. Specifically, it reorganizes relevant information and creates shortened text. The output is a simplified summary text provided to family members.

[0703] Step 5:

[0704] The terminal receives summary text sent from the server and displays it on a home display or smart screen. Displaying it requires converting it into a user-friendly format. The output is a visually presented summary text.

[0705] Step 6:

[0706] By visually reviewing the summary displayed on the screen, users can understand the conversations and decisions made within the household. The input in this step is the visual information of the summary text, which leads to output that facilitates user understanding and action.

[0707] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0708] This invention provides a system that can efficiently grasp information during a meeting and recognize the emotional state of the users, thereby enabling a real-time understanding of the meeting's atmosphere. This system operates through the coordinated efforts of a server, terminals, and users.

[0709] The server acquires audio data from the conference system in real time. Specifically, the server receives the audio stream via the conference system's communication interface and processes the data. The server then sends this audio data to a speech recognition API to convert it into text data. This process transforms the audio into text, laying the foundation for recording and analyzing meeting minutes.

[0710] Next, the server analyzes the text data. Using a natural language processing model, it understands the context of the conversation and extracts key information. Based on this, a summary capturing the essence of the meeting is generated. This summary efficiently extracts the main points from the lengthy conversation and presents them clearly to the participants.

[0711] Furthermore, the server uses an emotion engine to analyze the user's emotional state from the audio data. The emotion engine identifies emotions such as joy, anger, and sadness based on characteristics such as the speaker's tone, intonation, and speed. This emotional information is incorporated into the summary and used to provide participants with an overview of the meeting's atmosphere and emotional progression.

[0712] The terminal receives summary and sentiment information transmitted from the server and displays it on the user's visual device. This display, along with the progress of the meeting, is shown as a visual feedback of the emotional state in the form of a ticker on a portion of the screen.

[0713] Users can view summaries and sentiment information displayed on their devices, allowing them to grasp the meeting content in real time. This enables users to gain a deeper understanding of the meeting's context and participate more effectively based on comprehensive information, including the emotional states of other participants.

[0714] Thus, the present invention provides a new means for improving the quality of meetings by comprehensively capturing the content of the discussions and the emotions of the participants.

[0715] The following describes the processing flow.

[0716] Step 1:

[0717] The server acquires audio data from the conference system in real time. In this process, the server receives the conference audio stream via a communication interface and continuously collects data.

[0718] Step 2:

[0719] The server sends the acquired audio data to a speech recognition API, which converts it into text data in real time. This conversion process utilizes high-precision speech recognition technology to generate accurate text strings from the audio.

[0720] Step 3:

[0721] The server analyzes the converted text data using a natural language processing model. This involves understanding the context of the text and extracting important keywords and phrases.

[0722] Step 4:

[0723] The server generates a summary based on the extracted key information. The generated summary condenses the meeting's main points into a single line, making it easy for participants to quickly understand.

[0724] Step 5:

[0725] The server uses an emotion engine to analyze the user's emotional state based on the audio data. By evaluating the tone, speed, and intonation of the voice, the emotion engine identifies the speaker's emotions.

[0726] Step 6:

[0727] The server incorporates the emotional information analyzed by the emotion engine into a summary, further compiling it into information that reveals the overall atmosphere of the meeting.

[0728] Step 7:

[0729] The server sends the generated summary and sentiment information to the terminal. This data is encoded in preparation for providing information visually.

[0730] Step 8:

[0731] The terminal receives data transmitted from the server and displays it in ticker format on the meeting participants' visual devices. The displayed information includes the progress of the meeting and the emotional state of the participants, and is visually indicated in the appropriate part of the screen.

[0732] Step 9:

[0733] Users can understand the key points of a meeting and the emotional state of participants in real time by viewing summaries and sentiment information displayed on their devices. This allows users to gain a deeper understanding of the meeting content and use it as material for effective decision-making.

[0734] (Example 2)

[0735] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0736] In meetings, participants must understand a vast amount of information in a short time, and it is particularly difficult to grasp spoken conversations in real time as text. Furthermore, while understanding participants' emotional states during a meeting is valuable, capturing emotional changes through audio alone presents a challenge. This can hinder the efficient progress of meetings and decision-making.

[0737] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0738] In this invention, the server includes means for acquiring audio information during a meeting, means for converting the acquired audio information into text information in real time, and means for analyzing emotional states from the audio information to acquire emotional information. This enables efficient information gathering during a meeting, captures changes in participants' emotions, and facilitates more effective meeting management.

[0739] "Audio information" refers to data that digitally represents the characteristics of the voices spoken during a meeting.

[0740] "Textual information" refers to data in text format generated based on audio information.

[0741] "Important information" refers to essential content that is relevant to the purpose of the meeting and can be extracted from the conversation during the meeting.

[0742] A "summary" is a text document that condenses the main points from the vast amount of information gathered during a meeting.

[0743] A "visual device" is a hardware device that displays information in a way that allows users to visually confirm it.

[0744] "Emotional state" refers to data that describes the mental state of meeting participants, based on factors such as tone, intonation, and speed of speech.

[0745] "Emotional information" refers to digital data obtained by analyzing audio information to indicate the emotional state of participants.

[0746] A "natural language processing model" is a collection of algorithms and systems that understand the meaning and context from text data and perform data extraction and generation.

[0747] This invention is a system for efficiently grasping information during meetings and recognizing the emotional state of participants in real time. This system consists of three elements: a server, a terminal, and a user, each playing a specific role.

[0748] The server acquires audio information in real time using the conference system's communication method. This process involves receiving audio in streaming format using protocols such as WebRTC or SIP. Then, to convert the audio information into text, it utilizes a speech recognition API (e.g., a general speech recognition service). This speech recognition API instantly transcribes the acquired audio information into text, enabling accurate recording of the conversation.

[0749] Next, the server analyzes the text data using a natural language processing model (e.g., generative AI models such as BERT or GPT). This allows it to understand the context of the conversation and extract important information. This information extraction efficiently summarizes key points such as "When is the next meeting?" or "Client feedback on a specific product."

[0750] Furthermore, the server uses an emotion analysis engine (e.g., a common emotion analysis tool) to analyze the user's emotional state from the audio information. It determines the participant's emotions from the tone, intonation, and speed of their voice, identifying emotional information such as joy, anger, and sadness. This emotional information is integrated with the summary and used to provide participants with a sense of the meeting's atmosphere.

[0751] The terminal receives summaries and sentiment information sent from the server and displays them on the user's visual device. Specifically, the information is presented on the display device in the form of pop-up windows or tickers, allowing the user to understand the progress of the meeting and the emotional state of the participants.

[0752] Users can view the information provided by this device and understand the meeting content in real time. This allows users to gain a deeper understanding of the meeting context and engage in more effective discussions that take into account the emotional states of other participants.

[0753] For example, in a company meeting, information such as "sales are increasing" is summarized, and the speaker's emotion of "joy" is identified and displayed on the screen. An example of such a prompt would be, "Summarize the meeting content in real time and analyze the emotional state of the participants."

[0754] This system makes it possible to improve the quality of meetings by integrating meeting information with participants' emotions.

[0755] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0756] Step 1:

[0757] The server acquires audio information in real time using the conference system's communication methods. It receives an audio stream from the conference as input. For processing, it collects the audio data in digital format using the WebRTC or SIP protocol. As output, the server holds the audio information in digital format.

[0758] Step 2:

[0759] The server sends the acquired audio information to a speech recognition API, which converts it into text information. Digital audio information is used as input. The processing involves the speech recognition service analyzing the audio data and converting its contents into text format. The server then obtains the text information as output.

[0760] Step 3:

[0761] The server feeds the textual information obtained through speech recognition into a natural language processing model to extract important information. Textual information is used as input. During processing, a generative AI model (e.g., BERT or GPT) understands the context of the text and extracts the key points of the meeting. The output is summarized text information.

[0762] Step 4:

[0763] The server uses an emotion engine to recognize the emotional state in the audio information in order to perform emotion analysis based on the obtained text information. The inputs used are audio and text information. The processing involves analyzing the speaker's tone, intonation, and speaking speed to identify emotions such as joy, anger, and sadness. The output is the identified emotion information.

[0764] Step 5:

[0765] The server integrates extracted key information and sentiment information to generate a summary. It uses summarized text information and sentiment information as input. The processing combines them to create information that reflects the content and atmosphere of the meeting. The output is a visually presentable summary.

[0766] Step 6:

[0767] The terminal receives summary and sentiment information transmitted from the server and displays it to the user. Integrated summary information is used as input. Processing includes outputting information to a display device, and presentation in the form of a pop-up window or ticker. The output is information that the user can visually confirm.

[0768] Step 7:

[0769] Users review summaries and sentiment information displayed on their devices to understand the meeting content in real time. Input is the information provided by the device. Processing involves users making decisions based on this information and guiding the discussion. The output is effective meeting participation and decision-making.

[0770] (Application Example 2)

[0771] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0772] During meetings, a large amount of information is exchanged, making it difficult to efficiently grasp and understand it. Furthermore, understanding the emotional state of participants is crucial for improving meeting outcomes, but in typical meetings, this information cannot be shared in real time.

[0773] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0774] In this invention, the server includes means for acquiring audio information during a meeting, means for converting the acquired audio information into text information in real time, means for extracting important information from the converted text information and generating summary information, and means for analyzing emotional states from the audio information and generating emotional information. This makes it possible to provide real-time visual feedback of meeting summary information and participants' emotional information.

[0775] "Audio information" refers to the sound signals emitted by participants during a meeting, which convey the content of their communication.

[0776] "Textual information" refers to text data obtained by analyzing and converting audio information, and is a format designed to facilitate recording and analysis.

[0777] "Important information" refers to key points and keywords extracted from conversations using natural language processing technology.

[0778] "Summary information" refers to abbreviated information generated based on extracted key information, which helps in understanding the content of the meeting.

[0779] "Visual devices" are devices that enable meeting participants to visually perceive information, and include display screens and projectors.

[0780] "Emotional state" refers to the type of emotion identified by analyzing the emotional characteristics of the speaker contained in the audio information.

[0781] "Emotional information" refers to data indicating emotional states, analyzed by the emotion engine from audio information, and is used to understand the atmosphere of a meeting and the reactions of the participants.

[0782] The system for implementing the present invention mainly consists of elements such as a server, a terminal, and a user.

[0783] During a meeting, the server acquires the speech of meeting participants as audio information. This involves a process of receiving audio signals via communication means to stream the meeting audio to the server in real time. The server converts the received audio information into text information using speech recognition technology, and then utilizes natural language processing models to extract important information from the text information. Specifically, speech recognition APIs such as Google Cloud Speech-to-Text and natural language processing libraries such as spaCy can be used.

[0784] Furthermore, the server uses sentiment analysis libraries to analyze the emotional state contained in the audio information. For example, it uses IBM Watson Tone Analyzer to identify the speaker's emotional state from the audio information and generate sentiment data.

[0785] The generated summary and sentiment information are periodically displayed on the meeting participants' visual devices via their terminals. This allows users to understand the participants' emotional responses in real time, along with the important agenda items, as the meeting progresses.

[0786] As a concrete example, consider a project progress report at a weekly meeting. The server receives the project leader's statements as audio information, extracts the main progress and problems, and converts them into text information. It also generates emotional information, such as the leader's expectations and other members' anxieties, detected in the statements, visualizes it, and conveys it to the participants.

[0787] An example of a prompt for a generative AI model is: "Analyze the speaker's emotions in real time during the meeting and create the following text summary: {Audio data} Analyze the emotional state and identify the following key emotional tones: Joy, Anger, Sadness Incorporate the overall atmosphere of the session into the summary."

[0788] In this way, the present invention enables meeting participants to deeply understand the meeting content and receive detailed feedback, including the emotional aspects of the dialogue, in real time.

[0789] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0790] Step 1:

[0791] The server acquires audio information during a meeting. As input, it receives the meeting's audio stream in real time via a communication method. The server temporarily stores this audio data as an audio file.

[0792] Step 2:

[0793] The server converts the acquired audio information into text information. It sends the audio data as input to a speech recognition API such as Google Cloud Speech-to-Text, obtaining text data as output. This process converts the audio into understandable text information.

[0794] Step 3:

[0795] The server extracts important information from text data. It uses spaCy, a natural language processing library, to analyze text data as input. As output, it generates summary information containing keywords and important phrases from the conversation.

[0796] Step 4:

[0797] The server generates emotional information from audio data. It processes the audio data as input using an emotional analysis library such as IBM Watson Tone Analyzer. The output is data indicating the speaker's emotional state. This information includes emotional indicators such as joy and anger.

[0798] Step 5:

[0799] The terminal displays the generated summary and sentiment information on a visual device. It receives summary and sentiment information from a server as input and displays it on a screen or projector. The output is provided to participants as visual feedback.

[0800] Step 6:

[0801] Users understand the meeting content through information displayed on visual devices. Based on information provided by the terminal as input, users grasp the progress of the meeting and the emotional state of participants in real time. This allows for deeper participation in the meeting.

[0802] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0803] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0804] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0805] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0806] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0807] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0808] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0809] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0810] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0811] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0812] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0813] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0814] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0815] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0816] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0817] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0818] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0819] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0820] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0821] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0822] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0823] The following is further disclosed regarding the embodiments described above.

[0824] (Claim 1)

[0825] A means of acquiring audio data during a meeting,

[0826] A means of converting acquired audio data into text data in real time,

[0827] A means of extracting important information from converted text data,

[0828] A means for generating a summary based on extracted important information,

[0829] A means for displaying the generated summary on the visual devices of meeting participants at regular intervals,

[0830] A system that includes this.

[0831] (Claim 2)

[0832] The system according to claim 1, wherein the means for acquiring the audio data is for receiving audio streaming via the communication interface of the conference system.

[0833] (Claim 3)

[0834] The system according to claim 1, wherein the means for generating the summary is for obtaining a shortened text based on the context of the conversation using a natural language processing model.

[0835] "Example 1"

[0836] (Claim 1)

[0837] Means for acquiring audio data,

[0838] A means of converting acquired audio data into text data,

[0839] A means of analyzing the converted text data to extract important information,

[0840] A means for generating a concise summary based on extracted key information,

[0841] A means for periodically displaying the generated summary on a visual device,

[0842] Information processing device including

[0843] (Claim 2)

[0844] The information processing apparatus according to claim 1, wherein the means for acquiring the voice data is for receiving the voice data via a communication network.

[0845] (Claim 3)

[0846] The information processing apparatus according to claim 1, wherein the means for generating the summary is for obtaining a context-based summary using language analysis techniques.

[0847] "Application Example 1"

[0848] (Claim 1)

[0849] A means of acquiring audio data during a meeting,

[0850] A means of converting acquired audio data into text data in real time,

[0851] A means of extracting important information from converted text data,

[0852] A means for generating a summary based on extracted important information,

[0853] A means for displaying the generated summary on the device's display device at regular intervals,

[0854] A means of acquiring voice data within the home and providing a real-time summary to family members,

[0855] A system that includes this.

[0856] (Claim 2)

[0857] The system according to claim 1, wherein the means for acquiring the audio data is for receiving audio streaming via a communication network.

[0858] (Claim 3)

[0859] The system according to claim 1, wherein the means for generating the summary is for obtaining a shortened text based on the context of the conversation using a natural language processing model.

[0860] "Example 2 of combining an emotion engine"

[0861] (Claim 1)

[0862] Means of acquiring audio information during a meeting,

[0863] A means of converting acquired audio information into text information in real time,

[0864] A means of extracting important information from converted text information,

[0865] Means for generating a summary based on extracted important information,

[0866] A means for displaying the generated summary on the visual devices of meeting participants at regular intervals,

[0867] A means of analyzing emotional states from audio information to obtain emotional information,

[0868] A means of integrating and displaying acquired emotional information in a summary,

[0869] A system that includes this.

[0870] (Claim 2)

[0871] The system according to claim 1, wherein the means for acquiring the audio information is for receiving audio streaming via the communication means of the conference system.

[0872] (Claim 3)

[0873] The system according to claim 1, wherein the means for generating the summary is for obtaining shortened textual information based on the context of the conversation using a natural language processing model.

[0874] "Application example 2 when combining with an emotional engine"

[0875] (Claim 1)

[0876] Means of acquiring audio information during a meeting,

[0877] A means of converting acquired audio information into text information in real time,

[0878] A means of extracting important information from converted text information,

[0879] A means for generating summary information based on extracted important information,

[0880] A means for displaying the generated summary information on a visual device at regular intervals,

[0881] A means of analyzing emotional states from audio information and generating emotional information,

[0882] A means for displaying the generated emotional information as visual feedback on a visual device,

[0883] A system that includes this.

[0884] (Claim 2)

[0885] The system according to claim 1, wherein the means for acquiring the audio information is for receiving audio streaming via the communication means of the information exchange system.

[0886] (Claim 3)

[0887] The system according to claim 1, wherein the means for generating the summary information is for obtaining shortened textual information based on the context of a conversation using a natural language processing model, and further for integrating sentiment information. [Explanation of Symbols]

[0888] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of acquiring audio data during a meeting, A means of converting acquired audio data into text data in real time, A means of extracting important information from converted text data, A means for generating a summary based on extracted important information, A means for displaying the generated summary on the device's display device at regular intervals, A means of acquiring voice data within the home and providing a real-time summary to family members, A system that includes this.

2. The system according to claim 1, wherein the means for acquiring the audio data is for receiving audio streaming via a communication network.

3. The system according to claim 1, wherein the means for generating the summary is for obtaining a shortened text based on the context of the conversation using a natural language processing model.