system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The system addresses inefficiencies in meeting minute creation by converting audio to text, summarizing, and translating in multiple languages, ensuring timely and accurate information sharing and task management.

JP2026104560APending Publication Date: 2026-06-25SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-13
Publication Date: 2026-06-25

Application Information

Patent Timeline

13 Dec 2024

Application

25 Jun 2026

Publication

JP2026104560A

IPC: H04L65/403; G06Q10/10; G10L15/10; G10L15/00; G06Q10/00; G10L15/30

AI Tagging

Application Domain

Office automation Speech recognition

Technology Topics

Engineering Human–computer interaction

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Texitile light ageing test instrument
CN1588059Acompact structure Easy to assemble and disassemble Material analysis by optical meansTextile testingEngineering Light filter
Multi-dimensional training method and device of support vector machine
CN114186620AImprove linear separabilityimprove classificationKernel methods Character and pattern recognition Data set Descent algorithm
Loop structure of cold heat flows
CN1916533AImprove efficiencySimple configurationFluid circulation arrangement Heating and refrigeration combinations Heat flow Working fluid
Environment-friendly mobile collecting box for decoration cutting dust
CN108636005AThe dragging process is smoothavoid secondary flyingUsing liquid separation agent Working accessories Engineering Sediment
An IGBT lifetime prediction method based on a GA-Elman-LSTM combined model
CN115964937BImprove forecast accuracySolve the problem of easy to fall into local minimumInternal combustion piston engines Biological models Engineering Data mining

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Creating meeting minutes in diverse business environments is time-consuming and labor-intensive, and conventional methods face challenges with input errors, delays, and inefficient multilingual communication, especially in international settings.

Method used

A system that converts audio data to text using speech recognition, analyzes the text to generate summaries, extracts tasks and actions, and provides multilingual translation, enabling real-time access and management of meeting minutes.

Benefits of technology

Facilitates efficient information sharing and task management across languages, reducing errors and delays, and allowing immediate access to meeting summaries and action items.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026104560000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A means of acquiring audio information and converting it into text information using speech recognition, A means of analyzing textual information and generating a summary, A method for automatically extracting work items and action items from a summary and registering them in the task management system, A means for translating summaries and work items into multiple specified natural languages, A means of saving the generated data and making it accessible in real time, A method for generating natural language meeting minutes in real time and making them immediately available to meeting participants. A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In recent years' diversified business environments, creating meeting minutes requires a lot of time and labor, and with the demand for information sharing in multiple languages, it has become difficult to respond quickly and accurately to international cases. Also, the conventional method relying on manual work has problems such as input errors and delays in information transmission, resulting in a decline in communication efficiency. To solve these problems, a technology that performs automatic generation, summarization, issue management, and multilingual translation of meeting minutes in one go is needed.

Means for Solving the Problems

[0005] This invention is based on a technology that acquires audio data and converts it into text data using speech recognition. It includes a function to analyze the text data, generate a summary containing important points of discussion, automatically extract tasks and actions based on that summary, and register them in a task management system. Furthermore, the summary and tasks are made available in multiple languages through multilingual translation. In addition, the generated data is saved and made accessible in real time, thereby enabling efficient information sharing in multiple languages. Therefore, by making full use of each means described in the claims, information sharing will be accelerated and work efficiency will be improved.

[0006] "Audio data" refers to digital or analog data on which audio is recorded, including audio information collected in meetings, interviews, and other similar settings.

[0007] "Speech recognition" refers to the technology that analyzes audio data and converts that audio into text data, which is written information.

[0008] "Text data" refers to character information generated by speech recognition, and is data that possesses meaning and context as a sentence.

[0009] "Analysis" refers to the process of processing text data using various algorithms to extract important information.

[0010] A "summary" is a collection of information that concisely summarizes the main points of text data, presenting the overall content in a shortened form.

[0011] A "task" refers to a specific task or activity that needs to be accomplished in a meeting or project.

[0012] "Action" refers to the actions or steps that must be taken to complete a specific task.

[0013] Task management is the process of planning, tracking, and managing the progress of tasks and actions.

[0014] "Multilingual translation" refers to the process of translating text written in one language into another language.

[0015] "Saving" refers to the process of recording the generated data in a state where it can be referenced later.

[0016] "Real-time" refers to a state where data processing and communication are performed almost immediately, and the results can be obtained immediately.

Brief Description of Drawings

[0017] [Figure 1] It is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which multiple emotions are mapped. [Figure 10] It shows an emotion map to which multiple emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Embodiment 2 when the emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when the emotion engine is combined.

Mode for Carrying Out the Invention

[0018] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0019] First, the language used in the following description will be explained.

[0020] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0021] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0022] In the following embodiments, the signed storage is one or more non-volatile storage devices that store various programs and various parameters. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes.

[0023] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0024] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0025] [First Embodiment]

[0026] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0027] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0028] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0029] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0030] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0031] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0032] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0033] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0034] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0035] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0036] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0037] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0038] This invention is an information processing system that enables the rapid and accurate creation of meeting minutes and supports multiple languages. The embodiments for carrying out this invention are described below.

[0039] Users launch a dedicated application at the start of a meeting to begin collecting audio data. The audio data is transferred to the server in real time via the terminal. The server receives this audio data and converts it into text data using a speech recognition engine. This speech recognition utilizes a pre-trained model to achieve highly accurate text conversion.

[0040] The server analyzes the converted text data to extract the main discussion points and decisions of the meeting and create a summary. The generated summary is designed to present the key points in a concise manner, making it easy for users to understand the content.

[0041] The server automatically extracts tasks and action points from the further summarized information and registers them in the issue tracking function. This function provides the ability to set assignees and deadlines for each task and manage their progress. In particular, it allows for management based on task priority and dependencies.

[0042] For multilingual translation, the server passes the generated summary and action list to a multilingual translation engine. For example, it can translate from English to Japanese or Chinese in real time, allowing participants to view the meeting minutes in their native language.

[0043] Ultimately, the server stores all generated data in secure cloud storage, allowing meeting participants to access it in real time via their devices. This system enables users to review meeting minutes and action lists immediately after the meeting ends, allowing them to efficiently move forward with their tasks.

[0044] For example, when conducting online meetings with overseas project teams, using this system allows for efficient sharing of meeting content in multiple languages and immediate management of decisions and tasks, thus enabling the rapid progress of international projects.

[0045] The following describes the processing flow.

[0046] Step 1:

[0047] The user launches an audio recording application at the start of the meeting. This application collects audio data and prepares to send it to the server in real time.

[0048] Step 2:

[0049] The device immediately transmits the collected audio data to the server in streaming format. The data is compressed to optimize network bandwidth.

[0050] Step 3:

[0051] The server inputs the received audio data into the speech recognition engine and converts it into text data. During this process, the model's training data is used to improve recognition accuracy.

[0052] Step 4:

[0053] The server processes the converted text data using a natural language processing algorithm to extract key discussion points and decisions. This generates a shortened version that summarizes the entire meeting.

[0054] Step 5:

[0055] The server automatically extracts tasks and actions from the summary and registers them in the issue management system. Assignees and deadlines are automatically assigned based on pre-configured settings.

[0056] Step 6:

[0057] The server translates the summary and action list into the specified language using a multilingual translation engine. This process utilizes dictionary functionality to maintain terminology consistency.

[0058] Step 7:

[0059] The server saves the final meeting minutes data—text, summary, and action list—to cloud storage. This allows users and participants to access this data in real time through their devices.

[0060] (Example 1)

[0061] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0062] In meetings, it is necessary to improve the efficiency of information sharing and management by quickly and accurately converting audio data into text and summarizing its content. Furthermore, a multilingual system is required to provide consistent information even when participants speak different languages. It is also crucial to provide an environment where the generated information is accessible in real time, enabling rapid decision-making and task progress management.

[0063] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0064] In this invention, the server includes means for acquiring audio information and converting it into text information by speech recognition, means for analyzing the text information and generating a summary, and means for automatically extracting tasks and action points from the summary and registering them in task management. This makes it possible to efficiently organize the contents of meetings and share information in a consistent manner in multiple languages. Furthermore, since the generated information can be used immediately, it is expected to speed up decision-making and improve the efficiency of task management.

[0065] "Audio information" refers to audio data acquired in specific situations or environments, specifically the content of speech in settings such as meetings.

[0066] "Speech recognition" refers to the technology that analyzes acquired speech information and converts it into corresponding text information.

[0067] "Textual information" refers to text data generated from speech information through speech recognition.

[0068] A "summary" refers to information that concisely summarizes the main points of discussion and decisions extracted from textual information.

[0069] "Tasks" refers to specific tasks and project-related action items extracted from the summary.

[0070] "Action points" refer to action items agreed upon during the meeting, or specific actions to be taken as the next steps.

[0071] "Task management" refers to the process of registering tasks and action points, and managing their progress, priorities, and dependencies.

[0072] "Multilingual support" refers to a function that translates summaries and business information into different languages, making it easier for users who speak various languages to understand the information.

[0073] "Instant access" refers to a state where generated information is saved and made available to users in real time.

[0074] This invention is realized through a system that streamlines information processing in meetings. When a user starts a meeting, they launch a dedicated application on their terminal and begin collecting audio information. The terminal captures this audio using a high-quality microphone and transmits it to a server in real time.

[0075] The server converts received audio information into text using a speech recognition engine. This engine is enhanced by a generative AI model, enabling highly accurate recognition of different utterances and specialized terminology.

[0076] Next, the server analyzes the text information and uses natural language processing algorithms to generate a text summary. This summarization process concisely organizes the main discussion points and decisions of the meeting, making it easy for users to understand the content.

[0077] Furthermore, the server extracts tasks and action points from the summarized information and automatically registers them in the task management system. The task management system has the functionality to set assignees and deadlines for each task, and also has the ability to manage progress and task dependencies.

[0078] To support multiple languages, the server uses a translation engine to translate summaries and business processes in real time into multiple specified languages, enabling information to be shared in various languages. This allows participants to review meeting minutes in their native language.

[0079] Ultimately, the server securely stores the generated information in cloud storage, allowing users to access it instantly via their devices. This system enables users to review meeting minutes and action points immediately after the meeting ends, allowing them to efficiently plan their next actions.

[0080] For example, when conducting online meetings with multiple overseas locations, this system can be used to quickly share information about the situation in each location and update decisions in a timely manner. An example of a prompt message would be, "Please check the preparation status for the next project meeting, translate any important decisions, and notify relevant parties."

[0081] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0082] Step 1:

[0083] The user launches a dedicated application on their device and begins collecting audio information. The device uses a microphone to capture the meeting audio in real time and transfers the audio data to a server. The input is the raw audio of the meeting, and the output is a digital audio file.

[0084] Step 2:

[0085] The server processes the audio data received from the terminal and inputs it into the speech recognition engine. The speech recognition engine utilizes a generative AI model to analyze the audio data and convert it into text information. The input is a digital audio file, and the output is text information (text data).

[0086] Step 3:

[0087] The server analyzes the textual information and generates a summary using natural language processing algorithms. This process extracts the main discussion points and decisions of the meeting and summarizes the content concisely. The input is textual information, and the output is summarized text data.

[0088] Step 4:

[0089] The server automatically extracts tasks and action points from the generated summary and registers them in the task management system. This allows for the assignment of assignees and deadlines, and enables task progress management. The input is summarized text data, and the output is a task list and action points.

[0090] Step 5:

[0091] The server uses a translation engine to translate summaries and task data into multiple languages. This allows participants using different languages to access the information in their native language. The input is summaries and task lists, and the output is equivalent information translated into each language.

[0092] Step 6:

[0093] The server stores the final data in secure cloud storage, making it instantly accessible to users in real time via their devices. Users can use this to review information immediately after the meeting ends. The input is a translated dataset, and the output is accessible data in the cloud.

[0094] (Application Example 1)

[0095] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0096] In today's world, with the increase in international conferences and webinars, there is a growing need for efficient meeting minute creation and information sharing that transcends language barriers. However, traditional methods struggle with real-time multilingual support, making it difficult to ensure the accuracy and timeliness of information. Furthermore, the process of understanding meeting details in different languages is cumbersome, hindering efficient communication.

[0097] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0098] In this invention, the server includes means for acquiring audio information and converting it into text information by speech recognition, means for analyzing the text information and generating a summary, and means for generating natural language meeting minutes in real time and making them immediately available to meeting participants. This enables the creation of multilingual meeting minutes in real time, reduces communication barriers in international meetings, and facilitates efficient information sharing.

[0099] "Audio information" refers to data input as audio, which forms the basis for converting it into text information.

[0100] "Textual information" refers to text data converted from audio information by speech recognition, and is used for further analysis and summary generation.

[0101] "Speech recognition" is a technology that converts spoken information into text information, automatically recording the content of conversations during meetings as text.

[0102] A "summary" is a brief record that extracts key points of discussion and decisions from analyzed textual information, allowing for efficient comprehension of the information.

[0103] A "work item" refers to the specific details of the actions decided at the meeting, and each item is assigned a person in charge and a deadline for its implementation.

[0104] "Action items" are specific actions that arise as a result of discussions during the meeting, and they indicate the measures that participants should take.

[0105] Task management is a method for systematically recording work items and action items, and for tracking and managing their progress.

[0106] "Natural language" refers to the language that humans use on a daily basis, and this system is capable of processing multiple languages.

[0107] "Real-time" means that data processing and information provision occur immediately and are available concurrently with the progress of the meeting.

[0108] "Generation" refers to the process of creating new information or records based on data, and specifically refers to the creation of summaries and translated meeting minutes.

[0109] "Meeting minutes" are documents that record the content and decisions made at a meeting, and are used for future reference and information sharing.

[0110] "Meeting participants" refers to individuals who participate in meetings or webinars and are the entities that utilize the generated information.

[0111] This system generates meeting minutes in real time in multiple languages, allowing participants to access the information immediately.

[0112] First, users collect audio information from meetings using devices such as smartphones or computers. These devices have a dedicated application installed and send the audio data to a server. The server then uses speech recognition software to convert the received audio information into text in real time for analysis. Because a pre-trained model is used for speech recognition, highly accurate conversion is possible. Specifically, the speech_recognition library is utilized to perform the conversion from speech to text.

[0113] Next, based on the transformed text information, the server generates a summary using a generative AI model. This summary extracts the key points and presents them concisely. The transformers library's models are used to analyze the text data and extract useful information.

[0114] The server then passes the generated summary to a multilingual translation engine, which translates it into the natural language requested by the participants. This enables smooth communication between participants who speak different languages. The langchain library is used for translation, providing fast and accurate translations.

[0115] This information is securely stored in cloud storage, and participants can access and view it at any time. This makes it possible to easily check necessary information immediately after the meeting ends, and to smoothly proceed with the execution of work items and action items. For example, when an international research team holds an online meeting, using this system can overcome language barriers and allow them to grasp important discussions and decisions in real time.

[0116] An example of a prompt used when inputting data into a generative AI model is: "Summarize the meeting minutes and translate them into Japanese. Please shorten the content while extracting the key points." This prompt allows the system to process the information in the specified format, enabling rapid and efficient information sharing.

[0117] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0118] Step 1:

[0119] The user launches a dedicated application and begins collecting audio information. At this time, the terminal captures the conference audio via its microphone and sends the acquired audio data to the server. The input is the live audio, and the output is the audio data sent to the server.

[0120] Step 2:

[0121] The server converts the received audio data into text information using the speech_recognition library. This library is a speech recognition engine that converts speech to text and uses a trained model to achieve high accuracy. The input is audio data, and the output is text data.

[0122] Step 3:

[0123] The server generates a summary from the acquired text information using the transformers library. This library utilizes a generative AI model to efficiently extract key discussion points and decisions from the text information. The input is text data, and the output is a summarized text.

[0124] Step 4:

[0125] The server translates the generated summary into the specified natural language using the langchain library. This enables multilingual support and facilitates smooth information exchange among participants using different languages. The input is the summary text, and the output is the translated text.

[0126] Step 5:

[0127] The server saves the summary and translation results to cloud storage. This data is accessible in real time, allowing participants to continue referencing the information even after the meeting ends. The input is the translated text, and the output is the data stored in cloud storage.

[0128] Step 6:

[0129] Users can access cloud storage through their devices and view the necessary information. This process allows them to instantly review the agenda and tasks discussed in a meeting and quickly move on to the next action. The input is data in the cloud, and the output is information displayed on the user's screen.

[0130] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0131] This invention is an information processing system that combines a system for automatically generating meeting minutes from audio data with an emotion engine that analyzes the user's emotions. The following describes specific embodiments for carrying out this invention.

[0132] First, the user launches a dedicated application at the start of the meeting to begin collecting audio data. The collected audio data is streamed in real time from the terminal to the server. The server receives this audio data and converts it into text data using a speech recognition engine. This process integrates an emotion engine that recognizes emotions from the user's voice tone and intonation.

[0133] The server analyzes the recognized text data, extracts the main points of discussion from the meeting, and generates a summary. This summary includes sentiment information recognized by the sentiment engine, and sentiment labels are assigned to express the emotional nuances behind each discussion.

[0134] Next, the server extracts tasks and action points from the summarized information and registers them in the issue management system. This registration process also takes into account the user's emotional state, allowing for adjustment of task priorities. For example, if a discussion is perceived as emotionally important, tasks related to that discussion will be given a higher priority.

[0135] Furthermore, the server performs multilingual translation into multiple specified languages, incorporating sentiment data into the generated summaries and tasks. This translation process utilizes dictionary lookup functionality to maintain terminology consistency. This enables multilingual communication that includes sentiment information.

[0136] Ultimately, the server stores all data in real time in secure cloud storage, making it instantly accessible to global meeting participants via their devices. This system allows users to review meeting minutes and tasks after the meeting, as well as gain emotional insights from the process.

[0137] As a concrete example, in situations where intercultural communication is crucial, such as at international conferences, this system can be used to instantly share meeting minutes in multiple languages, taking emotional nuances into account, thereby facilitating smoother decision-making and task execution.

[0138] The following describes the processing flow.

[0139] Step 1:

[0140] The user launches a dedicated application at the start of the meeting to begin collecting audio data. This collection is managed by the device, and the audio data is sent to the server in real time.

[0141] Step 2:

[0142] The server inputs the received audio data into the speech recognition engine, which converts the audio into text data. This conversion process uses a high-precision language model to faithfully record the content of the conversation as text.

[0143] Step 3:

[0144] Simultaneously, the server uses an emotion engine to extract user emotion information from the audio data in real time. It analyzes the tone, speed, and pitch of the voice and assigns an emotion label to each utterance.

[0145] Step 4:

[0146] The server analyzes the generated text data and extracts key discussion points and decisions from the meeting. When creating a summary based on this analysis, the extracted sentiment information is also reflected in the summary, indicating the atmosphere and tone of the meeting.

[0147] Step 5:

[0148] The server automatically extracts tasks and action points from the summary and registers them in the issue management system. Sentimental information is taken into consideration, and tasks associated with high-priority emotions are given higher priority.

[0149] Step 6:

[0150] The server translates summaries and tasks into the specified language using a multilingual translation engine. During this process, a dictionary lookup function is used to maintain terminology consistency and preserve sentiment.

[0151] Step 7:

[0152] The server securely stores the generated meeting minutes data and related information in cloud storage. All participants are given real-time access via their devices to review the meeting content and sentiment analysis results.

[0153] (Example 2)

[0154] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0155] While meeting minute generation systems based on audio data exist, they lack effective content summarization and task management that take emotional information into account. Furthermore, they struggle to maintain smooth communication while considering emotional nuances across multiple languages.

[0156] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0157] In this invention, the server includes means for collecting voice data, converting it into text data using a voice recognition device and analyzing emotions from the voice; means for analyzing the text data and generating a summary with added emotional information; and means for automatically extracting tasks and actions from the summary, taking emotional information into consideration, setting priorities, and registering them in task management. This enables effective summary generation that reflects emotional information and communication that preserves emotional nuances across multiple languages.

[0158] "Audio data" refers to information obtained by converting sound waveforms into digital signals, and is used to record the content of meetings and conversations.

[0159] A "speech recognition device" is a software or hardware technology that has the function of analyzing speech data and converting it into text data.

[0160] "Text data" refers to string information converted by speech recognition, and is used for content analysis and summarization.

[0161] "Emotional analysis" is a technology that analyzes a speaker's tone of voice and word choice to determine their emotional state.

[0162] A "summary" is a concise compilation of key information extracted from text data.

[0163] A "task" refers to specific work items or action plans in a meeting or discussion.

[0164] It appears that the Japanese definition of "action" is missing, so I will add it.

[0165] In meeting minutes and task management, "action" refers to specific actions or decisions that should be taken to achieve a particular objective.

[0166] Task management is a system for organizing, prioritizing, and tracking tasks and problems in order to achieve a specific goal.

[0167] "Priority" is an indicator used to determine the order and importance of tasks and actions.

[0168] "Registering tasks in task management" is the process of entering extracted tasks and actions into the system, enabling tracking and management.

[0169] Translation is the process of converting content written in one language into another language while preserving its meaning and context.

[0170] The "dictionary reference function" is a feature that allows for the referencing of a predefined glossary to ensure consistency in terminology during translation and to produce appropriate translations.

[0171] "Multilingualism" refers to dealing with multiple different languages, and is particularly used in translation and international communication.

[0172] "Real-time" refers to operations and information updates that are processed and reflected immediately without delay.

[0173] A "secure storage device" is a device or technology for protecting and storing digital data, preventing unauthorized access and data leaks.

[0174] "Emotional information" refers to data that represents a person's emotional state, extracted from audio or text.

[0175] To implement this invention, the user must first launch a dedicated application on their terminal and collect audio data during the meeting. The terminal uses its microphone to acquire the audio data and streams it to a server in real time. The audio data is converted into text data by a speech recognition device. Specifically, speech recognition software is used for speech recognition.

[0176] The server performs speech recognition on the audio data and simultaneously analyzes the user's emotions using an emotion analysis engine. Emotion analysis assigns emotion labels such as positive, negative, or neutral based on the tone of voice and the content of the speech.

[0177] Once text data is generated, the server uses natural language processing tools to analyze the data and generate a summary that takes sentiment into account. This summary extracts key points of discussion using a specific algorithm and includes emotional nuances. The generated summary is then translated into multiple specified languages according to the user's instructions. The translation process utilizes dictionary lookup to maintain consistency in terminology.

[0178] The server extracts tasks and actions from the generated summary and registers them in the issue management system. Tasks are prioritized based on the sentiment of the discussion.

[0179] All generated data is stored in real time on a secure storage device by the server, and users can access it at any time via their terminals. This makes it easy to review meeting minutes and related tasks even after the meeting has ended.

[0180] As a concrete example, using sentiment analysis in international conferences can facilitate smoother intercultural communication. This is expected to lead to faster decision-making and improved project outcomes.

[0181] An example of a prompt to input into a generative AI model is, "Analyze the participants' emotional information from the audio data of this meeting, create a summary, and provide it."

[0182] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0183] Step 1:

[0184] At the start of a meeting, the user launches a dedicated application on their device to begin collecting audio data. The input is real-time audio emitted by the meeting participants. The device uses its microphone to capture this audio and streams it to the server as audio data. The output is a real-time stream of audio data to the server.

[0185] Step 2:

[0186] When the server receives the streamed audio data, it converts it into text data using a speech recognition device. The input for this step is the audio data sent from the terminal. The server recognizes the audio waveform and processes it to convert it into text information as phonemes. The output is the recognized text data.

[0187] Step 3:

[0188] Following speech recognition, the server uses an emotion analysis engine to analyze emotional information from the text data and speech tone. The input here is the text data and speech tone information generated in step 2. The server analyzes these and assigns emotion labels such as positive, negative, and neutral. The output is the text data with the emotion labels assigned.

[0189] Step 4:

[0190] The server analyzes text data with sentiment information attached, uses natural language processing tools to extract key points, and generates a summary. The input is text data with sentiment labels attached in step 3. The server analyzes the meaning of the text and summarizes the important information. The output is summarized data with sentiment labels.

[0191] Step 5:

[0192] The server extracts tasks and actions from the summary data and registers them in the issue management system. The input is the summary data generated in step 4. The server identifies key action points and sets task priorities based on sentiment information. The output is the tasks and their priorities registered in the issue management system.

[0193] Step 6:

[0194] The generated summary and task data are translated into multiple languages specified by the server. This translation process utilizes dictionary lookup to ensure terminology consistency. The input is the summary and task data generated in step 5. The server uses a translation API to convert this into multiple languages. The output is multilingual data, including sentiment information.

[0195] Step 7:

[0196] All processed data is saved in real time by the server to secure storage. The input is the multilingualized summary and task data from step 6. The server formats this appropriately and saves it to cloud storage. The output is data that is securely stored and accessible at any time.

[0197] (Application Example 2)

[0198] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0199] In meetings and communication settings, information processing is required that goes beyond simply transcribing speech into text, taking into account the speaker's emotions and the overall atmosphere. Furthermore, it is desirable that this information be multilingual and accessible in real time. This is necessary to reduce misunderstandings between cultures and streamline the decision-making process.

[0200] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0201] In this invention, the server includes means for acquiring voice data and converting it into text data using speech recognition, means for analyzing emotions from voice tone and intonation and adding emotional information to the summary, and means for adjusting task priorities based on the emotional information. This enables real-time information processing that takes emotions into account and smooth communication through multilingual support.

[0202] "Audio data" refers to acoustic signals obtained from human speech.

[0203] "Speech recognition" is a technology that analyzes speech data and converts it into corresponding text data.

[0204] "Text data" refers to string information obtained through speech recognition technology.

[0205] A "summary" is information that extracts the main points from text data and expresses them concisely.

[0206] A "task" is a specific work item extracted through summarization.

[0207] "Action points" are the next actions that should be taken, as clarified as a result of the discussion.

[0208] "Issue management" is the process of systematically organizing and monitoring tasks and action points.

[0209] Translation is the process of converting text written in one language into another specified language.

[0210] "Real-time" refers to a situation where processing or reactions occur almost simultaneously.

[0211] "Emotional analysis" is a technology that recognizes and analyzes the emotions of a speaker from audio data.

[0212] "Emotional information" refers to data about emotions obtained through emotion analysis.

[0213] "Priority adjustment" refers to changing the order in which tasks are processed according to their importance and urgency.

[0214] An "operation management system" is a system for managing processes and information related to the operation of vehicles and other related activities.

[0215] This invention is a system for efficiently processing information using audio data in meetings and other communication situations. The server first acquires audio data and converts it into text data using a speech recognition engine. For speech recognition, it utilizes platforms such as Google® Cloud Speech-to-Text. The server then analyzes the text data using Watson® Tone Analyzer and recognizes emotions from the speaker's voice tone and intonation.

[0216] After performing sentiment analysis, the server adds sentiment information to the summary, incorporating emotional nuances into the meeting content. The server then extracts tasks and action points from the summary, adjusting task priorities based on the sentiment information. This ensures that tasks related to emotionally important discussions receive higher priority.

[0217] The information obtained is translated into multiple specified languages using a multilingual translation function. The translation includes a dictionary lookup function to maintain consistency in terminology. Finally, the generated data is stored in cloud storage and made accessible globally in real time via devices.

[0218] As a concrete example, using this system at an international autonomous vehicle development conference would allow for the instant sharing of multilingual information that takes into account emotional nuances in cross-cultural communication between countries, enabling smoother decision-making.

[0219] An example of a prompt for a generated AI model is, "How should you optimize driver assistance when the driver is experiencing increased stress due to traffic congestion?"

[0220] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0221] Step 1:

[0222] The user launches an audio collection application on their device at the start of the meeting and streams audio data to the server in real time using the microphone. The input is audio data, and the output is transmission to the server.

[0223] Step 2:

[0224] The server converts the received audio data into text data using a speech recognition engine (e.g., Google Cloud Speech-to-Text). The input is audio data, and the output is text data.

[0225] Step 3:

[0226] The server sends text data to an emotion analysis engine (e.g., Watson Tone Analyzer) to extract emotional information from speech tone and intonation. The input is text data, and the output is emotional information.

[0227] Step 4:

[0228] The server analyzes text data containing sentiment information to generate a summary. The input consists of sentiment information and text data, and the output is a summary with sentiment labels.

[0229] Step 5:

[0230] The server automatically extracts tasks and action points from the summary and adjusts task priorities based on sentiment information. The input is sentiment information and the summary, and the output is a task list with adjusted priorities.

[0231] Step 6:

[0232] The server translates tasks and summaries into multiple specified languages. The inputs are summaries and tasks, and translation software is used to generate multilingual data as output.

[0233] Step 7:

[0234] The server stores all generated data in cloud storage, allowing participants worldwide to access it instantly in real time via their devices. Input is translated multilingual data, and output is securely accessible data.

[0235] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0236] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0237] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0238] [Second Embodiment]

[0239] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0240] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0241] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0242] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0243] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0244] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0245] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0246] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0247] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0248] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0249] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0250] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0251] This invention is an information processing system that enables the rapid and accurate creation of meeting minutes and supports multiple languages. The embodiments for carrying out this invention are described below.

[0252] Users launch a dedicated application at the start of a meeting to begin collecting audio data. The audio data is transferred to the server in real time via the terminal. The server receives this audio data and converts it into text data using a speech recognition engine. This speech recognition utilizes a pre-trained model to achieve highly accurate text conversion.

[0253] The server analyzes the converted text data to extract the main discussion points and decisions of the meeting and create a summary. The generated summary is designed to present the key points in a concise manner, making it easy for users to understand the content.

[0254] The server automatically extracts tasks and action points from the further summarized information and registers them in the issue tracking function. This function provides the ability to set assignees and deadlines for each task and manage their progress. In particular, it allows for management based on task priority and dependencies.

[0255] For multilingual translation, the server passes the generated summary and action list to a multilingual translation engine. For example, it can translate from English to Japanese or Chinese in real time, allowing participants to view the meeting minutes in their native language.

[0256] Ultimately, the server stores all generated data in secure cloud storage, allowing meeting participants to access it in real time via their devices. This system enables users to review meeting minutes and action lists immediately after the meeting ends, allowing them to efficiently move forward with their tasks.

[0257] For example, when conducting online meetings with overseas project teams, using this system allows for efficient sharing of meeting content in multiple languages and immediate management of decisions and tasks, thus enabling the rapid progress of international projects.

[0258] The following describes the processing flow.

[0259] Step 1:

[0260] The user launches an audio recording application at the start of the meeting. This application collects audio data and prepares to send it to the server in real time.

[0261] Step 2:

[0262] The device immediately transmits the collected audio data to the server in streaming format. The data is compressed to optimize network bandwidth.

[0263] Step 3:

[0264] The server inputs the received audio data into the speech recognition engine and converts it into text data. During this process, the model's training data is used to improve recognition accuracy.

[0265] Step 4:

[0266] The server processes the converted text data using a natural language processing algorithm to extract key discussion points and decisions. This generates a shortened version that summarizes the entire meeting.

[0267] Step 5:

[0268] The server automatically extracts tasks and actions from the summary and registers them in the issue management system. Assignees and deadlines are automatically assigned based on pre-configured settings.

[0269] Step 6:

[0270] The server translates the summary and action list into the specified language using a multilingual translation engine. This process utilizes dictionary functionality to maintain terminology consistency.

[0271] Step 7:

[0272] The server saves the final meeting minutes data—text, summary, and action list—to cloud storage. This allows users and participants to access this data in real time through their devices.

[0273] (Example 1)

[0274] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0275] In meetings, it is necessary to improve the efficiency of information sharing and management by quickly and accurately converting audio data into text and summarizing its content. Furthermore, a multilingual system is required to provide consistent information even when participants speak different languages. It is also crucial to provide an environment where the generated information is accessible in real time, enabling rapid decision-making and task progress management.

[0276] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0277] In this invention, the server includes means for acquiring audio information and converting it into text information by speech recognition, means for analyzing the text information and generating a summary, and means for automatically extracting tasks and action points from the summary and registering them in task management. This makes it possible to efficiently organize the contents of meetings and share information in a consistent manner in multiple languages. Furthermore, since the generated information can be used immediately, it is expected to speed up decision-making and improve the efficiency of task management.

[0278] "Audio information" refers to audio data acquired in specific situations or environments, specifically the content of speech in settings such as meetings.

[0279] "Speech recognition" refers to the technology that analyzes acquired speech information and converts it into corresponding text information.

[0280] "Textual information" refers to text data generated from speech information through speech recognition.

[0281] A "summary" refers to information that concisely summarizes the main points of discussion and decisions extracted from textual information.

[0282] "Business" refers to the action items related to the specific tasks or projects extracted from the summary.

[0283] "Action point" refers to the action items agreed upon within the meeting or the specific implementation matters as the next steps.

[0284] "Task management" refers to the process of registering business and action points and managing their progress, priorities, and dependencies.

[0285] "Multilingual support" refers to the function of translating summaries and business into different languages so that users speaking various languages can easily understand the information.

[0286] "Immediate access" refers to the state where the generated information is saved and made available to users in real-time.

[0287] This invention is realized by a system that improves the efficiency of information processing in meetings. When a user starts a meeting, a dedicated application is launched on the terminal to start collecting voice information. The terminal captures this voice using a high-quality microphone and transfers it to the server in real-time.

[0288] The server converts the received voice information into text information using a voice recognition engine. This engine is enhanced by a generative AI model and can achieve highly accurate recognition for different speech contents and technical terms.

[0289] Next, the server analyzes the text information and generates a summary of the text by utilizing natural language processing algorithms. In this summarization process, the main discussion points and decisions of the meeting are concisely organized so that the user can easily grasp the content.

[0290] Also, the server extracts business and action points from the summarized information and automatically registers them in the task management system. Task management has the function of setting the person in charge and deadline for each task and further has the ability to manage the progress status and dependencies of tasks.

[0291] To support multiple languages, the server uses a translation engine to translate summaries and business processes in real time into multiple specified languages, enabling information to be shared in various languages. This allows participants to review meeting minutes in their native language.

[0292] Ultimately, the server securely stores the generated information in cloud storage, allowing users to access it instantly via their devices. This system enables users to review meeting minutes and action points immediately after the meeting ends, allowing them to efficiently plan their next actions.

[0293] For example, when conducting online meetings with multiple overseas locations, this system can be used to quickly share information about the situation in each location and update decisions in a timely manner. An example of a prompt message would be, "Please check the preparation status for the next project meeting, translate any important decisions, and notify relevant parties."

[0294] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0295] Step 1:

[0296] The user launches a dedicated application on their device and begins collecting audio information. The device uses a microphone to capture the meeting audio in real time and transfers the audio data to a server. The input is the raw audio of the meeting, and the output is a digital audio file.

[0297] Step 2:

[0298] The server processes the audio data received from the terminal and inputs it into the speech recognition engine. The speech recognition engine utilizes a generative AI model to analyze the audio data and convert it into text information. The input is a digital audio file, and the output is text information (text data).

[0299] Step 3:

[0300] The server analyzes the text information and generates a summary using natural language processing algorithms. In this process, the main discussion points and decisions of the meeting are extracted and the content is concisely summarized. The input is text information, and the output is summarized text data.

[0301] Step 4:

[0302] The server automatically extracts business and action points from the generated summary and registers them in the issue management system. As a result, the responsible person and deadline are set, and the progress management of tasks becomes possible. The input is summarized text data, and the output is a task list and action points.

[0303] Step 5:

[0304] The server uses a translation engine to translate the summary and business data into multiple languages. As a result, participants using different languages can check the information in their respective mother tongues. The input is the summary and task list, and the output is equivalent information translated into each language.

[0305] Step 6:

[0306] The server stores the final data in a secure cloud storage and enables users to access it immediately in real time through the terminal. Users can use this to check the information immediately after the meeting. The input is the translated dataset, and the output is the accessible data on the cloud.

[0307] (Application Example 1)

[0308] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".

[0309] In today's world, with the increase in international conferences and webinars, there is a growing need for efficient meeting minute creation and information sharing that transcends language barriers. However, traditional methods struggle with real-time multilingual support, making it difficult to ensure the accuracy and timeliness of information. Furthermore, the process of understanding meeting details in different languages is cumbersome, hindering efficient communication.

[0310] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0311] In this invention, the server includes means for acquiring audio information and converting it into text information by speech recognition, means for analyzing the text information and generating a summary, and means for generating natural language meeting minutes in real time and making them immediately available to meeting participants. This enables the creation of multilingual meeting minutes in real time, reduces communication barriers in international meetings, and facilitates efficient information sharing.

[0312] "Audio information" refers to data input as audio, which forms the basis for converting it into text information.

[0313] "Textual information" refers to text data converted from audio information by speech recognition, and is used for further analysis and summary generation.

[0314] "Speech recognition" is a technology that converts spoken information into text information, automatically recording the content of conversations during meetings as text.

[0315] A "summary" is a brief record that extracts key points of discussion and decisions from analyzed textual information, allowing for efficient comprehension of the information.

[0316] A "work item" refers to the specific details of the actions decided at the meeting, and each item is assigned a person in charge and a deadline for its implementation.

[0317] "Action items" are specific actions that arise as a result of discussions during the meeting, and they indicate the measures that participants should take.

[0318] Task management is a method for systematically recording work items and action items, and for tracking and managing their progress.

[0319] "Natural language" refers to the language that humans use on a daily basis, and this system is capable of processing multiple languages.

[0320] "Real-time" means that data processing and information provision occur immediately and are available concurrently with the progress of the meeting.

[0321] "Generation" refers to the process of creating new information or records based on data, and specifically refers to the creation of summaries and translated meeting minutes.

[0322] "Meeting minutes" are documents that record the content and decisions made at a meeting, and are used for future reference and information sharing.

[0323] "Meeting participants" refers to individuals who participate in meetings or webinars and are the entities that utilize the generated information.

[0324] This system generates meeting minutes in real time in multiple languages, allowing participants to access the information immediately.

[0325] First, users collect audio information from meetings using devices such as smartphones or computers. These devices have a dedicated application installed and send the audio data to a server. The server then uses speech recognition software to convert the received audio information into text in real time for analysis. Because a pre-trained model is used for speech recognition, highly accurate conversion is possible. Specifically, the speech_recognition library is utilized to perform the conversion from speech to text.

[0326] Next, based on the transformed text information, the server generates a summary using a generative AI model. This summary extracts the key points and presents them concisely. The transformers library's models are used to analyze the text data and extract useful information.

[0327] The server then passes the generated summary to a multilingual translation engine, which translates it into the natural language requested by the participants. This enables smooth communication between participants who speak different languages. The langchain library is used for translation, providing fast and accurate translations.

[0328] This information is securely stored in cloud storage, and participants can access and view it at any time. This makes it possible to easily check necessary information immediately after the meeting ends, and to smoothly proceed with the execution of work items and action items. For example, when an international research team holds an online meeting, using this system can overcome language barriers and allow them to grasp important discussions and decisions in real time.

[0329] An example of a prompt used when inputting data into a generative AI model is: "Summarize the meeting minutes and translate them into Japanese. Please shorten the content while extracting the key points." This prompt allows the system to process the information in the specified format, enabling rapid and efficient information sharing.

[0330] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0331] Step 1:

[0332] The user launches a dedicated application and begins collecting audio information. At this time, the terminal captures the conference audio via its microphone and sends the acquired audio data to the server. The input is the live audio, and the output is the audio data sent to the server.

[0333] Step 2:

[0334] The server converts the received audio data into text information using the speech_recognition library. This library is a speech recognition engine that converts speech to text and uses a trained model to achieve high accuracy. The input is audio data, and the output is text data.

[0335] Step 3:

[0336] The server generates a summary from the acquired text information using the transformers library. This library utilizes a generative AI model to efficiently extract key discussion points and decisions from the text information. The input is text data, and the output is a summarized text.

[0337] Step 4:

[0338] The server translates the generated summary into the specified natural language using the langchain library. This enables multilingual support and facilitates smooth information exchange among participants using different languages. The input is the summary text, and the output is the translated text.

[0339] Step 5:

[0340] The server saves the summary and translation results to cloud storage. This data is accessible in real time, allowing participants to continue referencing the information even after the meeting ends. The input is the translated text, and the output is the data stored in cloud storage.

[0341] Step 6:

[0342] Users can access cloud storage through their devices and view the necessary information. This process allows them to instantly review the agenda and tasks discussed in a meeting and quickly move on to the next action. The input is data in the cloud, and the output is information displayed on the user's screen.

[0343] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0344] This invention is an information processing system that combines a system for automatically generating meeting minutes from audio data with an emotion engine that analyzes the user's emotions. The following describes specific embodiments for carrying out this invention.

[0345] First, the user launches a dedicated application at the start of the meeting to begin collecting audio data. The collected audio data is streamed in real time from the terminal to the server. The server receives this audio data and converts it into text data using a speech recognition engine. This process integrates an emotion engine that recognizes emotions from the user's voice tone and intonation.

[0346] The server analyzes the recognized text data, extracts the main points of discussion from the meeting, and generates a summary. This summary includes sentiment information recognized by the sentiment engine, and sentiment labels are assigned to express the emotional nuances behind each discussion.

[0347] Next, the server extracts tasks and action points from the summarized information and registers them in the issue management system. This registration process also takes into account the user's emotional state, allowing for adjustment of task priorities. For example, if a discussion is perceived as emotionally important, tasks related to that discussion will be given a higher priority.

[0348] Furthermore, the server performs multilingual translation into multiple specified languages, incorporating sentiment data into the generated summaries and tasks. This translation process utilizes dictionary lookup functionality to maintain terminology consistency. This enables multilingual communication that includes sentiment information.

[0349] Ultimately, the server stores all data in real time in secure cloud storage, making it instantly accessible to global meeting participants via their devices. This system allows users to review meeting minutes and tasks after the meeting, as well as gain emotional insights from the process.

[0350] As a concrete example, in situations where intercultural communication is crucial, such as at international conferences, this system can be used to instantly share meeting minutes in multiple languages, taking emotional nuances into account, thereby facilitating smoother decision-making and task execution.

[0351] The following describes the processing flow.

[0352] Step 1:

[0353] The user launches a dedicated application at the start of the meeting to begin collecting audio data. This collection is managed by the device, and the audio data is sent to the server in real time.

[0354] Step 2:

[0355] The server inputs the received audio data into the speech recognition engine, which converts the audio into text data. This conversion process uses a high-precision language model to faithfully record the content of the conversation as text.

[0356] Step 3:

[0357] Simultaneously, the server uses an emotion engine to extract user emotion information from the audio data in real time. It analyzes the tone, speed, and pitch of the voice and assigns an emotion label to each utterance.

[0358] Step 4:

[0359] The server analyzes the generated text data and extracts key discussion points and decisions from the meeting. When creating a summary based on this analysis, the extracted sentiment information is also reflected in the summary, indicating the atmosphere and tone of the meeting.

[0360] Step 5:

[0361] The server automatically extracts tasks and action points from the summary and registers them in the issue management system. Sentimental information is taken into consideration, and tasks associated with high-priority emotions are given higher priority.

[0362] Step 6:

[0363] The server translates summaries and tasks into the specified language using a multilingual translation engine. During this process, a dictionary lookup function is used to maintain terminology consistency and preserve sentiment.

[0364] Step 7:

[0365] The server securely stores the generated meeting minutes data and related information in cloud storage. All participants are given real-time access via their devices to review the meeting content and sentiment analysis results.

[0366] (Example 2)

[0367] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0368] While meeting minute generation systems based on audio data exist, they lack effective content summarization and task management that take emotional information into account. Furthermore, they struggle to maintain smooth communication while considering emotional nuances across multiple languages.

[0369] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0370] In this invention, the server includes means for collecting voice data, converting it into text data using a voice recognition device and analyzing emotions from the voice; means for analyzing the text data and generating a summary with added emotional information; and means for automatically extracting tasks and actions from the summary, taking emotional information into consideration, setting priorities, and registering them in task management. This enables effective summary generation that reflects emotional information and communication that preserves emotional nuances across multiple languages.

[0371] "Audio data" refers to information obtained by converting sound waveforms into digital signals, and is used to record the content of meetings and conversations.

[0372] A "speech recognition device" is a software or hardware technology that has the function of analyzing speech data and converting it into text data.

[0373] "Text data" refers to string information converted by speech recognition, and is used for content analysis and summarization.

[0374] "Emotional analysis" is a technology that analyzes a speaker's tone of voice and word choice to determine their emotional state.

[0375] A "summary" is a concise compilation of key information extracted from text data.

[0376] A "task" refers to specific work items or action plans in a meeting or discussion.

[0377] It appears that the Japanese definition of "action" is missing, so I will add it.

[0378] In meeting minutes and task management, "action" refers to specific actions or decisions that should be taken to achieve a particular objective.

[0379] Task management is a system for organizing, prioritizing, and tracking tasks and problems in order to achieve a specific goal.

[0380] "Priority" is an indicator used to determine the order and importance of tasks and actions.

[0381] "Registering tasks in task management" is the process of entering extracted tasks and actions into the system, enabling tracking and management.

[0382] Translation is the process of converting content written in one language into another language while preserving its meaning and context.

[0383] The "dictionary reference function" is a feature that allows for the referencing of a predefined glossary to ensure consistency in terminology during translation and to produce appropriate translations.

[0384] "Multilingualism" refers to dealing with multiple different languages, and is particularly used in translation and international communication.

[0385] "Real-time" refers to operations and information updates that are processed and reflected immediately without delay.

[0386] A "secure storage device" is a device or technology for protecting and storing digital data, preventing unauthorized access and data leaks.

[0387] "Emotional information" refers to data that represents a person's emotional state, extracted from audio or text.

[0388] To implement this invention, the user must first launch a dedicated application on their terminal and collect audio data during the meeting. The terminal uses its microphone to acquire the audio data and streams it to a server in real time. The audio data is converted into text data by a speech recognition device. Specifically, speech recognition software is used for speech recognition.

[0389] The server performs speech recognition on the audio data and simultaneously analyzes the user's emotions using an emotion analysis engine. Emotion analysis assigns emotion labels such as positive, negative, or neutral based on the tone of voice and the content of the speech.

[0390] Once text data is generated, the server uses natural language processing tools to analyze the data and generate a summary that takes sentiment into account. This summary extracts key points of discussion using a specific algorithm and includes emotional nuances. The generated summary is then translated into multiple specified languages according to the user's instructions. The translation process utilizes dictionary lookup to maintain consistency in terminology.

[0391] The server extracts tasks and actions from the generated summary and registers them in the issue management system. Tasks are prioritized based on the sentiment of the discussion.

[0392] All generated data is stored in real time on a secure storage device by the server, and users can access it at any time via their terminals. This makes it easy to review meeting minutes and related tasks even after the meeting has ended.

[0393] As a concrete example, using sentiment analysis in international conferences can facilitate smoother intercultural communication. This is expected to lead to faster decision-making and improved project outcomes.

[0394] An example of a prompt to input into a generative AI model is, "Analyze the participants' emotional information from the audio data of this meeting, create a summary, and provide it."

[0395] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0396] Step 1:

[0397] At the start of a meeting, the user launches a dedicated application on their device to begin collecting audio data. The input is real-time audio emitted by the meeting participants. The device uses its microphone to capture this audio and streams it to the server as audio data. The output is a real-time stream of audio data to the server.

[0398] Step 2:

[0399] When the server receives the streamed audio data, it converts it into text data using a speech recognition device. The input for this step is the audio data sent from the terminal. The server recognizes the audio waveform and processes it to convert it into text information as phonemes. The output is the recognized text data.

[0400] Step 3:

[0401] Following speech recognition, the server uses an emotion analysis engine to analyze emotional information from the text data and speech tone. The input here is the text data and speech tone information generated in step 2. The server analyzes these and assigns emotion labels such as positive, negative, and neutral. The output is the text data with the emotion labels assigned.

[0402] Step 4:

[0403] The server analyzes text data with sentiment information attached, uses natural language processing tools to extract key points, and generates a summary. The input is text data with sentiment labels attached in step 3. The server analyzes the meaning of the text and summarizes the important information. The output is summarized data with sentiment labels.

[0404] Step 5:

[0405] The server extracts tasks and actions from the summary data and registers them in the issue management system. The input is the summary data generated in step 4. The server identifies key action points and sets task priorities based on sentiment information. The output is the tasks and their priorities registered in the issue management system.

[0406] Step 6:

[0407] The generated summary and task data are translated into multiple languages specified by the server. This translation process utilizes dictionary lookup to ensure terminology consistency. The input is the summary and task data generated in step 5. The server uses a translation API to convert this into multiple languages. The output is multilingual data, including sentiment information.

[0408] Step 7:

[0409] All processed data is saved in real time by the server to secure storage. The input is the multilingualized summary and task data from step 6. The server formats this appropriately and saves it to cloud storage. The output is data that is securely stored and accessible at any time.

[0410] (Application Example 2)

[0411] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0412] In meetings and communication settings, information processing is required that goes beyond simply transcribing speech into text, taking into account the speaker's emotions and the overall atmosphere. Furthermore, it is desirable that this information be multilingual and accessible in real time. This is necessary to reduce misunderstandings between cultures and streamline the decision-making process.

[0413] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0414] In this invention, the server includes means for acquiring voice data and converting it into text data using speech recognition, means for analyzing emotions from voice tone and intonation and adding emotional information to the summary, and means for adjusting task priorities based on the emotional information. This enables real-time information processing that takes emotions into account and smooth communication through multilingual support.

[0415] "Audio data" refers to acoustic signals obtained from human speech.

[0416] "Speech recognition" is a technology that analyzes speech data and converts it into corresponding text data.

[0417] "Text data" refers to string information obtained through speech recognition technology.

[0418] A "summary" is information that extracts the main points from text data and expresses them concisely.

[0419] A "task" is a specific work item extracted through summarization.

[0420] "Action points" are the next actions that should be taken, as clarified as a result of the discussion.

[0421] "Issue management" is the process of systematically organizing and monitoring tasks and action points.

[0422] Translation is the process of converting text written in one language into another specified language.

[0423] "Real-time" refers to a situation where processing or reactions occur almost simultaneously.

[0424] "Emotional analysis" is a technology that recognizes and analyzes the emotions of a speaker from audio data.

[0425] "Emotional information" refers to data about emotions obtained through emotion analysis.

[0426] "Priority adjustment" refers to changing the order in which tasks are processed according to their importance and urgency.

[0427] An "operation management system" is a system for managing processes and information related to the operation of vehicles and other related activities.

[0428] This invention is a system for efficiently processing information using audio data in meetings and other communication situations. The server first acquires audio data and converts it into text data using a speech recognition engine. For speech recognition, it utilizes platforms such as Google Cloud Speech-to-Text. The server then analyzes the text data using tools such as Watson Tone Analyzer to recognize emotions from the speaker's voice tone and intonation.

[0429] After performing sentiment analysis, the server adds sentiment information to the summary, incorporating emotional nuances into the meeting content. The server then extracts tasks and action points from the summary, adjusting task priorities based on the sentiment information. This ensures that tasks related to emotionally important discussions receive higher priority.

[0430] The information obtained is translated into multiple specified languages using a multilingual translation function. The translation includes a dictionary lookup function to maintain consistency in terminology. Finally, the generated data is stored in cloud storage and made accessible globally in real time via devices.

[0431] As a concrete example, using this system at an international autonomous vehicle development conference would allow for the instant sharing of multilingual information that takes into account emotional nuances in cross-cultural communication between countries, enabling smoother decision-making.

[0432] An example of a prompt for a generated AI model is, "How should you optimize driver assistance when the driver is experiencing increased stress due to traffic congestion?"

[0433] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0434] Step 1:

[0435] The user launches an audio collection application on their device at the start of the meeting and streams audio data to the server in real time using the microphone. The input is audio data, and the output is transmission to the server.

[0436] Step 2:

[0437] The server converts the received audio data into text data using a speech recognition engine (e.g., Google Cloud Speech-to-Text). The input is audio data, and the output is text data.

[0438] Step 3:

[0439] The server sends text data to an emotion analysis engine (e.g., Watson Tone Analyzer) to extract emotional information from speech tone and intonation. The input is text data, and the output is emotional information.

[0440] Step 4:

[0441] The server analyzes text data containing sentiment information to generate a summary. The input consists of sentiment information and text data, and the output is a summary with sentiment labels.

[0442] Step 5:

[0443] The server automatically extracts tasks and action points from the summary and adjusts task priorities based on sentiment information. The input is sentiment information and the summary, and the output is a task list with adjusted priorities.

[0444] Step 6:

[0445] The server translates tasks and summaries into multiple specified languages. The inputs are summaries and tasks, and translation software is used to generate multilingual data as output.

[0446] Step 7:

[0447] The server stores all generated data in cloud storage, allowing participants worldwide to access it instantly in real time via their devices. Input is translated multilingual data, and output is securely accessible data.

[0448] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0449] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0450] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0451] [Third Embodiment]

[0452] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0453] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0454] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0455] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0456] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0457] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0458] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0459] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0460] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0461] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0462] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0463] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0464] This invention is an information processing system that enables the rapid and accurate creation of meeting minutes and supports multiple languages. The embodiments for carrying out this invention are described below.

[0465] Users launch a dedicated application at the start of a meeting to begin collecting audio data. The audio data is transferred to the server in real time via the terminal. The server receives this audio data and converts it into text data using a speech recognition engine. This speech recognition utilizes a pre-trained model to achieve highly accurate text conversion.

[0466] The server analyzes the converted text data to extract the main discussion points and decisions of the meeting and create a summary. The generated summary is designed to present the key points in a concise manner, making it easy for users to understand the content.

[0467] The server automatically extracts tasks and action points from the further summarized information and registers them in the issue tracking function. This function provides the ability to set assignees and deadlines for each task and manage their progress. In particular, it allows for management based on task priority and dependencies.

[0468] For multilingual translation, the server passes the generated summary and action list to a multilingual translation engine. For example, it can translate from English to Japanese or Chinese in real time, allowing participants to view the meeting minutes in their native language.

[0469] Ultimately, the server stores all generated data in secure cloud storage, allowing meeting participants to access it in real time via their devices. This system enables users to review meeting minutes and action lists immediately after the meeting ends, allowing them to efficiently move forward with their tasks.

[0470] For example, when conducting online meetings with overseas project teams, using this system allows for efficient sharing of meeting content in multiple languages and immediate management of decisions and tasks, thus enabling the rapid progress of international projects.

[0471] The following describes the processing flow.

[0472] Step 1:

[0473] The user launches an audio recording application at the start of the meeting. This application collects audio data and prepares to send it to the server in real time.

[0474] Step 2:

[0475] The device immediately transmits the collected audio data to the server in streaming format. The data is compressed to optimize network bandwidth.

[0476] Step 3:

[0477] The server inputs the received audio data into the speech recognition engine and converts it into text data. During this process, the model's training data is used to improve recognition accuracy.

[0478] Step 4:

[0479] The server processes the converted text data using a natural language processing algorithm to extract key discussion points and decisions. This generates a shortened version that summarizes the entire meeting.

[0480] Step 5:

[0481] The server automatically extracts tasks and actions from the summary and registers them in the issue management system. Assignees and deadlines are automatically assigned based on pre-configured settings.

[0482] Step 6:

[0483] The server translates the summary and action list into the specified language using a multilingual translation engine. This process utilizes dictionary functionality to maintain terminology consistency.

[0484] Step 7:

[0485] The server saves the final meeting minutes data—text, summary, and action list—to cloud storage. This allows users and participants to access this data in real time through their devices.

[0486] (Example 1)

[0487] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0488] In meetings, it is necessary to improve the efficiency of information sharing and management by quickly and accurately converting audio data into text and summarizing its content. Furthermore, a multilingual system is required to provide consistent information even when participants speak different languages. It is also crucial to provide an environment where the generated information is accessible in real time, enabling rapid decision-making and task progress management.

[0489] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0490] In this invention, the server includes means for acquiring audio information and converting it into text information by speech recognition, means for analyzing the text information and generating a summary, and means for automatically extracting tasks and action points from the summary and registering them in task management. This makes it possible to efficiently organize the contents of meetings and share information in a consistent manner in multiple languages. Furthermore, since the generated information can be used immediately, it is expected to speed up decision-making and improve the efficiency of task management.

[0491] "Audio information" refers to audio data acquired in specific situations or environments, specifically the content of speech in settings such as meetings.

[0492] "Speech recognition" refers to the technology that analyzes acquired speech information and converts it into corresponding text information.

[0493] "Textual information" refers to text data generated from speech information through speech recognition.

[0494] A "summary" refers to information that concisely summarizes the main points of discussion and decisions extracted from textual information.

[0495] "Tasks" refers to specific tasks and project-related action items extracted from the summary.

[0496] "Action points" refer to action items agreed upon during the meeting, or specific actions to be taken as the next steps.

[0497] "Task management" refers to the process of registering tasks and action points, and managing their progress, priorities, and dependencies.

[0498] "Multilingual support" refers to a function that translates summaries and business information into different languages, making it easier for users who speak various languages to understand the information.

[0499] "Instant access" refers to a state where generated information is saved and made available to users in real time.

[0500] This invention is realized through a system that streamlines information processing in meetings. When a user starts a meeting, they launch a dedicated application on their terminal and begin collecting audio information. The terminal captures this audio using a high-quality microphone and transmits it to a server in real time.

[0501] The server converts received audio information into text using a speech recognition engine. This engine is enhanced by a generative AI model, enabling highly accurate recognition of different utterances and specialized terminology.

[0502] Next, the server analyzes the text information and uses natural language processing algorithms to generate a text summary. This summarization process concisely organizes the main discussion points and decisions of the meeting, making it easy for users to understand the content.

[0503] Furthermore, the server extracts tasks and action points from the summarized information and automatically registers them in the task management system. The task management system has the functionality to set assignees and deadlines for each task, and also has the ability to manage progress and task dependencies.

[0504] To support multiple languages, the server uses a translation engine to translate summaries and business processes in real time into multiple specified languages, enabling information to be shared in various languages. This allows participants to review meeting minutes in their native language.

[0505] Ultimately, the server securely stores the generated information in cloud storage, allowing users to access it instantly via their devices. This system enables users to review meeting minutes and action points immediately after the meeting ends, allowing them to efficiently plan their next actions.

[0506] For example, when conducting online meetings with multiple overseas locations, this system can be used to quickly share information about the situation in each location and update decisions in a timely manner. An example of a prompt message would be, "Please check the preparation status for the next project meeting, translate any important decisions, and notify relevant parties."

[0507] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0508] Step 1:

[0509] The user launches a dedicated application on their device and begins collecting audio information. The device uses a microphone to capture the meeting audio in real time and transfers the audio data to a server. The input is the raw audio of the meeting, and the output is a digital audio file.

[0510] Step 2:

[0511] The server processes the audio data received from the terminal and inputs it into the speech recognition engine. The speech recognition engine utilizes a generative AI model to analyze the audio data and convert it into text information. The input is a digital audio file, and the output is text information (text data).

[0512] Step 3:

[0513] The server analyzes the textual information and generates a summary using natural language processing algorithms. This process extracts the main discussion points and decisions of the meeting and summarizes the content concisely. The input is textual information, and the output is summarized text data.

[0514] Step 4:

[0515] The server automatically extracts tasks and action points from the generated summary and registers them in the task management system. This allows for the assignment of assignees and deadlines, and enables task progress management. The input is summarized text data, and the output is a task list and action points.

[0516] Step 5:

[0517] The server uses a translation engine to translate summaries and task data into multiple languages. This allows participants using different languages to access the information in their native language. The input is summaries and task lists, and the output is equivalent information translated into each language.

[0518] Step 6:

[0519] The server stores the final data in secure cloud storage, making it instantly accessible to users in real time via their devices. Users can use this to review information immediately after the meeting ends. The input is a translated dataset, and the output is accessible data in the cloud.

[0520] (Application Example 1)

[0521] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0522] In today's world, with the increase in international conferences and webinars, there is a growing need for efficient meeting minute creation and information sharing that transcends language barriers. However, traditional methods struggle with real-time multilingual support, making it difficult to ensure the accuracy and timeliness of information. Furthermore, the process of understanding meeting details in different languages is cumbersome, hindering efficient communication.

[0523] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0524] In this invention, the server includes means for acquiring audio information and converting it into text information by speech recognition, means for analyzing the text information and generating a summary, and means for generating natural language meeting minutes in real time and making them immediately available to meeting participants. This enables the creation of multilingual meeting minutes in real time, reduces communication barriers in international meetings, and facilitates efficient information sharing.

[0525] "Audio information" refers to data input as audio, which forms the basis for converting it into text information.

[0526] "Textual information" refers to text data converted from audio information by speech recognition, and is used for further analysis and summary generation.

[0527] "Speech recognition" is a technology that converts spoken information into text information, automatically recording the content of conversations during meetings as text.

[0528] A "summary" is a brief record that extracts key points of discussion and decisions from analyzed textual information, allowing for efficient comprehension of the information.

[0529] A "work item" refers to the specific details of the actions decided at the meeting, and each item is assigned a person in charge and a deadline for its implementation.

[0530] "Action items" are specific actions that arise as a result of discussions during the meeting, and they indicate the measures that participants should take.

[0531] Task management is a method for systematically recording work items and action items, and for tracking and managing their progress.

[0532] "Natural language" refers to the language that humans use on a daily basis, and this system is capable of processing multiple languages.

[0533] "Real-time" means that data processing and information provision occur immediately and are available concurrently with the progress of the meeting.

[0534] "Generation" refers to the process of creating new information or records based on data, and specifically refers to the creation of summaries and translated meeting minutes.

[0535] "Meeting minutes" are documents that record the content and decisions made at a meeting, and are used for future reference and information sharing.

[0536] "Meeting participants" refers to individuals who participate in meetings or webinars and are the entities that utilize the generated information.

[0537] This system generates meeting minutes in real time in multiple languages, allowing participants to access the information immediately.

[0538] First, users collect audio information from meetings using devices such as smartphones or computers. These devices have a dedicated application installed and send the audio data to a server. The server then uses speech recognition software to convert the received audio information into text in real time for analysis. Because a pre-trained model is used for speech recognition, highly accurate conversion is possible. Specifically, the speech_recognition library is utilized to perform the conversion from speech to text.

[0539] Next, based on the transformed text information, the server generates a summary using a generative AI model. This summary extracts the key points and presents them concisely. The transformers library's models are used to analyze the text data and extract useful information.

[0540] The server then passes the generated summary to a multilingual translation engine, which translates it into the natural language requested by the participants. This enables smooth communication between participants who speak different languages. The langchain library is used for translation, providing fast and accurate translations.

[0541] This information is securely stored in cloud storage, and participants can access and view it at any time. This makes it possible to easily check necessary information immediately after the meeting ends, and to smoothly proceed with the execution of work items and action items. For example, when an international research team holds an online meeting, using this system can overcome language barriers and allow them to grasp important discussions and decisions in real time.

[0542] An example of a prompt used when inputting data into a generative AI model is: "Summarize the meeting minutes and translate them into Japanese. Please shorten the content while extracting the key points." This prompt allows the system to process the information in the specified format, enabling rapid and efficient information sharing.

[0543] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0544] Step 1:

[0545] The user launches a dedicated application and begins collecting audio information. At this time, the terminal captures the conference audio via its microphone and sends the acquired audio data to the server. The input is the live audio, and the output is the audio data sent to the server.

[0546] Step 2:

[0547] The server converts the received audio data into text information using the speech_recognition library. This library is a speech recognition engine that converts speech to text and uses a trained model to achieve high accuracy. The input is audio data, and the output is text data.

[0548] Step 3:

[0549] The server generates a summary from the acquired text information using the transformers library. This library utilizes a generative AI model to efficiently extract key discussion points and decisions from the text information. The input is text data, and the output is a summarized text.

[0550] Step 4:

[0551] The server translates the generated summary into the specified natural language using the langchain library. This enables multilingual support and facilitates smooth information exchange among participants using different languages. The input is the summary text, and the output is the translated text.

[0552] Step 5:

[0553] The server saves the summary and translation results to cloud storage. This data is accessible in real time, allowing participants to continue referencing the information even after the meeting ends. The input is the translated text, and the output is the data stored in cloud storage.

[0554] Step 6:

[0555] Users can access cloud storage through their devices and view the necessary information. This process allows them to instantly review the agenda and tasks discussed in a meeting and quickly move on to the next action. The input is data in the cloud, and the output is information displayed on the user's screen.

[0556] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0557] This invention is an information processing system that combines a system for automatically generating meeting minutes from audio data with an emotion engine that analyzes the user's emotions. The following describes specific embodiments for carrying out this invention.

[0558] First, the user launches a dedicated application at the start of the meeting to begin collecting audio data. The collected audio data is streamed in real time from the terminal to the server. The server receives this audio data and converts it into text data using a speech recognition engine. This process integrates an emotion engine that recognizes emotions from the user's voice tone and intonation.

[0559] The server analyzes the recognized text data, extracts the main points of discussion from the meeting, and generates a summary. This summary includes sentiment information recognized by the sentiment engine, and sentiment labels are assigned to express the emotional nuances behind each discussion.

[0560] Next, the server extracts tasks and action points from the summarized information and registers them in the issue management system. This registration process also takes into account the user's emotional state, allowing for adjustment of task priorities. For example, if a discussion is perceived as emotionally important, tasks related to that discussion will be given a higher priority.

[0561] Furthermore, the server performs multilingual translation into multiple specified languages, incorporating sentiment data into the generated summaries and tasks. This translation process utilizes dictionary lookup functionality to maintain terminology consistency. This enables multilingual communication that includes sentiment information.

[0562] Ultimately, the server stores all data in real time in secure cloud storage, making it instantly accessible to global meeting participants via their devices. This system allows users to review meeting minutes and tasks after the meeting, as well as gain emotional insights from the process.

[0563] As a concrete example, in situations where intercultural communication is crucial, such as at international conferences, this system can be used to instantly share meeting minutes in multiple languages, taking emotional nuances into account, thereby facilitating smoother decision-making and task execution.

[0564] The following describes the processing flow.

[0565] Step 1:

[0566] The user launches a dedicated application at the start of the meeting to begin collecting audio data. This collection is managed by the device, and the audio data is sent to the server in real time.

[0567] Step 2:

[0568] The server inputs the received audio data into the speech recognition engine, which converts the audio into text data. This conversion process uses a high-precision language model to faithfully record the content of the conversation as text.

[0569] Step 3:

[0570] Simultaneously, the server uses an emotion engine to extract user emotion information from the audio data in real time. It analyzes the tone, speed, and pitch of the voice and assigns an emotion label to each utterance.

[0571] Step 4:

[0572] The server analyzes the generated text data and extracts key discussion points and decisions from the meeting. When creating a summary based on this analysis, the extracted sentiment information is also reflected in the summary, indicating the atmosphere and tone of the meeting.

[0573] Step 5:

[0574] The server automatically extracts tasks and action points from the summary and registers them in the issue management system. Sentimental information is taken into consideration, and tasks associated with high-priority emotions are given higher priority.

[0575] Step 6:

[0576] The server translates summaries and tasks into the specified language using a multilingual translation engine. During this process, a dictionary lookup function is used to maintain terminology consistency and preserve sentiment.

[0577] Step 7:

[0578] The server securely stores the generated meeting minutes data and related information in cloud storage. All participants are given real-time access via their devices to review the meeting content and sentiment analysis results.

[0579] (Example 2)

[0580] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0581] While meeting minute generation systems based on audio data exist, they lack effective content summarization and task management that take emotional information into account. Furthermore, they struggle to maintain smooth communication while considering emotional nuances across multiple languages.

[0582] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0583] In this invention, the server includes means for collecting voice data, converting it into text data using a voice recognition device and analyzing emotions from the voice; means for analyzing the text data and generating a summary with added emotional information; and means for automatically extracting tasks and actions from the summary, taking emotional information into consideration, setting priorities, and registering them in task management. This enables effective summary generation that reflects emotional information and communication that preserves emotional nuances across multiple languages.

[0584] "Audio data" refers to information obtained by converting sound waveforms into digital signals, and is used to record the content of meetings and conversations.

[0585] A "speech recognition device" is a software or hardware technology that has the function of analyzing speech data and converting it into text data.

[0586] "Text data" refers to string information converted by speech recognition, and is used for content analysis and summarization.

[0587] "Emotional analysis" is a technology that analyzes a speaker's tone of voice and word choice to determine their emotional state.

[0588] A "summary" is a concise compilation of key information extracted from text data.

[0589] A "task" refers to specific work items or action plans in a meeting or discussion.

[0590] It appears that the Japanese definition of "action" is missing, so I will add it.

[0591] In meeting minutes and task management, "action" refers to specific actions or decisions that should be taken to achieve a particular objective.

[0592] Task management is a system for organizing, prioritizing, and tracking tasks and problems in order to achieve a specific goal.

[0593] "Priority" is an indicator used to determine the order and importance of tasks and actions.

[0594] "Registering tasks in task management" is the process of entering extracted tasks and actions into the system, enabling tracking and management.

[0595] Translation is the process of converting content written in one language into another language while preserving its meaning and context.

[0596] The "dictionary reference function" is a feature that allows for the referencing of a predefined glossary to ensure consistency in terminology during translation and to produce appropriate translations.

[0597] "Multilingualism" refers to dealing with multiple different languages, and is particularly used in translation and international communication.

[0598] "Real-time" refers to operations and information updates that are processed and reflected immediately without delay.

[0599] A "secure storage device" is a device or technology for protecting and storing digital data, preventing unauthorized access and data leaks.

[0600] "Emotional information" refers to data that represents a person's emotional state, extracted from audio or text.

[0601] To implement this invention, the user must first launch a dedicated application on their terminal and collect audio data during the meeting. The terminal uses its microphone to acquire the audio data and streams it to a server in real time. The audio data is converted into text data by a speech recognition device. Specifically, speech recognition software is used for speech recognition.

[0602] The server performs speech recognition on the audio data and simultaneously analyzes the user's emotions using an emotion analysis engine. Emotion analysis assigns emotion labels such as positive, negative, or neutral based on the tone of voice and the content of the speech.

[0603] Once text data is generated, the server uses natural language processing tools to analyze the data and generate a summary that takes sentiment into account. This summary extracts key points of discussion using a specific algorithm and includes emotional nuances. The generated summary is then translated into multiple specified languages according to the user's instructions. The translation process utilizes dictionary lookup to maintain consistency in terminology.

[0604] The server extracts tasks and actions from the generated summary and registers them in the issue management system. Tasks are prioritized based on the sentiment of the discussion.

[0605] All generated data is stored in real time on a secure storage device by the server, and users can access it at any time via their terminals. This makes it easy to review meeting minutes and related tasks even after the meeting has ended.

[0606] As a concrete example, using sentiment analysis in international conferences can facilitate smoother intercultural communication. This is expected to lead to faster decision-making and improved project outcomes.

[0607] An example of a prompt to input into a generative AI model is, "Analyze the participants' emotional information from the audio data of this meeting, create a summary, and provide it."

[0608] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0609] Step 1:

[0610] At the start of a meeting, the user launches a dedicated application on their device to begin collecting audio data. The input is real-time audio emitted by the meeting participants. The device uses its microphone to capture this audio and streams it to the server as audio data. The output is a real-time stream of audio data to the server.

[0611] Step 2:

[0612] When the server receives the streamed audio data, it converts it into text data using a speech recognition device. The input for this step is the audio data sent from the terminal. The server recognizes the audio waveform and processes it to convert it into text information as phonemes. The output is the recognized text data.

[0613] Step 3:

[0614] Following speech recognition, the server uses an emotion analysis engine to analyze emotional information from the text data and speech tone. The input here is the text data and speech tone information generated in step 2. The server analyzes these and assigns emotion labels such as positive, negative, and neutral. The output is the text data with the emotion labels assigned.

[0615] Step 4:

[0616] The server analyzes text data with sentiment information attached, uses natural language processing tools to extract key points, and generates a summary. The input is text data with sentiment labels attached in step 3. The server analyzes the meaning of the text and summarizes the important information. The output is summarized data with sentiment labels.

[0617] Step 5:

[0618] The server extracts tasks and actions from the summary data and registers them in the issue management system. The input is the summary data generated in step 4. The server identifies key action points and sets task priorities based on sentiment information. The output is the tasks and their priorities registered in the issue management system.

[0619] Step 6:

[0620] The generated summary and task data are translated into multiple languages specified by the server. This translation process utilizes dictionary lookup to ensure terminology consistency. The input is the summary and task data generated in step 5. The server uses a translation API to convert this into multiple languages. The output is multilingual data, including sentiment information.

[0621] Step 7:

[0622] All processed data is saved in real time by the server to secure storage. The input is the multilingualized summary and task data from step 6. The server formats this appropriately and saves it to cloud storage. The output is data that is securely stored and accessible at any time.

[0623] (Application Example 2)

[0624] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0625] In meetings and communication settings, information processing is required that goes beyond simply transcribing speech into text, taking into account the speaker's emotions and the overall atmosphere. Furthermore, it is desirable that this information be multilingual and accessible in real time. This is necessary to reduce misunderstandings between cultures and streamline the decision-making process.

[0626] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0627] In this invention, the server includes means for acquiring voice data and converting it into text data using speech recognition, means for analyzing emotions from voice tone and intonation and adding emotional information to the summary, and means for adjusting task priorities based on the emotional information. This enables real-time information processing that takes emotions into account and smooth communication through multilingual support.

[0628] "Audio data" refers to acoustic signals obtained from human speech.

[0629] "Speech recognition" is a technology that analyzes speech data and converts it into corresponding text data.

[0630] "Text data" refers to string information obtained through speech recognition technology.

[0631] A "summary" is information that extracts the main points from text data and expresses them concisely.

[0632] A "task" is a specific work item extracted through summarization.

[0633] "Action points" are the next actions that should be taken, as clarified as a result of the discussion.

[0634] "Issue management" is the process of systematically organizing and monitoring tasks and action points.

[0635] Translation is the process of converting text written in one language into another specified language.

[0636] "Real-time" refers to a situation where processing or reactions occur almost simultaneously.

[0637] "Emotional analysis" is a technology that recognizes and analyzes the emotions of a speaker from audio data.

[0638] "Emotional information" refers to data about emotions obtained through emotion analysis.

[0639] "Priority adjustment" refers to changing the order in which tasks are processed according to their importance and urgency.

[0640] An "operation management system" is a system for managing processes and information related to the operation of vehicles and other related activities.

[0641] This invention is a system for efficiently processing information using audio data in meetings and other communication situations. The server first acquires audio data and converts it into text data using a speech recognition engine. For speech recognition, it utilizes platforms such as Google Cloud Speech-to-Text. The server then analyzes the text data using tools such as Watson Tone Analyzer to recognize emotions from the speaker's voice tone and intonation.

[0642] After performing sentiment analysis, the server adds sentiment information to the summary, incorporating emotional nuances into the meeting content. The server then extracts tasks and action points from the summary, adjusting task priorities based on the sentiment information. This ensures that tasks related to emotionally important discussions receive higher priority.

[0643] The information obtained is translated into multiple specified languages using a multilingual translation function. The translation includes a dictionary lookup function to maintain consistency in terminology. Finally, the generated data is stored in cloud storage and made accessible globally in real time via devices.

[0644] As a concrete example, using this system at an international autonomous vehicle development conference would allow for the instant sharing of multilingual information that takes into account emotional nuances in cross-cultural communication between countries, enabling smoother decision-making.

[0645] An example of a prompt for a generated AI model is, "How should you optimize driver assistance when the driver is experiencing increased stress due to traffic congestion?"

[0646] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0647] Step 1:

[0648] The user launches an audio collection application on their device at the start of the meeting and streams audio data to the server in real time using the microphone. The input is audio data, and the output is transmission to the server.

[0649] Step 2:

[0650] The server converts the received audio data into text data using a speech recognition engine (e.g., Google Cloud Speech-to-Text). The input is audio data, and the output is text data.

[0651] Step 3:

[0652] The server sends text data to an emotion analysis engine (e.g., Watson Tone Analyzer) to extract emotional information from speech tone and intonation. The input is text data, and the output is emotional information.

[0653] Step 4:

[0654] The server analyzes text data containing sentiment information to generate a summary. The input consists of sentiment information and text data, and the output is a summary with sentiment labels.

[0655] Step 5:

[0656] The server automatically extracts tasks and action points from the summary and adjusts task priorities based on sentiment information. The input is sentiment information and the summary, and the output is a task list with adjusted priorities.

[0657] Step 6:

[0658] The server translates tasks and summaries into multiple specified languages. The inputs are summaries and tasks, and translation software is used to generate multilingual data as output.

[0659] Step 7:

[0660] The server stores all generated data in cloud storage, allowing participants worldwide to access it instantly in real time via their devices. Input is translated multilingual data, and output is securely accessible data.

[0661] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0662] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0663] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0664] [Fourth Embodiment]

[0665] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0666] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0667] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0668] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0669] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0670] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0671] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0672] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0673] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0674] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0675] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0676] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0677] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0678] This invention is an information processing system that enables the rapid and accurate creation of meeting minutes and supports multiple languages. The embodiments for carrying out this invention are described below.

[0679] Users launch a dedicated application at the start of a meeting to begin collecting audio data. The audio data is transferred to the server in real time via the terminal. The server receives this audio data and converts it into text data using a speech recognition engine. This speech recognition utilizes a pre-trained model to achieve highly accurate text conversion.

[0680] The server analyzes the converted text data to extract the main discussion points and decisions of the meeting and create a summary. The generated summary is designed to present the key points in a concise manner, making it easy for users to understand the content.

[0681] The server automatically extracts tasks and action points from the further summarized information and registers them in the issue tracking function. This function provides the ability to set assignees and deadlines for each task and manage their progress. In particular, it allows for management based on task priority and dependencies.

[0682] For multilingual translation, the server passes the generated summary and action list to a multilingual translation engine. For example, it can translate from English to Japanese or Chinese in real time, allowing participants to view the meeting minutes in their native language.

[0683] Ultimately, the server stores all generated data in secure cloud storage, allowing meeting participants to access it in real time via their devices. This system enables users to review meeting minutes and action lists immediately after the meeting ends, allowing them to efficiently move forward with their tasks.

[0684] For example, when conducting online meetings with overseas project teams, using this system allows for efficient sharing of meeting content in multiple languages and immediate management of decisions and tasks, thus enabling the rapid progress of international projects.

[0685] The following describes the processing flow.

[0686] Step 1:

[0687] The user launches an audio recording application at the start of the meeting. This application collects audio data and prepares to send it to the server in real time.

[0688] Step 2:

[0689] The device immediately transmits the collected audio data to the server in streaming format. The data is compressed to optimize network bandwidth.

[0690] Step 3:

[0691] The server inputs the received audio data into the speech recognition engine and converts it into text data. During this process, the model's training data is used to improve recognition accuracy.

[0692] Step 4:

[0693] The server processes the converted text data using a natural language processing algorithm to extract key discussion points and decisions. This generates a shortened version that summarizes the entire meeting.

[0694] Step 5:

[0695] The server automatically extracts tasks and actions from the summary and registers them in the issue management system. Assignees and deadlines are automatically assigned based on pre-configured settings.

[0696] Step 6:

[0697] The server translates the summary and action list into the specified language using a multilingual translation engine. This process utilizes dictionary functionality to maintain terminology consistency.

[0698] Step 7:

[0699] The server saves the final meeting minutes data—text, summary, and action list—to cloud storage. This allows users and participants to access this data in real time through their devices.

[0700] (Example 1)

[0701] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0702] In meetings, it is necessary to improve the efficiency of information sharing and management by quickly and accurately converting audio data into text and summarizing its content. Furthermore, a multilingual system is required to provide consistent information even when participants speak different languages. It is also crucial to provide an environment where the generated information is accessible in real time, enabling rapid decision-making and task progress management.

[0703] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0704] In this invention, the server includes means for acquiring audio information and converting it into text information by speech recognition, means for analyzing the text information and generating a summary, and means for automatically extracting tasks and action points from the summary and registering them in task management. This makes it possible to efficiently organize the contents of meetings and share information in a consistent manner in multiple languages. Furthermore, since the generated information can be used immediately, it is expected to speed up decision-making and improve the efficiency of task management.

[0705] "Audio information" refers to audio data acquired in specific situations or environments, specifically the content of speech in settings such as meetings.

[0706] "Speech recognition" refers to the technology that analyzes acquired speech information and converts it into corresponding text information.

[0707] "Textual information" refers to text data generated from speech information through speech recognition.

[0708] A "summary" refers to information that concisely summarizes the main points of discussion and decisions extracted from textual information.

[0709] "Tasks" refers to specific tasks and project-related action items extracted from the summary.

[0710] "Action points" refer to action items agreed upon during the meeting, or specific actions to be taken as the next steps.

[0711] "Task management" refers to the process of registering tasks and action points, and managing their progress, priorities, and dependencies.

[0712] "Multilingual support" refers to a function that translates summaries and business information into different languages, making it easier for users who speak various languages to understand the information.

[0713] "Instant access" refers to a state where generated information is saved and made available to users in real time.

[0714] This invention is realized through a system that streamlines information processing in meetings. When a user starts a meeting, they launch a dedicated application on their terminal and begin collecting audio information. The terminal captures this audio using a high-quality microphone and transmits it to a server in real time.

[0715] The server converts received audio information into text using a speech recognition engine. This engine is enhanced by a generative AI model, enabling highly accurate recognition of different utterances and specialized terminology.

[0716] Next, the server analyzes the text information and uses natural language processing algorithms to generate a text summary. This summarization process concisely organizes the main discussion points and decisions of the meeting, making it easy for users to understand the content.

[0717] Furthermore, the server extracts tasks and action points from the summarized information and automatically registers them in the task management system. The task management system has the functionality to set assignees and deadlines for each task, and also has the ability to manage progress and task dependencies.

[0718] To support multiple languages, the server uses a translation engine to translate summaries and business processes in real time into multiple specified languages, enabling information to be shared in various languages. This allows participants to review meeting minutes in their native language.

[0719] Ultimately, the server securely stores the generated information in cloud storage, allowing users to access it instantly via their devices. This system enables users to review meeting minutes and action points immediately after the meeting ends, allowing them to efficiently plan their next actions.

[0720] For example, when conducting online meetings with multiple overseas locations, this system can be used to quickly share information about the situation in each location and update decisions in a timely manner. An example of a prompt message would be, "Please check the preparation status for the next project meeting, translate any important decisions, and notify relevant parties."

[0721] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0722] Step 1:

[0723] The user launches a dedicated application on their device and begins collecting audio information. The device uses a microphone to capture the meeting audio in real time and transfers the audio data to a server. The input is the raw audio of the meeting, and the output is a digital audio file.

[0724] Step 2:

[0725] The server processes the audio data received from the terminal and inputs it into the speech recognition engine. The speech recognition engine utilizes a generative AI model to analyze the audio data and convert it into text information. The input is a digital audio file, and the output is text information (text data).

[0726] Step 3:

[0727] The server analyzes the textual information and generates a summary using natural language processing algorithms. This process extracts the main discussion points and decisions of the meeting and summarizes the content concisely. The input is textual information, and the output is summarized text data.

[0728] Step 4:

[0729] The server automatically extracts tasks and action points from the generated summary and registers them in the task management system. This allows for the assignment of assignees and deadlines, and enables task progress management. The input is summarized text data, and the output is a task list and action points.

[0730] Step 5:

[0731] The server uses a translation engine to translate summaries and task data into multiple languages. This allows participants using different languages to access the information in their native language. The input is summaries and task lists, and the output is equivalent information translated into each language.

[0732] Step 6:

[0733] The server stores the final data in secure cloud storage, making it instantly accessible to users in real time via their devices. Users can use this to review information immediately after the meeting ends. The input is a translated dataset, and the output is accessible data in the cloud.

[0734] (Application Example 1)

[0735] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0736] In today's world, with the increase in international conferences and webinars, there is a growing need for efficient meeting minute creation and information sharing that transcends language barriers. However, traditional methods struggle with real-time multilingual support, making it difficult to ensure the accuracy and timeliness of information. Furthermore, the process of understanding meeting details in different languages is cumbersome, hindering efficient communication.

[0737] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0738] In this invention, the server includes means for acquiring audio information and converting it into text information by speech recognition, means for analyzing the text information and generating a summary, and means for generating natural language meeting minutes in real time and making them immediately available to meeting participants. This enables the creation of multilingual meeting minutes in real time, reduces communication barriers in international meetings, and facilitates efficient information sharing.

[0739] "Audio information" refers to data input as audio, which forms the basis for converting it into text information.

[0740] "Textual information" refers to text data converted from audio information by speech recognition, and is used for further analysis and summary generation.

[0741] "Speech recognition" is a technology that converts spoken information into text information, automatically recording the content of conversations during meetings as text.

[0742] A "summary" is a brief record that extracts key points of discussion and decisions from analyzed textual information, allowing for efficient comprehension of the information.

[0743] A "work item" refers to the specific details of the actions decided at the meeting, and each item is assigned a person in charge and a deadline for its implementation.

[0744] "Action items" are specific actions that arise as a result of discussions during the meeting, and they indicate the measures that participants should take.

[0745] Task management is a method for systematically recording work items and action items, and for tracking and managing their progress.

[0746] "Natural language" refers to the language that humans use on a daily basis, and this system is capable of processing multiple languages.

[0747] "Real-time" means that data processing and information provision occur immediately and are available concurrently with the progress of the meeting.

[0748] "Generation" refers to the process of creating new information or records based on data, and specifically refers to the creation of summaries and translated meeting minutes.

[0749] "Meeting minutes" are documents that record the content and decisions made at a meeting, and are used for future reference and information sharing.

[0750] "Meeting participants" refers to individuals who participate in meetings or webinars and are the entities that utilize the generated information.

[0751] This system generates meeting minutes in real time in multiple languages, allowing participants to access the information immediately.

[0752] First, users collect audio information from meetings using devices such as smartphones or computers. These devices have a dedicated application installed and send the audio data to a server. The server then uses speech recognition software to convert the received audio information into text in real time for analysis. Because a pre-trained model is used for speech recognition, highly accurate conversion is possible. Specifically, the speech_recognition library is utilized to perform the conversion from speech to text.

[0753] Next, based on the transformed text information, the server generates a summary using a generative AI model. This summary extracts the key points and presents them concisely. The transformers library's models are used to analyze the text data and extract useful information.

[0754] The server then passes the generated summary to a multilingual translation engine, which translates it into the natural language requested by the participants. This enables smooth communication between participants who speak different languages. The langchain library is used for translation, providing fast and accurate translations.

[0755] This information is securely stored in cloud storage, and participants can access and view it at any time. This makes it possible to easily check necessary information immediately after the meeting ends, and to smoothly proceed with the execution of work items and action items. For example, when an international research team holds an online meeting, using this system can overcome language barriers and allow them to grasp important discussions and decisions in real time.

[0756] An example of a prompt used when inputting data into a generative AI model is: "Summarize the meeting minutes and translate them into Japanese. Please shorten the content while extracting the key points." This prompt allows the system to process the information in the specified format, enabling rapid and efficient information sharing.

[0757] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0758] Step 1:

[0759] The user launches a dedicated application and begins collecting audio information. At this time, the terminal captures the conference audio via its microphone and sends the acquired audio data to the server. The input is the live audio, and the output is the audio data sent to the server.

[0760] Step 2:

[0761] The server converts the received audio data into text information using the speech_recognition library. This library is a speech recognition engine that converts speech to text and uses a trained model to achieve high accuracy. The input is audio data, and the output is text data.

[0762] Step 3:

[0763] The server generates a summary from the acquired text information using the transformers library. This library utilizes a generative AI model to efficiently extract key discussion points and decisions from the text information. The input is text data, and the output is a summarized text.

[0764] Step 4:

[0765] The server translates the generated summary into the specified natural language using the langchain library. This enables multilingual support and facilitates smooth information exchange among participants using different languages. The input is the summary text, and the output is the translated text.

[0766] Step 5:

[0767] The server saves the summary and translation results to cloud storage. This data is accessible in real time, allowing participants to continue referencing the information even after the meeting ends. The input is the translated text, and the output is the data stored in cloud storage.

[0768] Step 6:

[0769] Users can access cloud storage through their devices and view the necessary information. This process allows them to instantly review the agenda and tasks discussed in a meeting and quickly move on to the next action. The input is data in the cloud, and the output is information displayed on the user's screen.

[0770] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0771] This invention is an information processing system that combines a system for automatically generating meeting minutes from audio data with an emotion engine that analyzes the user's emotions. The following describes specific embodiments for carrying out this invention.

[0772] First, the user launches a dedicated application at the start of the meeting to begin collecting audio data. The collected audio data is streamed in real time from the terminal to the server. The server receives this audio data and converts it into text data using a speech recognition engine. This process integrates an emotion engine that recognizes emotions from the user's voice tone and intonation.

[0773] The server analyzes the recognized text data, extracts the main points of discussion from the meeting, and generates a summary. This summary includes sentiment information recognized by the sentiment engine, and sentiment labels are assigned to express the emotional nuances behind each discussion.

[0774] Next, the server extracts tasks and action points from the summarized information and registers them in the issue management system. This registration process also takes into account the user's emotional state, allowing for adjustment of task priorities. For example, if a discussion is perceived as emotionally important, tasks related to that discussion will be given a higher priority.

[0775] Furthermore, the server performs multilingual translation into multiple specified languages, incorporating sentiment data into the generated summaries and tasks. This translation process utilizes dictionary lookup functionality to maintain terminology consistency. This enables multilingual communication that includes sentiment information.

[0776] Ultimately, the server stores all data in real time in secure cloud storage, making it instantly accessible to global meeting participants via their devices. This system allows users to review meeting minutes and tasks after the meeting, as well as gain emotional insights from the process.

[0777] As a concrete example, in situations where intercultural communication is crucial, such as at international conferences, this system can be used to instantly share meeting minutes in multiple languages, taking emotional nuances into account, thereby facilitating smoother decision-making and task execution.

[0778] The following describes the processing flow.

[0779] Step 1:

[0780] The user launches a dedicated application at the start of the meeting to begin collecting audio data. This collection is managed by the device, and the audio data is sent to the server in real time.

[0781] Step 2:

[0782] The server inputs the received audio data into the speech recognition engine, which converts the audio into text data. This conversion process uses a high-precision language model to faithfully record the content of the conversation as text.

[0783] Step 3:

[0784] Simultaneously, the server uses an emotion engine to extract user emotion information from the audio data in real time. It analyzes the tone, speed, and pitch of the voice and assigns an emotion label to each utterance.

[0785] Step 4:

[0786] The server analyzes the generated text data and extracts key discussion points and decisions from the meeting. When creating a summary based on this analysis, the extracted sentiment information is also reflected in the summary, indicating the atmosphere and tone of the meeting.

[0787] Step 5:

[0788] The server automatically extracts tasks and action points from the summary and registers them in the issue management system. Sentimental information is taken into consideration, and tasks associated with high-priority emotions are given higher priority.

[0789] Step 6:

[0790] The server translates summaries and tasks into the specified language using a multilingual translation engine. During this process, a dictionary lookup function is used to maintain terminology consistency and preserve sentiment.

[0791] Step 7:

[0792] The server securely stores the generated meeting minutes data and related information in cloud storage. All participants are given real-time access via their devices to review the meeting content and sentiment analysis results.

[0793] (Example 2)

[0794] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0795] While meeting minute generation systems based on audio data exist, they lack effective content summarization and task management that take emotional information into account. Furthermore, they struggle to maintain smooth communication while considering emotional nuances across multiple languages.

[0796] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0797] In this invention, the server includes means for collecting voice data, converting it into text data using a voice recognition device and analyzing emotions from the voice; means for analyzing the text data and generating a summary with added emotional information; and means for automatically extracting tasks and actions from the summary, taking emotional information into consideration, setting priorities, and registering them in task management. This enables effective summary generation that reflects emotional information and communication that preserves emotional nuances across multiple languages.

[0798] "Audio data" refers to information obtained by converting sound waveforms into digital signals, and is used to record the content of meetings and conversations.

[0799] A "speech recognition device" is a software or hardware technology that has the function of analyzing speech data and converting it into text data.

[0800] "Text data" refers to string information converted by speech recognition, and is used for content analysis and summarization.

[0801] "Emotional analysis" is a technology that analyzes a speaker's tone of voice and word choice to determine their emotional state.

[0802] A "summary" is a concise compilation of key information extracted from text data.

[0803] A "task" refers to specific work items or action plans in a meeting or discussion.

[0804] It appears that the Japanese definition of "action" is missing, so I will add it.

[0805] In meeting minutes and task management, "action" refers to specific actions or decisions that should be taken to achieve a particular objective.

[0806] Task management is a system for organizing, prioritizing, and tracking tasks and problems in order to achieve a specific goal.

[0807] "Priority" is an indicator used to determine the order and importance of tasks and actions.

[0808] "Registering tasks in task management" is the process of entering extracted tasks and actions into the system, enabling tracking and management.

[0809] Translation is the process of converting content written in one language into another language while preserving its meaning and context.

[0810] The "dictionary reference function" is a feature that allows for the referencing of a predefined glossary to ensure consistency in terminology during translation and to produce appropriate translations.

[0811] "Multilingualism" refers to dealing with multiple different languages, and is particularly used in translation and international communication.

[0812] "Real-time" refers to operations and information updates that are processed and reflected immediately without delay.

[0813] A "secure storage device" is a device or technology for protecting and storing digital data, preventing unauthorized access and data leaks.

[0814] "Emotional information" refers to data that represents a person's emotional state, extracted from audio or text.

[0815] To implement this invention, the user must first launch a dedicated application on their terminal and collect audio data during the meeting. The terminal uses its microphone to acquire the audio data and streams it to a server in real time. The audio data is converted into text data by a speech recognition device. Specifically, speech recognition software is used for speech recognition.

[0816] The server performs speech recognition on the audio data and simultaneously analyzes the user's emotions using an emotion analysis engine. Emotion analysis assigns emotion labels such as positive, negative, or neutral based on the tone of voice and the content of the speech.

[0817] Once text data is generated, the server uses natural language processing tools to analyze the data and generate a summary that takes sentiment into account. This summary extracts key points of discussion using a specific algorithm and includes emotional nuances. The generated summary is then translated into multiple specified languages according to the user's instructions. The translation process utilizes dictionary lookup to maintain consistency in terminology.

[0818] The server extracts tasks and actions from the generated summary and registers them in the issue management system. Tasks are prioritized based on the sentiment of the discussion.

[0819] All generated data is stored in real time on a secure storage device by the server, and users can access it at any time via their terminals. This makes it easy to review meeting minutes and related tasks even after the meeting has ended.

[0820] As a concrete example, using sentiment analysis in international conferences can facilitate smoother intercultural communication. This is expected to lead to faster decision-making and improved project outcomes.

[0821] An example of a prompt to input into a generative AI model is, "Analyze the participants' emotional information from the audio data of this meeting, create a summary, and provide it."

[0822] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0823] Step 1:

[0824] At the start of a meeting, the user launches a dedicated application on their device to begin collecting audio data. The input is real-time audio emitted by the meeting participants. The device uses its microphone to capture this audio and streams it to the server as audio data. The output is a real-time stream of audio data to the server.

[0825] Step 2:

[0826] When the server receives the streamed audio data, it converts it into text data using a speech recognition device. The input for this step is the audio data sent from the terminal. The server recognizes the audio waveform and processes it to convert it into text information as phonemes. The output is the recognized text data.

[0827] Step 3:

[0828] Following speech recognition, the server uses an emotion analysis engine to analyze emotional information from the text data and speech tone. The input here is the text data and speech tone information generated in step 2. The server analyzes these and assigns emotion labels such as positive, negative, and neutral. The output is the text data with the emotion labels assigned.

[0829] Step 4:

[0830] The server analyzes text data with sentiment information attached, uses natural language processing tools to extract key points, and generates a summary. The input is text data with sentiment labels attached in step 3. The server analyzes the meaning of the text and summarizes the important information. The output is summarized data with sentiment labels.

[0831] Step 5:

[0832] The server extracts tasks and actions from the summary data and registers them in the issue management system. The input is the summary data generated in step 4. The server identifies key action points and sets task priorities based on sentiment information. The output is the tasks and their priorities registered in the issue management system.

[0833] Step 6:

[0834] The generated summary and task data are translated into multiple languages specified by the server. This translation process utilizes dictionary lookup to ensure terminology consistency. The input is the summary and task data generated in step 5. The server uses a translation API to convert this into multiple languages. The output is multilingual data, including sentiment information.

[0835] Step 7:

[0836] All processed data is saved in real time by the server to secure storage. The input is the multilingualized summary and task data from step 6. The server formats this appropriately and saves it to cloud storage. The output is data that is securely stored and accessible at any time.

[0837] (Application Example 2)

[0838] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0839] In meetings and communication settings, information processing is required that goes beyond simply transcribing speech into text, taking into account the speaker's emotions and the overall atmosphere. Furthermore, it is desirable that this information be multilingual and accessible in real time. This is necessary to reduce misunderstandings between cultures and streamline the decision-making process.

[0840] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0841] In this invention, the server includes means for acquiring voice data and converting it into text data using speech recognition, means for analyzing emotions from voice tone and intonation and adding emotional information to the summary, and means for adjusting task priorities based on the emotional information. This enables real-time information processing that takes emotions into account and smooth communication through multilingual support.

[0842] "Audio data" refers to acoustic signals obtained from human speech.

[0843] "Speech recognition" is a technology that analyzes speech data and converts it into corresponding text data.

[0844] "Text data" refers to string information obtained through speech recognition technology.

[0845] A "summary" is information that extracts the main points from text data and expresses them concisely.

[0846] A "task" is a specific work item extracted through summarization.

[0847] "Action points" are the next actions that should be taken, as clarified as a result of the discussion.

[0848] "Issue management" is the process of systematically organizing and monitoring tasks and action points.

[0849] Translation is the process of converting text written in one language into another specified language.

[0850] "Real-time" refers to a situation where processing or reactions occur almost simultaneously.

[0851] "Emotional analysis" is a technology that recognizes and analyzes the emotions of a speaker from audio data.

[0852] "Emotional information" refers to data about emotions obtained through emotion analysis.

[0853] "Priority adjustment" refers to changing the order in which tasks are processed according to their importance and urgency.

[0854] An "operation management system" is a system for managing processes and information related to the operation of vehicles and other related activities.

[0855] This invention is a system for efficiently processing information using audio data in meetings and other communication situations. The server first acquires audio data and converts it into text data using a speech recognition engine. For speech recognition, it utilizes platforms such as Google Cloud Speech-to-Text. The server then analyzes the text data using tools such as Watson Tone Analyzer to recognize emotions from the speaker's voice tone and intonation.

[0856] After performing sentiment analysis, the server adds sentiment information to the summary, incorporating emotional nuances into the meeting content. The server then extracts tasks and action points from the summary, adjusting task priorities based on the sentiment information. This ensures that tasks related to emotionally important discussions receive higher priority.

[0857] The information obtained is translated into multiple specified languages using a multilingual translation function. The translation includes a dictionary lookup function to maintain consistency in terminology. Finally, the generated data is stored in cloud storage and made accessible globally in real time via devices.

[0858] As a concrete example, using this system at an international autonomous vehicle development conference would allow for the instant sharing of multilingual information that takes into account emotional nuances in cross-cultural communication between countries, enabling smoother decision-making.

[0859] An example of a prompt for a generated AI model is, "How should you optimize driver assistance when the driver is experiencing increased stress due to traffic congestion?"

[0860] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0861] Step 1:

[0862] The user launches an audio collection application on their device at the start of the meeting and streams audio data to the server in real time using the microphone. The input is audio data, and the output is transmission to the server.

[0863] Step 2:

[0864] The server converts the received audio data into text data using a speech recognition engine (e.g., Google Cloud Speech-to-Text). The input is audio data, and the output is text data.

[0865] Step 3:

[0866] The server sends text data to an emotion analysis engine (e.g., Watson Tone Analyzer) to extract emotional information from speech tone and intonation. The input is text data, and the output is emotional information.

[0867] Step 4:

[0868] The server analyzes text data containing sentiment information to generate a summary. The input consists of sentiment information and text data, and the output is a summary with sentiment labels.

[0869] Step 5:

[0870] The server automatically extracts tasks and action points from the summary and adjusts task priorities based on sentiment information. The input is sentiment information and the summary, and the output is a task list with adjusted priorities.

[0871] Step 6:

[0872] The server translates tasks and summaries into multiple specified languages. The inputs are summaries and tasks, and translation software is used to generate multilingual data as output.

[0873] Step 7:

[0874] The server stores all generated data in cloud storage, allowing participants worldwide to access it instantly in real time via their devices. Input is translated multilingual data, and output is securely accessible data.

[0875] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0876] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0877] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0878] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0879] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0880] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0881] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0882] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0883] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0884] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0885] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0886] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0887] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0888] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0889] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0890] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0891] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0892] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0893] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0894] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0895] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0896] The following is further disclosed regarding the embodiments described above.

[0897] (Claim 1)

[0898] A means of acquiring audio data and converting it into text data using speech recognition,

[0899] A means of analyzing text data and generating a summary,

[0900] A method for automatically extracting tasks and actions from summaries and registering them in issue management,

[0901] Means for translating summaries and tasks into multiple specified languages,

[0902] A means of saving the generated data and making it accessible in real time.

[0903] A system that includes this.

[0904] (Claim 2)

[0905] The system according to claim 1, further comprising means for comparing with past data in order to improve the accuracy of speech recognition.

[0906] (Claim 3)

[0907] The system according to claim 1, comprising a dictionary lookup function for maintaining terminology consistency during translation.

[0908] "Example 1"

[0909] (Claim 1)

[0910] A means of acquiring audio information and converting it into text information through speech recognition,

[0911] A means of analyzing textual information and generating a summary,

[0912] A method for automatically extracting tasks and action points from summaries and registering them in issue management,

[0913] Means for translating summaries and business statements into multiple specified languages,

[0914] A means of saving the generated information and making it immediately accessible,

[0915] A means to organize meeting content and enable management based on prioritization and dependencies.

[0916] A system that includes this.

[0917] (Claim 2)

[0918] The system according to claim 1, further comprising means for utilizing a pre-trained generative AI model to improve the accuracy of speech recognition.

[0919] (Claim 3)

[0920] The system according to claim 1, comprising a dictionary lookup function for maintaining consistency among multiple languages for information to be translated.

[0921] "Application Example 1"

[0922] (Claim 1)

[0923] A means of acquiring audio information and converting it into text information using speech recognition,

[0924] A means of analyzing textual information and generating a summary,

[0925] A method for automatically extracting work items and action items from a summary and registering them in the task management system,

[0926] A means for translating summaries and work items into multiple specified natural languages,

[0927] A means of saving the generated data and making it accessible in real time,

[0928] A method for generating natural language meeting minutes in real time and making them immediately available to meeting participants.

[0929] A system that includes this.

[0930] (Claim 2)

[0931] The system according to claim 1, further comprising means for comparing with past information in order to improve the accuracy of speech recognition.

[0932] (Claim 3)

[0933] The system according to claim 1, comprising a dictionary lookup function for maintaining terminology consistency during translation.

[0934] "Example 2 of combining an emotion engine"

[0935] (Claim 1)

[0936] A means for collecting audio data, converting it into text data using a speech recognition device, and analyzing emotions from the audio,

[0937] A means for analyzing text data and generating a summary with added sentiment information,

[0938] A method for automatically extracting tasks and actions from summaries, taking emotional information into consideration, setting priorities, and registering them in issue management,

[0939] A means of translating summaries and tasks into multiple specified languages and maintaining terminology consistency using a dictionary lookup function,

[0940] A means of saving generated data in real time and storing it in a secure storage device for access.

[0941] A system that includes this.

[0942] (Claim 2)

[0943] The system according to claim 1, further comprising means for adjusting the priority of tasks according to their importance based on sentiment analysis of voice data.

[0944] (Claim 3)

[0945] The system according to claim 1, comprising an algorithm for preserving the nuances of emotional information during multilingual translation.

[0946] "Application example 2 when combining with an emotional engine"

[0947] (Claim 1)

[0948] A means of acquiring audio data and converting it into text data using speech recognition,

[0949] A means of analyzing text data and generating a summary,

[0950] A method for automatically extracting tasks and action points from summaries and registering them in issue management,

[0951] Means for translating summaries and tasks into multiple specified languages,

[0952] A means of saving the generated data and making it accessible in real time,

[0953] A method for analyzing emotions from voice tone and intonation, and adding emotional information to a summary,

[0954] A method for adjusting task priorities based on emotional information,

[0955] A means of providing information in real time by feeding it back to the operational management system.

[0956] A system that includes this.

[0957] (Claim 2)

[0958] The system according to claim 1, further comprising means for comparing with past data in order to improve the accuracy of speech recognition.

[0959] (Claim 3)

[0960] The system according to claim 1, comprising a language reference function for maintaining terminology consistency during translation. [Explanation of Symbols]

[0961] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of acquiring audio information and converting it into text information using speech recognition, A means of analyzing textual information and generating a summary, A method for automatically extracting work items and action items from a summary and registering them in the task management system, A means for translating summaries and work items into multiple specified natural languages, A means of saving the generated data and making it accessible in real time, A method for generating natural language meeting minutes in real time and making them immediately available to meeting participants. A system that includes this.

2. The system according to claim 1, further comprising means for comparing with past information in order to improve the accuracy of speech recognition.

3. The system according to claim 1, comprising a dictionary lookup function for maintaining consistency of terminology during translation.

Citation Information

Patent Citations

Persona chatbot control method and system
JP2022180282A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Persona chatbot control method and system