system

The system addresses recognition accuracy issues by integrating multiple speech recognition models to transcribe and summarize meeting audio, enhancing the efficiency and accuracy of meeting minute creation for organizational decision-making.

JP2026101287APending Publication Date: 2026-06-22SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-10
Publication Date
2026-06-22

AI Technical Summary

Technical Problem

Conventional speech recognition and transcription technologies rely on single models, leading to recognition accuracy issues and difficulties in quickly and accurately extracting key points from meetings, making it challenging to utilize meeting information for organizational decision-making.

Method used

A system that utilizes multiple speech recognition models to transcribe meeting audio, integrates and summarizes the results, and generates answers based on user questions using meeting minutes and related materials.

Benefits of technology

Improves the accuracy and efficiency of creating meeting minutes, enabling efficient acquisition of meeting insights and supporting organizational decision-making.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026101287000001_ABST
    Figure 2026101287000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A means of receiving audio data from a communication device, The means for transmitting the aforementioned acoustic data to multiple acoustic recognition models and receiving string conversion results from each model, A means for comparing and integrating the aforementioned multiple string conversion results to generate a single record document, A means for summarizing the aforementioned record document and extracting important information elements, A means for receiving inquiries from users and responding to those inquiries based on the aforementioned record documents and related information, A means of recording and analyzing conversations in stores and automatically generating and summarizing customer feedback, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a character of the chatbot, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] An object of the present invention is to improve the accuracy and efficiency of creating minutes of meetings or the like. Conventional speech recognition technology and transcription technology rely on a single model, so there is a high possibility of recognition accuracy and omission. There is also a problem that it is difficult to quickly and accurately extract key points in a meeting and utilize them for discovering organizational issues and making decisions in combination with related information.

Means for Solving the Problems

[0005] To address this challenge, we provide a system that incorporates steps such as receiving conference audio from a communication terminal, transcribing it using multiple speech recognition models, and comparing and integrating the results. Furthermore, it has a function to summarize the integrated meeting minutes and extract important information, and generates answers based on user questions using the meeting minutes and related materials. This system improves the accuracy of meeting minute creation and enables efficient acquisition of meeting insights.

[0006] A "communication terminal" is an electronic device that has the function of transmitting and receiving voice data and related materials.

[0007] "Audio data" refers to data that stores human voice information recorded during meetings or other similar events in digital format.

[0008] A "speech recognition model" is a set of algorithms and techniques used to convert speech data into text format.

[0009] "Transcription results" refer to text data converted from audio data by a speech recognition model.

[0010] "Meeting minutes" are documents that record what was said and what decisions were made during a meeting.

[0011] A "summary" is a format in which the most important parts of an entire document are extracted and summarized in a short format.

[0012] "Related materials" refer to additional information and documents related to the content of the meeting, and are included in the analysis along with the meeting minutes.

[0013] A "user" refers to an individual or organization that uses the system to provide voice data and ask questions.

[0014] A "means of answering questions" refers to a component that has the function of automatically providing information based on meeting minutes and related documents in response to inquiries from users. [Brief explanation of the drawing]

[0015] [Figure 1] It is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

Mode for Carrying Out the Invention

[0016] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0017] First, the terms used in the following description will be explained.

[0018] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0019] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0020] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0021] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0022] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0023] [First Embodiment]

[0024] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0025] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0026] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0027] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0028] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0029] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0030] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0031] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0032] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0033] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0034] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0035] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0036] This invention relates to a system that analyzes audio data from meetings and other similar events from multiple perspectives and automatically generates highly accurate meeting minutes. This system is implemented using three main components: a communication terminal, a server, and a user.

[0037] Users record audio during meetings using communication devices. This recorded audio data is uploaded to a server after the meeting ends. The server uses multiple speech recognition models to analyze this audio data and perform transcription. The transcription results from the different speech recognition models are aggregated, compared, and integrated on the server to create a single, unified meeting transcript.

[0038] To improve the accuracy of the meeting minutes, the server evaluates the reliability of each transcript and selects the optimal combination. Next, the server summarizes the generated minutes, extracting and organizing key points. Users can provide relevant materials to the server as needed, which analyzes them, compares them with the minutes, and evaluates their relevance to gain further insights.

[0039] In this system, users can ask questions through a server interface, and the server returns answers to those questions. This allows users to efficiently obtain information and is supported in their decision-making process.

[0040] As a concrete example, considering a company's weekly meeting, users install a dedicated application on their smartphones, record the meeting discussions, and transfer the recordings to a server. The server immediately analyzes the audio and generates a set of text data using speech recognition models A, B, and C. The server then selects the most accurate text data and summarizes it as meeting minutes. Users can use the server's question-and-answer function to review the conversation after the meeting or to inquire about specific points of discussion. In this way, the recording and analysis of meetings are centralized, aiming to contribute to organizational information management and decision-making.

[0041] The following describes the processing flow.

[0042] Step 1:

[0043] Users record audio on their devices during meetings and upload the audio data to the server after the meeting ends.

[0044] Step 2:

[0045] The server sends the received audio data to multiple speech recognition models, each of which analyzes the audio data and generates a transcription result.

[0046] Step 3:

[0047] The server receives transcription results from each speech recognition model, compares and analyzes them to evaluate their reliability, and then integrates the most accurate text to create a single meeting transcript.

[0048] Step 4:

[0049] The server inputs the generated meeting minutes into a summarization model, creates a summary, and extracts the key points.

[0050] Step 5:

[0051] Users upload relevant materials used in meetings from their devices to the server, which then compares the meeting minutes with the materials, assesses their relevance, and generates additional insights.

[0052] Step 6:

[0053] The user enters a question through the server interface, and the server generates an answer based on meeting minutes and related documents and sends it back to the user.

[0054] Step 7:

[0055] Users provide feedback to the server, which then uses this feedback to improve the speech recognition and meeting minute creation processes.

[0056] (Example 1)

[0057] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0058] In recent years, the importance of recording meetings and discussions has increased, but traditional manual recording methods often lack accuracy and speed. Furthermore, relying on a single speech recognition technology when converting audio data into text increases the risk of misrecognition. As a result, information may be lost or inaccurate. In addition, there is a lack of efficient methods for grasping the important points discussed in meetings, and there is a challenge in extracting important information from information overload. This invention aims to solve these problems and improve the accuracy and efficiency of meeting records.

[0059] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0060] In this invention, the server includes means for acquiring audio information from a communication device, means for supplying the audio information to multiple speech recognition algorithms and acquiring text output from each algorithm, and means for comparing and integrating the multiple text outputs to create a single document. This makes it possible to analyze the audio information from multiple perspectives by processing it with multiple recognition algorithms and efficiently generate a single integrated record. Furthermore, by using a generation AI model to scrutinize the information according to the importance of the document, high-quality summaries and extraction of related information can be achieved.

[0061] "Communication equipment" is a general term for devices that have the function of acquiring voice information and transmitting it to a server.

[0062] "Audio information" refers to audio data acquired during meetings and discussions, which is then used to create text.

[0063] A "speech recognition algorithm" refers to various analytical methods used to convert speech information into text.

[0064] "Text output" refers to text data generated by analyzing audio information using a speech recognition algorithm.

[0065] A "document" is a record generated from comparative and integrated text output, and is a consistent form of representation of information, including meeting content.

[0066] "Digital processing" refers to a series of data adjustment operations performed on audio information, such as noise reduction and sound quality correction.

[0067] A "security protocol" is a set of rules that includes communication procedures to ensure security during the transmission of voice information.

[0068] A "generative AI model" is a type of artificial intelligence technology used for purposes such as summarizing text data and extracting important information.

[0069] One embodiment of the present invention is a system that analyzes conference audio from multiple angles and automatically generates high-quality meeting minutes. This system mainly consists of three elements: a communication device, a server, and a user.

[0070] Users record audio information during meetings using a communication device. This device is a smartphone or tablet with a dedicated application installed, which records the audio in digital format and sends it to the server after the meeting ends. The files are encrypted using a secure protocol for information protection, ensuring security during transmission.

[0071] The server is equipped with multiple speech recognition algorithms and analyzes the received audio information. Specifically, it performs digital processing, noise reduction, and sound quality adjustment before feeding the information to the speech recognition algorithms. In this step, different algorithms such as Model A, Model B, and Model C are used. Each algorithm generates a different text output, and the results are integrated to create a single document. This document is further summarized by the server, and important information is extracted using a generative AI model.

[0072] As a concrete example, consider a company's weekly meeting, where the user records the meeting using a dedicated application. Once the recording is complete, the audio file is sent to the server, and analysis begins immediately. The server integrates the text generated from the audio and compiles it into meeting minutes. If the user wants to confirm a specific point of discussion after the meeting, they can enter a prompt such as, "Please tell me about the budget plan discussed in this meeting," and the server's AI model will provide an appropriate answer.

[0073] This makes it easier for users to understand meeting content and quickly obtain the information necessary for decision-making. The entire system enables centralized information management, supporting the efficient operation of the organization.

[0074] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0075] Step 1:

[0076] The user records audio information during a meeting using a communication device. Once recording is complete, the device sends the audio file to the server. The input is audio information in digital format, and the output is the audio file transferred to the server. The specific action taken in this case is tapping the "Recording Complete" button within the application.

[0077] Step 2:

[0078] The server performs digital processing on the received audio files. Specifically, it removes noise and adjusts the sound quality to make them suitable for analysis. The input is the audio file sent to the server, and the output is the digitally processed audio data. The server aims to improve the accuracy of the analysis through this process.

[0079] Step 3:

[0080] The server supplies processed audio data to multiple speech recognition algorithms. Specifically, it inputs the audio data into each algorithm (e.g., Model A, Model B, Model C) and generates text output. The input is digitally processed audio data, and the output is a set of text outputs from each algorithm.

[0081] Step 4:

[0082] The server compares and integrates multiple text outputs obtained from speech recognition algorithms to create a single document. Specifically, it selects the most suitable text output based on its reliability and combines them into a consistent document. The input is a set of text outputs, and the output is the integrated document.

[0083] Step 5:

[0084] The server summarizes the integrated document and extracts key information. This specifically involves using a generative AI model to analyze the text and concisely summarize the essential points. The input is the integrated document, and the output is the summarized key points.

[0085] Step 6:

[0086] The user inputs a specific question using the server interface, and the server generates an answer based on that. Specifically, the AI ​​model analyzes the input prompt and provides an answer. The input is the prompt from the user, and the output is the generated answer.

[0087] (Application Example 1)

[0088] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0089] There is a need to efficiently record audio data generated during team meetings and customer interactions in stores, organize key points, and utilize it as information useful for decision-making. However, manually processing the large volume of audio data generated in these operations is time-consuming and labor-intensive, and presents challenges in terms of accuracy and comprehensiveness. Therefore, there is a need to develop a system that can automatically analyze audio data with high accuracy and provide useful information for store operations quickly and effectively.

[0090] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0091] In this invention, the server includes means for receiving acoustic data from a communication device, means for transmitting the acoustic data to multiple acoustic recognition models and receiving string conversion results from each model, and means for comparing and integrating the multiple string conversion results to generate a single recorded document. This enables automatic recording of audio and extraction of important information in stores.

[0092] A "communication device" is a device used to record audio data and transmit it to other systems.

[0093] "Audio data" refers to data that records sound in digital format.

[0094] An "acoustic recognition model" is a set of algorithms that analyze acoustic data and convert it into a string format.

[0095] "String conversion result" refers to text data generated by the acoustic recognition model.

[0096] A "record document" is a document created by integrating the results of multiple string conversions.

[0097] "Related information" refers to external data used to cross-reference with documentary records and gain additional insights.

[0098] A "user" is a person or organization that uses the system to process audio data or acquire information.

[0099] "Feedback" refers to the opinions and reactions received from customers or users.

[0100] To realize this invention, it is necessary to build a system that includes a server, a communication terminal, and a user. In this system, the user uses the communication terminal to record meetings and conversations with customers within the store. The recorded audio data is transmitted from the communication terminal to the server.

[0101] The server passes the received audio data to multiple acoustic recognition models and obtains the string conversion result from each model. Examples of usable acoustic recognition models include Google® Speech-to-Text API and Amazon Transcribe. This enables highly accurate transcription from audio data.

[0102] Next, the server compares these string conversion results, selects the most suitable parts, and integrates them to generate a single record document. Then, natural language processing (NLP) tools are used to summarize the record document and extract important information elements. Examples of NLP tools include NLTK and spaCy.

[0103] The generated summary is provided to the user, who can then use it to make decisions. For example, if an apparel shop discusses changing the product layout, the key points from the recording of the discussion could be summarized and provided to the staff as supplementary material.

[0104] An example of a prompt for a generative AI model is, "Analyze the audio data from the store meeting and transcribe the product lineup discussion appropriately." This allows for efficient analysis of audio data without human intervention, providing information useful for store operations.

[0105] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0106] Step 1:

[0107] The user uses a device to record conversations within the store. The device records the conversation as audio data. The input is audio data, and the output is the audio data stored on the device. The device then prepares to transmit this audio data to the next step.

[0108] Step 2:

[0109] The terminal sends the recorded audio data to the server. The input is the audio data stored on the terminal, and the output is the audio data received by the server. The server prepares this data to pass to the acoustic recognition model.

[0110] Step 3:

[0111] The server inputs acoustic data into multiple acoustic recognition models and retrieves string conversion results from each model. The input is the acoustic data stored on the server, and the output is the string conversion results generated by the acoustic recognition models. The server aggregates the string conversion results from the different models.

[0112] Step 4:

[0113] The server compares and integrates the acquired string conversion results to generate a single record document. During this process, the accuracy and reliability of the results from each model are evaluated. The input consists of multiple string conversion results, and the output is an integrated record document. The server constructs the record document based on the most reliable information.

[0114] Step 5:

[0115] The server summarizes the recorded document using natural language processing (NLP) tools and extracts key information elements. The input is the integrated recorded document, and the output is the summarized information elements. The server uses NLP tools to organize the key points of the document.

[0116] Step 6:

[0117] The server provides the user with a generated summary. The input is the summarized information element, and the output is the summary information received by the user. The user can use this information to make decisions.

[0118] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0119] This invention relates to a system that incorporates sentiment analysis into the process of creating meeting minutes. The system aims to analyze the user's audio data during meetings and capture not only the content of the utterances but also the emotional information contained within them. As a result, meeting minutes will contain richer information and will become a useful source of information for decision-making and problem identification.

[0120] On the terminal, the user records the audio during the meeting and uploads the data to the server. On the server, this audio data is transcribed by multiple speech recognition models, and at the same time, an emotion engine analyzes the emotions. The emotion engine analyzes the intonation, speed, and tempo of the speech to identify emotions such as joy, anger, and anxiety.

[0121] On the server, sentiment information is tagged onto the transcribed meeting minutes. This allows for later review of the speaker's emotional state. When summarizing the minutes, this sentiment information can be taken into account to adjust the importance and focus of the summary.

[0122] Furthermore, the server, upon receiving relevant materials from users, compares and analyzes them against the meeting minutes and emotional information. This allows for an understanding of how emotions expressed during the meeting influenced awareness and understanding of the agenda items.

[0123] As a concrete example, in a project progress meeting, the emotion engine detects that a speaker is feeling anxious about a particular issue in the audio recording made by the user. The server adds this emotion information to the meeting minutes, performs a more detailed analysis of the issue within the summary, and helps the project manager determine the priority of addressing that issue.

[0124] The implementation of this system will improve the quality of meeting minutes and provide valuable insights for improvement activities and strategic planning across the organization by allowing us to understand the true reactions of participants.

[0125] The following describes the processing flow.

[0126] Step 1:

[0127] Users record audio during meetings using their devices and upload the audio data to the server after the meeting ends.

[0128] Step 2:

[0129] The server sends the uploaded audio data to multiple speech recognition models, and each model performs the transcription.

[0130] Step 3:

[0131] The server simultaneously activates the emotion engine and extracts emotional information from the audio data. During this process, it analyzes emotions using factors such as intonation, volume, and speed.

[0132] Step 4:

[0133] The server aggregates the transcription results from each speech recognition model and generates meeting minutes with added emotional information from the emotion engine.

[0134] Step 5:

[0135] The server uses sentiment information contained in the generated meeting minutes to summarize and adjust the importance and focus of each statement.

[0136] Step 6:

[0137] Users provide relevant materials from their terminals to the server, which then integrates and analyzes them along with meeting minutes and sentiment information.

[0138] Step 7:

[0139] The server receives questions from users, generates answers considering meeting minutes, relevant documents, and sentiment information, and provides them to the users.

[0140] Step 8:

[0141] Based on the information provided by the server, users make decisions necessary for the progress of projects and meetings, and provide feedback as needed. The server uses this feedback to improve the overall accuracy and functionality of the system.

[0142] (Example 2)

[0143] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0144] In meetings, there is a growing need to create meeting minutes that not only transcribe participants' statements but also capture the emotions associated with those statements. However, conventional systems lacked the integration of audio transcription and sentiment analysis, making it difficult to grasp both the content of the statements and the emotions behind them.

[0145] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0146] In this invention, the server includes means for receiving voice information from a communication device, means for transmitting the voice information to a plurality of voice recognition devices and receiving text information from each device, and means for analyzing the voice information corresponding to the text information and identifying emotional characteristics. This makes it possible to generate meeting minutes that capture not only the content of what was said in a meeting, but also the emotions embedded in those statements.

[0147] "Communication equipment" refers to devices or equipment used to receive and transmit voice information.

[0148] "Audio information" refers to audio data that records human speech.

[0149] A "speech recognition device" is a device or program that converts speech information into text format.

[0150] "Textual information" refers to text data converted by a speech recognition device.

[0151] "Emotional characteristics" refer to the psychological state analyzed from the intonation, speed, tempo, and other elements contained in auditory information.

[0152] A "meeting record" is a document that includes the content of what was said during a meeting and the corresponding emotional characteristics.

[0153] "Related information" refers to documents and data related to the meeting content and is used for cross-referencing with the meeting minutes.

[0154] This invention relates to a system for efficiently processing audio information in a meeting and generating a meeting record that includes the content of the speeches and their emotional characteristics. This system mainly consists of the following components.

[0155] The terminal provides a function for users to record audio during meetings. Users can start recording themselves and upload the audio to the server after completion. The terminal also has an audio format conversion function that converts the audio information into a format that the server can process.

[0156] The server receives audio information transmitted from the terminal and converts it into text using multiple speech recognition devices. Specifically, the server utilizes a cloud-based speech recognition service (e.g., a common speech recognition API) to segment the audio data and convert it into text. In addition, the server uses an emotion analysis engine (e.g., a common emotion analysis tool) to identify emotional characteristics by analyzing the intonation, speed, tempo, etc., of the voice.

[0157] The collected textual information and emotional characteristics are integrated by the server, and meeting records are generated as emotionally tagged text. These records are organized so that important elements are highlighted and useful information is readily apparent when users review them later.

[0158] As a concrete example, in a company project progress meeting, users record discussions about specific issues. Later, a server can detect emotions such as impatience or anxiety, tag them in the text, and reflect them in the meeting record. Such records can serve as a guide for project managers to take swift action.

[0159] Another example of a prompt to input into a generative AI model is: "Analyze the meeting audio data and list the emotions detected from all speakers. Analyze particularly emphasized emotions and related statements, and indicate how they might affect project progress." This prompt allows the AI ​​to provide detailed analysis to support user decision-making.

[0160] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0161] Step 1:

[0162] The terminal supports the user in recording audio during a meeting. It receives real-time audio data collected from the meeting as input and saves it as a recorded audio file (e.g., WAV format) as output. Once recording is complete, the terminal prepares to send the audio file to the server.

[0163] Step 2:

[0164] The server receives audio files sent from the terminal. It uses a speech recognition device (e.g., a common speech recognition API) to take the audio file as input and convert it into text information as output. The server segments the audio file and sends each segment to the speech recognition device for sequential transcription.

[0165] Step 3:

[0166] The server identifies emotional characteristics based on audio and text data. It receives audio and text data as input and analyzes the data using an emotion analysis engine (e.g., a common emotion analysis tool). The output extracts emotional characteristics (e.g., joy, anxiety, anger) for each segment. This process takes into account factors such as intonation, speed, and tempo of the speech.

[0167] Step 4:

[0168] The server generates meeting records by adding emotional characteristics to textual information. It receives textual information and emotional characteristics as input and creates meeting records with emotional tags as output. The server uses an integration algorithm to consolidate the recorded content and its emotional characteristics.

[0169] Step 5:

[0170] Users view the created meeting records and use them for decision-making and information analysis. As input, they receive sentiment-tagged meeting records provided by the server, and as output, they use this as a basis for project management and strategic planning decisions. Through reviewing the meeting records, users can understand the emotional responses of the participants.

[0171] (Application Example 2)

[0172] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0173] In meetings and staff gatherings, simply transcribing audio is insufficient to reflect the true intentions and emotions of the speakers. This can lead to a lack of information necessary for post-meeting decision-making and improving dialogue. Therefore, there is a need to create rich meeting minutes that also take into account the emotions of the speakers.

[0174] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0175] In this invention, the server includes means for receiving audio information from a communication device, means for analyzing emotions from the audio information and generating emotional information, and means for presenting the emotional information as feedback on the emotional state of participants after the meeting. This makes it possible to create meeting minutes and interpret intentions while taking into account the emotional state of participants after the meeting.

[0176] A "communication device" is an electronic device that has the function of transmitting and receiving voice information.

[0177] "Audio information" refers to sound data that records what was said during a meeting or conference.

[0178] "Speech recognition technology" refers to algorithms and programs used to transcribe audio data into text.

[0179] "Emotional information" refers to data extracted from audio data that represents the emotional state of the speaker.

[0180] "Meeting minutes" are documents that record the statements and content of a meeting, making them available for later reference.

[0181] A "summary" is information that is compiled by extracting the important points from meeting minutes and presenting them concisely.

[0182] "Information sources" refer to any data that is referenced, including the content of meetings and related materials.

[0183] "Feedback" refers to information provided to participants after a meeting, including reflections and advice regarding their comments and emotional state.

[0184] To implement this invention, a communication terminal is required for recording audio during a meeting or conference. The user uses this communication terminal to collect audio information in real time and transmit it to a server. The server receives the audio information and not only transcribes it using speech recognition technology, but also performs sentiment analysis to extract emotional information.

[0185] Speech recognition technology can utilize widely used libraries such as speech_recognition and Google's speech recognition API. Furthermore, specific algorithms are used to analyze the intonation and speed of speech in order to extract emotional information. These algorithms can not only identify emotions but also evaluate their intensity.

[0186] The generated transcripts and sentiment information are integrated and compared on the server and saved as meeting minutes. This allows users to easily review the minutes after the meeting, and analyze the sentiment information in particular as feedback to help with future decision-making.

[0187] As a concrete example, when discussing project progress in a meeting, not only the speaker's intentions but also their emotions, such as impatience, can be analyzed, allowing for a review of project priorities based on this analysis. An example of a prompt to input into the generative AI model is, "Perform an emotion analysis of this meeting and add emotion tags to the meeting minutes." This prompt enables the server to perform the necessary data processing and analysis.

[0188] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0189] Step 1:

[0190] The terminal records the audio of meetings and conferences in real time and sends it to the server as audio information. The input is the recorded audio data, and the output is the audio information sent to the server. At this stage, acoustic processing such as noise cancellation is important to maintain the quality of the audio.

[0191] Step 2:

[0192] The server inputs the received audio information into speech recognition technology and performs transcription. The input is audio information, and the output is the transcribed text. The server uses multiple speech recognition models and receives results from each model to improve accuracy.

[0193] Step 3:

[0194] The server analyzes audio information to extract emotional information. The input is audio information, and it identifies emotions by analyzing the intonation and tempo of the voice. The output is data representing the speaker's emotional state. The emotion engine analyzes the audio patterns and generates emotional information as emotion tags.

[0195] Step 4:

[0196] The server integrates transcript data and sentiment information to generate meeting minutes. The input is the transcript text and sentiment information, and the output is meeting minutes with sentiment tags. Based on the sentiment information, the server considers the importance of each statement and adjusts each section within the meeting minutes accordingly.

[0197] Step 5:

[0198] The server summarizes the generated meeting minutes and extracts key points. The input is the detailed meeting minutes, and the output is the summarized meeting minutes. Sentiment information is used to highlight particularly noteworthy sections.

[0199] Step 6:

[0200] The user inputs prompt text into the server, and a generative AI model is used to obtain feedback that includes emotional information. The input is the prompt text from the user, and the output is the analyzed feedback. Based on this, the server presents the emotional trends during the meeting to the participants.

[0201] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0202] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0203] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0204] [Second Embodiment]

[0205] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0206] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0207] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0208] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0209] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0210] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0211] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0212] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0213] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0214] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0215] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0216] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0217] This invention relates to a system that analyzes audio data from meetings and other similar events from multiple perspectives and automatically generates highly accurate meeting minutes. This system is implemented using three main components: a communication terminal, a server, and a user.

[0218] Users record audio during meetings using communication devices. This recorded audio data is uploaded to a server after the meeting ends. The server uses multiple speech recognition models to analyze this audio data and perform transcription. The transcription results from the different speech recognition models are aggregated, compared, and integrated on the server to create a single, unified meeting transcript.

[0219] To improve the accuracy of the meeting minutes, the server evaluates the reliability of each transcript and selects the optimal combination. Next, the server summarizes the generated minutes, extracting and organizing key points. Users can provide relevant materials to the server as needed, which analyzes them, compares them with the minutes, and evaluates their relevance to gain further insights.

[0220] In this system, users can ask questions through a server interface, and the server returns answers to those questions. This allows users to efficiently obtain information and is supported in their decision-making process.

[0221] As a concrete example, considering a company's weekly meeting, users install a dedicated application on their smartphones, record the meeting discussions, and transfer the recordings to a server. The server immediately analyzes the audio and generates a set of text data using speech recognition models A, B, and C. The server then selects the most accurate text data and summarizes it as meeting minutes. Users can use the server's question-and-answer function to review the conversation after the meeting or to inquire about specific points of discussion. In this way, the recording and analysis of meetings are centralized, aiming to contribute to organizational information management and decision-making.

[0222] The following describes the processing flow.

[0223] Step 1:

[0224] Users record audio on their devices during meetings and upload the audio data to the server after the meeting ends.

[0225] Step 2:

[0226] The server sends the received audio data to multiple speech recognition models, each of which analyzes the audio data and generates a transcription result.

[0227] Step 3:

[0228] The server receives transcription results from each speech recognition model, compares and analyzes them to evaluate their reliability, and then integrates the most accurate text to create a single meeting transcript.

[0229] Step 4:

[0230] The server inputs the generated meeting minutes into a summarization model, creates a summary, and extracts the key points.

[0231] Step 5:

[0232] Users upload relevant materials used in meetings from their devices to the server, which then compares the meeting minutes with the materials, assesses their relevance, and generates additional insights.

[0233] Step 6:

[0234] The user enters a question through the server interface, and the server generates an answer based on meeting minutes and related documents and sends it back to the user.

[0235] Step 7:

[0236] Users provide feedback to the server, which then uses this feedback to improve the speech recognition and meeting minute creation processes.

[0237] (Example 1)

[0238] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0239] In recent years, the importance of recording meetings and discussions has increased, but traditional manual recording methods often lack accuracy and speed. Furthermore, relying on a single speech recognition technology when converting audio data into text increases the risk of misrecognition. As a result, information may be lost or inaccurate. In addition, there is a lack of efficient methods for grasping the important points discussed in meetings, and there is a challenge in extracting important information from information overload. This invention aims to solve these problems and improve the accuracy and efficiency of meeting records.

[0240] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0241] In this invention, the server includes means for acquiring audio information from a communication device, means for supplying the audio information to multiple speech recognition algorithms and acquiring text output from each algorithm, and means for comparing and integrating the multiple text outputs to create a single document. This makes it possible to analyze the audio information from multiple perspectives by processing it with multiple recognition algorithms and efficiently generate a single integrated record. Furthermore, by using a generation AI model to scrutinize the information according to the importance of the document, high-quality summaries and extraction of related information can be achieved.

[0242] "Communication equipment" is a general term for devices that have the function of acquiring voice information and transmitting it to a server.

[0243] "Audio information" refers to audio data acquired during meetings and discussions, which is then used to create text.

[0244] A "speech recognition algorithm" refers to various analytical methods used to convert speech information into text.

[0245] "Text output" refers to text data generated by analyzing audio information using a speech recognition algorithm.

[0246] A "document" is a record generated from comparative and integrated text output, and is a consistent form of representation of information, including meeting content.

[0247] "Digital processing" refers to a series of data adjustment operations performed on audio information, such as noise reduction and sound quality correction.

[0248] A "security protocol" is a set of rules that includes communication procedures to ensure security during the transmission of voice information.

[0249] A "generative AI model" is a type of artificial intelligence technology used for purposes such as summarizing text data and extracting important information.

[0250] One embodiment of the present invention is a system that analyzes conference audio from multiple angles and automatically generates high-quality meeting minutes. This system mainly consists of three elements: a communication device, a server, and a user.

[0251] Users record audio information during meetings using a communication device. This device is a smartphone or tablet with a dedicated application installed, which records the audio in digital format and sends it to the server after the meeting ends. The files are encrypted using a secure protocol for information protection, ensuring security during transmission.

[0252] The server is equipped with multiple speech recognition algorithms and analyzes the received audio information. Specifically, it performs digital processing, noise reduction, and sound quality adjustment before feeding the information to the speech recognition algorithms. In this step, different algorithms such as Model A, Model B, and Model C are used. Each algorithm generates a different text output, and the results are integrated to create a single document. This document is further summarized by the server, and important information is extracted using a generative AI model.

[0253] As a concrete example, consider a company's weekly meeting, where the user records the meeting using a dedicated application. Once the recording is complete, the audio file is sent to the server, and analysis begins immediately. The server integrates the text generated from the audio and compiles it into meeting minutes. If the user wants to confirm a specific point of discussion after the meeting, they can enter a prompt such as, "Please tell me about the budget plan discussed in this meeting," and the server's AI model will provide an appropriate answer.

[0254] This makes it easier for users to understand meeting content and quickly obtain the information necessary for decision-making. The entire system enables centralized information management, supporting the efficient operation of the organization.

[0255] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0256] Step 1:

[0257] The user records audio information during a meeting using a communication device. Once recording is complete, the device sends the audio file to the server. The input is audio information in digital format, and the output is the audio file transferred to the server. The specific action taken in this case is tapping the "Recording Complete" button within the application.

[0258] Step 2:

[0259] The server performs digital processing on the received audio files. Specifically, it removes noise and adjusts the sound quality to make them suitable for analysis. The input is the audio file sent to the server, and the output is the digitally processed audio data. The server aims to improve the accuracy of the analysis through this process.

[0260] Step 3:

[0261] The server supplies processed audio data to multiple speech recognition algorithms. Specifically, it inputs the audio data into each algorithm (e.g., Model A, Model B, Model C) and generates text output. The input is digitally processed audio data, and the output is a set of text outputs from each algorithm.

[0262] Step 4:

[0263] The server compares and integrates multiple text outputs obtained from speech recognition algorithms to create a single document. Specifically, it selects the most suitable text output based on its reliability and combines them into a consistent document. The input is a set of text outputs, and the output is the integrated document.

[0264] Step 5:

[0265] The server summarizes the integrated document and extracts key information. This specifically involves using a generative AI model to analyze the text and concisely summarize the essential points. The input is the integrated document, and the output is the summarized key points.

[0266] Step 6:

[0267] The user inputs a specific question using the server interface, and the server generates an answer based on that. Specifically, the AI ​​model analyzes the input prompt and provides an answer. The input is the prompt from the user, and the output is the generated answer.

[0268] (Application Example 1)

[0269] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0270] There is a need to efficiently record audio data generated during team meetings and customer interactions in stores, organize key points, and utilize it as information useful for decision-making. However, manually processing the large volume of audio data generated in these operations is time-consuming and labor-intensive, and presents challenges in terms of accuracy and comprehensiveness. Therefore, there is a need to develop a system that can automatically analyze audio data with high accuracy and provide useful information for store operations quickly and effectively.

[0271] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0272] In this invention, the server includes means for receiving acoustic data from a communication device, means for transmitting the acoustic data to multiple acoustic recognition models and receiving string conversion results from each model, and means for comparing and integrating the multiple string conversion results to generate a single recorded document. This enables automatic recording of audio and extraction of important information in stores.

[0273] A "communication device" is a device used to record audio data and transmit it to other systems.

[0274] "Audio data" refers to data that records sound in digital format.

[0275] An "acoustic recognition model" is a set of algorithms that analyze acoustic data and convert it into a string format.

[0276] "String conversion result" refers to text data generated by the acoustic recognition model.

[0277] A "record document" is a document created by integrating the results of multiple string conversions.

[0278] "Related information" refers to external data used to cross-reference with documentary records and gain additional insights.

[0279] "User" refers to a person or organization that uses the system to process audio data and obtain information.

[0280] "Feedback" refers to opinions and reactions obtained from customers and users.

[0281] To implement this invention, it is necessary to construct a system including a server, communication terminals, and users. In this system, the user uses the communication terminal to record meetings and conversations with customers in the store. The recorded audio data is transmitted from the communication terminal to the server.

[0282] The server passes the received audio data to multiple speech recognition models and obtains string conversion results from each model. At this time, as the speech recognition model, for example, Google Speech-to-Text API, Amazon Transcribe, etc. can be used. Thereby, highly accurate speech-to-text conversion is realized from the audio data.

[0283] Next, the server compares these string conversion results, selects and integrates the most suitable parts, and generates a single recorded document. Then, using natural language processing (NLP) tools, the summary of the recorded document is performed, and important information elements are extracted. Here, as the NLP tool, NLTK, spaCy, etc. can be considered.

[0284] The generated summary is provided to the user, and the user can make a decision based on it. As a specific example, when discussing changes in product layouts in a clothing store, important points are summarized from the recording of the discussion and provided to the staff as additional materials.

[0285] As an example of the prompt sentence for the generation AI model, "Analyze the audio data of the store meeting and appropriately transcribe the discussion on the product lineup." can be cited. Thereby, the audio data can be analyzed efficiently without human intervention, and useful information for store operation can be provided.

[0286] The flow of the specific process in Application Example 1 will be described using FIG. 12.

[0287] Step 1:

[0288] The user uses the terminal to record a conversation in the store. At this time, the terminal records the conversation as audio data. The input is audio data, and the output is the audio data stored in the terminal. The terminal prepares to send this audio data to the next step.

[0289] Step 2:

[0290] The terminal sends the recorded audio data to the server. The input is the audio data stored in the terminal, and the output is the audio data received by the server. The server prepares to pass this data to the acoustic recognition model.

[0291] Step 3:

[0292] The server inputs the audio data into a plurality of acoustic recognition models and obtains the string conversion results from each model. The input is the audio data stored in the server, and the output is the string conversion result generated by the acoustic recognition model. The server aggregates the string conversion results from different models.

[0293] Step 4:

[0294] The server compares and integrates the obtained string conversion results to generate a single recorded document. In this process, the accuracy and reliability of the results of each model are evaluated. The input is a plurality of string conversion results, and the output is the integrated recorded document. The server constructs the recorded document based on the most reliable information.

[0295] Step 5:

[0296] The server summarizes the recorded document using natural language processing (NLP) tools and extracts key information elements. The input is the integrated recorded document, and the output is the summarized information elements. The server uses NLP tools to organize the key points of the document.

[0297] Step 6:

[0298] The server provides the user with a generated summary. The input is the summarized information element, and the output is the summary information received by the user. The user can use this information to make decisions.

[0299] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0300] This invention relates to a system that incorporates sentiment analysis into the process of creating meeting minutes. The system aims to analyze the user's audio data during meetings and capture not only the content of the utterances but also the emotional information contained within them. As a result, meeting minutes will contain richer information and will become a useful source of information for decision-making and problem identification.

[0301] On the terminal, the user records the audio during the meeting and uploads the data to the server. On the server, this audio data is transcribed by multiple speech recognition models, and at the same time, an emotion engine analyzes the emotions. The emotion engine analyzes the intonation, speed, and tempo of the speech to identify emotions such as joy, anger, and anxiety.

[0302] On the server, sentiment information is tagged onto the transcribed meeting minutes. This allows for later review of the speaker's emotional state. When summarizing the minutes, this sentiment information can be taken into account to adjust the importance and focus of the summary.

[0303] In addition, the server that receives relevant materials from the user matches them with the content of the minutes of the meeting and sentiment information, and conducts analysis. This enables grasping how the sentiment expressed during the meeting affects the awareness and understanding of the topic.

[0304] As a specific example, in a project progress meeting, the sentiment engine detects that a certain speaker has a sense of impatience about a certain problem in the voice recorded by the user. The server adds this sentiment information to the minutes of the meeting, conducts a more detailed analysis of that problem in the summary, and helps the project manager determine the priority of the response to that problem.

[0305] By introducing this system, the quality of the minutes of the meeting can be improved, and by grasping the true reactions of the participants, valuable insights for improvement activities and strategic planning across the organization can be provided.

[0306] The following describes the processing flow.

[0307] Step 1:

[0308] The user uses the terminal to record the voice during the meeting and uploads the voice data to the server after the meeting ends.

[0309] Step 2:

[0310] The server sends the uploaded voice data to multiple speech recognition models, and each model performs speech-to-text conversion.

[0311] Step 3:

[0312] The server simultaneously activates the sentiment engine and extracts sentiment information from the voice data. At this time, sentiment is analyzed using intonation, strength, speed, etc. of the voice.

[0313] Step 4:

[0314] The server aggregates the transcription results from each speech recognition model and generates meeting minutes with added emotional information from the emotion engine.

[0315] Step 5:

[0316] The server uses sentiment information contained in the generated meeting minutes to summarize and adjust the importance and focus of each statement.

[0317] Step 6:

[0318] Users provide relevant materials from their terminals to the server, which then integrates and analyzes them along with meeting minutes and sentiment information.

[0319] Step 7:

[0320] The server receives questions from users, generates answers considering meeting minutes, relevant documents, and sentiment information, and provides them to the users.

[0321] Step 8:

[0322] Based on the information provided by the server, users make decisions necessary for the progress of projects and meetings, and provide feedback as needed. The server uses this feedback to improve the overall accuracy and functionality of the system.

[0323] (Example 2)

[0324] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0325] In meetings, there is a growing need to create meeting minutes that not only transcribe participants' statements but also capture the emotions associated with those statements. However, conventional systems lacked the integration of audio transcription and sentiment analysis, making it difficult to grasp both the content of the statements and the emotions behind them.

[0326] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0327] In this invention, the server includes means for receiving voice information from a communication device, means for transmitting the voice information to a plurality of voice recognition devices and receiving text information from each device, and means for analyzing the voice information corresponding to the text information and identifying emotional characteristics. This makes it possible to generate meeting minutes that capture not only the content of what was said in a meeting, but also the emotions embedded in those statements.

[0328] "Communication equipment" refers to devices or equipment used to receive and transmit voice information.

[0329] "Audio information" refers to audio data that records human speech.

[0330] A "speech recognition device" is a device or program that converts speech information into text format.

[0331] "Textual information" refers to text data converted by a speech recognition device.

[0332] "Emotional characteristics" refer to the psychological state analyzed from the intonation, speed, tempo, and other elements contained in auditory information.

[0333] A "meeting record" is a document that includes the content of what was said during a meeting and the corresponding emotional characteristics.

[0334] "Related information" refers to documents and data related to the meeting content and is used for cross-referencing with the meeting minutes.

[0335] This invention relates to a system for efficiently processing audio information in a meeting and generating a meeting record that includes the content of the speeches and their emotional characteristics. This system mainly consists of the following components.

[0336] The terminal provides a function for users to record audio during meetings. Users can start recording themselves and upload the audio to the server after completion. The terminal also has an audio format conversion function that converts the audio information into a format that the server can process.

[0337] The server receives audio information transmitted from the terminal and converts it into text using multiple speech recognition devices. Specifically, the server utilizes a cloud-based speech recognition service (e.g., a common speech recognition API) to segment the audio data and convert it into text. In addition, the server uses an emotion analysis engine (e.g., a common emotion analysis tool) to identify emotional characteristics by analyzing the intonation, speed, tempo, etc., of the voice.

[0338] The collected textual information and emotional characteristics are integrated by the server, and meeting records are generated as emotionally tagged text. These records are organized so that important elements are highlighted and useful information is readily apparent when users review them later.

[0339] As a concrete example, in a company project progress meeting, users record discussions about specific issues. Later, a server can detect emotions such as impatience or anxiety, tag them in the text, and reflect them in the meeting record. Such records can serve as a guide for project managers to take swift action.

[0340] Another example of a prompt to input into a generative AI model is: "Analyze the meeting audio data and list the emotions detected from all speakers. Analyze particularly emphasized emotions and related statements, and indicate how they might affect project progress." This prompt allows the AI ​​to provide detailed analysis to support user decision-making.

[0341] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0342] Step 1:

[0343] The terminal supports the user in recording audio during a meeting. It receives real-time audio data collected from the meeting as input and saves it as a recorded audio file (e.g., WAV format) as output. Once recording is complete, the terminal prepares to send the audio file to the server.

[0344] Step 2:

[0345] The server receives audio files sent from the terminal. It uses a speech recognition device (e.g., a common speech recognition API) to take the audio file as input and convert it into text information as output. The server segments the audio file and sends each segment to the speech recognition device for sequential transcription.

[0346] Step 3:

[0347] The server identifies emotional characteristics based on audio and text data. It receives audio and text data as input and analyzes the data using an emotion analysis engine (e.g., a common emotion analysis tool). The output extracts emotional characteristics (e.g., joy, anxiety, anger) for each segment. This process takes into account factors such as intonation, speed, and tempo of the speech.

[0348] Step 4:

[0349] The server generates meeting records by adding emotional characteristics to textual information. It receives textual information and emotional characteristics as input and creates meeting records with emotional tags as output. The server uses an integration algorithm to consolidate the recorded content and its emotional characteristics.

[0350] Step 5:

[0351] Users view the created meeting records and use them for decision-making and information analysis. As input, they receive sentiment-tagged meeting records provided by the server, and as output, they use this as a basis for project management and strategic planning decisions. Through reviewing the meeting records, users can understand the emotional responses of the participants.

[0352] (Application Example 2)

[0353] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the smart glasses 214 as the "terminal".

[0354] In meetings and staff gatherings, simply transcribing audio is insufficient to reflect the true intentions and emotions of the speakers. This can lead to a lack of information necessary for post-meeting decision-making and improving dialogue. Therefore, there is a need to create rich meeting minutes that also take into account the emotions of the speakers.

[0355] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0356] In this invention, the server includes means for receiving audio information from a communication device, means for analyzing emotions from the audio information and generating emotional information, and means for presenting the emotional information as feedback on the emotional state of participants after the meeting. This makes it possible to create meeting minutes and interpret intentions while taking into account the emotional state of participants after the meeting.

[0357] A "communication device" is an electronic device that has the function of transmitting and receiving voice information.

[0358] "Audio information" refers to sound data that records what was said during a meeting or conference.

[0359] "Speech recognition technology" refers to algorithms and programs used to transcribe audio data into text.

[0360] "Emotional information" refers to data extracted from audio data that represents the emotional state of the speaker.

[0361] "Meeting minutes" are documents that record the statements and content of a meeting, making them available for later reference.

[0362] A "summary" is information that is compiled by extracting the important points from meeting minutes and presenting them concisely.

[0363] "Information sources" refer to any data that is referenced, including the content of meetings and related materials.

[0364] "Feedback" refers to information provided to participants after a meeting, including reflections and advice regarding their comments and emotional state.

[0365] To implement this invention, a communication terminal is required for recording audio during a meeting or conference. The user uses this communication terminal to collect audio information in real time and transmit it to a server. The server receives the audio information and not only transcribes it using speech recognition technology, but also performs sentiment analysis to extract emotional information.

[0366] Speech recognition technology can utilize widely used libraries such as speech_recognition and Google's speech recognition API. Furthermore, specific algorithms are used to analyze the intonation and speed of speech in order to extract emotional information. These algorithms can not only identify emotions but also evaluate their intensity.

[0367] The generated transcripts and sentiment information are integrated and compared on the server and saved as meeting minutes. This allows users to easily review the minutes after the meeting, and analyze the sentiment information in particular as feedback to help with future decision-making.

[0368] As a concrete example, when discussing project progress in a meeting, not only the speaker's intentions but also their emotions, such as impatience, can be analyzed, allowing for a review of project priorities based on this analysis. An example of a prompt to input into the generative AI model is, "Perform an emotion analysis of this meeting and add emotion tags to the meeting minutes." This prompt enables the server to perform the necessary data processing and analysis.

[0369] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0370] Step 1:

[0371] The terminal records the audio of meetings and conferences in real time and sends it to the server as audio information. The input is the recorded audio data, and the output is the audio information sent to the server. At this stage, acoustic processing such as noise cancellation is important to maintain the quality of the audio.

[0372] Step 2:

[0373] The server inputs the received audio information into speech recognition technology and performs transcription. The input is audio information, and the output is the transcribed text. The server uses multiple speech recognition models and receives results from each model to improve accuracy.

[0374] Step 3:

[0375] The server analyzes audio information to extract emotional information. The input is audio information, and it identifies emotions by analyzing the intonation and tempo of the voice. The output is data representing the speaker's emotional state. The emotion engine analyzes the audio patterns and generates emotional information as emotion tags.

[0376] Step 4:

[0377] The server integrates transcript data and sentiment information to generate meeting minutes. The input is the transcript text and sentiment information, and the output is meeting minutes with sentiment tags. Based on the sentiment information, the server considers the importance of each statement and adjusts each section within the meeting minutes accordingly.

[0378] Step 5:

[0379] The server summarizes the generated meeting minutes and extracts key points. The input is the detailed meeting minutes, and the output is the summarized meeting minutes. Sentiment information is used to highlight particularly noteworthy sections.

[0380] Step 6:

[0381] The user inputs prompt text into the server, and a generative AI model is used to obtain feedback that includes emotional information. The input is the prompt text from the user, and the output is the analyzed feedback. Based on this, the server presents the emotional trends during the meeting to the participants.

[0382] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0383] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0384] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0385] [Third Embodiment]

[0386] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0387] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0388] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0389] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0390] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0391] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0392] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0393] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0394] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0395] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0396] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0397] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0398] This invention relates to a system that analyzes audio data from meetings and other similar events from multiple perspectives and automatically generates highly accurate meeting minutes. This system is implemented using three main components: a communication terminal, a server, and a user.

[0399] Users record audio during meetings using communication devices. This recorded audio data is uploaded to a server after the meeting ends. The server uses multiple speech recognition models to analyze this audio data and perform transcription. The transcription results from the different speech recognition models are aggregated, compared, and integrated on the server to create a single, unified meeting transcript.

[0400] To improve the accuracy of the meeting minutes, the server evaluates the reliability of each transcript and selects the optimal combination. Next, the server summarizes the generated minutes, extracting and organizing key points. Users can provide relevant materials to the server as needed, which analyzes them, compares them with the minutes, and evaluates their relevance to gain further insights.

[0401] In this system, users can ask questions through a server interface, and the server returns answers to those questions. This allows users to efficiently obtain information and is supported in their decision-making process.

[0402] As a concrete example, considering a company's weekly meeting, users install a dedicated application on their smartphones, record the meeting discussions, and transfer the recordings to a server. The server immediately analyzes the audio and generates a set of text data using speech recognition models A, B, and C. The server then selects the most accurate text data and summarizes it as meeting minutes. Users can use the server's question-and-answer function to review the conversation after the meeting or to inquire about specific points of discussion. In this way, the recording and analysis of meetings are centralized, aiming to contribute to organizational information management and decision-making.

[0403] The following describes the processing flow.

[0404] Step 1:

[0405] Users record audio on their devices during meetings and upload the audio data to the server after the meeting ends.

[0406] Step 2:

[0407] The server sends the received audio data to multiple speech recognition models, each of which analyzes the audio data and generates a transcription result.

[0408] Step 3:

[0409] The server receives transcription results from each speech recognition model, compares and analyzes them to evaluate their reliability, and then integrates the most accurate text to create a single meeting transcript.

[0410] Step 4:

[0411] The server inputs the generated meeting minutes into a summarization model, creates a summary, and extracts the key points.

[0412] Step 5:

[0413] Users upload relevant materials used in meetings from their devices to the server, which then compares the meeting minutes with the materials, assesses their relevance, and generates additional insights.

[0414] Step 6:

[0415] The user enters a question through the server interface, and the server generates an answer based on meeting minutes and related documents and sends it back to the user.

[0416] Step 7:

[0417] Users provide feedback to the server, which then uses this feedback to improve the speech recognition and meeting minute creation processes.

[0418] (Example 1)

[0419] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0420] In recent years, the importance of recording meetings and discussions has increased, but traditional manual recording methods often lack accuracy and speed. Furthermore, relying on a single speech recognition technology when converting audio data into text increases the risk of misrecognition. As a result, information may be lost or inaccurate. In addition, there is a lack of efficient methods for grasping the important points discussed in meetings, and there is a challenge in extracting important information from information overload. This invention aims to solve these problems and improve the accuracy and efficiency of meeting records.

[0421] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0422] In this invention, the server includes means for acquiring audio information from a communication device, means for supplying the audio information to multiple speech recognition algorithms and acquiring text output from each algorithm, and means for comparing and integrating the multiple text outputs to create a single document. This makes it possible to analyze the audio information from multiple perspectives by processing it with multiple recognition algorithms and efficiently generate a single integrated record. Furthermore, by using a generation AI model to scrutinize the information according to the importance of the document, high-quality summaries and extraction of related information can be achieved.

[0423] "Communication equipment" is a general term for devices that have the function of acquiring voice information and transmitting it to a server.

[0424] "Audio information" refers to audio data acquired during meetings and discussions, which is then used to create text.

[0425] A "speech recognition algorithm" refers to various analytical methods used to convert speech information into text.

[0426] "Text output" refers to text data generated by analyzing audio information using a speech recognition algorithm.

[0427] A "document" is a record generated from comparative and integrated text output, and is a consistent form of representation of information, including meeting content.

[0428] "Digital processing" refers to a series of data adjustment operations performed on audio information, such as noise reduction and sound quality correction.

[0429] A "security protocol" is a set of rules that includes communication procedures to ensure security during the transmission of voice information.

[0430] A "generative AI model" is a type of artificial intelligence technology used for purposes such as summarizing text data and extracting important information.

[0431] One embodiment of the present invention is a system that analyzes conference audio from multiple angles and automatically generates high-quality meeting minutes. This system mainly consists of three elements: a communication device, a server, and a user.

[0432] Users record audio information during meetings using a communication device. This device is a smartphone or tablet with a dedicated application installed, which records the audio in digital format and sends it to the server after the meeting ends. The files are encrypted using a secure protocol for information protection, ensuring security during transmission.

[0433] The server is equipped with multiple speech recognition algorithms and analyzes the received audio information. Specifically, it performs digital processing, noise reduction, and sound quality adjustment before feeding the information to the speech recognition algorithms. In this step, different algorithms such as Model A, Model B, and Model C are used. Each algorithm generates a different text output, and the results are integrated to create a single document. This document is further summarized by the server, and important information is extracted using a generative AI model.

[0434] As a concrete example, consider a company's weekly meeting, where the user records the meeting using a dedicated application. Once the recording is complete, the audio file is sent to the server, and analysis begins immediately. The server integrates the text generated from the audio and compiles it into meeting minutes. If the user wants to confirm a specific point of discussion after the meeting, they can enter a prompt such as, "Please tell me about the budget plan discussed in this meeting," and the server's AI model will provide an appropriate answer.

[0435] This makes it easier for users to understand meeting content and quickly obtain the information necessary for decision-making. The entire system enables centralized information management, supporting the efficient operation of the organization.

[0436] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0437] Step 1:

[0438] The user records audio information during a meeting using a communication device. Once recording is complete, the device sends the audio file to the server. The input is audio information in digital format, and the output is the audio file transferred to the server. The specific action taken in this case is tapping the "Recording Complete" button within the application.

[0439] Step 2:

[0440] The server performs digital processing on the received audio files. Specifically, it removes noise and adjusts the sound quality to make them suitable for analysis. The input is the audio file sent to the server, and the output is the digitally processed audio data. The server aims to improve the accuracy of the analysis through this process.

[0441] Step 3:

[0442] The server supplies processed audio data to multiple speech recognition algorithms. Specifically, it inputs the audio data into each algorithm (e.g., Model A, Model B, Model C) and generates text output. The input is digitally processed audio data, and the output is a set of text outputs from each algorithm.

[0443] Step 4:

[0444] The server compares and integrates multiple text outputs obtained from speech recognition algorithms to create a single document. Specifically, it selects the most suitable text output based on its reliability and combines them into a consistent document. The input is a set of text outputs, and the output is the integrated document.

[0445] Step 5:

[0446] The server summarizes the integrated document and extracts key information. This specifically involves using a generative AI model to analyze the text and concisely summarize the essential points. The input is the integrated document, and the output is the summarized key points.

[0447] Step 6:

[0448] The user inputs a specific question using the server interface, and the server generates an answer based on that. Specifically, the AI ​​model analyzes the input prompt and provides an answer. The input is the prompt from the user, and the output is the generated answer.

[0449] (Application Example 1)

[0450] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0451] There is a need to efficiently record audio data generated during team meetings and customer interactions in stores, organize key points, and utilize it as information useful for decision-making. However, manually processing the large volume of audio data generated in these operations is time-consuming and labor-intensive, and presents challenges in terms of accuracy and comprehensiveness. Therefore, there is a need to develop a system that can automatically analyze audio data with high accuracy and provide useful information for store operations quickly and effectively.

[0452] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0453] In this invention, the server includes means for receiving acoustic data from a communication device, means for transmitting the acoustic data to multiple acoustic recognition models and receiving string conversion results from each model, and means for comparing and integrating the multiple string conversion results to generate a single recorded document. This enables automatic recording of audio and extraction of important information in stores.

[0454] A "communication device" is a device used to record audio data and transmit it to other systems.

[0455] "Audio data" refers to data that records sound in digital format.

[0456] An "acoustic recognition model" is a set of algorithms that analyze acoustic data and convert it into a string format.

[0457] "String conversion result" refers to text data generated by the acoustic recognition model.

[0458] A "record document" is a document created by integrating the results of multiple string conversions.

[0459] "Related information" refers to external data used to cross-reference with documentary records and gain additional insights.

[0460] A "user" is a person or organization that uses the system to process audio data or acquire information.

[0461] "Feedback" refers to the opinions and reactions received from customers or users.

[0462] To realize this invention, it is necessary to build a system that includes a server, a communication terminal, and a user. In this system, the user uses the communication terminal to record meetings and conversations with customers within the store. The recorded audio data is transmitted from the communication terminal to the server.

[0463] The server passes the received audio data to multiple acoustic recognition models and obtains the string conversion result from each model. Examples of usable acoustic recognition models include the Google Speech-to-Text API and Amazon Transcribe. This enables highly accurate transcription from audio data.

[0464] Next, the server compares these string conversion results, selects the most suitable parts, and integrates them to generate a single record document. Then, natural language processing (NLP) tools are used to summarize the record document and extract important information elements. Examples of NLP tools include NLTK and spaCy.

[0465] The generated summary is provided to the user, who can then use it to make decisions. For example, if an apparel shop discusses changing the product layout, the key points from the recording of the discussion could be summarized and provided to the staff as supplementary material.

[0466] An example of a prompt for a generative AI model is, "Analyze the audio data from the store meeting and transcribe the product lineup discussion appropriately." This allows for efficient analysis of audio data without human intervention, providing information useful for store operations.

[0467] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0468] Step 1:

[0469] The user uses a device to record conversations within the store. The device records the conversation as audio data. The input is audio data, and the output is the audio data stored on the device. The device then prepares to transmit this audio data to the next step.

[0470] Step 2:

[0471] The terminal sends the recorded audio data to the server. The input is the audio data stored on the terminal, and the output is the audio data received by the server. The server prepares this data to pass to the acoustic recognition model.

[0472] Step 3:

[0473] The server inputs acoustic data into multiple acoustic recognition models and retrieves string conversion results from each model. The input is the acoustic data stored on the server, and the output is the string conversion results generated by the acoustic recognition models. The server aggregates the string conversion results from the different models.

[0474] Step 4:

[0475] The server compares and integrates the acquired string conversion results to generate a single record document. During this process, the accuracy and reliability of the results from each model are evaluated. The input consists of multiple string conversion results, and the output is an integrated record document. The server constructs the record document based on the most reliable information.

[0476] Step 5:

[0477] The server summarizes the recorded document using natural language processing (NLP) tools and extracts key information elements. The input is the integrated recorded document, and the output is the summarized information elements. The server uses NLP tools to organize the key points of the document.

[0478] Step 6:

[0479] The server provides the user with a generated summary. The input is the summarized information element, and the output is the summary information received by the user. The user can use this information to make decisions.

[0480] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0481] This invention relates to a system that incorporates sentiment analysis into the process of creating meeting minutes. The system aims to analyze the user's audio data during meetings and capture not only the content of the utterances but also the emotional information contained within them. As a result, meeting minutes will contain richer information and will become a useful source of information for decision-making and problem identification.

[0482] On the terminal, the user records the audio during the meeting and uploads the data to the server. On the server, this audio data is transcribed by multiple speech recognition models, and at the same time, an emotion engine analyzes the emotions. The emotion engine analyzes the intonation, speed, and tempo of the speech to identify emotions such as joy, anger, and anxiety.

[0483] On the server, sentiment information is tagged onto the transcribed meeting minutes. This allows for later review of the speaker's emotional state. When summarizing the minutes, this sentiment information can be taken into account to adjust the importance and focus of the summary.

[0484] Furthermore, the server, upon receiving relevant materials from users, compares and analyzes them against the meeting minutes and emotional information. This allows for an understanding of how emotions expressed during the meeting influenced awareness and understanding of the agenda items.

[0485] As a concrete example, in a project progress meeting, the emotion engine detects that a speaker is feeling anxious about a particular issue in the audio recording made by the user. The server adds this emotion information to the meeting minutes, performs a more detailed analysis of the issue within the summary, and helps the project manager determine the priority of addressing that issue.

[0486] The implementation of this system will improve the quality of meeting minutes and provide valuable insights for improvement activities and strategic planning across the organization by allowing us to understand the true reactions of participants.

[0487] The following describes the processing flow.

[0488] Step 1:

[0489] Users record audio during meetings using their devices and upload the audio data to the server after the meeting ends.

[0490] Step 2:

[0491] The server sends the uploaded audio data to multiple speech recognition models, and each model performs the transcription.

[0492] Step 3:

[0493] The server simultaneously activates the emotion engine and extracts emotional information from the audio data. During this process, it analyzes emotions using factors such as intonation, volume, and speed.

[0494] Step 4:

[0495] The server aggregates the transcription results from each speech recognition model and generates meeting minutes with added emotional information from the emotion engine.

[0496] Step 5:

[0497] The server uses sentiment information contained in the generated meeting minutes to summarize and adjust the importance and focus of each statement.

[0498] Step 6:

[0499] Users provide relevant materials from their terminals to the server, which then integrates and analyzes them along with meeting minutes and sentiment information.

[0500] Step 7:

[0501] The server receives questions from users, generates answers considering meeting minutes, relevant documents, and sentiment information, and provides them to the users.

[0502] Step 8:

[0503] Based on the information provided by the server, users make decisions necessary for the progress of projects and meetings, and provide feedback as needed. The server uses this feedback to improve the overall accuracy and functionality of the system.

[0504] (Example 2)

[0505] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0506] In meetings, there is a growing need to create meeting minutes that not only transcribe participants' statements but also capture the emotions associated with those statements. However, conventional systems lacked the integration of audio transcription and sentiment analysis, making it difficult to grasp both the content of the statements and the emotions behind them.

[0507] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0508] In this invention, the server includes means for receiving voice information from a communication device, means for transmitting the voice information to a plurality of voice recognition devices and receiving text information from each device, and means for analyzing the voice information corresponding to the text information and identifying emotional characteristics. This makes it possible to generate meeting minutes that capture not only the content of what was said in a meeting, but also the emotions embedded in those statements.

[0509] "Communication equipment" refers to devices or equipment used to receive and transmit voice information.

[0510] "Audio information" refers to audio data that records human speech.

[0511] A "speech recognition device" is a device or program that converts speech information into text format.

[0512] "Textual information" refers to text data converted by a speech recognition device.

[0513] "Emotional characteristics" refer to the psychological state analyzed from the intonation, speed, tempo, and other elements contained in auditory information.

[0514] A "meeting record" is a document that includes the content of what was said during a meeting and the corresponding emotional characteristics.

[0515] "Related information" refers to documents and data related to the meeting content and is used for cross-referencing with the meeting minutes.

[0516] This invention relates to a system for efficiently processing audio information in a meeting and generating a meeting record that includes the content of the speeches and their emotional characteristics. This system mainly consists of the following components.

[0517] The terminal provides a function for users to record audio during meetings. Users can start recording themselves and upload the audio to the server after completion. The terminal also has an audio format conversion function that converts the audio information into a format that the server can process.

[0518] The server receives audio information transmitted from the terminal and converts it into text using multiple speech recognition devices. Specifically, the server utilizes a cloud-based speech recognition service (e.g., a common speech recognition API) to segment the audio data and convert it into text. In addition, the server uses an emotion analysis engine (e.g., a common emotion analysis tool) to identify emotional characteristics by analyzing the intonation, speed, tempo, etc., of the voice.

[0519] The collected textual information and emotional characteristics are integrated by the server, and meeting records are generated as emotionally tagged text. These records are organized so that important elements are highlighted and useful information is readily apparent when users review them later.

[0520] As a concrete example, in a company project progress meeting, users record discussions about specific issues. Later, a server can detect emotions such as impatience or anxiety, tag them in the text, and reflect them in the meeting record. Such records can serve as a guide for project managers to take swift action.

[0521] Another example of a prompt to input into a generative AI model is: "Analyze the meeting audio data and list the emotions detected from all speakers. Analyze particularly emphasized emotions and related statements, and indicate how they might affect project progress." This prompt allows the AI ​​to provide detailed analysis to support user decision-making.

[0522] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0523] Step 1:

[0524] The terminal supports the user in recording audio during a meeting. It receives real-time audio data collected from the meeting as input and saves it as a recorded audio file (e.g., WAV format) as output. Once recording is complete, the terminal prepares to send the audio file to the server.

[0525] Step 2:

[0526] The server receives audio files sent from the terminal. It uses a speech recognition device (e.g., a common speech recognition API) to take the audio file as input and convert it into text information as output. The server segments the audio file and sends each segment to the speech recognition device for sequential transcription.

[0527] Step 3:

[0528] The server identifies emotional characteristics based on audio and text data. It receives audio and text data as input and analyzes the data using an emotion analysis engine (e.g., a common emotion analysis tool). The output extracts emotional characteristics (e.g., joy, anxiety, anger) for each segment. This process takes into account factors such as intonation, speed, and tempo of the speech.

[0529] Step 4:

[0530] The server generates meeting records by adding emotional characteristics to textual information. It receives textual information and emotional characteristics as input and creates meeting records with emotional tags as output. The server uses an integration algorithm to consolidate the recorded content and its emotional characteristics.

[0531] Step 5:

[0532] Users view the created meeting records and use them for decision-making and information analysis. As input, they receive sentiment-tagged meeting records provided by the server, and as output, they use this as a basis for project management and strategic planning decisions. Through reviewing the meeting records, users can understand the emotional responses of the participants.

[0533] (Application Example 2)

[0534] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0535] In meetings and staff gatherings, simply transcribing audio is insufficient to reflect the true intentions and emotions of the speakers. This can lead to a lack of information necessary for post-meeting decision-making and improving dialogue. Therefore, there is a need to create rich meeting minutes that also take into account the emotions of the speakers.

[0536] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0537] In this invention, the server includes means for receiving audio information from a communication device, means for analyzing emotions from the audio information and generating emotional information, and means for presenting the emotional information as feedback on the emotional state of participants after the meeting. This makes it possible to create meeting minutes and interpret intentions while taking into account the emotional state of participants after the meeting.

[0538] A "communication device" is an electronic device that has the function of transmitting and receiving voice information.

[0539] "Audio information" refers to sound data that records what was said during a meeting or conference.

[0540] "Speech recognition technology" refers to algorithms and programs used to transcribe audio data into text.

[0541] "Emotional information" refers to data extracted from audio data that represents the emotional state of the speaker.

[0542] "Meeting minutes" are documents that record the statements and content of a meeting, making them available for later reference.

[0543] A "summary" is information that is compiled by extracting the important points from meeting minutes and presenting them concisely.

[0544] "Information sources" refer to any data that is referenced, including the content of meetings and related materials.

[0545] "Feedback" refers to information provided to participants after a meeting, including reflections and advice regarding their comments and emotional state.

[0546] To implement this invention, a communication terminal is required for recording audio during a meeting or conference. The user uses this communication terminal to collect audio information in real time and transmit it to a server. The server receives the audio information and not only transcribes it using speech recognition technology, but also performs sentiment analysis to extract emotional information.

[0547] Speech recognition technology can utilize widely used libraries such as speech_recognition and Google's speech recognition API. Furthermore, specific algorithms are used to analyze the intonation and speed of speech in order to extract emotional information. These algorithms can not only identify emotions but also evaluate their intensity.

[0548] The generated transcripts and sentiment information are integrated and compared on the server and saved as meeting minutes. This allows users to easily review the minutes after the meeting, and analyze the sentiment information in particular as feedback to help with future decision-making.

[0549] As a concrete example, when discussing project progress in a meeting, not only the speaker's intentions but also their emotions, such as impatience, can be analyzed, allowing for a review of project priorities based on this analysis. An example of a prompt to input into the generative AI model is, "Perform an emotion analysis of this meeting and add emotion tags to the meeting minutes." This prompt enables the server to perform the necessary data processing and analysis.

[0550] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0551] Step 1:

[0552] The terminal records the audio of meetings and conferences in real time and sends it to the server as audio information. The input is the recorded audio data, and the output is the audio information sent to the server. At this stage, acoustic processing such as noise cancellation is important to maintain the quality of the audio.

[0553] Step 2:

[0554] The server inputs the received audio information into speech recognition technology and performs transcription. The input is audio information, and the output is the transcribed text. The server uses multiple speech recognition models and receives results from each model to improve accuracy.

[0555] Step 3:

[0556] The server analyzes audio information to extract emotional information. The input is audio information, and it identifies emotions by analyzing the intonation and tempo of the voice. The output is data representing the speaker's emotional state. The emotion engine analyzes the audio patterns and generates emotional information as emotion tags.

[0557] Step 4:

[0558] The server integrates transcript data and sentiment information to generate meeting minutes. The input is the transcript text and sentiment information, and the output is meeting minutes with sentiment tags. Based on the sentiment information, the server considers the importance of each statement and adjusts each section within the meeting minutes accordingly.

[0559] Step 5:

[0560] The server summarizes the generated meeting minutes and extracts key points. The input is the detailed meeting minutes, and the output is the summarized meeting minutes. Sentiment information is used to highlight particularly noteworthy sections.

[0561] Step 6:

[0562] The user inputs prompt text into the server, and a generative AI model is used to obtain feedback that includes emotional information. The input is the prompt text from the user, and the output is the analyzed feedback. Based on this, the server presents the emotional trends during the meeting to the participants.

[0563] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0564] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0565] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0566] [Fourth Embodiment]

[0567] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0568] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0569] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0570] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0571] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0572] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0573] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0574] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0575] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0576] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0577] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0578] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0579] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0580] This invention relates to a system that analyzes audio data from meetings and other similar events from multiple perspectives and automatically generates highly accurate meeting minutes. This system is implemented using three main components: a communication terminal, a server, and a user.

[0581] Users record audio during meetings using communication devices. This recorded audio data is uploaded to a server after the meeting ends. The server uses multiple speech recognition models to analyze this audio data and perform transcription. The transcription results from the different speech recognition models are aggregated, compared, and integrated on the server to create a single, unified meeting transcript.

[0582] To improve the accuracy of the meeting minutes, the server evaluates the reliability of each transcript and selects the optimal combination. Next, the server summarizes the generated minutes, extracting and organizing key points. Users can provide relevant materials to the server as needed, which analyzes them, compares them with the minutes, and evaluates their relevance to gain further insights.

[0583] In this system, users can ask questions through a server interface, and the server returns answers to those questions. This allows users to efficiently obtain information and is supported in their decision-making process.

[0584] As a concrete example, considering a company's weekly meeting, users install a dedicated application on their smartphones, record the meeting discussions, and transfer the recordings to a server. The server immediately analyzes the audio and generates a set of text data using speech recognition models A, B, and C. The server then selects the most accurate text data and summarizes it as meeting minutes. Users can use the server's question-and-answer function to review the conversation after the meeting or to inquire about specific points of discussion. In this way, the recording and analysis of meetings are centralized, aiming to contribute to organizational information management and decision-making.

[0585] The following describes the processing flow.

[0586] Step 1:

[0587] Users record audio on their devices during meetings and upload the audio data to the server after the meeting ends.

[0588] Step 2:

[0589] The server sends the received audio data to multiple speech recognition models, each of which analyzes the audio data and generates a transcription result.

[0590] Step 3:

[0591] The server receives transcription results from each speech recognition model, compares and analyzes them to evaluate their reliability, and then integrates the most accurate text to create a single meeting transcript.

[0592] Step 4:

[0593] The server inputs the generated meeting minutes into a summarization model, creates a summary, and extracts the key points.

[0594] Step 5:

[0595] Users upload relevant materials used in meetings from their devices to the server, which then compares the meeting minutes with the materials, assesses their relevance, and generates additional insights.

[0596] Step 6:

[0597] The user enters a question through the server interface, and the server generates an answer based on meeting minutes and related documents and sends it back to the user.

[0598] Step 7:

[0599] Users provide feedback to the server, which then uses this feedback to improve the speech recognition and meeting minute creation processes.

[0600] (Example 1)

[0601] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0602] In recent years, the importance of recording meetings and discussions has increased, but traditional manual recording methods often lack accuracy and speed. Furthermore, relying on a single speech recognition technology when converting audio data into text increases the risk of misrecognition. As a result, information may be lost or inaccurate. In addition, there is a lack of efficient methods for grasping the important points discussed in meetings, and there is a challenge in extracting important information from information overload. This invention aims to solve these problems and improve the accuracy and efficiency of meeting records.

[0603] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0604] In this invention, the server includes means for acquiring audio information from a communication device, means for supplying the audio information to multiple speech recognition algorithms and acquiring text output from each algorithm, and means for comparing and integrating the multiple text outputs to create a single document. This makes it possible to analyze the audio information from multiple perspectives by processing it with multiple recognition algorithms and efficiently generate a single integrated record. Furthermore, by using a generation AI model to scrutinize the information according to the importance of the document, high-quality summaries and extraction of related information can be achieved.

[0605] "Communication equipment" is a general term for devices that have the function of acquiring voice information and transmitting it to a server.

[0606] "Audio information" refers to audio data acquired during meetings and discussions, which is then used to create text.

[0607] A "speech recognition algorithm" refers to various analytical methods used to convert speech information into text.

[0608] "Text output" refers to text data generated by analyzing audio information using a speech recognition algorithm.

[0609] A "document" is a record generated from comparative and integrated text output, and is a consistent form of representation of information, including meeting content.

[0610] "Digital processing" refers to a series of data adjustment operations performed on audio information, such as noise reduction and sound quality correction.

[0611] A "security protocol" is a set of rules that includes communication procedures to ensure security during the transmission of voice information.

[0612] A "generative AI model" is a type of artificial intelligence technology used for purposes such as summarizing text data and extracting important information.

[0613] One embodiment of the present invention is a system that analyzes conference audio from multiple angles and automatically generates high-quality meeting minutes. This system mainly consists of three elements: a communication device, a server, and a user.

[0614] Users record audio information during meetings using a communication device. This device is a smartphone or tablet with a dedicated application installed, which records the audio in digital format and sends it to the server after the meeting ends. The files are encrypted using a secure protocol for information protection, ensuring security during transmission.

[0615] The server is equipped with multiple speech recognition algorithms and analyzes the received audio information. Specifically, it performs digital processing, noise reduction, and sound quality adjustment before feeding the information to the speech recognition algorithms. In this step, different algorithms such as Model A, Model B, and Model C are used. Each algorithm generates a different text output, and the results are integrated to create a single document. This document is further summarized by the server, and important information is extracted using a generative AI model.

[0616] As a concrete example, consider a company's weekly meeting, where the user records the meeting using a dedicated application. Once the recording is complete, the audio file is sent to the server, and analysis begins immediately. The server integrates the text generated from the audio and compiles it into meeting minutes. If the user wants to confirm a specific point of discussion after the meeting, they can enter a prompt such as, "Please tell me about the budget plan discussed in this meeting," and the server's AI model will provide an appropriate answer.

[0617] This makes it easier for users to understand meeting content and quickly obtain the information necessary for decision-making. The entire system enables centralized information management, supporting the efficient operation of the organization.

[0618] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0619] Step 1:

[0620] The user records audio information during a meeting using a communication device. Once recording is complete, the device sends the audio file to the server. The input is audio information in digital format, and the output is the audio file transferred to the server. The specific action taken in this case is tapping the "Recording Complete" button within the application.

[0621] Step 2:

[0622] The server performs digital processing on the received audio files. Specifically, it removes noise and adjusts the sound quality to make them suitable for analysis. The input is the audio file sent to the server, and the output is the digitally processed audio data. The server aims to improve the accuracy of the analysis through this process.

[0623] Step 3:

[0624] The server supplies processed audio data to multiple speech recognition algorithms. Specifically, it inputs the audio data into each algorithm (e.g., Model A, Model B, Model C) and generates text output. The input is digitally processed audio data, and the output is a set of text outputs from each algorithm.

[0625] Step 4:

[0626] The server compares and integrates multiple text outputs obtained from speech recognition algorithms to create a single document. Specifically, it selects the most suitable text output based on its reliability and combines them into a consistent document. The input is a set of text outputs, and the output is the integrated document.

[0627] Step 5:

[0628] The server summarizes the integrated document and extracts key information. This specifically involves using a generative AI model to analyze the text and concisely summarize the essential points. The input is the integrated document, and the output is the summarized key points.

[0629] Step 6:

[0630] The user inputs a specific question using the server interface, and the server generates an answer based on that. Specifically, the AI ​​model analyzes the input prompt and provides an answer. The input is the prompt from the user, and the output is the generated answer.

[0631] (Application Example 1)

[0632] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0633] There is a need to efficiently record audio data generated during team meetings and customer interactions in stores, organize key points, and utilize it as information useful for decision-making. However, manually processing the large volume of audio data generated in these operations is time-consuming and labor-intensive, and presents challenges in terms of accuracy and comprehensiveness. Therefore, there is a need to develop a system that can automatically analyze audio data with high accuracy and provide useful information for store operations quickly and effectively.

[0634] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0635] In this invention, the server includes means for receiving acoustic data from a communication device, means for transmitting the acoustic data to multiple acoustic recognition models and receiving string conversion results from each model, and means for comparing and integrating the multiple string conversion results to generate a single recorded document. This enables automatic recording of audio and extraction of important information in stores.

[0636] A "communication device" is a device used to record audio data and transmit it to other systems.

[0637] "Audio data" refers to data that records sound in digital format.

[0638] An "acoustic recognition model" is a set of algorithms that analyze acoustic data and convert it into a string format.

[0639] "String conversion result" refers to text data generated by the acoustic recognition model.

[0640] A "record document" is a document created by integrating the results of multiple string conversions.

[0641] "Related information" refers to external data used to cross-reference with documentary records and gain additional insights.

[0642] A "user" is a person or organization that uses the system to process audio data or acquire information.

[0643] "Feedback" refers to the opinions and reactions received from customers or users.

[0644] To realize this invention, it is necessary to build a system that includes a server, a communication terminal, and a user. In this system, the user uses the communication terminal to record meetings and conversations with customers within the store. The recorded audio data is transmitted from the communication terminal to the server.

[0645] The server passes the received audio data to multiple acoustic recognition models and obtains the string conversion result from each model. Examples of usable acoustic recognition models include the Google Speech-to-Text API and Amazon Transcribe. This enables highly accurate transcription from audio data.

[0646] Next, the server compares these string conversion results, selects the most suitable parts, and integrates them to generate a single record document. Then, natural language processing (NLP) tools are used to summarize the record document and extract important information elements. Examples of NLP tools include NLTK and spaCy.

[0647] The generated summary is provided to the user, who can then use it to make decisions. For example, if an apparel shop discusses changing the product layout, the key points from the recording of the discussion could be summarized and provided to the staff as supplementary material.

[0648] An example of a prompt for a generative AI model is, "Analyze the audio data from the store meeting and transcribe the product lineup discussion appropriately." This allows for efficient analysis of audio data without human intervention, providing information useful for store operations.

[0649] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0650] Step 1:

[0651] The user uses a device to record conversations within the store. The device records the conversation as audio data. The input is audio data, and the output is the audio data stored on the device. The device then prepares to transmit this audio data to the next step.

[0652] Step 2:

[0653] The terminal sends the recorded audio data to the server. The input is the audio data stored on the terminal, and the output is the audio data received by the server. The server prepares this data to pass to the acoustic recognition model.

[0654] Step 3:

[0655] The server inputs acoustic data into multiple acoustic recognition models and retrieves string conversion results from each model. The input is the acoustic data stored on the server, and the output is the string conversion results generated by the acoustic recognition models. The server aggregates the string conversion results from the different models.

[0656] Step 4:

[0657] The server compares and integrates the acquired string conversion results to generate a single record document. During this process, the accuracy and reliability of the results from each model are evaluated. The input consists of multiple string conversion results, and the output is an integrated record document. The server constructs the record document based on the most reliable information.

[0658] Step 5:

[0659] The server summarizes the recorded document using natural language processing (NLP) tools and extracts key information elements. The input is the integrated recorded document, and the output is the summarized information elements. The server uses NLP tools to organize the key points of the document.

[0660] Step 6:

[0661] The server provides the user with a generated summary. The input is the summarized information element, and the output is the summary information received by the user. The user can use this information to make decisions.

[0662] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0663] This invention relates to a system that incorporates sentiment analysis into the process of creating meeting minutes. The system aims to analyze the user's audio data during meetings and capture not only the content of the utterances but also the emotional information contained within them. As a result, meeting minutes will contain richer information and will become a useful source of information for decision-making and problem identification.

[0664] On the terminal, the user records the audio during the meeting and uploads the data to the server. On the server, this audio data is transcribed by multiple speech recognition models, and at the same time, an emotion engine analyzes the emotions. The emotion engine analyzes the intonation, speed, and tempo of the speech to identify emotions such as joy, anger, and anxiety.

[0665] On the server, sentiment information is tagged onto the transcribed meeting minutes. This allows for later review of the speaker's emotional state. When summarizing the minutes, this sentiment information can be taken into account to adjust the importance and focus of the summary.

[0666] Furthermore, the server, upon receiving relevant materials from users, compares and analyzes them against the meeting minutes and emotional information. This allows for an understanding of how emotions expressed during the meeting influenced awareness and understanding of the agenda items.

[0667] As a concrete example, in a project progress meeting, the emotion engine detects that a speaker is feeling anxious about a particular issue in the audio recording made by the user. The server adds this emotion information to the meeting minutes, performs a more detailed analysis of the issue within the summary, and helps the project manager determine the priority of addressing that issue.

[0668] The implementation of this system will improve the quality of meeting minutes and provide valuable insights for improvement activities and strategic planning across the organization by allowing us to understand the true reactions of participants.

[0669] The following describes the processing flow.

[0670] Step 1:

[0671] Users record audio during meetings using their devices and upload the audio data to the server after the meeting ends.

[0672] Step 2:

[0673] The server sends the uploaded audio data to multiple speech recognition models, and each model performs the transcription.

[0674] Step 3:

[0675] The server simultaneously activates the emotion engine and extracts emotional information from the audio data. During this process, it analyzes emotions using factors such as intonation, volume, and speed.

[0676] Step 4:

[0677] The server aggregates the transcription results from each speech recognition model and generates meeting minutes with added emotional information from the emotion engine.

[0678] Step 5:

[0679] The server uses sentiment information contained in the generated meeting minutes to summarize and adjust the importance and focus of each statement.

[0680] Step 6:

[0681] Users provide relevant materials from their terminals to the server, which then integrates and analyzes them along with meeting minutes and sentiment information.

[0682] Step 7:

[0683] The server receives questions from users, generates answers considering meeting minutes, relevant documents, and sentiment information, and provides them to the users.

[0684] Step 8:

[0685] Based on the information provided by the server, users make decisions necessary for the progress of projects and meetings, and provide feedback as needed. The server uses this feedback to improve the overall accuracy and functionality of the system.

[0686] (Example 2)

[0687] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0688] In meetings, there is a growing need to create meeting minutes that not only transcribe participants' statements but also capture the emotions associated with those statements. However, conventional systems lacked the integration of audio transcription and sentiment analysis, making it difficult to grasp both the content of the statements and the emotions behind them.

[0689] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0690] In this invention, the server includes means for receiving voice information from a communication device, means for transmitting the voice information to a plurality of voice recognition devices and receiving text information from each device, and means for analyzing the voice information corresponding to the text information and identifying emotional characteristics. This makes it possible to generate meeting minutes that capture not only the content of what was said in a meeting, but also the emotions embedded in those statements.

[0691] "Communication equipment" refers to devices or equipment used to receive and transmit voice information.

[0692] "Audio information" refers to audio data that records human speech.

[0693] A "speech recognition device" is a device or program that converts speech information into text format.

[0694] "Textual information" refers to text data converted by a speech recognition device.

[0695] "Emotional characteristics" refer to the psychological state analyzed from the intonation, speed, tempo, and other elements contained in auditory information.

[0696] A "meeting record" is a document that includes the content of what was said during a meeting and the corresponding emotional characteristics.

[0697] "Related information" refers to documents and data related to the meeting content and is used for cross-referencing with the meeting minutes.

[0698] This invention relates to a system for efficiently processing audio information in a meeting and generating a meeting record that includes the content of the speeches and their emotional characteristics. This system mainly consists of the following components.

[0699] The terminal provides a function for users to record audio during meetings. Users can start recording themselves and upload the audio to the server after completion. The terminal also has an audio format conversion function that converts the audio information into a format that the server can process.

[0700] The server receives audio information transmitted from the terminal and converts it into text using multiple speech recognition devices. Specifically, the server utilizes a cloud-based speech recognition service (e.g., a common speech recognition API) to segment the audio data and convert it into text. In addition, the server uses an emotion analysis engine (e.g., a common emotion analysis tool) to identify emotional characteristics by analyzing the intonation, speed, tempo, etc., of the voice.

[0701] The collected textual information and emotional characteristics are integrated by the server, and meeting records are generated as emotionally tagged text. These records are organized so that important elements are highlighted and useful information is readily apparent when users review them later.

[0702] As a concrete example, in a company project progress meeting, users record discussions about specific issues. Later, a server can detect emotions such as impatience or anxiety, tag them in the text, and reflect them in the meeting record. Such records can serve as a guide for project managers to take swift action.

[0703] Another example of a prompt to input into a generative AI model is: "Analyze the meeting audio data and list the emotions detected from all speakers. Analyze particularly emphasized emotions and related statements, and indicate how they might affect project progress." This prompt allows the AI ​​to provide detailed analysis to support user decision-making.

[0704] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0705] Step 1:

[0706] The terminal supports the user in recording audio during a meeting. It receives real-time audio data collected from the meeting as input and saves it as a recorded audio file (e.g., WAV format) as output. Once recording is complete, the terminal prepares to send the audio file to the server.

[0707] Step 2:

[0708] The server receives audio files sent from the terminal. It uses a speech recognition device (e.g., a common speech recognition API) to take the audio file as input and convert it into text information as output. The server segments the audio file and sends each segment to the speech recognition device for sequential transcription.

[0709] Step 3:

[0710] The server identifies emotional characteristics based on audio and text data. It receives audio and text data as input and analyzes the data using an emotion analysis engine (e.g., a common emotion analysis tool). The output extracts emotional characteristics (e.g., joy, anxiety, anger) for each segment. This process takes into account factors such as intonation, speed, and tempo of the speech.

[0711] Step 4:

[0712] The server generates meeting records by adding emotional characteristics to textual information. It receives textual information and emotional characteristics as input and creates meeting records with emotional tags as output. The server uses an integration algorithm to consolidate the recorded content and its emotional characteristics.

[0713] Step 5:

[0714] Users view the created meeting records and use them for decision-making and information analysis. As input, they receive sentiment-tagged meeting records provided by the server, and as output, they use this as a basis for project management and strategic planning decisions. Through reviewing the meeting records, users can understand the emotional responses of the participants.

[0715] (Application Example 2)

[0716] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0717] In meetings and staff gatherings, simply transcribing audio is insufficient to reflect the true intentions and emotions of the speakers. This can lead to a lack of information necessary for post-meeting decision-making and improving dialogue. Therefore, there is a need to create rich meeting minutes that also take into account the emotions of the speakers.

[0718] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0719] In this invention, the server includes means for receiving audio information from a communication device, means for analyzing emotions from the audio information and generating emotional information, and means for presenting the emotional information as feedback on the emotional state of participants after the meeting. This makes it possible to create meeting minutes and interpret intentions while taking into account the emotional state of participants after the meeting.

[0720] A "communication device" is an electronic device that has the function of transmitting and receiving voice information.

[0721] "Audio information" refers to sound data that records what was said during a meeting or conference.

[0722] "Speech recognition technology" refers to algorithms and programs used to transcribe audio data into text.

[0723] "Emotional information" refers to data extracted from audio data that represents the emotional state of the speaker.

[0724] "Meeting minutes" are documents that record the statements and content of a meeting, making them available for later reference.

[0725] A "summary" is information that is compiled by extracting the important points from meeting minutes and presenting them concisely.

[0726] "Information sources" refer to any data that is referenced, including the content of meetings and related materials.

[0727] "Feedback" refers to information provided to participants after a meeting, including reflections and advice regarding their comments and emotional state.

[0728] To implement this invention, a communication terminal is required for recording audio during a meeting or conference. The user uses this communication terminal to collect audio information in real time and transmit it to a server. The server receives the audio information and not only transcribes it using speech recognition technology, but also performs sentiment analysis to extract emotional information.

[0729] Speech recognition technology can utilize widely used libraries such as speech_recognition and Google's speech recognition API. Furthermore, specific algorithms are used to analyze the intonation and speed of speech in order to extract emotional information. These algorithms can not only identify emotions but also evaluate their intensity.

[0730] The generated transcripts and sentiment information are integrated and compared on the server and saved as meeting minutes. This allows users to easily review the minutes after the meeting, and analyze the sentiment information in particular as feedback to help with future decision-making.

[0731] As a concrete example, when discussing project progress in a meeting, not only the speaker's intentions but also their emotions, such as impatience, can be analyzed, allowing for a review of project priorities based on this analysis. An example of a prompt to input into the generative AI model is, "Perform an emotion analysis of this meeting and add emotion tags to the meeting minutes." This prompt enables the server to perform the necessary data processing and analysis.

[0732] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0733] Step 1:

[0734] The terminal records the audio of meetings and conferences in real time and sends it to the server as audio information. The input is the recorded audio data, and the output is the audio information sent to the server. At this stage, acoustic processing such as noise cancellation is important to maintain the quality of the audio.

[0735] Step 2:

[0736] The server inputs the received audio information into speech recognition technology and performs transcription. The input is audio information, and the output is the transcribed text. The server uses multiple speech recognition models and receives results from each model to improve accuracy.

[0737] Step 3:

[0738] The server analyzes audio information to extract emotional information. The input is audio information, and it identifies emotions by analyzing the intonation and tempo of the voice. The output is data representing the speaker's emotional state. The emotion engine analyzes the audio patterns and generates emotional information as emotion tags.

[0739] Step 4:

[0740] The server integrates transcript data and sentiment information to generate meeting minutes. The input is the transcript text and sentiment information, and the output is meeting minutes with sentiment tags. Based on the sentiment information, the server considers the importance of each statement and adjusts each section within the meeting minutes accordingly.

[0741] Step 5:

[0742] The server summarizes the generated meeting minutes and extracts key points. The input is the detailed meeting minutes, and the output is the summarized meeting minutes. Sentiment information is used to highlight particularly noteworthy sections.

[0743] Step 6:

[0744] The user inputs prompt text into the server, and a generative AI model is used to obtain feedback that includes emotional information. The input is the prompt text from the user, and the output is the analyzed feedback. Based on this, the server presents the emotional trends during the meeting to the participants.

[0745] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0746] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0747] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0748] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0749] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. In the upper and lower directions of the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. Also, the upper side of the concentric circles is where "pleasant" emotions are located, and the lower side is where "unpleasant" emotions are located. In this way, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0750] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0751] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0752] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0753] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0754] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0755] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0756] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0757] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0758] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0759] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0760] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0761] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0762] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0763] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0764] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0765] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0766] The following is further disclosed regarding the embodiments described above.

[0767] (Claim 1)

[0768] A means of receiving voice data from a communication terminal,

[0769] A means for transmitting the aforementioned audio data to multiple speech recognition models and receiving transcription results from each model,

[0770] A means for comparing and integrating the aforementioned multiple transcription results to generate a single meeting minutes,

[0771] A means for summarizing the aforementioned meeting minutes and extracting key points,

[0772] A means for receiving questions from users and answering those questions based on the aforementioned meeting minutes and related materials,

[0773] A system that includes this.

[0774] (Claim 2)

[0775] The system according to claim 1, further comprising means for evaluating the reliability of the aforementioned speech recognition models and selecting the optimal combination.

[0776] (Claim 3)

[0777] The system according to claim 1, comprising means for receiving relevant materials from a communication terminal and analyzing them in relation to the minutes of the meeting.

[0778] "Example 1"

[0779] (Claim 1)

[0780] A means of acquiring voice information from a communication device,

[0781] A means for supplying the aforementioned audio information to multiple speech recognition algorithms and obtaining text output from each algorithm,

[0782] A means for comparing and integrating the aforementioned multiple text outputs to create a single document,

[0783] A means for summarizing the aforementioned document and extracting important information,

[0784] A means for obtaining questions from users and answering those questions based on the aforementioned documents and related information sources,

[0785] Means for adjusting the aforementioned audio information by digital processing,

[0786] Means for protecting the transfer of voice information by an initialized security protocol,

[0787] A means for scrutinizing information using a generative AI model according to the importance of the aforementioned document,

[0788] A system that includes this.

[0789] (Claim 2)

[0790] The system according to claim 1, comprising means for evaluating the reliability of the aforementioned speech recognition algorithms and extracting the most suitable combination.

[0791] (Claim 3)

[0792] The system according to claim 1, comprising means for obtaining relevant information sources from a communication device and evaluating them in association with the document.

[0793] "Application Example 1"

[0794] (Claim 1)

[0795] A means of receiving audio data from a communication device,

[0796] The means for transmitting the aforementioned acoustic data to multiple acoustic recognition models and receiving string conversion results from each model,

[0797] A means for comparing and integrating the aforementioned multiple string conversion results to generate a single record document,

[0798] A means for summarizing the aforementioned record document and extracting important information elements,

[0799] A means for receiving inquiries from users and responding to those inquiries based on the aforementioned record documents and related information,

[0800] A means of recording and analyzing conversations in stores and automatically generating and summarizing customer feedback,

[0801] A system that includes this.

[0802] (Claim 2)

[0803] The system according to claim 1, further comprising means for evaluating the reliability of the aforementioned acoustic recognition model and selecting the optimal combination.

[0804] (Claim 3)

[0805] The system according to claim 1, comprising means for receiving relevant information from a communication device and analyzing it in correlation with the aforementioned record document.

[0806] "Example 2 of combining an emotion engine"

[0807] (Claim 1)

[0808] A means for receiving voice information from a communication device,

[0809] The means for transmitting the aforementioned voice information to multiple voice recognition devices and receiving text information from each device,

[0810] A means for analyzing the aforementioned textual information and corresponding audio information to identify emotional characteristics,

[0811] A means for generating meeting records by adding the aforementioned emotional characteristics to textual information,

[0812] A means for summarizing the aforementioned meeting minutes and extracting important elements,

[0813] A means for receiving a question from a person and answering the question based on the meeting record and related information,

[0814] A system that includes this.

[0815] (Claim 2)

[0816] The system according to claim 1, further comprising means for evaluating the reliability of the speech recognition device and selecting the optimal combination.

[0817] (Claim 3)

[0818] The system according to claim 1, further comprising means for receiving relevant information from a communication device and analyzing it in relation to the meeting records.

[0819] "Application example 2 when combining with an emotional engine"

[0820] (Claim 1)

[0821] A means for receiving voice information from a communication device,

[0822] A means for transmitting the aforementioned audio information to multiple speech recognition technologies and receiving transcription results from each technology,

[0823] A means for analyzing emotions from the aforementioned audio information and generating emotional information,

[0824] A means for comparing and integrating the aforementioned multiple transcription results and sentiment information to generate a single meeting record,

[0825] A means for summarizing the aforementioned meeting minutes, extracting key points, and adjusting the summary based on sentiment information,

[0826] A means for receiving inquiries from users and responding to those inquiries based on the minutes and related information sources,

[0827] A means for presenting the aforementioned emotional information as feedback on the emotional state of participants after the meeting,

[0828] A system that includes this.

[0829] (Claim 2)

[0830] The system according to claim 1, further comprising means for evaluating the reliability of the aforementioned speech recognition technology and determining the optimal combination.

[0831] (Claim 3)

[0832] The system according to claim 1, comprising means for receiving relevant information sources from a communication device and analyzing them in conjunction with the minutes and emotional states during the meeting. [Explanation of Symbols]

[0833] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of receiving audio data from a communication device, The means for transmitting the aforementioned acoustic data to multiple acoustic recognition models and receiving string conversion results from each model, A means for comparing and integrating the aforementioned multiple string conversion results to generate a single record document, A means for summarizing the aforementioned record document and extracting important information elements, A means for receiving inquiries from users and responding to those inquiries based on the aforementioned record documents and related information, A means of recording and analyzing conversations in stores and automatically generating and summarizing customer feedback, A system that includes this.

2. The system according to claim 1, further comprising means for evaluating the reliability of the aforementioned acoustic recognition model and selecting the optimal combination.

3. The system according to claim 1, comprising means for receiving relevant information from a communication device and analyzing it in correlation with the aforementioned recorded document.