system
The system addresses the inefficiency of manual meeting minute creation by automatically converting audio to text, extracting key information, and generating structured summaries, enhancing meeting productivity.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-09
- Publication Date
- 2026-06-19
AI Technical Summary
Creating meeting minutes requires participants to focus on recording words, hindering active contribution to discussions, and the process of summarizing important information is time-consuming and laborious, leading to inefficiency.
A system that automatically converts meeting audio into text data, extracts important information using natural language processing, and generates meeting minutes in a predefined format, allowing participants to concentrate on discussions.
Significantly reduces the time and effort spent on creating meeting minutes, enabling efficient information sharing and improving meeting productivity by providing accurate and structured meeting summaries.
Smart Images

Figure 2026100733000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] The creation of meeting minutes in a meeting requires participants to focus on recording the words at that moment, which is a factor hindering their active contribution to the discussion. Furthermore, the work of appropriately grasping the content of the meeting, extracting and summarizing important information is time-consuming and laborious, so it is not efficient. Therefore, there is a need to provide a method for quickly and accurately creating meeting minutes while providing an environment in which meeting participants can concentrate on the original discussion.
Means for Solving the Problems
[0005] This invention is a system that automatically acquires meeting audio and records information by converting the audio data into text data. Furthermore, it extracts important information from this text data and automatically generates meeting minutes according to a pre-specified format. The generated meeting minutes are quickly sent to meeting participants, freeing them from the burden of creating minutes and allowing them to concentrate on the discussion. This system effectively solves the problem of improving the efficiency of meeting minute creation and increasing meeting productivity.
[0006] "Audio data" refers to recorded sound information collected during a meeting, including acoustic signals such as spoken language.
[0007] "Text data" refers to information in text format converted from audio data, specifically the transcribed content of the speech.
[0008] "Important information" refers to key content related to the progress and decision-making of a meeting, and is an element that should be reflected in the meeting minutes.
[0009] A "format" is a predefined structural pattern used in creating meeting minutes, and it is a framework for organizing and displaying information.
[0010] Meeting minutes are documents that summarize the main points and decisions made during a meeting, and are managed and shared as a record after the meeting.
[0011] "Natural language processing technology" refers to the technology used by computers to understand and generate human natural language, enabling the analysis and processing of text data. [Brief explanation of the drawing]
[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3]It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.
Embodiments for Carrying Out the Invention
[0013] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.
[0014] First, the language used in the following description will be explained.
[0015] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0016] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0017] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.
[0018] In the following embodiments, the numbered communication I / F (Interface) is an interface that includes a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0020] [First Embodiment]
[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0033] This invention is a system that automatically converts meeting audio data into meeting minutes, and efficiently creates meeting minutes using speech recognition technology and natural language processing technology. The following describes specific embodiments of this system.
[0034] First, the user configures the microphone connected to their device at the start of the meeting and launches a dedicated application to begin collecting audio data. When the user presses the "Start Recording" button, the device is configured to send an audio stream to the server. The server receives this audio stream and records it in a database in real time.
[0035] Next, the server periodically acquires the audio data and converts it into text data using a speech recognition API. This text data is stored in the server's text storage and managed for subsequent processing. For example, if a user says "Let's move on to the next agenda item" during a meeting, speech recognition generates the text data "Let's move on to the next agenda item".
[0036] Next, the server uses natural language processing technology to extract important information from the text data. Specifically, it analyzes key elements of the meeting, such as agenda items, decisions made, and assigned personnel, and extracts the necessary information. This process yields information that captures the core of the meeting.
[0037] Based on this important information, the server organizes the information according to the specified format and automatically generates meeting minutes. The format is pre-configured by the user and may consist of sections such as "Agenda: Content," "Decisions: Content," and "Next Steps: Content."
[0038] Finally, users review the meeting minutes generated on their devices and make minor revisions as needed. After revisions, the server sends the completed minutes to all stakeholders, enabling efficient information sharing and confirmation. This system significantly reduces the time and effort spent by meeting participants on creating minutes, providing an environment where they can focus on the discussion.
[0039] The following describes the processing flow.
[0040] Step 1:
[0041] The user launches a dedicated application on their device, configures the microphone, and presses the "Start Recording" button. The device continuously captures the audio of the meeting and streams the audio data to the server in real time.
[0042] Step 2:
[0043] The server receives the audio stream sent from the terminal and records it in the database. The audio data is scheduled to be retrieved at regular intervals for subsequent text conversion.
[0044] Step 3:
[0045] The server sends the recorded audio data to a speech recognition API, which converts it into corresponding text data. This text data is then stored on the server as a transcript of the meeting.
[0046] Step 4:
[0047] The server applies natural language processing techniques to the obtained text data to extract important information. Specifically, it uses keyword extraction and semantic analysis to detect key elements such as agenda items, decisions, and responsible parties.
[0048] Step 5:
[0049] The server organizes the extracted key information according to a predefined format and generates the initial meeting minutes. This format is a standard meeting minutes format and can be customized to meet user needs.
[0050] Step 6:
[0051] Users review the meeting minutes generated on their devices and make necessary corrections based on the displayed information. These corrections typically involve minor adjustments to wording or the addition of crucial information.
[0052] Step 7:
[0053] After revisions, the server saves the finalized meeting minutes as the final version and automatically distributes them to meeting participants and designated stakeholders via email or intranet. The generated minutes can also be accessed for later viewing.
[0054] (Example 1)
[0055] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0056] In modern workplaces, accurately and quickly recording meeting content and sharing information among stakeholders is crucial. However, manually creating meeting minutes is time-consuming, labor-intensive, and prone to human error. Furthermore, effectively extracting and organizing important information is challenging. Therefore, to address these issues, there is a need for a system that automatically converts audio data into text data, quickly extracts key elements, and generates well-organized reports.
[0057] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0058] In this invention, the server includes means for setting up an audio input device and collecting an audio signal at the start of a meeting, means for transmitting the audio signal to a central processing unit, and means for converting the audio signal into text information using speech recognition technology in the central processing unit. This makes it possible to automatically record the contents of a meeting and to quickly and accurately extract important information.
[0059] A "voice input device" is a device used to accurately acquire voice signals and plays the role of converting speech during a meeting into electrical signals.
[0060] A "central processing unit" refers to a core computing mechanism for processing received audio signals, and is a computer system that performs audio data analysis and text conversion.
[0061] "Speech recognition technology" is a technology that generates text data from speech signals, and includes algorithms that automatically convert human speech into text information.
[0062] "Text information" refers to string data that represents the content of speech, generated using speech recognition technology, and serves as a visual record of the meeting's proceedings.
[0063] "Natural language processing technology" refers to techniques for analyzing and extracting meaningful information from text information, and is a language processing system used to identify important elements and patterns within documents.
[0064] A "report" is a document that is automatically formatted and organized based on text information generated using voice input devices and a central processing unit, and functions as a formal record containing the contents and decisions of a meeting.
[0065] This invention is a system that automatically converts meeting audio data into meeting minutes, utilizing speech recognition technology and natural language processing technology. The user first connects and configures an audio input device, such as a microphone, to the terminal. A dedicated application runs on the terminal, allowing for real-time collection of audio signals during the meeting.
[0066] Once an audio signal is collected, the terminal sends the audio data to a central processing unit, specifically a server. The server receives this data and uses speech recognition technology to convert the audio signal into text information. A common example of a speech recognition API used in this process is a cloud-based service (such as Google® Cloud Speech-to-Text API).
[0067] Next, the server applies natural language processing techniques to analyze this text information and extract important elements, such as agenda items and decisions. Suitable software for this purpose includes natural language processing libraries such as SpaCy and NLTK.
[0068] Once key information is extracted, the server automatically generates a report according to the user's specified format. The report organizes the information in a format such as "Agenda: Content" and "Decisions: Content." The user can then review this report and, if necessary, use the generating AI model to make further revisions to the wording.
[0069] The generated report is distributed to all relevant parties via a server. This distribution is done via email or cloud storage service (e.g., Google Drive or Dropbox). This significantly reduces the time meeting participants spend creating meeting minutes, allowing them to focus on the discussion itself.
[0070] For example, if a user says "We will discuss setting goals for the next project" during a meeting, the server can incorporate this into the report as "Agenda: Setting goals for the next project." Another example of a prompt for the generating AI model is "Create a report from the meeting audio data, including the agenda and decisions made."
[0071] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0072] Step 1:
[0073] The user connects an audio input device to their terminal and launches a dedicated application. This application provides an interface for collecting audio signals at the start of a meeting. The input is the audio from the meeting, and the output is recorded on the terminal as an audio signal. Specifically, the collection of audio data begins when the user presses the "Start Recording" button.
[0074] Step 2:
[0075] The terminal sends the collected audio signal to the server. Here, the audio signal recorded within the terminal is used as input, and the audio data is sent to the server in real time as output. Transmission is performed using a streaming protocol, and the data is received by the server.
[0076] Step 3:
[0077] The server converts the received audio signal into text information by inputting it into a speech recognition API. The input is the audio data received on the server, and the output is the parsed text data. Specifically, for example, the Google Cloud Speech-to-Text API is used to convert the audio content into text.
[0078] Step 4:
[0079] The server uses natural language processing techniques to extract important elements from text information. The input is text data obtained through speech recognition, and the output is important information such as agenda items and decisions. This process utilizes natural language processing libraries (e.g., SpaCy and NLTK).
[0080] Step 5:
[0081] The server generates a report based on the extracted key information according to the user's specified format. The input here is the extracted key information, and the output is a formatted meeting transcript. The server organizes the text information according to the specified format and compiles it into a report.
[0082] Step 6:
[0083] The user reviews the generated report on their device and makes corrections to the wording as needed using the AI generation model. The input is the text data of the generated report, and the output is the corrected report. The user can use the correction prompt to refine the report into a more natural-sounding format.
[0084] Step 7:
[0085] The server distributes the final report to all stakeholders. The input is the revised meeting minutes, and the output is the final report delivered to the recipients. Email and cloud storage services are used for distribution, ensuring efficient information sharing.
[0086] (Application Example 1)
[0087] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0088] Traditional meetings required significant time and effort to manually record and transcribe audio data and create meeting minutes. Furthermore, real-time information sharing was often impossible, leading to considerable time spent on post-meeting review. To address these issues and enable more efficient and accurate information processing, an autonomous system is needed.
[0089] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0090] In this invention, the server includes means for receiving audio data acquired during a meeting, means for converting the audio data into text data, means for extracting core information from the text data, means for organizing the core information according to a specified format and generating a report, and means for autonomously taking in audio data and providing information immediately. This makes it possible to process information in real time during the meeting and to provide a report immediately after the meeting ends.
[0091] "Audio data" refers to digital or analog signals generated by the human voice.
[0092] "Character data" refers to digital characters converted from speech and in a readable and writable format.
[0093] "Core information" refers to information within the data that is semantically important and relevant to the topic or decision of a meeting or discussion.
[0094] "Format" refers to the rules and styles that define the structure, arrangement, and presentation of data and information.
[0095] A "report" is a document that organizes relevant information about a specific event or activity, and includes conclusions and recommendations.
[0096] "Autonomously" refers to the act or decision-making that occurs independently, without the need for human or external control.
[0097] The system for carrying out this invention acquires audio data, converts it to text data in real time, extracts important information, and generates a report. To achieve this, the following hardware and software are used.
[0098] First, the terminal uses a high-sensitivity microphone to capture the audio of the meeting. This microphone is designed to clearly capture the speaker's voice in the meeting environment. The audio data is then transmitted from the terminal to the server via the internet.
[0099] The server uses the Google Speech-to-Text API to quickly convert received audio data into text data. This digital conversion process allows for the generation of text information from audio information. Subsequently, the server analyzes the generated text data using natural language processing libraries such as spaCy and NLTK. This analysis extracts the core information.
[0100] The server organizes the extracted core information based on a user-defined format and generates a report. This report is structured to clearly present important information, such as agenda items and decisions, according to the specified format. Finally, the generated report is automatically sent to relevant parties. This makes it easy for meeting participants to review the information afterward and ensures that important decisions made during the meeting are properly managed.
[0101] As a concrete example, imagine a family meeting held at home where travel plans, shopping lists, and assigned household chores are discussed. In this scenario, reports related to these topics are generated in real time from the audio data of the meeting. Once the reports are ready, the system shares them with the relevant parties and provides guidance for the next steps.
[0102] An example of a prompt message would be, "Based on the content of the previous meeting, please create meeting minutes that include the following keywords: family trip, shopping list, assigned household chores." This would allow the system to efficiently extract and organize information that meets the user's needs.
[0103] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0104] Step 1:
[0105] The user acquires audio data using a microphone connected to the device. Here, the input is what is said during the meeting. The device converts this into digital audio data and transmits it to the server via the internet.
[0106] Step 2:
[0107] The server converts the received audio data into text data using the Google Speech-to-Text API. The input is audio data, which is analyzed and decoded via the API, and the output is text data.
[0108] Step 3:
[0109] The server analyzes text data using natural language processing libraries such as spaCy and NLTK. The input here is character data. The server extracts keywords and structural information from the text and generates important core information as output.
[0110] Step 4:
[0111] The server organizes the extracted core information based on the format set by the user and generates a report. The input is core information, which is structured according to formatting rules, and the output is a formatted report.
[0112] Step 5:
[0113] The server sends the generated report to the relevant parties. The input is the completed report, which is sent to the relevant parties as output via email or cloud service.
[0114] In this way, the server, terminal, and user elements work together to efficiently process meeting content and provide valuable information to participants.
[0115] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0116] This invention relates to a system that automatically converts meeting audio data into meeting minutes and further incorporates participant emotions into the minutes. This system improves the quality and efficiency of meeting minute creation by combining speech recognition technology, natural language processing technology, and emotion recognition technology. A specific embodiment of this system is described below.
[0117] First, the user configures the device's built-in microphone to collect audio data at the start of the meeting and begins recording. The device captures audio throughout the meeting and sends it to the server.
[0118] The server receives audio data in real time and records it in a database. This audio data is converted into text data by a speech recognition API and stored in the database as text. The converted text is not merely a transcription, but is used as material for in-depth analysis.
[0119] Next, the server uses natural language processing technology to extract key information from the text data. This efficiently organizes the main agenda items, decisions, and next steps from the meeting. This information is automatically arranged according to a format pre-configured by the user, and the initial meeting minutes are generated.
[0120] Furthermore, the server uses an emotion recognition engine to analyze participants' emotions from the audio data. This analysis detects and appropriately records each speaker's emotional state (e.g., excitement, calmness, dissatisfaction). For example, if the emotion analysis detects that a user is excited during a discussion, this information is added to the meeting minutes, making it clear under what emotional circumstances important statements were made.
[0121] Ultimately, users review the meeting minutes generated on their devices, including sentiment data, and make adjustments as needed. The server then distributes the reviewed and revised minutes as the final version to each participant. This streamlines the creation of meeting minutes while simultaneously adding sentiment-based supplementary information, enabling the provision of more immersive information.
[0122] The following describes the processing flow.
[0123] Step 1:
[0124] The user launches a dedicated application on their device and sets up the microphone to record the meeting. Pressing the "Start Recording" button initiates audio collection. The device continuously transmits this audio data to the server in real time.
[0125] Step 2:
[0126] The server receives the audio stream sent from the terminal and records the audio data in the database. This recorded audio data is managed for subsequent speech recognition processing.
[0127] Step 3:
[0128] The server sends the audio data to a speech recognition API, which converts it into text data. This text data is stored in a database as a record of the meeting content. For example, if a user says, "Let's discuss the new project," that is saved as text data.
[0129] Step 4:
[0130] The server applies natural language processing techniques to extract important information (e.g., agenda items, opinions, decisions) from the text data. This information becomes the main content of the meeting and is used later to generate meeting minutes.
[0131] Step 5:
[0132] The server uses an emotion recognition engine to analyze the speaker's emotions from the audio data. The analysis results are recorded in a database as the emotional state for each statement (e.g., joy, surprise, criticism).
[0133] Step 6:
[0134] The server combines extracted key information with sentiment analysis results to generate meeting minutes in a format pre-configured by the user. This format includes elements such as "agenda," "speaker's sentiment," and "decisions made."
[0135] Step 7:
[0136] The user reviews the meeting minutes generated on their device, examining the overall content, including sentiment data. They correct or add information as needed. For example, they might correct misinterpretations of sentiment.
[0137] Step 8:
[0138] The server finalizes the revised meeting minutes and automatically sends them to meeting participants and relevant parties. Since the minutes include emotional information about important statements, participants can reconfirm the atmosphere of the meeting.
[0139] (Example 2)
[0140] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0141] In recording and analyzing audio data from meetings, it is essential to efficiently and accurately extract and record important information. Furthermore, understanding the participants' mental states and reflecting them in the recording is necessary to gain a more comprehensive understanding of the meeting's content. Traditional methods often limit themselves to simple recording and conversion of audio, resulting in insufficient accuracy and efficiency in extracting important information and analyzing emotions.
[0142] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0143] In this invention, the server includes means for receiving audio information, means for converting audio information into text information, means for extracting important information from the text information, means for analyzing mental states and reflecting them in the record, and means for transmitting the generated record to the relevant parties. This enables efficient processing of audio data as well as the generation of detailed records that take into account the mental states of the participants.
[0144] "Audio information" refers to sound data acquired during meetings and conversations, and is typically analyzed using speech recognition technology.
[0145] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is a format designed to make information easier to visualize.
[0146] "Important information" refers to data from textual information that is deemed particularly significant in meetings and discussions, and is primarily extracted using natural language processing technology.
[0147] "Mental state" refers to the emotions and psychological responses exhibited by participants during meetings or discussions, and is analyzed using emotion recognition technology.
[0148] "Records" refer to documents or data that organize the information collected during a meeting, including the participants' mental states.
[0149] "Stakeholders" refers to individuals or organizations that should be shared with the contents of the meeting, and typically includes meeting participants and those responsible for them.
[0150] A "server" refers to a computer system that processes collected audio information and generates and distributes text information and other important information.
[0151] This invention relates to a system for efficiently processing audio information from meetings, extracting important information, and analyzing the mental state of participants. Specific embodiments are described below.
[0152] At the start of a meeting, the user configures the built-in microphone using their device and begins collecting audio information. The device captures audio information using noise cancellation and transmits the data to the server in real time. The communication is protected by the SSL / TLS protocol.
[0153] The server converts the received audio information into text using a speech recognition API (for example, a general-purpose speech recognition service). This converted text information is stored in a database. Subsequently, the server extracts important information from the text information using natural language processing techniques (for example, a general-purpose natural language processing library). This process organizes the main agenda items and decisions made at the meeting.
[0154] Furthermore, the server uses an emotion recognition engine (for example, a widely used emotion analysis service) to analyze the participants' mental state from the audio information. The results of this analysis are reflected in the record along with the text information, and the final meeting minutes are generated.
[0155] Users can review these meeting minutes via their devices and correct errors or add information. After this review and correction is complete, the server distributes the final version of the record to the relevant parties.
[0156] For example, if a user says "I'd like to discuss the project's progress" during a monthly meeting, and the mental state analysis detects "tension," this information will be recorded in the meeting minutes as "Discussion on progress (tension)." This kind of information visualization allows for a deeper understanding of the meeting's content.
[0157] An example of a prompt message used with a generation AI model would be: "Analyze the audio and text data and generate meeting minutes highlighting key topics and emotions."
[0158] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0159] Step 1:
[0160] The user uses the device to configure the built-in microphone and begin collecting audio information. The device enables noise cancellation to properly record the collected audio information. At this stage, the input is audio data of the meeting participants' speech, and the output is clear audio data with the noise removed.
[0161] Step 2:
[0162] The terminal transmits collected audio data to the server in real time. The input is clear audio data, which is encrypted and securely transferred to the server. The output is secure audio data stored on the server. SSL / TLS is used to protect the communication during this process.
[0163] Step 3:
[0164] The server converts received audio data into text using a speech recognition API. The input is audio data stored on the server, which is analyzed using speech recognition technology and converted into text. The output is clearly recorded text information, which is stored in a database.
[0165] Step 4:
[0166] The server uses natural language processing techniques to extract important information from textual data. At this stage, textual data is used as input. An information extraction algorithm is executed to identify the meeting's topics and decisions. The output is added to a database in an organized form. Specifically, noun phrase extraction and topic modeling are performed.
[0167] Step 5:
[0168] The server uses an emotion recognition engine to analyze mental states from audio data. The input is the original audio data, and emotion analysis techniques are used to identify the emotional state of each speaker. The output is textual information with the emotional state added. For example, characteristics such as whether a statement is "calm" or "nervous" are evaluated.
[0169] Step 6:
[0170] The user reviews the generated meeting minutes on their terminal and corrects any errors as needed. At this stage, the input is the meeting minutes data sent from the server. The user checks for errors and adds notes to supplement the record. The output is the final version of the meeting minutes after corrections have been made.
[0171] Step 7:
[0172] The server distributes the final, revised meeting minutes to the relevant parties. The input is the final version of the meeting minutes after user review, and it is distributed to relevant parties via email or other means. The output is the meeting minutes data securely delivered to all participants. As a result, the meeting content is shared with high transparency.
[0173] (Application Example 2)
[0174] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0175] In brick-and-mortar stores, accurately understanding customer needs and emotions is difficult, which can result in a decline in service quality. Furthermore, it is challenging for staff to instantly grasp customer emotions and adjust their responses accordingly, making improving customer satisfaction a significant challenge.
[0176] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0177] In this invention, the server includes means for receiving audio information acquired during a conversation and converting it into text information, means for extracting important content from the text information, and means for analyzing the emotional state from the audio information during the conversation. This makes it possible to understand the customer's needs and provide the most appropriate response based on their emotions in real time during interactions with customers.
[0178] "Conversation" is the act of multiple audio information senders and receivers exchanging audio information with each other.
[0179] "Audio information" refers to data that represents the words and sounds spoken by a person in digital or analog format.
[0180] "Textual information" refers to data expressed using the alphabet, kanji, kana, etc., based on audio information.
[0181] "Important content" refers to specific matters or topics that have particular significance and are extracted from the conversation.
[0182] "Format" refers to a framework for organizing information based on certain standards and formats.
[0183] A "record" is a collection of data that is saved so that information can be reused or referenced later.
[0184] "Emotional state" refers to psychological characteristics related to the emotions of the sender or receiver inferred from audio information.
[0185] A "physical store" is a facility that provides goods and services in a physical location and conducts face-to-face transactions with customers.
[0186] This invention is a system aimed at improving customer service in physical stores. Based on the cooperative relationship between the server, terminal, and user, this system is realized through the following functions performed by each element.
[0187] First, conversations within the physical store are captured as audio information through the microphone built into the device. Smartphones and tablets are suitable devices for this purpose. This audio information is sent to a server in real time, and the server uses a speech recognition API (for example, Google Cloud Speech-to-Text) to convert the audio information into text. The text information is then stored in a database.
[0188] Next, the server uses natural language processing techniques to extract important information from the text. Natural language processing libraries such as spaCy and NLTK are useful in this process. The server also uses emotion recognition APIs (e.g., IBM Watson® Tone Analyzer) to analyze the emotional state of the participants in the conversation and incorporates the results into the recording.
[0189] Ultimately, the user (in this case, a store employee) can respond immediately to customer needs and emotions based on the analysis results displayed on the terminal. This improves the quality of customer service and leads to increased customer satisfaction.
[0190] As a concrete scenario, for example, if a customer is very interested in a new product, their level of excitement can be detected through sentiment analysis. Based on this information, the salesperson can immediately take appropriate measures, such as providing more detailed information about the new product.
[0191] An example of a prompt message generated using an AI model might be: "Please explain in detail the product the customer is interested in. Also, identify any concerns the customer may have and suggest solutions." This prompt message provides store staff with guidance to provide appropriate service smoothly.
[0192] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0193] Step 1:
[0194] The device acquires conversation audio. Using the microphone built into the device, it captures conversation audio between customers and store employees in a physical store in real time. The input is audio information, and this information is stored in digital format.
[0195] Step 2:
[0196] The server receives audio information and converts it into text information using a speech recognition API. The input data is digital audio information sent from the terminal, and the output data is text information obtained by converting the audio information into text format. Speech recognition technologies such as Google Cloud Speech-to-Text are used.
[0197] Step 3:
[0198] The server uses natural language processing techniques to extract important information from textual data. The input data is textual data obtained from speech recognition, and the output data is information about key topics and needs extracted from the conversation. Natural language processing libraries (e.g., spaCy or NLTK) are used for this process.
[0199] Step 4:
[0200] The server analyzes emotional states using an emotion recognition API. Input data consists of textual information, including conversation content, while output data represents the emotional states of both the customer and the staff. This analysis utilizes emotion recognition technologies such as IBM Watson Tone Analyzer.
[0201] Step 5:
[0202] The server generates a record based on important content and emotional state, and delivers it to the terminal. The input data is the analysis results, including important content and emotional state, and the output data is a record that organizes them. The terminal uses this to display the analysis results to the user.
[0203] Step 6:
[0204] The system reviews the analytical data provided by the user's device and uses it to improve customer service. Based on the analysis results, the system adjusts customer service methods and improves service quality. In this process, store employees refer to prompts for the generated AI model to provide the best possible response to customer needs.
[0205] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0206] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0207] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0208] [Second Embodiment]
[0209] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0210] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0211] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0212] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0213] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0214] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0215] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0216] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0217] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0218] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0219] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0220] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0221] This invention is a system that automatically converts meeting audio data into meeting minutes, and efficiently creates meeting minutes using speech recognition technology and natural language processing technology. The following describes specific embodiments of this system.
[0222] First, the user configures the microphone connected to their device at the start of the meeting and launches a dedicated application to begin collecting audio data. When the user presses the "Start Recording" button, the device is configured to send an audio stream to the server. The server receives this audio stream and records it in a database in real time.
[0223] Next, the server periodically acquires the audio data and converts it into text data using a speech recognition API. This text data is stored in the server's text storage and managed for subsequent processing. For example, if a user says "Let's move on to the next agenda item" during a meeting, speech recognition generates the text data "Let's move on to the next agenda item".
[0224] Next, the server uses natural language processing technology to extract important information from the text data. Specifically, it analyzes key elements of the meeting, such as agenda items, decisions made, and assigned personnel, and extracts the necessary information. This process yields information that captures the core of the meeting.
[0225] Based on this important information, the server organizes the information according to the specified format and automatically generates meeting minutes. The format is pre-configured by the user and may consist of sections such as "Agenda: Content," "Decisions: Content," and "Next Steps: Content."
[0226] Finally, users review the meeting minutes generated on their devices and make minor revisions as needed. After revisions, the server sends the completed minutes to all stakeholders, enabling efficient information sharing and confirmation. This system significantly reduces the time and effort spent by meeting participants on creating minutes, providing an environment where they can focus on the discussion.
[0227] The following describes the processing flow.
[0228] Step 1:
[0229] The user launches a dedicated application on their device, configures the microphone, and presses the "Start Recording" button. The device continuously captures the audio of the meeting and streams the audio data to the server in real time.
[0230] Step 2:
[0231] The server receives the audio stream sent from the terminal and records it in the database. The audio data is scheduled to be retrieved at regular intervals for subsequent text conversion.
[0232] Step 3:
[0233] The server sends the recorded audio data to a speech recognition API, which converts it into corresponding text data. This text data is then stored on the server as a transcript of the meeting.
[0234] Step 4:
[0235] The server applies natural language processing techniques to the obtained text data to extract important information. Specifically, it uses keyword extraction and semantic analysis to detect key elements such as agenda items, decisions, and responsible parties.
[0236] Step 5:
[0237] The server organizes the extracted key information according to a predefined format and generates the initial meeting minutes. This format is a standard meeting minutes format and can be customized to meet user needs.
[0238] Step 6:
[0239] Users review the meeting minutes generated on their devices and make necessary corrections based on the displayed information. These corrections typically involve minor adjustments to wording or the addition of crucial information.
[0240] Step 7:
[0241] After revisions, the server saves the finalized meeting minutes as the final version and automatically distributes them to meeting participants and designated stakeholders via email or intranet. The generated minutes can also be accessed for later viewing.
[0242] (Example 1)
[0243] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0244] In modern workplaces, accurately and quickly recording meeting content and sharing information among stakeholders is crucial. However, manually creating meeting minutes is time-consuming, labor-intensive, and prone to human error. Furthermore, effectively extracting and organizing important information is challenging. Therefore, to address these issues, there is a need for a system that automatically converts audio data into text data, quickly extracts key elements, and generates well-organized reports.
[0245] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0246] In this invention, the server includes means for setting up an audio input device and collecting an audio signal at the start of a meeting, means for transmitting the audio signal to a central processing unit, and means for converting the audio signal into text information using speech recognition technology in the central processing unit. This makes it possible to automatically record the contents of a meeting and to quickly and accurately extract important information.
[0247] A "voice input device" is a device used to accurately acquire voice signals and plays the role of converting speech during a meeting into electrical signals.
[0248] A "central processing unit" refers to a core computing mechanism for processing received audio signals, and is a computer system that performs audio data analysis and text conversion.
[0249] "Speech recognition technology" is a technology that generates text data from speech signals, and includes algorithms that automatically convert human speech into text information.
[0250] "Text information" refers to string data that represents the content of speech, generated using speech recognition technology, and serves as a visual record of the meeting's proceedings.
[0251] "Natural language processing technology" refers to techniques for analyzing and extracting meaningful information from text information, and is a language processing system used to identify important elements and patterns within documents.
[0252] A "report" is a document that is automatically formatted and organized based on text information generated using voice input devices and a central processing unit, and functions as a formal record containing the contents and decisions of a meeting.
[0253] This invention is a system that automatically converts meeting audio data into meeting minutes, utilizing speech recognition technology and natural language processing technology. The user first connects and configures an audio input device, such as a microphone, to the terminal. A dedicated application runs on the terminal, allowing for real-time collection of audio signals during the meeting.
[0254] Once an audio signal is collected, the terminal sends the audio data to a central processing unit, specifically a server. The server receives this data and uses speech recognition technology to convert the audio signal into text information. Common cloud-based services (such as Google Cloud Speech-to-Text API) are used as speech recognition APIs in this process.
[0255] Next, the server applies natural language processing techniques to analyze this text information and extract important elements, such as agenda items and decisions. Suitable software for this purpose includes natural language processing libraries such as SpaCy and NLTK.
[0256] Once key information is extracted, the server automatically generates a report according to the user's specified format. The report organizes the information in a format such as "Agenda: Content" and "Decisions: Content." The user can then review this report and, if necessary, use the generating AI model to make further revisions to the wording.
[0257] The generated report is distributed to all relevant parties via a server. This distribution is done via email or cloud storage service (e.g., Google Drive or Dropbox). This significantly reduces the time meeting participants spend creating meeting minutes, allowing them to focus on the discussion itself.
[0258] For example, if a user says "We will discuss setting goals for the next project" during a meeting, the server can incorporate this into the report as "Agenda: Setting goals for the next project." Another example of a prompt for the generating AI model is "Create a report from the meeting audio data, including the agenda and decisions made."
[0259] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0260] Step 1:
[0261] The user connects an audio input device to their terminal and launches a dedicated application. This application provides an interface for collecting audio signals at the start of a meeting. The input is the audio from the meeting, and the output is recorded on the terminal as an audio signal. Specifically, the collection of audio data begins when the user presses the "Start Recording" button.
[0262] Step 2:
[0263] The terminal sends the collected audio signal to the server. Here, the audio signal recorded within the terminal is used as input, and the audio data is sent to the server in real time as output. Transmission is performed using a streaming protocol, and the data is received by the server.
[0264] Step 3:
[0265] The server converts the received audio signal into text information by inputting it into a speech recognition API. The input is the audio data received on the server, and the output is the parsed text data. Specifically, for example, the Google Cloud Speech-to-Text API is used to convert the audio content into text.
[0266] Step 4:
[0267] The server uses natural language processing techniques to extract important elements from text information. The input is text data obtained through speech recognition, and the output is important information such as agenda items and decisions. This process utilizes natural language processing libraries (e.g., SpaCy and NLTK).
[0268] Step 5:
[0269] The server generates a report based on the extracted key information according to the user's specified format. The input here is the extracted key information, and the output is a formatted meeting transcript. The server organizes the text information according to the specified format and compiles it into a report.
[0270] Step 6:
[0271] The user reviews the generated report on their device and makes corrections to the wording as needed using the AI generation model. The input is the text data of the generated report, and the output is the corrected report. The user can use the correction prompt to refine the report into a more natural-sounding format.
[0272] Step 7:
[0273] The server distributes the final report to all stakeholders. The input is the revised meeting minutes, and the output is the final report delivered to the recipients. Email and cloud storage services are used for distribution, ensuring efficient information sharing.
[0274] (Application Example 1)
[0275] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0276] Traditional meetings required significant time and effort to manually record and transcribe audio data and create meeting minutes. Furthermore, real-time information sharing was often impossible, leading to considerable time spent on post-meeting review. To address these issues and enable more efficient and accurate information processing, an autonomous system is needed.
[0277] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0278] In this invention, the server includes means for receiving audio data acquired during a meeting, means for converting the audio data into text data, means for extracting core information from the text data, means for organizing the core information according to a specified format and generating a report, and means for autonomously taking in audio data and providing information immediately. This makes it possible to process information in real time during the meeting and to provide a report immediately after the meeting ends.
[0279] "Audio data" refers to digital or analog signals generated by the human voice.
[0280] "Character data" refers to digital characters converted from speech and in a readable and writable format.
[0281] "Core information" refers to information that is semantically important among data and is related to the themes and decisions of meetings and conversations.
[0282] "Format" refers to the rules and styles that define the structure, arrangement, and expression method of data and information.
[0283] "Report" refers to a document that organizes relevant information regarding specific events or activities and is written with conclusions and proposals.
[0284] "Autonomously" means to act or make judgments independently without the need for human or external control.
[0285] In the system for implementing this invention, voice data is acquired, converted into character data in real time, important information is extracted, and a report is generated. To achieve this, the following hardware and software are used.
[0286] First, the terminal uses a high-sensitivity microphone to acquire the voice of the meeting. This microphone is for clearly capturing the voice of the speaker in the meeting environment. The voice data is transmitted from the terminal to the server through the Internet.
[0287] The server quickly converts the received voice data into character data by leveraging the Google Speech-to-Text API. Through this digital conversion process, text information can be generated from voice information. Subsequently, the server analyzes the generated character data using natural language processing libraries such as spaCy and NLTK. Through this analysis operation, core information is extracted.
[0288] The server organizes the extracted core information based on a user-defined format and generates a report. This report is structured to clearly present important information, such as agenda items and decisions, according to the specified format. Finally, the generated report is automatically sent to relevant parties. This makes it easy for meeting participants to review the information afterward and ensures that important decisions made during the meeting are properly managed.
[0289] As a concrete example, imagine a family meeting held at home where travel plans, shopping lists, and assigned household chores are discussed. In this scenario, reports related to these topics are generated in real time from the audio data of the meeting. Once the reports are ready, the system shares them with the relevant parties and provides guidance for the next steps.
[0290] An example of a prompt message would be, "Based on the content of the previous meeting, please create meeting minutes that include the following keywords: family trip, shopping list, assigned household chores." This would allow the system to efficiently extract and organize information that meets the user's needs.
[0291] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0292] Step 1:
[0293] The user acquires audio data using a microphone connected to the device. Here, the input is what is said during the meeting. The device converts this into digital audio data and transmits it to the server via the internet.
[0294] Step 2:
[0295] The server converts the received audio data into text data using the Google Speech-to-Text API. The input is audio data, which is analyzed and decoded via the API, and the output is text data.
[0296] Step 3:
[0297] The server analyzes text data using natural language processing libraries such as spaCy and NLTK. The input here is character data. The server extracts keywords and structural information from the text and generates important core information as output.
[0298] Step 4:
[0299] The server organizes the extracted core information based on the format set by the user and generates a report. The input is core information, which is structured according to formatting rules, and the output is a formatted report.
[0300] Step 5:
[0301] The server sends the generated report to the relevant parties. The input is the completed report, which is sent to the relevant parties as output via email or cloud service.
[0302] In this way, the server, terminal, and user elements work together to efficiently process meeting content and provide valuable information to participants.
[0303] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0304] This invention relates to a system that automatically converts meeting audio data into meeting minutes and further incorporates participant emotions into the minutes. This system improves the quality and efficiency of meeting minute creation by combining speech recognition technology, natural language processing technology, and emotion recognition technology. A specific embodiment of this system is described below.
[0305] First, at the start of the meeting, the user uses the microphone built into the terminal to set the collection of voice data and start recording. The terminal captures the voice throughout the meeting and sends it to the server.
[0306] The server receives the voice data in real time and records it in the database. This voice data is converted into character data by the speech recognition API and saved in the database as text. The converted text is not just a transcription but serves as material for in-depth analysis.
[0307] Next, the server uses natural language processing technology to extract important information from the text data. As a result, the main topics, decisions, next steps, etc. in the meeting are efficiently organized. This information is automatically arranged based on the format pre-set by the user, and the first meeting minutes are generated.
[0308] Furthermore, the server uses an emotion recognition engine to analyze the emotions of the participants from the voice data. Through this analysis, the emotional states (e.g., excitement, calmness, dissatisfaction, etc.) of each speaker are detected and appropriately recorded. As a specific example, if it is detected by the emotion analysis that the user is excited during the discussion, this information is appended to the meeting minutes, and it becomes possible to know under what emotions important statements were made.
[0309] Finally, the user checks the meeting minutes generated on the terminal, reviews the content including the emotion data, and makes adjustments as necessary. The server distributes the confirmed and corrected meeting minutes as the final version to each participant. As a result, the creation of meeting minutes in the meeting is streamlined, and supplementary information based on emotions is added, enabling a more immersive information provision.
[0310] The following explains the processing flow.
[0311] Step 1:
[0312] The user launches a dedicated application on their device and sets up the microphone to record the meeting. Pressing the "Start Recording" button initiates audio collection. The device continuously transmits this audio data to the server in real time.
[0313] Step 2:
[0314] The server receives the audio stream sent from the terminal and records the audio data in the database. This recorded audio data is managed for subsequent speech recognition processing.
[0315] Step 3:
[0316] The server sends the audio data to a speech recognition API, which converts it into text data. This text data is stored in a database as a record of the meeting content. For example, if a user says, "Let's discuss the new project," that is saved as text data.
[0317] Step 4:
[0318] The server applies natural language processing techniques to extract important information (e.g., agenda items, opinions, decisions) from the text data. This information becomes the main content of the meeting and is used later to generate meeting minutes.
[0319] Step 5:
[0320] The server uses an emotion recognition engine to analyze the speaker's emotions from the audio data. The analysis results are recorded in a database as the emotional state for each statement (e.g., joy, surprise, criticism).
[0321] Step 6:
[0322] The server combines extracted key information with sentiment analysis results to generate meeting minutes in a format pre-configured by the user. This format includes elements such as "agenda," "speaker's sentiment," and "decisions made."
[0323] Step 7:
[0324] The user reviews the meeting minutes generated on their device, examining the overall content, including sentiment data. They correct or add information as needed. For example, they might correct misinterpretations of sentiment.
[0325] Step 8:
[0326] The server finalizes the revised meeting minutes and automatically sends them to meeting participants and relevant parties. Since the minutes include emotional information about important statements, participants can reconfirm the atmosphere of the meeting.
[0327] (Example 2)
[0328] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0329] In recording and analyzing audio data from meetings, it is essential to efficiently and accurately extract and record important information. Furthermore, understanding the participants' mental states and reflecting them in the recording is necessary to gain a more comprehensive understanding of the meeting's content. Traditional methods often limit themselves to simple recording and conversion of audio, resulting in insufficient accuracy and efficiency in extracting important information and analyzing emotions.
[0330] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0331] In this invention, the server includes means for receiving audio information, means for converting audio information into text information, means for extracting important information from the text information, means for analyzing mental states and reflecting them in the record, and means for transmitting the generated record to the relevant parties. This enables efficient processing of audio data as well as the generation of detailed records that take into account the mental states of the participants.
[0332] "Audio information" refers to sound data acquired during meetings and conversations, and is typically analyzed using speech recognition technology.
[0333] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is a format designed to make information easier to visualize.
[0334] "Important information" refers to data from textual information that is deemed particularly significant in meetings and discussions, and is primarily extracted using natural language processing technology.
[0335] "Mental state" refers to the emotions and psychological responses exhibited by participants during meetings or discussions, and is analyzed using emotion recognition technology.
[0336] "Records" refer to documents or data that organize the information collected during a meeting, including the participants' mental states.
[0337] "Stakeholders" refers to individuals or organizations that should be shared with the contents of the meeting, and typically includes meeting participants and those responsible for them.
[0338] A "server" refers to a computer system that processes collected audio information and generates and distributes text information and other important information.
[0339] This invention relates to a system for efficiently processing audio information from meetings, extracting important information, and analyzing the mental state of participants. Specific embodiments are described below.
[0340] At the start of a meeting, the user configures the built-in microphone using their device and begins collecting audio information. The device captures audio information using noise cancellation and transmits the data to the server in real time. The communication is protected by the SSL / TLS protocol.
[0341] The server converts the received audio information into text using a speech recognition API (for example, a general-purpose speech recognition service). This converted text information is stored in a database. Subsequently, the server extracts important information from the text information using natural language processing techniques (for example, a general-purpose natural language processing library). This process organizes the main agenda items and decisions made at the meeting.
[0342] Furthermore, the server uses an emotion recognition engine (for example, a widely used emotion analysis service) to analyze the participants' mental state from the audio information. The results of this analysis are reflected in the record along with the text information, and the final meeting minutes are generated.
[0343] Users can review these meeting minutes via their devices and correct errors or add information. After this review and correction is complete, the server distributes the final version of the record to the relevant parties.
[0344] For example, if a user says "I'd like to discuss the project's progress" during a monthly meeting, and the mental state analysis detects "tension," this information will be recorded in the meeting minutes as "Discussion on progress (tension)." This kind of information visualization allows for a deeper understanding of the meeting's content.
[0345] An example of a prompt message used with a generation AI model would be: "Analyze the audio and text data and generate meeting minutes highlighting key topics and emotions."
[0346] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0347] Step 1:
[0348] The user uses the device to configure the built-in microphone and begin collecting audio information. The device enables noise cancellation to properly record the collected audio information. At this stage, the input is audio data of the meeting participants' speech, and the output is clear audio data with the noise removed.
[0349] Step 2:
[0350] The terminal transmits collected audio data to the server in real time. The input is clear audio data, which is encrypted and securely transferred to the server. The output is secure audio data stored on the server. SSL / TLS is used to protect the communication during this process.
[0351] Step 3:
[0352] The server converts received audio data into text using a speech recognition API. The input is audio data stored on the server, which is analyzed using speech recognition technology and converted into text. The output is clearly recorded text information, which is stored in a database.
[0353] Step 4:
[0354] The server uses natural language processing techniques to extract important information from textual data. At this stage, textual data is used as input. An information extraction algorithm is executed to identify the meeting's topics and decisions. The output is added to a database in an organized form. Specifically, noun phrase extraction and topic modeling are performed.
[0355] Step 5:
[0356] The server uses an emotion recognition engine to analyze mental states from audio data. The input is the original audio data, and emotion analysis techniques are used to identify the emotional state of each speaker. The output is textual information with the emotional state added. For example, characteristics such as whether a statement is "calm" or "nervous" are evaluated.
[0357] Step 6:
[0358] The user reviews the generated meeting minutes on their terminal and corrects any errors as needed. At this stage, the input is the meeting minutes data sent from the server. The user checks for errors and adds notes to supplement the record. The output is the final version of the meeting minutes after corrections have been made.
[0359] Step 7:
[0360] The server distributes the final, revised meeting minutes to the relevant parties. The input is the final version of the meeting minutes after user review, and it is distributed to relevant parties via email or other means. The output is the meeting minutes data securely delivered to all participants. As a result, the meeting content is shared with high transparency.
[0361] (Application Example 2)
[0362] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0363] In brick-and-mortar stores, accurately understanding customer needs and emotions is difficult, which can result in a decline in service quality. Furthermore, it is challenging for staff to instantly grasp customer emotions and adjust their responses accordingly, making improving customer satisfaction a significant challenge.
[0364] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0365] In this invention, the server includes means for receiving audio information acquired during a conversation and converting it into text information, means for extracting important content from the text information, and means for analyzing the emotional state from the audio information during the conversation. This makes it possible to understand the customer's needs and provide the most appropriate response based on their emotions in real time during interactions with customers.
[0366] "Conversation" is the act of multiple audio information senders and receivers exchanging audio information with each other.
[0367] "Audio information" refers to data that represents the words and sounds spoken by a person in digital or analog format.
[0368] "Textual information" refers to data expressed using the alphabet, kanji, kana, etc., based on audio information.
[0369] "Important content" refers to specific matters or topics that have particular significance and are extracted from the conversation.
[0370] "Format" refers to a framework for organizing information based on certain standards and formats.
[0371] A "record" is a collection of data that is saved so that information can be reused or referenced later.
[0372] "Emotional state" refers to psychological characteristics related to the emotions of the sender or receiver inferred from audio information.
[0373] A "physical store" is a facility that provides goods and services in a physical location and conducts face-to-face transactions with customers.
[0374] This invention is a system aimed at improving customer service in physical stores. Based on the cooperative relationship between the server, terminal, and user, this system is realized through the following functions performed by each element.
[0375] First, conversations within the physical store are captured as audio information through the microphone built into the device. Smartphones and tablets are suitable devices for this purpose. This audio information is sent to a server in real time, and the server uses a speech recognition API (for example, Google Cloud Speech-to-Text) to convert the audio information into text. The text information is then stored in a database.
[0376] Next, the server uses natural language processing techniques to extract important information from the text. Natural language processing libraries such as spaCy and NLTK are useful in this process. The server also uses emotion recognition APIs (e.g., IBM Watson Tone Analyzer) to analyze the emotional state of the participants during the conversation and incorporates the results into the recording.
[0377] Ultimately, the user (in this case, a store employee) can respond immediately to customer needs and emotions based on the analysis results displayed on the terminal. This improves the quality of customer service and leads to increased customer satisfaction.
[0378] As a concrete scenario, for example, if a customer is very interested in a new product, their level of excitement can be detected through sentiment analysis. Based on this information, the salesperson can immediately take appropriate measures, such as providing more detailed information about the new product.
[0379] An example of a prompt message generated using an AI model might be: "Please explain in detail the product the customer is interested in. Also, identify any concerns the customer may have and suggest solutions." This prompt message provides store staff with guidance to provide appropriate service smoothly.
[0380] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0381] Step 1:
[0382] The device acquires conversation audio. Using the microphone built into the device, it captures conversation audio between customers and store employees in a physical store in real time. The input is audio information, and this information is stored in digital format.
[0383] Step 2:
[0384] The server receives audio information and converts it into text information using a speech recognition API. The input data is digital audio information sent from the terminal, and the output data is text information obtained by converting the audio information into text format. Speech recognition technologies such as Google Cloud Speech-to-Text are used.
[0385] Step 3:
[0386] The server uses natural language processing techniques to extract important information from textual data. The input data is textual data obtained from speech recognition, and the output data is information about key topics and needs extracted from the conversation. Natural language processing libraries (e.g., spaCy or NLTK) are used for this process.
[0387] Step 4:
[0388] The server analyzes emotional states using an emotion recognition API. Input data consists of textual information, including conversation content, while output data represents the emotional states of both the customer and the staff. This analysis utilizes emotion recognition technologies such as IBM Watson Tone Analyzer.
[0389] Step 5:
[0390] The server generates a record based on important content and emotional state, and delivers it to the terminal. The input data is the analysis results, including important content and emotional state, and the output data is a record that organizes them. The terminal uses this to display the analysis results to the user.
[0391] Step 6:
[0392] The system reviews the analytical data provided by the user's device and uses it to improve customer service. Based on the analysis results, the system adjusts customer service methods and improves service quality. In this process, store employees refer to prompts for the generated AI model to provide the best possible response to customer needs.
[0393] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0394] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0395] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0396] [Third Embodiment]
[0397] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0398] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0399] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0400] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0401] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0402] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0403] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0404] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0405] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0406] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0407] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0408] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0409] This invention is a system that automatically converts meeting audio data into meeting minutes, and efficiently creates meeting minutes using speech recognition technology and natural language processing technology. The following describes specific embodiments of this system.
[0410] First, the user configures the microphone connected to their device at the start of the meeting and launches a dedicated application to begin collecting audio data. When the user presses the "Start Recording" button, the device is configured to send an audio stream to the server. The server receives this audio stream and records it in a database in real time.
[0411] Next, the server periodically acquires the audio data and converts it into text data using a speech recognition API. This text data is stored in the server's text storage and managed for subsequent processing. For example, if a user says "Let's move on to the next agenda item" during a meeting, speech recognition generates the text data "Let's move on to the next agenda item".
[0412] Next, the server uses natural language processing technology to extract important information from the text data. Specifically, it analyzes key elements of the meeting, such as agenda items, decisions made, and assigned personnel, and extracts the necessary information. This process yields information that captures the core of the meeting.
[0413] Based on this important information, the server organizes the information according to the specified format and automatically generates meeting minutes. The format is pre-configured by the user and may consist of sections such as "Agenda: Content," "Decisions: Content," and "Next Steps: Content."
[0414] Finally, users review the meeting minutes generated on their devices and make minor revisions as needed. After revisions, the server sends the completed minutes to all stakeholders, enabling efficient information sharing and confirmation. This system significantly reduces the time and effort spent by meeting participants on creating minutes, providing an environment where they can focus on the discussion.
[0415] The following describes the processing flow.
[0416] Step 1:
[0417] The user launches a dedicated application on their device, configures the microphone, and presses the "Start Recording" button. The device continuously captures the audio of the meeting and streams the audio data to the server in real time.
[0418] Step 2:
[0419] The server receives the audio stream sent from the terminal and records it in the database. The audio data is scheduled to be retrieved at regular intervals for subsequent text conversion.
[0420] Step 3:
[0421] The server sends the recorded audio data to a speech recognition API, which converts it into corresponding text data. This text data is then stored on the server as a transcript of the meeting.
[0422] Step 4:
[0423] The server applies natural language processing techniques to the obtained text data to extract important information. Specifically, it uses keyword extraction and semantic analysis to detect key elements such as agenda items, decisions, and responsible parties.
[0424] Step 5:
[0425] The server organizes the extracted key information according to a predefined format and generates the initial meeting minutes. This format is a standard meeting minutes format and can be customized to meet user needs.
[0426] Step 6:
[0427] Users review the meeting minutes generated on their devices and make necessary corrections based on the displayed information. These corrections typically involve minor adjustments to wording or the addition of crucial information.
[0428] Step 7:
[0429] After revisions, the server saves the finalized meeting minutes as the final version and automatically distributes them to meeting participants and designated stakeholders via email or intranet. The generated minutes can also be accessed for later viewing.
[0430] (Example 1)
[0431] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0432] In modern workplaces, accurately and quickly recording meeting content and sharing information among stakeholders is crucial. However, manually creating meeting minutes is time-consuming, labor-intensive, and prone to human error. Furthermore, effectively extracting and organizing important information is challenging. Therefore, to address these issues, there is a need for a system that automatically converts audio data into text data, quickly extracts key elements, and generates well-organized reports.
[0433] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0434] In this invention, the server includes means for setting up an audio input device and collecting an audio signal at the start of a meeting, means for transmitting the audio signal to a central processing unit, and means for converting the audio signal into text information using speech recognition technology in the central processing unit. This makes it possible to automatically record the contents of a meeting and to quickly and accurately extract important information.
[0435] A "voice input device" is a device used to accurately acquire voice signals and plays the role of converting speech during a meeting into electrical signals.
[0436] A "central processing unit" refers to a core computing mechanism for processing received audio signals, and is a computer system that performs audio data analysis and text conversion.
[0437] "Speech recognition technology" is a technology that generates text data from speech signals, and includes algorithms that automatically convert human speech into text information.
[0438] "Text information" refers to string data that represents the content of speech, generated using speech recognition technology, and serves as a visual record of the meeting's proceedings.
[0439] "Natural language processing technology" refers to techniques for analyzing and extracting meaningful information from text information, and is a language processing system used to identify important elements and patterns within documents.
[0440] A "report" is a document that is automatically formatted and organized based on text information generated using voice input devices and a central processing unit, and functions as a formal record containing the contents and decisions of a meeting.
[0441] This invention is a system that automatically converts meeting audio data into meeting minutes, utilizing speech recognition technology and natural language processing technology. The user first connects and configures an audio input device, such as a microphone, to the terminal. A dedicated application runs on the terminal, allowing for real-time collection of audio signals during the meeting.
[0442] Once an audio signal is collected, the terminal sends the audio data to a central processing unit, specifically a server. The server receives this data and uses speech recognition technology to convert the audio signal into text information. Common cloud-based services (such as Google Cloud Speech-to-Text API) are used as speech recognition APIs in this process.
[0443] Next, the server applies natural language processing techniques to analyze this text information and extract important elements, such as agenda items and decisions. Suitable software for this purpose includes natural language processing libraries such as SpaCy and NLTK.
[0444] Once key information is extracted, the server automatically generates a report according to the user's specified format. The report organizes the information in a format such as "Agenda: Content" and "Decisions: Content." The user can then review this report and, if necessary, use the generating AI model to make further revisions to the wording.
[0445] The generated report is distributed to all relevant parties via a server. This distribution is done via email or cloud storage service (e.g., Google Drive or Dropbox). This significantly reduces the time meeting participants spend creating meeting minutes, allowing them to focus on the discussion itself.
[0446] For example, if a user says "We will discuss setting goals for the next project" during a meeting, the server can incorporate this into the report as "Agenda: Setting goals for the next project." Another example of a prompt for the generating AI model is "Create a report from the meeting audio data, including the agenda and decisions made."
[0447] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0448] Step 1:
[0449] The user connects an audio input device to their terminal and launches a dedicated application. This application provides an interface for collecting audio signals at the start of a meeting. The input is the audio from the meeting, and the output is recorded on the terminal as an audio signal. Specifically, the collection of audio data begins when the user presses the "Start Recording" button.
[0450] Step 2:
[0451] The terminal sends the collected audio signal to the server. Here, the audio signal recorded within the terminal is used as input, and the audio data is sent to the server in real time as output. Transmission is performed using a streaming protocol, and the data is received by the server.
[0452] Step 3:
[0453] The server converts the received audio signal into text information by inputting it into a speech recognition API. The input is the audio data received on the server, and the output is the parsed text data. Specifically, for example, the Google Cloud Speech-to-Text API is used to convert the audio content into text.
[0454] Step 4:
[0455] The server uses natural language processing techniques to extract important elements from text information. The input is text data obtained through speech recognition, and the output is important information such as agenda items and decisions. This process utilizes natural language processing libraries (e.g., SpaCy and NLTK).
[0456] Step 5:
[0457] The server generates a report based on the extracted key information according to the user's specified format. The input here is the extracted key information, and the output is a formatted meeting transcript. The server organizes the text information according to the specified format and compiles it into a report.
[0458] Step 6:
[0459] The user reviews the generated report on their device and makes corrections to the wording as needed using the AI generation model. The input is the text data of the generated report, and the output is the corrected report. The user can use the correction prompt to refine the report into a more natural-sounding format.
[0460] Step 7:
[0461] The server distributes the final report to all stakeholders. The input is the revised meeting minutes, and the output is the final report delivered to the recipients. Email and cloud storage services are used for distribution, ensuring efficient information sharing.
[0462] (Application Example 1)
[0463] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0464] Traditional meetings required significant time and effort to manually record and transcribe audio data and create meeting minutes. Furthermore, real-time information sharing was often impossible, leading to considerable time spent on post-meeting review. To address these issues and enable more efficient and accurate information processing, an autonomous system is needed.
[0465] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0466] In this invention, the server includes means for receiving audio data acquired during a meeting, means for converting the audio data into text data, means for extracting core information from the text data, means for organizing the core information according to a specified format and generating a report, and means for autonomously taking in audio data and providing information immediately. This makes it possible to process information in real time during the meeting and to provide a report immediately after the meeting ends.
[0467] "Audio data" refers to digital or analog signals generated by the human voice.
[0468] "Character data" refers to digital characters converted from speech and in a readable and writable format.
[0469] "Core information" refers to information within the data that is semantically important and relevant to the topic or decision of a meeting or discussion.
[0470] "Format" refers to the rules and styles that define the structure, arrangement, and presentation of data and information.
[0471] A "report" is a document that organizes relevant information about a specific event or activity, and includes conclusions and recommendations.
[0472] "Autonomously" refers to the act or decision-making that occurs independently, without the need for human or external control.
[0473] The system for carrying out this invention acquires audio data, converts it to text data in real time, extracts important information, and generates a report. To achieve this, the following hardware and software are used.
[0474] First, the terminal uses a high-sensitivity microphone to capture the audio of the meeting. This microphone is designed to clearly capture the speaker's voice in the meeting environment. The audio data is then transmitted from the terminal to the server via the internet.
[0475] The server uses the Google Speech-to-Text API to quickly convert received audio data into text data. This digital conversion process allows for the generation of text information from audio information. Subsequently, the server analyzes the generated text data using natural language processing libraries such as spaCy and NLTK. This analysis extracts the core information.
[0476] The server organizes the extracted core information based on a user-defined format and generates a report. This report is structured to clearly present important information, such as agenda items and decisions, according to the specified format. Finally, the generated report is automatically sent to relevant parties. This makes it easy for meeting participants to review the information afterward and ensures that important decisions made during the meeting are properly managed.
[0477] As a concrete example, imagine a family meeting held at home where travel plans, shopping lists, and assigned household chores are discussed. In this scenario, reports related to these topics are generated in real time from the audio data of the meeting. Once the reports are ready, the system shares them with the relevant parties and provides guidance for the next steps.
[0478] An example of a prompt message would be, "Based on the content of the previous meeting, please create meeting minutes that include the following keywords: family trip, shopping list, assigned household chores." This would allow the system to efficiently extract and organize information that meets the user's needs.
[0479] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0480] Step 1:
[0481] The user acquires audio data using a microphone connected to the device. Here, the input is what is said during the meeting. The device converts this into digital audio data and transmits it to the server via the internet.
[0482] Step 2:
[0483] The server converts the received audio data into text data using the Google Speech-to-Text API. The input is audio data, which is analyzed and decoded via the API, and the output is text data.
[0484] Step 3:
[0485] The server analyzes text data using natural language processing libraries such as spaCy and NLTK. The input here is character data. The server extracts keywords and structural information from the text and generates important core information as output.
[0486] Step 4:
[0487] The server organizes the extracted core information based on the format set by the user and generates a report. The input is core information, which is structured according to formatting rules, and the output is a formatted report.
[0488] Step 5:
[0489] The server sends the generated report to the relevant parties. The input is the completed report, which is sent to the relevant parties as output via email or cloud service.
[0490] In this way, the server, terminal, and user elements work together to efficiently process meeting content and provide valuable information to participants.
[0491] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0492] This invention relates to a system that automatically converts meeting audio data into meeting minutes and further incorporates participant emotions into the minutes. This system improves the quality and efficiency of meeting minute creation by combining speech recognition technology, natural language processing technology, and emotion recognition technology. A specific embodiment of this system is described below.
[0493] First, the user configures the device's built-in microphone to collect audio data at the start of the meeting and begins recording. The device captures audio throughout the meeting and sends it to the server.
[0494] The server receives audio data in real time and records it in a database. This audio data is converted into text data by a speech recognition API and stored in the database as text. The converted text is not merely a transcription, but is used as material for in-depth analysis.
[0495] Next, the server uses natural language processing technology to extract key information from the text data. This efficiently organizes the main agenda items, decisions, and next steps from the meeting. This information is automatically arranged according to a format pre-configured by the user, and the initial meeting minutes are generated.
[0496] Furthermore, the server uses an emotion recognition engine to analyze participants' emotions from the audio data. This analysis detects and appropriately records each speaker's emotional state (e.g., excitement, calmness, dissatisfaction). For example, if the emotion analysis detects that a user is excited during a discussion, this information is added to the meeting minutes, making it clear under what emotional circumstances important statements were made.
[0497] Ultimately, users review the meeting minutes generated on their devices, including sentiment data, and make adjustments as needed. The server then distributes the reviewed and revised minutes as the final version to each participant. This streamlines the creation of meeting minutes while simultaneously adding sentiment-based supplementary information, enabling the provision of more immersive information.
[0498] The following describes the processing flow.
[0499] Step 1:
[0500] The user launches a dedicated application on their device and sets up the microphone to record the meeting. Pressing the "Start Recording" button initiates audio collection. The device continuously transmits this audio data to the server in real time.
[0501] Step 2:
[0502] The server receives the audio stream sent from the terminal and records the audio data in the database. This recorded audio data is managed for subsequent speech recognition processing.
[0503] Step 3:
[0504] The server sends the audio data to a speech recognition API, which converts it into text data. This text data is stored in a database as a record of the meeting content. For example, if a user says, "Let's discuss the new project," that is saved as text data.
[0505] Step 4:
[0506] The server applies natural language processing techniques to extract important information (e.g., agenda items, opinions, decisions) from the text data. This information becomes the main content of the meeting and is used later to generate meeting minutes.
[0507] Step 5:
[0508] The server uses an emotion recognition engine to analyze the speaker's emotions from the audio data. The analysis results are recorded in a database as the emotional state for each statement (e.g., joy, surprise, criticism).
[0509] Step 6:
[0510] The server combines extracted key information with sentiment analysis results to generate meeting minutes in a format pre-configured by the user. This format includes elements such as "agenda," "speaker's sentiment," and "decisions made."
[0511] Step 7:
[0512] The user reviews the meeting minutes generated on their device, examining the overall content, including sentiment data. They correct or add information as needed. For example, they might correct misinterpretations of sentiment.
[0513] Step 8:
[0514] The server finalizes the revised meeting minutes and automatically sends them to meeting participants and relevant parties. Since the minutes include emotional information about important statements, participants can reconfirm the atmosphere of the meeting.
[0515] (Example 2)
[0516] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0517] In recording and analyzing audio data from meetings, it is essential to efficiently and accurately extract and record important information. Furthermore, understanding the participants' mental states and reflecting them in the recording is necessary to gain a more comprehensive understanding of the meeting's content. Traditional methods often limit themselves to simple recording and conversion of audio, resulting in insufficient accuracy and efficiency in extracting important information and analyzing emotions.
[0518] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0519] In this invention, the server includes means for receiving audio information, means for converting audio information into text information, means for extracting important information from the text information, means for analyzing mental states and reflecting them in the record, and means for transmitting the generated record to the relevant parties. This enables efficient processing of audio data as well as the generation of detailed records that take into account the mental states of the participants.
[0520] "Audio information" refers to sound data acquired during meetings and conversations, and is typically analyzed using speech recognition technology.
[0521] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is a format designed to make information easier to visualize.
[0522] "Important information" refers to data from textual information that is deemed particularly significant in meetings and discussions, and is primarily extracted using natural language processing technology.
[0523] "Mental state" refers to the emotions and psychological responses exhibited by participants during meetings or discussions, and is analyzed using emotion recognition technology.
[0524] "Records" refer to documents or data that organize the information collected during a meeting, including the participants' mental states.
[0525] "Stakeholders" refers to individuals or organizations that should be shared with the contents of the meeting, and typically includes meeting participants and those responsible for them.
[0526] A "server" refers to a computer system that processes collected audio information and generates and distributes text information and other important information.
[0527] This invention relates to a system for efficiently processing audio information from meetings, extracting important information, and analyzing the mental state of participants. Specific embodiments are described below.
[0528] At the start of a meeting, the user configures the built-in microphone using their device and begins collecting audio information. The device captures audio information using noise cancellation and transmits the data to the server in real time. The communication is protected by the SSL / TLS protocol.
[0529] The server converts the received audio information into text using a speech recognition API (for example, a general-purpose speech recognition service). This converted text information is stored in a database. Subsequently, the server extracts important information from the text information using natural language processing techniques (for example, a general-purpose natural language processing library). This process organizes the main agenda items and decisions made at the meeting.
[0530] Furthermore, the server uses an emotion recognition engine (for example, a widely used emotion analysis service) to analyze the participants' mental state from the audio information. The results of this analysis are reflected in the record along with the text information, and the final meeting minutes are generated.
[0531] Users can review these meeting minutes via their devices and correct errors or add information. After this review and correction is complete, the server distributes the final version of the record to the relevant parties.
[0532] For example, if a user says "I'd like to discuss the project's progress" during a monthly meeting, and the mental state analysis detects "tension," this information will be recorded in the meeting minutes as "Discussion on progress (tension)." This kind of information visualization allows for a deeper understanding of the meeting's content.
[0533] An example of a prompt message used with a generation AI model would be: "Analyze the audio and text data and generate meeting minutes highlighting key topics and emotions."
[0534] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0535] Step 1:
[0536] The user uses the device to configure the built-in microphone and begin collecting audio information. The device enables noise cancellation to properly record the collected audio information. At this stage, the input is audio data of the meeting participants' speech, and the output is clear audio data with the noise removed.
[0537] Step 2:
[0538] The terminal transmits collected audio data to the server in real time. The input is clear audio data, which is encrypted and securely transferred to the server. The output is secure audio data stored on the server. SSL / TLS is used to protect the communication during this process.
[0539] Step 3:
[0540] The server converts received audio data into text using a speech recognition API. The input is audio data stored on the server, which is analyzed using speech recognition technology and converted into text. The output is clearly recorded text information, which is stored in a database.
[0541] Step 4:
[0542] The server uses natural language processing techniques to extract important information from textual data. At this stage, textual data is used as input. An information extraction algorithm is executed to identify the meeting's topics and decisions. The output is added to a database in an organized form. Specifically, noun phrase extraction and topic modeling are performed.
[0543] Step 5:
[0544] The server uses an emotion recognition engine to analyze mental states from audio data. The input is the original audio data, and emotion analysis techniques are used to identify the emotional state of each speaker. The output is textual information with the emotional state added. For example, characteristics such as whether a statement is "calm" or "nervous" are evaluated.
[0545] Step 6:
[0546] The user reviews the generated meeting minutes on their terminal and corrects any errors as needed. At this stage, the input is the meeting minutes data sent from the server. The user checks for errors and adds notes to supplement the record. The output is the final version of the meeting minutes after corrections have been made.
[0547] Step 7:
[0548] The server distributes the final, revised meeting minutes to the relevant parties. The input is the final version of the meeting minutes after user review, and it is distributed to relevant parties via email or other means. The output is the meeting minutes data securely delivered to all participants. As a result, the meeting content is shared with high transparency.
[0549] (Application Example 2)
[0550] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0551] In brick-and-mortar stores, accurately understanding customer needs and emotions is difficult, which can result in a decline in service quality. Furthermore, it is challenging for staff to instantly grasp customer emotions and adjust their responses accordingly, making improving customer satisfaction a significant challenge.
[0552] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0553] In this invention, the server includes means for receiving audio information acquired during a conversation and converting it into text information, means for extracting important content from the text information, and means for analyzing the emotional state from the audio information during the conversation. This makes it possible to understand the customer's needs and provide the most appropriate response based on their emotions in real time during interactions with customers.
[0554] "Conversation" is the act of multiple audio information senders and receivers exchanging audio information with each other.
[0555] "Audio information" refers to data that represents the words and sounds spoken by a person in digital or analog format.
[0556] "Textual information" refers to data expressed using the alphabet, kanji, kana, etc., based on audio information.
[0557] "Important content" refers to specific matters or topics that have particular significance and are extracted from the conversation.
[0558] "Format" refers to a framework for organizing information based on certain standards and formats.
[0559] A "record" is a collection of data that is saved so that information can be reused or referenced later.
[0560] "Emotional state" refers to psychological characteristics related to the emotions of the sender or receiver inferred from audio information.
[0561] A "physical store" is a facility that provides goods and services in a physical location and conducts face-to-face transactions with customers.
[0562] This invention is a system aimed at improving customer service in physical stores. Based on the cooperative relationship between the server, terminal, and user, this system is realized through the following functions performed by each element.
[0563] First, conversations within the physical store are captured as audio information through the microphone built into the device. Smartphones and tablets are suitable devices for this purpose. This audio information is sent to a server in real time, and the server uses a speech recognition API (for example, Google Cloud Speech-to-Text) to convert the audio information into text. The text information is then stored in a database.
[0564] Next, the server uses natural language processing techniques to extract important information from the text. Natural language processing libraries such as spaCy and NLTK are useful in this process. The server also uses emotion recognition APIs (e.g., IBM Watson Tone Analyzer) to analyze the emotional state of the participants during the conversation and incorporates the results into the recording.
[0565] Ultimately, the user (in this case, a store employee) can respond immediately to customer needs and emotions based on the analysis results displayed on the terminal. This improves the quality of customer service and leads to increased customer satisfaction.
[0566] As a concrete scenario, for example, if a customer is very interested in a new product, their level of excitement can be detected through sentiment analysis. Based on this information, the salesperson can immediately take appropriate measures, such as providing more detailed information about the new product.
[0567] An example of a prompt message generated using an AI model might be: "Please explain in detail the product the customer is interested in. Also, identify any concerns the customer may have and suggest solutions." This prompt message provides store staff with guidance to provide appropriate service smoothly.
[0568] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0569] Step 1:
[0570] The device acquires conversation audio. Using the microphone built into the device, it captures conversation audio between customers and store employees in a physical store in real time. The input is audio information, and this information is stored in digital format.
[0571] Step 2:
[0572] The server receives audio information and converts it into text information using a speech recognition API. The input data is digital audio information sent from the terminal, and the output data is text information obtained by converting the audio information into text format. Speech recognition technologies such as Google Cloud Speech-to-Text are used.
[0573] Step 3:
[0574] The server uses natural language processing techniques to extract important information from textual data. The input data is textual data obtained from speech recognition, and the output data is information about key topics and needs extracted from the conversation. Natural language processing libraries (e.g., spaCy or NLTK) are used for this process.
[0575] Step 4:
[0576] The server analyzes emotional states using an emotion recognition API. Input data consists of textual information, including conversation content, while output data represents the emotional states of both the customer and the staff. This analysis utilizes emotion recognition technologies such as IBM Watson Tone Analyzer.
[0577] Step 5:
[0578] The server generates a record based on important content and emotional state, and delivers it to the terminal. The input data is the analysis results, including important content and emotional state, and the output data is a record that organizes them. The terminal uses this to display the analysis results to the user.
[0579] Step 6:
[0580] The system reviews the analytical data provided by the user's device and uses it to improve customer service. Based on the analysis results, the system adjusts customer service methods and improves service quality. In this process, store employees refer to prompts for the generated AI model to provide the best possible response to customer needs.
[0581] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0582] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0583] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0584] [Fourth Embodiment]
[0585] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0586] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0587] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0588] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0589] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0590] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0591] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0592] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0593] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0594] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0595] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0596] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0597] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0598] This invention is a system that automatically converts meeting audio data into meeting minutes, and efficiently creates meeting minutes using speech recognition technology and natural language processing technology. The following describes specific embodiments of this system.
[0599] First, the user configures the microphone connected to their device at the start of the meeting and launches a dedicated application to begin collecting audio data. When the user presses the "Start Recording" button, the device is configured to send an audio stream to the server. The server receives this audio stream and records it in a database in real time.
[0600] Next, the server periodically acquires the audio data and converts it into text data using a speech recognition API. This text data is stored in the server's text storage and managed for subsequent processing. For example, if a user says "Let's move on to the next agenda item" during a meeting, speech recognition generates the text data "Let's move on to the next agenda item".
[0601] Next, the server uses natural language processing technology to extract important information from the text data. Specifically, it analyzes key elements of the meeting, such as agenda items, decisions made, and assigned personnel, and extracts the necessary information. This process yields information that captures the core of the meeting.
[0602] Based on this important information, the server organizes the information according to the specified format and automatically generates meeting minutes. The format is pre-configured by the user and may consist of sections such as "Agenda: Content," "Decisions: Content," and "Next Steps: Content."
[0603] Finally, users review the meeting minutes generated on their devices and make minor revisions as needed. After revisions, the server sends the completed minutes to all stakeholders, enabling efficient information sharing and confirmation. This system significantly reduces the time and effort spent by meeting participants on creating minutes, providing an environment where they can focus on the discussion.
[0604] The following describes the processing flow.
[0605] Step 1:
[0606] The user launches a dedicated application on their device, configures the microphone, and presses the "Start Recording" button. The device continuously captures the audio of the meeting and streams the audio data to the server in real time.
[0607] Step 2:
[0608] The server receives the audio stream sent from the terminal and records it in the database. The audio data is scheduled to be retrieved at regular intervals for subsequent text conversion.
[0609] Step 3:
[0610] The server sends the recorded audio data to a speech recognition API, which converts it into corresponding text data. This text data is then stored on the server as a transcript of the meeting.
[0611] Step 4:
[0612] The server applies natural language processing techniques to the obtained text data to extract important information. Specifically, it uses keyword extraction and semantic analysis to detect key elements such as agenda items, decisions, and responsible parties.
[0613] Step 5:
[0614] The server organizes the extracted key information according to a predefined format and generates the initial meeting minutes. This format is a standard meeting minutes format and can be customized to meet user needs.
[0615] Step 6:
[0616] Users review the meeting minutes generated on their devices and make necessary corrections based on the displayed information. These corrections typically involve minor adjustments to wording or the addition of crucial information.
[0617] Step 7:
[0618] After revisions, the server saves the finalized meeting minutes as the final version and automatically distributes them to meeting participants and designated stakeholders via email or intranet. The generated minutes can also be accessed for later viewing.
[0619] (Example 1)
[0620] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0621] In modern workplaces, accurately and quickly recording meeting content and sharing information among stakeholders is crucial. However, manually creating meeting minutes is time-consuming, labor-intensive, and prone to human error. Furthermore, effectively extracting and organizing important information is challenging. Therefore, to address these issues, there is a need for a system that automatically converts audio data into text data, quickly extracts key elements, and generates well-organized reports.
[0622] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0623] In this invention, the server includes means for setting up an audio input device and collecting an audio signal at the start of a meeting, means for transmitting the audio signal to a central processing unit, and means for converting the audio signal into text information using speech recognition technology in the central processing unit. This makes it possible to automatically record the contents of a meeting and to quickly and accurately extract important information.
[0624] A "voice input device" is a device used to accurately acquire voice signals and plays the role of converting speech during a meeting into electrical signals.
[0625] A "central processing unit" refers to a core computing mechanism for processing received audio signals, and is a computer system that performs audio data analysis and text conversion.
[0626] "Speech recognition technology" is a technology that generates text data from speech signals, and includes algorithms that automatically convert human speech into text information.
[0627] "Text information" refers to string data that represents the content of speech, generated using speech recognition technology, and serves as a visual record of the meeting's proceedings.
[0628] "Natural language processing technology" refers to techniques for analyzing and extracting meaningful information from text information, and is a language processing system used to identify important elements and patterns within documents.
[0629] A "report" is a document that is automatically formatted and organized based on text information generated using voice input devices and a central processing unit, and functions as a formal record containing the contents and decisions of a meeting.
[0630] This invention is a system that automatically converts meeting audio data into meeting minutes, utilizing speech recognition technology and natural language processing technology. The user first connects and configures an audio input device, such as a microphone, to the terminal. A dedicated application runs on the terminal, allowing for real-time collection of audio signals during the meeting.
[0631] Once an audio signal is collected, the terminal sends the audio data to a central processing unit, specifically a server. The server receives this data and uses speech recognition technology to convert the audio signal into text information. Common cloud-based services (such as Google Cloud Speech-to-Text API) are used as speech recognition APIs in this process.
[0632] Next, the server applies natural language processing techniques to analyze this text information and extract important elements, such as agenda items and decisions. Suitable software for this purpose includes natural language processing libraries such as SpaCy and NLTK.
[0633] Once key information is extracted, the server automatically generates a report according to the user's specified format. The report organizes the information in a format such as "Agenda: Content" and "Decisions: Content." The user can then review this report and, if necessary, use the generating AI model to make further revisions to the wording.
[0634] The generated report is distributed to all relevant parties via a server. This distribution is done via email or cloud storage service (e.g., Google Drive or Dropbox). This significantly reduces the time meeting participants spend creating meeting minutes, allowing them to focus on the discussion itself.
[0635] For example, if a user says "We will discuss setting goals for the next project" during a meeting, the server can incorporate this into the report as "Agenda: Setting goals for the next project." Another example of a prompt for the generating AI model is "Create a report from the meeting audio data, including the agenda and decisions made."
[0636] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0637] Step 1:
[0638] The user connects an audio input device to their terminal and launches a dedicated application. This application provides an interface for collecting audio signals at the start of a meeting. The input is the audio from the meeting, and the output is recorded on the terminal as an audio signal. Specifically, the collection of audio data begins when the user presses the "Start Recording" button.
[0639] Step 2:
[0640] The terminal sends the collected audio signal to the server. Here, the audio signal recorded within the terminal is used as input, and the audio data is sent to the server in real time as output. Transmission is performed using a streaming protocol, and the data is received by the server.
[0641] Step 3:
[0642] The server converts the received audio signal into text information by inputting it into a speech recognition API. The input is the audio data received on the server, and the output is the parsed text data. Specifically, for example, the Google Cloud Speech-to-Text API is used to convert the audio content into text.
[0643] Step 4:
[0644] The server uses natural language processing techniques to extract important elements from text information. The input is text data obtained through speech recognition, and the output is important information such as agenda items and decisions. This process utilizes natural language processing libraries (e.g., SpaCy and NLTK).
[0645] Step 5:
[0646] The server generates a report based on the extracted key information according to the user's specified format. The input here is the extracted key information, and the output is a formatted meeting transcript. The server organizes the text information according to the specified format and compiles it into a report.
[0647] Step 6:
[0648] The user reviews the generated report on their device and makes corrections to the wording as needed using the AI generation model. The input is the text data of the generated report, and the output is the corrected report. The user can use the correction prompt to refine the report into a more natural-sounding format.
[0649] Step 7:
[0650] The server distributes the final report to all stakeholders. The input is the revised meeting minutes, and the output is the final report delivered to the recipients. Email and cloud storage services are used for distribution, ensuring efficient information sharing.
[0651] (Application Example 1)
[0652] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0653] Traditional meetings required significant time and effort to manually record and transcribe audio data and create meeting minutes. Furthermore, real-time information sharing was often impossible, leading to considerable time spent on post-meeting review. To address these issues and enable more efficient and accurate information processing, an autonomous system is needed.
[0654] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0655] In this invention, the server includes means for receiving audio data acquired during a meeting, means for converting the audio data into text data, means for extracting core information from the text data, means for organizing the core information according to a specified format and generating a report, and means for autonomously taking in audio data and providing information immediately. This makes it possible to process information in real time during the meeting and to provide a report immediately after the meeting ends.
[0656] "Audio data" refers to digital or analog signals generated by the human voice.
[0657] "Character data" refers to digital characters converted from speech and in a readable and writable format.
[0658] "Core information" refers to information within the data that is semantically important and relevant to the topic or decision of a meeting or discussion.
[0659] "Format" refers to the rules and styles that define the structure, arrangement, and presentation of data and information.
[0660] A "report" is a document that organizes relevant information about a specific event or activity, and includes conclusions and recommendations.
[0661] "Autonomously" refers to the act or decision-making that occurs independently, without the need for human or external control.
[0662] The system for carrying out this invention acquires audio data, converts it to text data in real time, extracts important information, and generates a report. To achieve this, the following hardware and software are used.
[0663] First, the terminal uses a high-sensitivity microphone to capture the audio of the meeting. This microphone is designed to clearly capture the speaker's voice in the meeting environment. The audio data is then transmitted from the terminal to the server via the internet.
[0664] The server uses the Google Speech-to-Text API to quickly convert received audio data into text data. This digital conversion process allows for the generation of text information from audio information. Subsequently, the server analyzes the generated text data using natural language processing libraries such as spaCy and NLTK. This analysis extracts the core information.
[0665] The server organizes the extracted core information based on a user-defined format and generates a report. This report is structured to clearly present important information, such as agenda items and decisions, according to the specified format. Finally, the generated report is automatically sent to relevant parties. This makes it easy for meeting participants to review the information afterward and ensures that important decisions made during the meeting are properly managed.
[0666] As a concrete example, imagine a family meeting held at home where travel plans, shopping lists, and assigned household chores are discussed. In this scenario, reports related to these topics are generated in real time from the audio data of the meeting. Once the reports are ready, the system shares them with the relevant parties and provides guidance for the next steps.
[0667] An example of a prompt message would be, "Based on the content of the previous meeting, please create meeting minutes that include the following keywords: family trip, shopping list, assigned household chores." This would allow the system to efficiently extract and organize information that meets the user's needs.
[0668] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0669] Step 1:
[0670] The user acquires audio data using a microphone connected to the device. Here, the input is what is said during the meeting. The device converts this into digital audio data and transmits it to the server via the internet.
[0671] Step 2:
[0672] The server converts the received audio data into text data using the Google Speech-to-Text API. The input is audio data, which is analyzed and decoded via the API, and the output is text data.
[0673] Step 3:
[0674] The server analyzes text data using natural language processing libraries such as spaCy and NLTK. The input here is character data. The server extracts keywords and structural information from the text and generates important core information as output.
[0675] Step 4:
[0676] The server organizes the extracted core information based on the format set by the user and generates a report. The input is core information, which is structured according to formatting rules, and the output is a formatted report.
[0677] Step 5:
[0678] The server sends the generated report to the relevant parties. The input is the completed report, which is sent to the relevant parties as output via email or cloud service.
[0679] In this way, the server, terminal, and user elements work together to efficiently process meeting content and provide valuable information to participants.
[0680] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0681] This invention relates to a system that automatically converts meeting audio data into meeting minutes and further incorporates participant emotions into the minutes. This system improves the quality and efficiency of meeting minute creation by combining speech recognition technology, natural language processing technology, and emotion recognition technology. A specific embodiment of this system is described below.
[0682] First, the user configures the device's built-in microphone to collect audio data at the start of the meeting and begins recording. The device captures audio throughout the meeting and sends it to the server.
[0683] The server receives audio data in real time and records it in a database. This audio data is converted into text data by a speech recognition API and stored in the database as text. The converted text is not merely a transcription, but is used as material for in-depth analysis.
[0684] Next, the server uses natural language processing technology to extract key information from the text data. This efficiently organizes the main agenda items, decisions, and next steps from the meeting. This information is automatically arranged according to a format pre-configured by the user, and the initial meeting minutes are generated.
[0685] Furthermore, the server uses an emotion recognition engine to analyze participants' emotions from the audio data. This analysis detects and appropriately records each speaker's emotional state (e.g., excitement, calmness, dissatisfaction). For example, if the emotion analysis detects that a user is excited during a discussion, this information is added to the meeting minutes, making it clear under what emotional circumstances important statements were made.
[0686] Ultimately, users review the meeting minutes generated on their devices, including sentiment data, and make adjustments as needed. The server then distributes the reviewed and revised minutes as the final version to each participant. This streamlines the creation of meeting minutes while simultaneously adding sentiment-based supplementary information, enabling the provision of more immersive information.
[0687] The following describes the processing flow.
[0688] Step 1:
[0689] The user launches a dedicated application on their device and sets up the microphone to record the meeting. Pressing the "Start Recording" button initiates audio collection. The device continuously transmits this audio data to the server in real time.
[0690] Step 2:
[0691] The server receives the audio stream sent from the terminal and records the audio data in the database. This recorded audio data is managed for subsequent speech recognition processing.
[0692] Step 3:
[0693] The server sends the audio data to a speech recognition API, which converts it into text data. This text data is stored in a database as a record of the meeting content. For example, if a user says, "Let's discuss the new project," that is saved as text data.
[0694] Step 4:
[0695] The server applies natural language processing techniques to extract important information (e.g., agenda items, opinions, decisions) from the text data. This information becomes the main content of the meeting and is used later to generate meeting minutes.
[0696] Step 5:
[0697] The server uses an emotion recognition engine to analyze the speaker's emotions from the audio data. The analysis results are recorded in a database as the emotional state for each statement (e.g., joy, surprise, criticism).
[0698] Step 6:
[0699] The server combines extracted key information with sentiment analysis results to generate meeting minutes in a format pre-configured by the user. This format includes elements such as "agenda," "speaker's sentiment," and "decisions made."
[0700] Step 7:
[0701] The user reviews the meeting minutes generated on their device, examining the overall content, including sentiment data. They correct or add information as needed. For example, they might correct misinterpretations of sentiment.
[0702] Step 8:
[0703] The server finalizes the revised meeting minutes and automatically sends them to meeting participants and relevant parties. Since the minutes include emotional information about important statements, participants can reconfirm the atmosphere of the meeting.
[0704] (Example 2)
[0705] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0706] In recording and analyzing audio data from meetings, it is essential to efficiently and accurately extract and record important information. Furthermore, understanding the participants' mental states and reflecting them in the recording is necessary to gain a more comprehensive understanding of the meeting's content. Traditional methods often limit themselves to simple recording and conversion of audio, resulting in insufficient accuracy and efficiency in extracting important information and analyzing emotions.
[0707] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0708] In this invention, the server includes means for receiving audio information, means for converting audio information into text information, means for extracting important information from the text information, means for analyzing mental states and reflecting them in the record, and means for transmitting the generated record to the relevant parties. This enables efficient processing of audio data as well as the generation of detailed records that take into account the mental states of the participants.
[0709] "Audio information" refers to sound data acquired during meetings and conversations, and is typically analyzed using speech recognition technology.
[0710] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is a format designed to make information easier to visualize.
[0711] "Important information" refers to data from textual information that is deemed particularly significant in meetings and discussions, and is primarily extracted using natural language processing technology.
[0712] "Mental state" refers to the emotions and psychological responses exhibited by participants during meetings or discussions, and is analyzed using emotion recognition technology.
[0713] "Records" refer to documents or data that organize the information collected during a meeting, including the participants' mental states.
[0714] "Stakeholders" refers to individuals or organizations that should be shared with the contents of the meeting, and typically includes meeting participants and those responsible for them.
[0715] A "server" refers to a computer system that processes collected audio information and generates and distributes text information and other important information.
[0716] This invention relates to a system for efficiently processing audio information from meetings, extracting important information, and analyzing the mental state of participants. Specific embodiments are described below.
[0717] At the start of a meeting, the user configures the built-in microphone using their device and begins collecting audio information. The device captures audio information using noise cancellation and transmits the data to the server in real time. The communication is protected by the SSL / TLS protocol.
[0718] The server converts the received audio information into text using a speech recognition API (for example, a general-purpose speech recognition service). This converted text information is stored in a database. Subsequently, the server extracts important information from the text information using natural language processing techniques (for example, a general-purpose natural language processing library). This process organizes the main agenda items and decisions made at the meeting.
[0719] Furthermore, the server uses an emotion recognition engine (for example, a widely used emotion analysis service) to analyze the participants' mental state from the audio information. The results of this analysis are reflected in the record along with the text information, and the final meeting minutes are generated.
[0720] Users can review these meeting minutes via their devices and correct errors or add information. After this review and correction is complete, the server distributes the final version of the record to the relevant parties.
[0721] For example, if a user says "I'd like to discuss the project's progress" during a monthly meeting, and the mental state analysis detects "tension," this information will be recorded in the meeting minutes as "Discussion on progress (tension)." This kind of information visualization allows for a deeper understanding of the meeting's content.
[0722] An example of a prompt message used with a generation AI model would be: "Analyze the audio and text data and generate meeting minutes highlighting key topics and emotions."
[0723] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0724] Step 1:
[0725] The user uses the device to configure the built-in microphone and begin collecting audio information. The device enables noise cancellation to properly record the collected audio information. At this stage, the input is audio data of the meeting participants' speech, and the output is clear audio data with the noise removed.
[0726] Step 2:
[0727] The terminal transmits collected audio data to the server in real time. The input is clear audio data, which is encrypted and securely transferred to the server. The output is secure audio data stored on the server. SSL / TLS is used to protect the communication during this process.
[0728] Step 3:
[0729] The server converts received audio data into text using a speech recognition API. The input is audio data stored on the server, which is analyzed using speech recognition technology and converted into text. The output is clearly recorded text information, which is stored in a database.
[0730] Step 4:
[0731] The server uses natural language processing techniques to extract important information from textual data. At this stage, textual data is used as input. An information extraction algorithm is executed to identify the meeting's topics and decisions. The output is added to a database in an organized form. Specifically, noun phrase extraction and topic modeling are performed.
[0732] Step 5:
[0733] The server uses an emotion recognition engine to analyze mental states from audio data. The input is the original audio data, and emotion analysis techniques are used to identify the emotional state of each speaker. The output is textual information with the emotional state added. For example, characteristics such as whether a statement is "calm" or "nervous" are evaluated.
[0734] Step 6:
[0735] The user reviews the generated meeting minutes on their terminal and corrects any errors as needed. At this stage, the input is the meeting minutes data sent from the server. The user checks for errors and adds notes to supplement the record. The output is the final version of the meeting minutes after corrections have been made.
[0736] Step 7:
[0737] The server distributes the final, revised meeting minutes to the relevant parties. The input is the final version of the meeting minutes after user review, and it is distributed to relevant parties via email or other means. The output is the meeting minutes data securely delivered to all participants. As a result, the meeting content is shared with high transparency.
[0738] (Application Example 2)
[0739] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0740] In brick-and-mortar stores, accurately understanding customer needs and emotions is difficult, which can result in a decline in service quality. Furthermore, it is challenging for staff to instantly grasp customer emotions and adjust their responses accordingly, making improving customer satisfaction a significant challenge.
[0741] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0742] In this invention, the server includes means for receiving audio information acquired during a conversation and converting it into text information, means for extracting important content from the text information, and means for analyzing the emotional state from the audio information during the conversation. This makes it possible to understand the customer's needs and provide the most appropriate response based on their emotions in real time during interactions with customers.
[0743] "Conversation" is the act of multiple audio information senders and receivers exchanging audio information with each other.
[0744] "Audio information" refers to data that represents the words and sounds spoken by a person in digital or analog format.
[0745] "Textual information" refers to data expressed using the alphabet, kanji, kana, etc., based on audio information.
[0746] "Important content" refers to specific matters or topics that have particular significance and are extracted from the conversation.
[0747] "Format" refers to a framework for organizing information based on certain standards and formats.
[0748] A "record" is a collection of data that is saved so that information can be reused or referenced later.
[0749] "Emotional state" refers to psychological characteristics related to the emotions of the sender or receiver inferred from audio information.
[0750] A "physical store" is a facility that provides goods and services in a physical location and conducts face-to-face transactions with customers.
[0751] This invention is a system aimed at improving customer service in physical stores. Based on the cooperative relationship between the server, terminal, and user, this system is realized through the following functions performed by each element.
[0752] First, conversations within the physical store are captured as audio information through the microphone built into the device. Smartphones and tablets are suitable devices for this purpose. This audio information is sent to a server in real time, and the server uses a speech recognition API (for example, Google Cloud Speech-to-Text) to convert the audio information into text. The text information is then stored in a database.
[0753] Next, the server uses natural language processing techniques to extract important information from the text. Natural language processing libraries such as spaCy and NLTK are useful in this process. The server also uses emotion recognition APIs (e.g., IBM Watson Tone Analyzer) to analyze the emotional state of the participants during the conversation and incorporates the results into the recording.
[0754] Ultimately, the user (in this case, a store employee) can respond immediately to customer needs and emotions based on the analysis results displayed on the terminal. This improves the quality of customer service and leads to increased customer satisfaction.
[0755] As a concrete scenario, for example, if a customer is very interested in a new product, their level of excitement can be detected through sentiment analysis. Based on this information, the salesperson can immediately take appropriate measures, such as providing more detailed information about the new product.
[0756] An example of a prompt message generated using an AI model might be: "Please explain in detail the product the customer is interested in. Also, identify any concerns the customer may have and suggest solutions." This prompt message provides store staff with guidance to provide appropriate service smoothly.
[0757] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0758] Step 1:
[0759] The device acquires conversation audio. Using the microphone built into the device, it captures conversation audio between customers and store employees in a physical store in real time. The input is audio information, and this information is stored in digital format.
[0760] Step 2:
[0761] The server receives audio information and converts it into text information using a speech recognition API. The input data is digital audio information sent from the terminal, and the output data is text information obtained by converting the audio information into text format. Speech recognition technologies such as Google Cloud Speech-to-Text are used.
[0762] Step 3:
[0763] The server uses natural language processing techniques to extract important information from textual data. The input data is textual data obtained from speech recognition, and the output data is information about key topics and needs extracted from the conversation. Natural language processing libraries (e.g., spaCy or NLTK) are used for this process.
[0764] Step 4:
[0765] The server analyzes emotional states using an emotion recognition API. Input data consists of textual information, including conversation content, while output data represents the emotional states of both the customer and the staff. This analysis utilizes emotion recognition technologies such as IBM Watson Tone Analyzer.
[0766] Step 5:
[0767] The server generates a record based on important content and emotional state, and delivers it to the terminal. The input data is the analysis results, including important content and emotional state, and the output data is a record that organizes them. The terminal uses this to display the analysis results to the user.
[0768] Step 6:
[0769] The system reviews the analytical data provided by the user's device and uses it to improve customer service. Based on the analysis results, the system adjusts customer service methods and improves service quality. In this process, store employees refer to prompts for the generated AI model to provide the best possible response to customer needs.
[0770] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0771] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0772] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0773] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0774] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0775] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0776] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0777] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0778] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0779] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0780] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0781] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0782] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0783] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0784] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0785] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0786] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0787] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0788] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0789] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0790] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0791] The following is further disclosed regarding the embodiments described above.
[0792] (Claim 1)
[0793] A means of receiving audio data acquired during a meeting,
[0794] A means for converting the audio data into text data,
[0795] A means for extracting important information from the text data,
[0796] A means for organizing the important information and generating meeting minutes based on a specified format,
[0797] A means of sending the generated meeting minutes to the relevant parties,
[0798] A system that includes this.
[0799] (Claim 2)
[0800] The system according to claim 1, which includes means for analyzing the content of a statement using natural language processing technology in the extraction of the aforementioned important information.
[0801] (Claim 3)
[0802] The system according to claim 1, characterized in that the aforementioned format can be set in advance by the user.
[0803] "Example 1"
[0804] (Claim 1)
[0805] A means of setting up an audio input device and collecting audio signals at the start of a meeting,
[0806] Means for transmitting the audio signal to a central processing unit,
[0807] The central processing unit includes means for converting speech signals into text information using speech recognition technology,
[0808] A means for analyzing and extracting important elements from the text information using natural language processing technology,
[0809] A means for automatically generating a report by structuring the key elements based on a user-configurable format,
[0810] A means of distributing the generated report to the relevant parties,
[0811] A system that includes this.
[0812] (Claim 2)
[0813] The system according to claim 1, which includes a correction function using a generative AI model to format the generated report into natural-sounding text.
[0814] (Claim 3)
[0815] The system according to claim 1, which processes data in real time upon receiving an audio signal and records the information immediately.
[0816] "Application Example 1"
[0817] (Claim 1)
[0818] A means of receiving audio data acquired during a meeting,
[0819] A means for converting the audio data into text data,
[0820] A means for extracting core information from the character data,
[0821] A means for organizing the core information and generating a report based on a specified format,
[0822] A means of sending the aforementioned report to the relevant parties,
[0823] A means of autonomously acquiring audio data and providing information immediately,
[0824] A system that includes this.
[0825] (Claim 2)
[0826] The system according to claim 1, which includes means for analyzing the content of a statement using language processing technology in the extraction of the core information.
[0827] (Claim 3)
[0828] The system according to claim 1, characterized in that the aforementioned format can be set in advance by the user.
[0829] "Example 2 of combining an emotion engine"
[0830] (Claim 1)
[0831] A means for receiving audio information,
[0832] Means for converting the audio information into text information,
[0833] A means for extracting important information from the text information,
[0834] A means for organizing the important information and generating a record based on a specified format,
[0835] A means for analyzing the mental state from the audio information and reflecting it in the record,
[0836] A means of transmitting the generated records to the relevant parties,
[0837] A system that includes this.
[0838] (Claim 2)
[0839] The system according to claim 1, which includes means for analyzing the content of a statement using natural language processing technology in the extraction of the aforementioned important information.
[0840] (Claim 3)
[0841] The system according to claim 1, characterized in that the aforementioned format can be set in advance by the user.
[0842] "Application example 2 when combining with an emotional engine"
[0843] (Claim 1)
[0844] A means for receiving audio information acquired during a conversation,
[0845] Means for converting the audio information into text information,
[0846] A means for extracting important content from the text information,
[0847] A means for organizing the important content and generating a record based on a specified format,
[0848] A means of transmitting the generated records to the relevant parties,
[0849] A method for analyzing emotional states from audio information during conversations,
[0850] A means of reflecting the analyzed emotional state in the record.
[0851] A system that includes this.
[0852] (Claim 2)
[0853] The system according to claim 1, which includes means for analyzing the content using natural language processing technology in the extraction of the important content.
[0854] (Claim 3)
[0855] The system according to claim 1, characterized in that the aforementioned format can be set in advance by the user. [Explanation of symbols]
[0856] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A means of receiving audio data acquired during a meeting, A means for converting the audio data into text data, A means for extracting important information from the text data, A means for organizing the important information and generating meeting minutes based on a specified format, A means of sending the generated meeting minutes to the relevant parties, A system that includes this.
2. The system according to claim 1, which includes means for analyzing the content of a statement using natural language processing technology in the extraction of the aforementioned important information.
3. The system according to claim 1, characterized in that the aforementioned format can be set in advance by the user.