system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The system addresses the inefficiencies in meeting minute creation and task management by using speech recognition and natural language processing to automate the process, enhancing productivity through streamlined information recording and task organization.

JP2026096688APending Publication Date: 2026-06-15SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-03
Publication Date: 2026-06-15

Application Information

Patent Timeline

03 Dec 2024

Application

15 Jun 2026

Publication

JP2026096688A

IPC: G06Q10/10; G06Q10/06

AI Tagging

Application Domain

Office automation Resources

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In business meetings, creating meeting minutes and managing tasks is time-consuming, leading to incomplete records and missed tasks, which hinders productivity.

Method used

A system utilizing speech recognition to convert audio data into text, natural language processing to extract important information, and automatic meeting minutes creation, along with task management to streamline these processes.

Benefits of technology

Enables efficient information recording and task management during and after meetings, improving productivity by automating the creation of meeting records and task organization.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026096688000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A speech recognition means that receives audio data and converts the audio data into text data, A natural language processing method for extracting important information from generated text data, A meeting minutes creation method that automatically generates meeting records based on extracted important information, A task management means that automatically detects tasks based on the meeting records and manages those tasks, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance as a response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In business meetings, it is necessary to create meeting minutes, organize information, and manage tasks. However, these tasks are time-consuming and prevent participants from concentrating on core discussions. Also, in follow-up work after the meeting, incomplete records and missed tasks cause productivity to decline. Therefore, there is a need for means to smoothly record information during the meeting and reliably and efficiently manage tasks after the meeting.

Means for Solving the Problems

[0005] This invention provides a system that includes speech recognition means for receiving audio data and converting it into text data in real time, as well as natural language processing means for extracting important meeting information from the generated text data. Furthermore, it includes meeting minutes creation means for automatically generating meeting records based on the extracted information, and a task management means for automatically detecting and managing tasks from the meeting records. This system enables effective information recording during meetings, efficient task management after meetings, and improved productivity.

[0006] "Audio data" refers to information recorded in digital format as audio signals, including the statements of participants in meetings or discussions.

[0007] "Speech recognition means" refers to the technology and apparatus for analyzing received speech data and converting it into text data.

[0008] "Character data" refers to text-based information converted by speech recognition technology.

[0009] "Natural language processing means" refers to technologies and devices for analyzing text data and extracting important information and intentions.

[0010] "Meeting minutes creation means" refers to a function and device for automatically generating meeting records based on important information extracted through natural language processing.

[0011] "Task management means" refers to functions and devices for organizing tasks detected from meeting minutes creation means and for monitoring and managing their progress. [Brief explanation of the drawing]

[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of the data processing device and smart device according to the first embodiment. [Figure 3]It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] [[ID=二十一]]It shows an emotion map to which a plurality of emotions are mapped. [[ID=二十二]] [[ID=二十三]] [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] [[ID=三十]]It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

Embodiments for Carrying Out the Invention

[0013] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0014] First, the language used in the following description will be explained.

[0015] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0016] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0017] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0018] In the following embodiments, the numbered communication I / F (Interface) is an interface that includes a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), etc.

[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0020] [First Embodiment]

[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0033] The system according to the present invention is configured to automate the creation of meeting minutes from meeting recordings and task management. This system mainly consists of a server, terminals, and users.

[0034] First, the terminal performs voice input at the start of the meeting. This voice input is collected using the terminal's built-in microphone and sent to the server in real time. The voice data sent to the server is immediately converted into text data through speech recognition. This converted text data is stored in temporary storage on the server.

[0035] Next, the server analyzes the stored text data using natural language processing. This extracts important information and keywords from the meeting content. In particular, it is designed to automatically identify decisions and key topics. This extracted information is then organized by a meeting minutes creation system and automatically generated as formatted meeting minutes.

[0036] Furthermore, the server automatically detects tasks from the generated meeting minutes. The task management system analyzes deadlines and responsible persons associated with these tasks and records their progress. The task list is notified to the user, who can then take further action based on it.

[0037] As a concrete example, if a participant says "We need to prepare the materials before the next meeting" during a meeting, the terminal sends this statement as audio data to the server. The server uses speech recognition to convert "We need to prepare the materials before the next meeting" into text data, and then uses natural language processing to extract the task "Prepare materials" from this statement. The task management system lists this task and notifies the user along with related information. In this way, tasks are managed efficiently even after the meeting, enabling smooth information sharing and follow-up.

[0038] In this way, the present invention aims to streamline administrative tasks in meetings and provide an environment where participants can focus on more important discussions.

[0039] The following describes the processing flow.

[0040] Step 1:

[0041] The device starts recording audio at the beginning of the meeting. The device uses its built-in microphone to capture participants' speech in real time and stores it as audio data.

[0042] Step 2:

[0043] After the terminal buffers a certain amount of audio data, it sends this audio data to the server in packet format. Transmission is performed periodically in real time to minimize data delay.

[0044] Step 3:

[0045] The server passes the received audio data to the speech recognition engine. The speech recognition engine analyzes the audio data and converts it into text data. This process is continuous, generating text data in real time during the meeting.

[0046] Step 4:

[0047] The server analyzes the generated text data using a natural language processing algorithm. The algorithm extracts context and keywords from the text data, identifying important information from the meeting. This information is then organized into key points to focus on.

[0048] Step 5:

[0049] The server automatically generates meeting minutes based on important information. The meeting minutes creation module organizes the information in an appropriate format and saves it as a meeting record.

[0050] Step 6:

[0051] The server automatically extracts tasks using the generated meeting minutes and registers them in the task management database. Tasks can be automatically assigned deadlines and responsible parties, and this information is stored in a structured format.

[0052] Step 7:

[0053] Users receive meeting minutes and task lists from the server via their terminals. They use this information to manage post-meeting tasks and track task progress.

[0054] Step 8:

[0055] Users can access the task management system through an interface to update task progress. Users can add progress information and notify other stakeholders.

[0056] (Example 1)

[0057] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0058] Traditional meeting record-keeping and task management processes lack automation, resulting in cumbersome and time-consuming post-meeting processing. In particular, extracting key information from discussions, quickly creating meeting minutes, and efficiently managing related tasks is challenging.

[0059] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0060] In this invention, the server includes speech recognition means for receiving audio information and converting it into text information, natural language processing means for extracting important information from the generated text information, and record creation means for automatically generating meeting minutes based on the extracted important information. This makes it possible to streamline post-meeting task management and provide an environment in which participants can concentrate on important discussions.

[0061] "Audio information" refers to audio data, including spoken words during a meeting.

[0062] "Textual information" refers to text data converted from audio information.

[0063] "Speech recognition means" refers to technical means that convert speech information into text information, and includes the process of analyzing acoustic signals and generating corresponding text.

[0064] "Natural language processing means" refers to technologies for extracting important information from textual data, and includes the process of understanding and structuring the meaning and context of the text.

[0065] "Record creation means" refers to a method for automatically generating meeting minutes based on extracted important information, organizing the information and outputting it in a predetermined format.

[0066] A "work item" refers to a specific task or action decided upon in a meeting, accompanied by relevant information such as the person who will perform it and the deadline.

[0067] "Work management means" refers to methods for handling detected work items and managing their processing and progress.

[0068] "Analysis of relevant deadlines and responsible persons" involves analyzing the deadlines set for work items and the responsible persons involved through information processing, and utilizing this information in the management process.

[0069] This invention is a system that efficiently creates meeting minutes by recording audio information from meetings and converting it into text information, and also automatically detects and manages work items. The system mainly consists of a server, terminals, and users.

[0070] The terminal inputs audio information during a meeting using its built-in microphone and transmits it to the server via its communication function. The basic hardware used in the terminal includes a standard microphone and a computer with network connectivity.

[0071] The server receives audio information transmitted from the terminal and converts it into text using speech recognition software. This process utilizes, for example, a "speech analysis API" as speech recognition software. After the audio information is converted to text, the server analyzes the text using natural language processing techniques. Here, software such as a "natural language processing library" is used to extract important information.

[0072] Next, the server uses the recording mechanism to format the meeting minutes based on the extracted information. The minutes are stored in a database and made available to users for access as needed.

[0073] Furthermore, the server detects work items from meeting records and manages these items using work management tools. Specifically, it analyzes deadlines and responsible parties and tracks progress using task management tool APIs. For example, the "Project Management Platform API" can be used.

[0074] Users can receive real-time notifications for generated meeting minutes and the progress of work items. Based on this, users can plan their next actions and perform their work efficiently.

[0075] For example, if a participant says "Please prepare for the next presentation" during a meeting, the terminal sends this audio to the server. The server converts it into text information, "Please prepare for the next presentation," and uses natural language processing to detect the task item "Prepare for the presentation."

[0076] An example of a prompt message is, "Convert the meeting audio log to text and extract specific work items." In this way, the system can automate meeting information management and dramatically improve work efficiency.

[0077] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0078] Step 1:

[0079] The terminal uses its built-in microphone to input voice information as soon as the meeting starts. The input voice is converted into digital data and temporarily stored in the terminal's memory. The converted digital voice data is then sent sequentially to the server as data chunks.

[0080] Step 2:

[0081] The server receives digital audio data transmitted from the terminal. The received digital audio data is converted into text information using a speech analysis API. In the speech-to-text conversion process, the acoustic signal is phonemic-analyzed, and text is generated by a continuous speech recognition model. As a result, the raw text data is stored in the server's temporary storage.

[0082] Step 3:

[0083] The server analyzes the stored raw text data using a natural language processing library. During this analysis process, named entities and important information are extracted from the text, and key points and decisions are identified using predefined rules and machine learning models. The extracted information is then output in a structured format as the analysis result.

[0084] Step 4:

[0085] The server generates meeting minutes using recording mechanisms based on structured information. In this process, information is organized according to a template and output as formatted meeting minutes. The generated meeting minutes are stored in a database for later access by users.

[0086] Step 5:

[0087] The server applies work management tools to detect new work items from meeting records. Information about the detected work items is sent to the task management system via API, and a work item list is generated. At this time, deadlines and responsible person information are automatically added to the items.

[0088] Step 6:

[0089] After a meeting ends, users receive a generated task list via real-time notification. Upon receiving the notification, users can review the details through their task management tool and take action to manage the progress of their tasks. This ensures that individual work items are efficiently tracked and completed.

[0090] (Application Example 1)

[0091] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0092] In modern brick-and-mortar stores, efficiently recording meeting and discussion content and managing tasks based on that information is essential. However, doing this manually is typically time-consuming, labor-intensive, and prone to errors. Furthermore, the lack of mechanisms for immediate information sharing on-site and visualization of task progress makes optimizing operations difficult. A system is needed to solve these problems and operate operations efficiently.

[0093] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0094] In this invention, the server includes speech recognition means for receiving audio information and converting it into text information, natural language processing means for extracting important data from the generated text information, and record creation means for automatically generating meeting documents based on the extracted important data. This enables the content of meetings to be recorded in real time, facilitating efficient task management and information sharing.

[0095] "Audio information" refers to information used to record and process audio in a digital format.

[0096] "Textual information" refers to text data expressed in digital format, specifically text converted from audio information.

[0097] "Speech recognition means" refers to technology that analyzes speech information and converts it into text information.

[0098] "Natural language processing means" refers to language processing techniques used to analyze textual information and extract important data from it.

[0099] "Record creation means" refers to the means for generating meeting documents based on extracted data, and for formalizing the documents.

[0100] "Work items" refer to specific actions or tasks that are decided or presented in meetings or discussions.

[0101] "Controlling tasks" refers to managing work items, monitoring their progress, and making appropriate adjustments.

[0102] "Business support" refers to back-office support activities aimed at improving the operational efficiency of stores.

[0103] "Business documents" refer to documents that organize and record information related to business operations.

[0104] "Business documents derived from audio" are recorded documents generated based on audio information, and reflect the content of the meeting.

[0105] "Methods for optimizing business progress" refer to techniques for organizing information and clarifying tasks in order to operate business efficiently.

[0106] The system that realizes this invention mainly consists of a server, a terminal, and a user. First, the terminal collects audio information at the start of a meeting or discussion. The terminal acquires audio information using its built-in microphone, and this audio information is transmitted to the server in real time.

[0107] Next, the server processes the received audio information. The server uses speech recognition technology to convert the audio information into text. For this process, for example, a speech recognition service provided by AWS® can be used. The converted text information is temporarily stored in a database on the server.

[0108] Subsequently, the server analyzes the textual information using natural language processing technology and extracts important data. Natural language processing tools such as Amazon Comprehend can be utilized for this process. The extracted data is organized by a recording system and automatically formatted as a meeting document.

[0109] Next, the server automatically detects and manages work items based on the meeting documents. Work items are organized with their associated deadlines and assignee information, and their progress is tracked. Managed tasks are notified to the user in real time, allowing them to take appropriate action. Task management can be effectively managed using tools such as the Trello API.

[0110] For example, if someone says during a meeting at a store, "We need to prepare the product display for the next sale," the terminal sends this statement as audio information to the server. The server converts it into text using speech recognition and extracts the task item "Prepare product display" using natural language processing. The task item is then assigned an appropriate person and deadline, and the user is notified.

[0111] An example of a prompt message would be, "Please upload the audio file, summarize the key points of the meeting, and create a list of tasks to complete before the next meeting."

[0112] In this way, this invention achieves improved operational efficiency and optimized information sharing in physical stores.

[0113] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0114] Step 1:

[0115] The terminal collects audio information as soon as a meeting or discussion begins. The terminal's built-in microphone captures ambient sound, and the recorded audio data is transmitted to the server in real time. The input is raw audio information, while the output is digital audio data transmitted to the server.

[0116] Step 2:

[0117] The server converts received audio data into text information using speech recognition technology. Specifically, the server analyzes the audio data using a speech recognition API and generates text data. The input is digital audio data, and the output is the corresponding text information.

[0118] Step 3:

[0119] The server analyzes the generated text information using natural language processing techniques and extracts important data. Here, the natural language processing engine identifies key points and decisions from the text information. The input is text information, and the output is the extracted important data.

[0120] Step 4:

[0121] The server automatically generates meeting documents using a record-keeping system based on the extracted key data. Here, the server applies formatting rules and outputs well-formed documents. The input is key data, and the output is a formatted meeting document.

[0122] Step 5:

[0123] The server automatically detects and manages work items based on the generated meeting documents. The server understands the context from the meeting documents and registers them in the task management system. The input is the meeting documents, and the output is a list of detected work items.

[0124] Step 6:

[0125] The user checks the progress of tasks by referring to a list of work items registered by the server in the task management system. The user sets actions for each work item as needed. The input is a list of work items, and the output is the updated status after user confirmation.

[0126] Step 7:

[0127] The server manages the progress of updated work items and notifies the user in real time. The input is the updated data, and the output is the information notified to the user.

[0128] This series of processes enables efficient recording and management of information from meetings and discussions at physical stores.

[0129] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0130] This invention relates to a system that processes audio and emotions in a meeting in real time and uses this information to create meeting records and manage tasks. The system consists of a server, terminals, and users, and also supports emotion recognition.

[0131] First, the device captures the audio and video of the meeting. This is done using the built-in microphone and camera, recording participants' statements and facial expressions in real time. This data is sent to a server, where the audio data is converted into text data, and the video data is analyzed by an emotion engine.

[0132] The server converts the received audio data into text data using a speech recognition engine. This text data is then subjected to natural language processing to extract important information. Based on this information, meeting minutes are automatically generated.

[0133] Here, the emotion engine installed on the server analyzes the user's emotional state from the video data. This emotional information is recorded along with the meeting content and used to analyze changes in emotion and tone during important discussions.

[0134] As a concrete example, when someone says "This proposal involves risks" during a meeting, the device sends the audio to the server, where a speech recognition engine transcribes it as "This proposal involves risks." Simultaneously, the camera captures the speaker's facial expressions, and an emotion engine detects emotions such as "anxiety" or "concern." The server combines this information to create meeting minutes that include the participants' emotions in relation to the context of the statement.

[0135] Furthermore, the server automatically extracts tasks from meeting minutes and sets deadlines and assigns responsibilities using task management tools. Sentimental information may also be used to evaluate the priority and urgency of tasks. Based on this information, users plan post-meeting follow-ups and manage projects.

[0136] This system enables comprehensive meeting recording and task management, including emotional shifts, supporting effective decision-making that takes into account the psychological nuances of participants.

[0137] The following describes the processing flow.

[0138] Step 1:

[0139] The device begins capturing audio and video at the start of the meeting. It uses the built-in microphone to record participants' speech and the camera to continuously record their facial expressions. This data is transmitted to the server in real time.

[0140] Step 2:

[0141] The server receives the audio data and passes it to the speech recognition engine, which converts it into text data. This conversion is performed sequentially, and the process of generating text data in real time continues throughout the meeting.

[0142] Step 3:

[0143] The server analyzes the generated text data using a natural language processing algorithm. The algorithm extracts important information and keywords from the text, identifying key parts of the meeting.

[0144] Step 4:

[0145] The server inputs video data collected during the meeting into the emotion engine. The emotion engine analyzes the video, identifies the emotional state of the participants from their facial expressions and tone of voice, and records the results in a database.

[0146] Step 5:

[0147] The server automatically generates meeting minutes by combining extracted key information and sentiment data. The minutes include not only the content of the statements but also the sentiment information of the participants at that time.

[0148] Step 6:

[0149] The server detects tasks from the meeting minutes and registers them in the task management system. Sentimental information may be considered as a factor in determining the importance and priority of tasks.

[0150] Step 7:

[0151] Users receive meeting minutes and task lists sent from the server via their terminals. Based on the information provided, users can manage post-meeting follow-ups and track the progress of each task.

[0152] Step 8:

[0153] Users update task progress within the system and add comments and feedback as needed. This ensures that progress is always kept up-to-date and that stakeholders are automatically notified.

[0154] (Example 2)

[0155] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0156] Traditional meeting recording systems simply convert audio data into text data to create minutes, failing to consider the emotions of participants or the atmosphere of the discussion during the meeting. Furthermore, it was difficult to properly prioritize tasks generated during the meeting and to manage them efficiently.

[0157] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0158] In this invention, the server includes a speech conversion means for receiving audio information and converting the audio information into text information, a natural language processing means for extracting important elements from the generated text information, and an emotion recognition means for analyzing emotional states from video information. This makes it possible to create detailed meeting minutes that take into account the emotions of participants during a meeting, and to prioritize them based on their emotional states.

[0159] "Speech conversion means" refers to a device or function that receives speech information and converts that speech information into text information.

[0160] "Natural language processing means" refers to techniques or algorithms that extract important elements from generated textual information and analyze their meaning and context.

[0161] A "meeting record creation method" refers to a system or method that automatically creates meeting minutes based on extracted key elements.

[0162] "Emotion recognition means" refers to a technology or system that identifies an emotional state by analyzing a participant's facial expressions and movements based on video information.

[0163] "Work management means" refers to a method or system that automatically detects work items based on generated meeting records and manages and organizes them.

[0164] This invention is a system designed to streamline information processing in meetings. Specific embodiments are described below.

[0165] First, the device captures the audio and video of the meeting in real time. This is done using the built-in microphone and camera, acquiring audio data in WAV format and video data in MP4 format. This data is then sent to the server in streaming format.

[0166] The server uses a speech converter to process the received audio data. Specifically, it utilizes a speech recognition engine to convert audio into text data. This engine is based on APIs commonly used in speech recognition software. Next, the generated text information is analyzed using natural language processing techniques to extract important keywords and utterances. This process uses a natural language processing engine and organizes the information using a language model.

[0167] Meanwhile, the server receives the video data and uses an emotion recognition system to analyze the emotional state of the participants. This system utilizes emotion recognition technologies such as facial recognition APIs. The analyzed emotion data is then used for creating meeting minutes and task management.

[0168] The generated meeting minutes are automatically created based on data processing and reflect emotional states, resulting in a summary that makes it easy to understand the atmosphere of the meeting and the tone of the participants. Furthermore, the server automatically extracts tasks within the meeting template and sets schedules and priorities using a task management system.

[0169] Users can view meeting minutes and task lists generated on their devices. They can also plan post-meeting follow-ups and manage projects as needed.

[0170] As a concrete example, when someone says "This proposal involves risks" during a meeting, the device sends the audio to the server, and the speech recognition engine transcribes it as "This proposal involves risks." Simultaneously, the camera captures the speaker's facial expressions, and the emotion recognition engine can analyze emotions such as "anxiety" or "concern." This information is then used to generate meeting minutes that take into account the emotions of the participants.

[0171] As an example of a prompt, the user can enter the instruction, "Based on the meeting audio and sentiment data, propose an action plan for the next meeting using automatically generated meeting minutes." This will trigger an automated proposal based on the generative AI model.

[0172] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0173] Step 1:

[0174] The terminal activates its built-in microphone and camera at the start of the meeting. Inputs include audio and video from the meeting room. This data is captured as WAV audio files and MP4 video files. The terminal collects this data in real time and sends it to the server via streaming.

[0175] Step 2:

[0176] The server uses a speech conversion system with audio data received from the terminal as input. Specifically, it activates a speech recognition engine and converts the audio data into text data. This process outputs the spoken content as text, which is then used for subsequent analysis.

[0177] Step 3:

[0178] The server uses a natural language processing system with text data as input. It leverages a natural language processing engine to extract important keywords and content from the text data. This extracted information becomes the output, forming the basis for creating meeting minutes. Furthermore, a language model is applied to understand the context of the text.

[0179] Step 4:

[0180] The server uses an emotion recognition system as input, taking video data received from the terminal. The emotion recognition engine analyzes the emotional state of participants based on their facial expressions. The results of this analysis are output as emotion data, which is used for meeting minutes and task management.

[0181] Step 5:

[0182] The server uses the extracted textual information and sentiment data as input to launch the meeting minutes creation system. Using the meeting minutes creation engine, it automatically generates meeting minutes. The output is a comprehensive record of meeting minutes that reflects the content and emotions of the meeting. These minutes also record the emotional tone of the participants.

[0183] Step 6:

[0184] The server uses the generated meeting minutes as input and utilizes a task management system. The task management engine automatically extracts tasks from the meeting minutes and sets their priorities and schedules. The output is compiled into a task list and provided to the user.

[0185] Step 7:

[0186] Users review the generated meeting minutes and task lists using their devices. Based on the outputted information, they plan meeting follow-ups and manage projects. Specifically, they adjust task deadlines and assign responsibilities, and monitor progress. This enables efficient work execution.

[0187] (Application Example 2)

[0188] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0189] In meetings, it is necessary to effectively record and manage participants' statements and the resulting emotional changes in real time to improve the quality of task management and decision-making. However, conventional meeting recording systems cannot take emotional changes into account, making it difficult to implement effective project management that takes into account the psychological nuances of participants.

[0190] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0191] In this invention, the server includes speech recognition means for receiving audio data and converting it into text data, natural language processing means for extracting important information, and emotion recognition means for analyzing the emotional state of participants from video data. This makes it possible to record and manage tasks while taking into account changes in participants' emotions during the progress of the meeting.

[0192] "Speech recognition means" refers to technology that receives speech data and converts that speech into text data.

[0193] "Natural language processing means" refers to technologies that extract important information from generated text data and perform information analysis and understanding.

[0194] "Meeting minutes creation method" refers to technology that automatically generates documents recording the content of a meeting based on important information.

[0195] "Task management method" refers to a technology that organizes and manages tasks automatically detected based on meeting minutes.

[0196] "Emotion recognition means" refers to technology that analyzes the emotional state of participants based on video data and understands their situation.

[0197] "Methods for adjusting meeting records based on emotion analysis results" refers to techniques that utilize emotion analysis results to adjust the recorded meeting content and its interpretation.

[0198] The system implementing this invention mainly consists of a server, a terminal, and a user.

[0199] The server converts speech data into text data using a speech recognition engine. Specifically, services such as Google® Speech-to-Text or AWS Transcribe can be used. Important information is extracted from this text data using natural language processing tools, and it becomes the basic data for creating meeting minutes. Libraries such as SpaCy and NLTK are used for natural language processing.

[0200] Furthermore, the server receives video data collected by the built-in camera and performs emotion recognition using Microsoft® Azure® facial recognition APIs, etc. This allows for real-time analysis of participants' emotional states, and the meeting record is adjusted based on the results.

[0201] The terminal uses its built-in microphone and camera to capture audio and video of meetings and transmits this data to the server. Through this system, users can receive meeting transcripts and sentiment analysis results generated in real time, which can be used for task management and project decision-making.

[0202] As a concrete example of this system, during a residents' meeting for a local shopping district revitalization project, a comment was made: "I'm worried about holding a new event." The system analyzes the speaker's facial expression from the captured video footage along with this comment, interpreting it as "anxiety," and adjusts the priority of related event preparation tasks accordingly.

[0203] An example of a prompt sentence using a generative AI model is "a method for analyzing the sentiment behind statements made at a residents' meeting and using that information to help with decision-making at the meeting."

[0204] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0205] Step 1:

[0206] The device captures the audio and video data of the meeting. Using the built-in microphone and camera, it records participants' speech and facial expressions in real time. To optimize resource usage, the data is compressed into an appropriate format before being transferred to the server. The input is the meeting's audio and video, and the output is raw data.

[0207] Step 2:

[0208] The server converts the received audio data into text data using a speech recognition engine. For example, using Google Speech-to-Text, this data is converted into text format. The input is audio data, and the output is the corresponding text data.

[0209] Step 3:

[0210] The server processes text data using natural language processing to extract important information. It uses natural language processing libraries such as SpaCy and NLTK to extract key points from spoken content for meeting minutes creation. The input is text data, and the output is the important information necessary for meeting minutes.

[0211] Step 4:

[0212] The server receives video data from the device's camera and analyzes it using an emotion recognition API. Microsoft Azure's facial recognition API is used to identify emotional states from participants' facial expressions. The input is video data, and the output is information about the detected emotions.

[0213] Step 5:

[0214] The server incorporates the sentiment analysis results into the meeting minutes, adjusting the content to reflect the tone and emotional impact of the statements. Based on the sentiment recognition results, the system processes the minutes to emphasize key points and adjust their tone. The input consists of key information and sentiment information from the meeting minutes, and the output is the sentiment-adjusted meeting minutes.

[0215] Step 6:

[0216] The server automatically detects tasks based on meeting minutes, sets their priorities, and adds them to the task management system. The system sends information to a project management interface to help visualize tasks and manage their progress. The input is the edited meeting minutes, and the output is a set of prioritized tasks.

[0217] Step 7:

[0218] Users receive meeting minutes and sentiment data generated in real time from the server, and then use this data for subsequent meeting follow-up and task management. Participants can view and edit information through the user interface. Inputs are meeting minutes and sentiment information from the server, while outputs are the displayed content and task progress information provided to the user.

[0219] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0220] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search)<url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0221] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0222] [Second Embodiment]

[0223] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0224] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0225] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0226] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0227] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0228] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0229] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0230] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0231] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0232] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0233] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0234] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0235] The system according to the present invention is configured to automate the creation of meeting minutes from meeting recordings and task management. This system mainly consists of a server, terminals, and users.

[0236] First, the terminal performs voice input at the start of the meeting. This voice input is collected using the terminal's built-in microphone and sent to the server in real time. The voice data sent to the server is immediately converted into text data through speech recognition. This converted text data is stored in temporary storage on the server.

[0237] Next, the server analyzes the stored text data using natural language processing. This extracts important information and keywords from the meeting content. In particular, it is designed to automatically identify decisions and key topics. This extracted information is then organized by a meeting minutes creation system and automatically generated as formatted meeting minutes.

[0238] Furthermore, the server automatically detects tasks from the generated meeting minutes. The task management system analyzes deadlines and responsible persons associated with these tasks and records their progress. The task list is notified to the user, who can then take further action based on it.

[0239] As a concrete example, if a participant says "We need to prepare the materials before the next meeting" during a meeting, the terminal sends this statement as audio data to the server. The server uses speech recognition to convert "We need to prepare the materials before the next meeting" into text data, and then uses natural language processing to extract the task "Prepare materials" from this statement. The task management system lists this task and notifies the user along with related information. In this way, tasks are managed efficiently even after the meeting, enabling smooth information sharing and follow-up.

[0240] In this way, the present invention aims to streamline administrative tasks in meetings and provide an environment where participants can focus on more important discussions.

[0241] The following describes the processing flow.

[0242] Step 1:

[0243] The device starts recording audio at the beginning of the meeting. The device uses its built-in microphone to capture participants' speech in real time and stores it as audio data.

[0244] Step 2:

[0245] After the terminal buffers a certain amount of audio data, it sends this audio data to the server in packet format. Transmission is performed periodically in real time to minimize data delay.

[0246] Step 3:

[0247] The server passes the received audio data to the speech recognition engine. The speech recognition engine analyzes the audio data and converts it into text data. This process is continuous, generating text data in real time during the meeting.

[0248] Step 4:

[0249] The server analyzes the generated text data using a natural language processing algorithm. The algorithm extracts context and keywords from the text data, identifying important information from the meeting. This information is then organized into key points to focus on.

[0250] Step 5:

[0251] The server automatically generates meeting minutes based on important information. The meeting minutes creation module organizes the information in an appropriate format and saves it as a meeting record.

[0252] Step 6:

[0253] The server automatically extracts tasks using the generated meeting minutes and registers them in the task management database. Tasks can be automatically assigned deadlines and responsible parties, and this information is stored in a structured format.

[0254] Step 7:

[0255] Users receive meeting minutes and task lists from the server via their terminals. They use this information to manage post-meeting tasks and track task progress.

[0256] Step 8:

[0257] Users can access the task management system through an interface to update task progress. Users can add progress information and notify other stakeholders.

[0258] (Example 1)

[0259] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0260] Traditional meeting record-keeping and task management processes lack automation, resulting in cumbersome and time-consuming post-meeting processing. In particular, extracting key information from discussions, quickly creating meeting minutes, and efficiently managing related tasks is challenging.

[0261] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0262] In this invention, the server includes speech recognition means for receiving audio information and converting it into text information, natural language processing means for extracting important information from the generated text information, and record creation means for automatically generating meeting minutes based on the extracted important information. This makes it possible to streamline post-meeting task management and provide an environment in which participants can concentrate on important discussions.

[0263] "Audio information" refers to audio data, including spoken words during a meeting.

[0264] "Textual information" refers to text data converted from audio information.

[0265] "Speech recognition means" refers to technical means that convert speech information into text information, and includes the process of analyzing acoustic signals and generating corresponding text.

[0266] "Natural language processing means" refers to technologies for extracting important information from textual data, and includes the process of understanding and structuring the meaning and context of the text.

[0267] "Record creation means" refers to a method for automatically generating meeting minutes based on extracted important information, organizing the information and outputting it in a predetermined format.

[0268] A "work item" refers to a specific task or action decided upon in a meeting, accompanied by relevant information such as the person who will perform it and the deadline.

[0269] "Work management means" refers to methods for handling detected work items and managing their processing and progress.

[0270] "Analysis of relevant deadlines and responsible persons" involves analyzing the deadlines set for work items and the responsible persons involved through information processing, and utilizing this information in the management process.

[0271] This invention is a system that efficiently creates meeting minutes by recording audio information from meetings and converting it into text information, and also automatically detects and manages work items. The system mainly consists of a server, terminals, and users.

[0272] The terminal inputs audio information during a meeting using its built-in microphone and transmits it to the server via its communication function. The basic hardware used in the terminal includes a standard microphone and a computer with network connectivity.

[0273] The server receives audio information transmitted from the terminal and converts it into text using speech recognition software. This process utilizes, for example, a "speech analysis API" as speech recognition software. After the audio information is converted to text, the server analyzes the text using natural language processing techniques. Here, software such as a "natural language processing library" is used to extract important information.

[0274] Next, the server uses the recording mechanism to format the meeting minutes based on the extracted information. The minutes are stored in a database and made available to users for access as needed.

[0275] Furthermore, the server detects work items from meeting records and manages these items using work management tools. Specifically, it analyzes deadlines and responsible parties and tracks progress using task management tool APIs. For example, the "Project Management Platform API" can be used.

[0276] Users can receive real-time notifications for generated meeting minutes and the progress of work items. Based on this, users can plan their next actions and perform their work efficiently.

[0277] For example, if a participant says "Please prepare for the next presentation" during a meeting, the terminal sends this audio to the server. The server converts it into text information, "Please prepare for the next presentation," and uses natural language processing to detect the task item "Prepare for the presentation."

[0278] An example of a prompt message is, "Convert the meeting audio log to text and extract specific work items." In this way, the system can automate meeting information management and dramatically improve work efficiency.

[0279] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0280] Step 1:

[0281] The terminal uses its built-in microphone to input voice information as soon as the meeting starts. The input voice is converted into digital data and temporarily stored in the terminal's memory. The converted digital voice data is then sent sequentially to the server as data chunks.

[0282] Step 2:

[0283] The server receives the digital audio data transmitted from the terminal. The received digital audio data is converted into character information using the voice analysis API. In the text conversion process from voice, the acoustic signal is phoneme-analyzed and text conversion is performed by a continuous speech recognition model. As a result, the raw text data is stored in the server's temporary storage.

[0284] Step 3:

[0285] The server analyzes the stored raw text data using a natural language processing library. In this analysis process, named entities and important matters are extracted from the text, and key points and decision matters are identified by pre-defined rules and machine learning models. As an analysis result, the extracted information is output in a structured format.

[0286] Step 4:

[0287] The server generates meeting minutes using a recording creation means based on the structured information. In this process, the information is organized according to a template and output as formatted meeting minutes. The generated meeting minutes are stored in a database and can be accessed by the user later.

[0288] Step 5:

[0289] The server applies a work management means to detect new work items from the meeting minutes. Information about the detected work items is transmitted to a task management system using an API, and a work item list is generated. At this time, information such as deadlines and responsible persons is automatically added to the items.

[0290] Step 6:

[0291] After a meeting ends, users receive a generated task list via real-time notification. Upon receiving the notification, users can review the details through their task management tool and take action to manage the progress of their tasks. This ensures that individual work items are efficiently tracked and completed.

[0292] (Application Example 1)

[0293] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0294] In modern brick-and-mortar stores, efficiently recording meeting and discussion content and managing tasks based on that information is essential. However, doing this manually is typically time-consuming, labor-intensive, and prone to errors. Furthermore, the lack of mechanisms for immediate information sharing on-site and visualization of task progress makes optimizing operations difficult. A system is needed to solve these problems and operate operations efficiently.

[0295] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0296] In this invention, the server includes speech recognition means for receiving audio information and converting it into text information, natural language processing means for extracting important data from the generated text information, and record creation means for automatically generating meeting documents based on the extracted important data. This enables the content of meetings to be recorded in real time, facilitating efficient task management and information sharing.

[0297] "Audio information" refers to information used to record and process audio in a digital format.

[0298] "Textual information" refers to text data expressed in digital format, specifically text converted from audio information.

[0299] "Voice recognition means" refers to the technology that analyzes voice information and converts it into character information.

[0300] "Natural language processing means" is a language processing technology that analyzes character information and extracts important data from it.

[0301] "Record creation means" is a means for generating a meeting document based on the extracted data, and formalizes the document.

[0302] "Work item" refers to specific actions or tasks determined and presented in meetings or consultations.

[0303] "Work to be controlled" refers to managing work items, grasping their progress, and making appropriate adjustments.

[0304] "Business support" refers to the support activities of the back office aimed at improving the business efficiency of stores.

[0305] "Business document" refers to a document that organizes and records information related to business.

[0306] "Business document derived from voice" is a recorded document generated based on voice information and reflects the content of a meeting.

[0307] "Means for optimizing business progress" is a method for efficiently operating business by organizing information and clarifying tasks.

[0308] The system that realizes this invention mainly consists of a server, a terminal, and a user. First, the terminal collects voice information at the start of a meeting or consultation. The terminal acquires voice information using a built-in microphone, and this voice information is transmitted to the server in real time.

[0309] Next, the server processes the received audio information. The server uses speech recognition technology to convert the audio information into text. For this process, for example, a speech recognition service provided by AWS can be used. The converted text information is temporarily stored in a database on the server.

[0310] Subsequently, the server analyzes the textual information using natural language processing technology and extracts important data. Natural language processing tools such as Amazon Comprehend can be utilized for this process. The extracted data is organized by a recording system and automatically formatted as a meeting document.

[0311] Next, the server automatically detects and manages work items based on the meeting documents. Work items are organized with their associated deadlines and assignee information, and their progress is tracked. Managed tasks are notified to the user in real time, allowing them to take appropriate action. Task management can be effectively managed using tools such as the Trello API.

[0312] For example, if someone says during a meeting at a store, "We need to prepare the product display for the next sale," the terminal sends this statement as audio information to the server. The server converts it into text using speech recognition and extracts the task item "Prepare product display" using natural language processing. The task item is then assigned an appropriate person and deadline, and the user is notified.

[0313] An example of a prompt message would be, "Please upload the audio file, summarize the key points of the meeting, and create a list of tasks to complete before the next meeting."

[0314] In this way, this invention achieves improved operational efficiency and optimized information sharing in physical stores.

[0315] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0316] Step 1:

[0317] The terminal collects audio information as soon as a meeting or discussion begins. The terminal's built-in microphone captures ambient sound, and the recorded audio data is transmitted to the server in real time. The input is raw audio information, while the output is digital audio data transmitted to the server.

[0318] Step 2:

[0319] The server converts received audio data into text information using speech recognition technology. Specifically, the server analyzes the audio data using a speech recognition API and generates text data. The input is digital audio data, and the output is the corresponding text information.

[0320] Step 3:

[0321] The server analyzes the generated text information using natural language processing techniques and extracts important data. Here, the natural language processing engine identifies key points and decisions from the text information. The input is text information, and the output is the extracted important data.

[0322] Step 4:

[0323] The server automatically generates meeting documents using a record-keeping system based on the extracted key data. Here, the server applies formatting rules and outputs well-formed documents. The input is key data, and the output is a formatted meeting document.

[0324] Step 5:

[0325] The server automatically detects and manages work items based on the generated meeting documents. The server understands the context from the meeting documents and registers them in the task management system. The input is the meeting documents, and the output is a list of detected work items.

[0326] Step 6:

[0327] The user checks the progress of tasks by referring to a list of work items registered by the server in the task management system. The user sets actions for each work item as needed. The input is a list of work items, and the output is the updated status after user confirmation.

[0328] Step 7:

[0329] The server manages the progress of updated work items and notifies the user in real time. The input is the updated data, and the output is the information notified to the user.

[0330] This series of processes enables efficient recording and management of information from meetings and discussions at physical stores.

[0331] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0332] This invention relates to a system that processes audio and emotions in a meeting in real time and uses this information to create meeting records and manage tasks. The system consists of a server, terminals, and users, and also supports emotion recognition.

[0333] First, the device captures the audio and video of the meeting. This is done using the built-in microphone and camera, recording participants' statements and facial expressions in real time. This data is sent to a server, where the audio data is converted into text data, and the video data is analyzed by an emotion engine.

[0334] The server converts the received audio data into text data using a speech recognition engine. This text data is then subjected to natural language processing to extract important information. Based on this information, meeting minutes are automatically generated.

[0335] Here, the emotion engine installed on the server analyzes the user's emotional state from the video data. This emotional information is recorded along with the meeting content and used to analyze changes in emotion and tone during important discussions.

[0336] As a concrete example, when someone says "This proposal involves risks" during a meeting, the device sends the audio to the server, where a speech recognition engine transcribes it as "This proposal involves risks." Simultaneously, the camera captures the speaker's facial expressions, and an emotion engine detects emotions such as "anxiety" or "concern." The server combines this information to create meeting minutes that include the participants' emotions in relation to the context of the statement.

[0337] Furthermore, the server automatically extracts tasks from meeting minutes and sets deadlines and assigns responsibilities using task management tools. Sentimental information may also be used to evaluate the priority and urgency of tasks. Based on this information, users plan post-meeting follow-ups and manage projects.

[0338] This system enables comprehensive meeting recording and task management, including emotional shifts, supporting effective decision-making that takes into account the psychological nuances of participants.

[0339] The following describes the processing flow.

[0340] Step 1:

[0341] The device begins capturing audio and video at the start of the meeting. It uses the built-in microphone to record participants' speech and the camera to continuously record their facial expressions. This data is transmitted to the server in real time.

[0342] Step 2:

[0343] The server receives the audio data and passes it to the speech recognition engine, which converts it into text data. This conversion is performed sequentially, and the process of generating text data in real time continues throughout the meeting.

[0344] Step 3:

[0345] The server analyzes the generated text data using a natural language processing algorithm. The algorithm extracts important information and keywords from the text, identifying key parts of the meeting.

[0346] Step 4:

[0347] The server inputs video data collected during the meeting into the emotion engine. The emotion engine analyzes the video, identifies the emotional state of the participants from their facial expressions and tone of voice, and records the results in a database.

[0348] Step 5:

[0349] The server automatically generates meeting minutes by combining extracted key information and sentiment data. The minutes include not only the content of the statements but also the sentiment information of the participants at that time.

[0350] Step 6:

[0351] The server detects tasks from the meeting minutes and registers them in the task management system. Sentimental information may be considered as a factor in determining the importance and priority of tasks.

[0352] Step 7:

[0353] Users receive meeting minutes and task lists sent from the server via their terminals. Based on the information provided, users can manage post-meeting follow-ups and track the progress of each task.

[0354] Step 8:

[0355] Users update task progress within the system and add comments and feedback as needed. This ensures that progress is always kept up-to-date and that stakeholders are automatically notified.

[0356] (Example 2)

[0357] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0358] Traditional meeting recording systems simply convert audio data into text data to create minutes, failing to consider the emotions of participants or the atmosphere of the discussion during the meeting. Furthermore, it was difficult to properly prioritize tasks generated during the meeting and to manage them efficiently.

[0359] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0360] In this invention, the server includes a speech conversion means for receiving audio information and converting the audio information into text information, a natural language processing means for extracting important elements from the generated text information, and an emotion recognition means for analyzing emotional states from video information. This makes it possible to create detailed meeting minutes that take into account the emotions of participants during a meeting, and to prioritize them based on their emotional states.

[0361] "Speech conversion means" refers to a device or function that receives speech information and converts that speech information into text information.

[0362] "Natural language processing means" refers to techniques or algorithms that extract important elements from generated textual information and analyze their meaning and context.

[0363] A "meeting record creation method" refers to a system or method that automatically creates meeting minutes based on extracted key elements.

[0364] "Emotion recognition means" refers to a technology or system that identifies an emotional state by analyzing a participant's facial expressions and movements based on video information.

[0365] "Work management means" refers to a method or system that automatically detects work items based on generated meeting records and manages and organizes them.

[0366] This invention is a system designed to streamline information processing in meetings. Specific embodiments are described below.

[0367] First, the device captures the audio and video of the meeting in real time. This is done using the built-in microphone and camera, acquiring audio data in WAV format and video data in MP4 format. This data is then sent to the server in streaming format.

[0368] The server uses a speech converter to process the received audio data. Specifically, it utilizes a speech recognition engine to convert audio into text data. This engine is based on APIs commonly used in speech recognition software. Next, the generated text information is analyzed using natural language processing techniques to extract important keywords and utterances. This process uses a natural language processing engine and organizes the information using a language model.

[0369] Meanwhile, the server receives the video data and uses an emotion recognition system to analyze the emotional state of the participants. This system utilizes emotion recognition technologies such as facial recognition APIs. The analyzed emotion data is then used for creating meeting minutes and task management.

[0370] The generated meeting minutes are automatically created based on data processing and reflect emotional states, resulting in a summary that makes it easy to understand the atmosphere of the meeting and the tone of the participants. Furthermore, the server automatically extracts tasks within the meeting template and sets schedules and priorities using a task management system.

[0371] Users can view meeting minutes and task lists generated on their devices. They can also plan post-meeting follow-ups and manage projects as needed.

[0372] As a concrete example, when someone says "This proposal involves risks" during a meeting, the device sends the audio to the server, and the speech recognition engine transcribes it as "This proposal involves risks." Simultaneously, the camera captures the speaker's facial expressions, and the emotion recognition engine can analyze emotions such as "anxiety" or "concern." This information is then used to generate meeting minutes that take into account the emotions of the participants.

[0373] As an example of a prompt, the user can enter the instruction, "Based on the meeting audio and sentiment data, propose an action plan for the next meeting using automatically generated meeting minutes." This will trigger an automated proposal based on the generative AI model.

[0374] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0375] Step 1:

[0376] The terminal activates its built-in microphone and camera at the start of the meeting. Inputs include audio and video from the meeting room. This data is captured as WAV audio files and MP4 video files. The terminal collects this data in real time and sends it to the server via streaming.

[0377] Step 2:

[0378] The server uses a speech conversion system with audio data received from the terminal as input. Specifically, it activates a speech recognition engine and converts the audio data into text data. This process outputs the spoken content as text, which is then used for subsequent analysis.

[0379] Step 3:

[0380] The server uses a natural language processing system with text data as input. It leverages a natural language processing engine to extract important keywords and content from the text data. This extracted information becomes the output, forming the basis for creating meeting minutes. Furthermore, a language model is applied to understand the context of the text.

[0381] Step 4:

[0382] The server uses an emotion recognition system as input, taking video data received from the terminal. The emotion recognition engine analyzes the emotional state of participants based on their facial expressions. The results of this analysis are output as emotion data, which is used for meeting minutes and task management.

[0383] Step 5:

[0384] The server uses the extracted textual information and sentiment data as input to launch the meeting minutes creation system. Using the meeting minutes creation engine, it automatically generates meeting minutes. The output is a comprehensive record of meeting minutes that reflects the content and emotions of the meeting. These minutes also record the emotional tone of the participants.

[0385] Step 6:

[0386] The server uses the generated meeting minutes as input and utilizes a task management system. The task management engine automatically extracts tasks from the meeting minutes and sets their priorities and schedules. The output is compiled into a task list and provided to the user.

[0387] Step 7:

[0388] Users review the generated meeting minutes and task lists using their devices. Based on the outputted information, they plan meeting follow-ups and manage projects. Specifically, they adjust task deadlines and assign responsibilities, and monitor progress. This enables efficient work execution.

[0389] (Application Example 2)

[0390] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the smart glasses 214 as the "terminal".

[0391] In meetings, it is necessary to effectively record and manage participants' statements and the resulting emotional changes in real time to improve the quality of task management and decision-making. However, conventional meeting recording systems cannot take emotional changes into account, making it difficult to implement effective project management that takes into account the psychological nuances of participants.

[0392] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0393] In this invention, the server includes speech recognition means for receiving audio data and converting it into text data, natural language processing means for extracting important information, and emotion recognition means for analyzing the emotional state of participants from video data. This makes it possible to record and manage tasks while taking into account changes in participants' emotions during the progress of the meeting.

[0394] "Speech recognition means" refers to technology that receives speech data and converts that speech into text data.

[0395] "Natural language processing means" refers to technologies that extract important information from generated text data and perform information analysis and understanding.

[0396] "Meeting minutes creation method" refers to technology that automatically generates documents recording the content of a meeting based on important information.

[0397] "Task management method" refers to a technology that organizes and manages tasks automatically detected based on meeting minutes.

[0398] "Emotion recognition means" refers to technology that analyzes the emotional state of participants based on video data and understands their situation.

[0399] "Methods for adjusting meeting records based on emotion analysis results" refers to techniques that utilize emotion analysis results to adjust the recorded meeting content and its interpretation.

[0400] The system implementing this invention mainly consists of a server, a terminal, and a user.

[0401] The server converts speech data into text data using a speech recognition engine. Specifically, services such as Google Speech-to-Text or AWS Transcribe can be used. Important information is extracted from this text data using natural language processing tools, and it becomes the basic data for creating meeting minutes. Libraries such as SpaCy and NLTK are used for natural language processing.

[0402] Furthermore, the server receives video data collected by the built-in camera and performs emotion recognition using Microsoft Azure's facial recognition API, among other things. This allows for real-time analysis of participants' emotional states, and the meeting record is adjusted based on the results.

[0403] The terminal uses its built-in microphone and camera to capture audio and video of meetings and transmits this data to the server. Through this system, users can receive meeting transcripts and sentiment analysis results generated in real time, which can be used for task management and project decision-making.

[0404] As a concrete example of this system, during a residents' meeting for a local shopping district revitalization project, a comment was made: "I'm worried about holding a new event." The system analyzes the speaker's facial expression from the captured video footage along with this comment, interpreting it as "anxiety," and adjusts the priority of related event preparation tasks accordingly.

[0405] An example of a prompt sentence using a generative AI model is "a method for analyzing the sentiment behind statements made at a residents' meeting and using that information to help with decision-making at the meeting."

[0406] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0407] Step 1:

[0408] The device captures the audio and video data of the meeting. Using the built-in microphone and camera, it records participants' speech and facial expressions in real time. To optimize resource usage, the data is compressed into an appropriate format before being transferred to the server. The input is the meeting's audio and video, and the output is raw data.

[0409] Step 2:

[0410] The server converts the received audio data into text data using a speech recognition engine. For example, using Google Speech-to-Text, this data is converted into text format. The input is audio data, and the output is the corresponding text data.

[0411] Step 3:

[0412] The server processes text data using natural language processing to extract important information. It uses natural language processing libraries such as SpaCy and NLTK to extract key points from spoken content for meeting minutes creation. The input is text data, and the output is the important information necessary for meeting minutes.

[0413] Step 4:

[0414] The server receives video data from the device's camera and analyzes it using an emotion recognition API. Microsoft Azure's facial recognition API is used to identify emotional states from participants' facial expressions. The input is video data, and the output is information about the detected emotions.

[0415] Step 5:

[0416] The server incorporates the sentiment analysis results into the meeting minutes, adjusting the content to reflect the tone and emotional impact of the statements. Based on the sentiment recognition results, the system processes the minutes to emphasize key points and adjust their tone. The input consists of key information and sentiment information from the meeting minutes, and the output is the sentiment-adjusted meeting minutes.

[0417] Step 6:

[0418] The server automatically detects tasks based on meeting minutes, sets their priorities, and adds them to the task management system. The system sends information to a project management interface to help visualize tasks and manage their progress. The input is the edited meeting minutes, and the output is a set of prioritized tasks.

[0419] Step 7:

[0420] Users receive meeting minutes and sentiment data generated in real time from the server, and then use this data for subsequent meeting follow-up and task management. Participants can view and edit information through the user interface. Inputs are meeting minutes and sentiment information from the server, while outputs are the displayed content and task progress information provided to the user.

[0421] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0422] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0423] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0424] [Third Embodiment]

[0425] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0426] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0427] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0428] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0429] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0430] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0431] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0432] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0433] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0434] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0435] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0436] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0437] The system according to the present invention is configured to automate the creation of meeting minutes from meeting recordings and task management. This system mainly consists of a server, terminals, and users.

[0438] First, the terminal performs voice input at the start of the meeting. This voice input is collected using the terminal's built-in microphone and sent to the server in real time. The voice data sent to the server is immediately converted into text data through speech recognition. This converted text data is stored in temporary storage on the server.

[0439] Next, the server analyzes the stored text data using natural language processing. This extracts important information and keywords from the meeting content. In particular, it is designed to automatically identify decisions and key topics. This extracted information is then organized by a meeting minutes creation system and automatically generated as formatted meeting minutes.

[0440] Furthermore, the server automatically detects tasks from the generated meeting minutes. The task management system analyzes deadlines and responsible persons associated with these tasks and records their progress. The task list is notified to the user, who can then take further action based on it.

[0441] As a concrete example, if a participant says "We need to prepare the materials before the next meeting" during a meeting, the terminal sends this statement as audio data to the server. The server uses speech recognition to convert "We need to prepare the materials before the next meeting" into text data, and then uses natural language processing to extract the task "Prepare materials" from this statement. The task management system lists this task and notifies the user along with related information. In this way, tasks are managed efficiently even after the meeting, enabling smooth information sharing and follow-up.

[0442] In this way, the present invention aims to streamline administrative tasks in meetings and provide an environment where participants can focus on more important discussions.

[0443] The following describes the processing flow.

[0444] Step 1:

[0445] The device starts recording audio at the beginning of the meeting. The device uses its built-in microphone to capture participants' speech in real time and stores it as audio data.

[0446] Step 2:

[0447] After the terminal buffers a certain amount of audio data, it sends this audio data to the server in packet format. Transmission is performed periodically in real time to minimize data delay.

[0448] Step 3:

[0449] The server passes the received audio data to the speech recognition engine. The speech recognition engine analyzes the audio data and converts it into text data. This process is continuous, generating text data in real time during the meeting.

[0450] Step 4:

[0451] The server analyzes the generated text data using a natural language processing algorithm. The algorithm extracts context and keywords from the text data, identifying important information from the meeting. This information is then organized into key points to focus on.

[0452] Step 5:

[0453] The server automatically generates meeting minutes based on important information. The meeting minutes creation module organizes the information in an appropriate format and saves it as a meeting record.

[0454] Step 6:

[0455] The server automatically extracts tasks using the generated meeting minutes and registers them in the task management database. Tasks can be automatically assigned deadlines and responsible parties, and this information is stored in a structured format.

[0456] Step 7:

[0457] Users receive meeting minutes and task lists from the server via their terminals. They use this information to manage post-meeting tasks and track task progress.

[0458] Step 8:

[0459] Users can access the task management system through an interface to update task progress. Users can add progress information and notify other stakeholders.

[0460] (Example 1)

[0461] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0462] Traditional meeting record-keeping and task management processes lack automation, resulting in cumbersome and time-consuming post-meeting processing. In particular, extracting key information from discussions, quickly creating meeting minutes, and efficiently managing related tasks is challenging.

[0463] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0464] In this invention, the server includes speech recognition means for receiving audio information and converting it into text information, natural language processing means for extracting important information from the generated text information, and record creation means for automatically generating meeting minutes based on the extracted important information. This makes it possible to streamline post-meeting task management and provide an environment in which participants can concentrate on important discussions.

[0465] "Audio information" refers to audio data, including spoken words during a meeting.

[0466] "Textual information" refers to text data converted from audio information.

[0467] "Speech recognition means" refers to technical means that convert speech information into text information, and includes the process of analyzing acoustic signals and generating corresponding text.

[0468] "Natural language processing means" refers to technologies for extracting important information from textual data, and includes the process of understanding and structuring the meaning and context of the text.

[0469] "Record creation means" refers to a method for automatically generating meeting minutes based on extracted important information, organizing the information and outputting it in a predetermined format.

[0470] A "work item" refers to a specific task or action decided upon in a meeting, accompanied by relevant information such as the person who will perform it and the deadline.

[0471] "Work management means" refers to methods for handling detected work items and managing their processing and progress.

[0472] "Analysis of relevant deadlines and responsible persons" involves analyzing the deadlines set for work items and the responsible persons involved through information processing, and utilizing this information in the management process.

[0473] This invention is a system that efficiently creates meeting minutes by recording audio information from meetings and converting it into text information, and also automatically detects and manages work items. The system mainly consists of a server, terminals, and users.

[0474] The terminal inputs audio information during a meeting using its built-in microphone and transmits it to the server via its communication function. The basic hardware used in the terminal includes a standard microphone and a computer with network connectivity.

[0475] The server receives audio information transmitted from the terminal and converts it into text using speech recognition software. This process utilizes, for example, a "speech analysis API" as speech recognition software. After the audio information is converted to text, the server analyzes the text using natural language processing techniques. Here, software such as a "natural language processing library" is used to extract important information.

[0476] Next, the server uses the recording mechanism to format the meeting minutes based on the extracted information. The minutes are stored in a database and made available to users for access as needed.

[0477] Furthermore, the server detects work items from meeting records and manages these items using work management tools. Specifically, it analyzes deadlines and responsible parties and tracks progress using task management tool APIs. For example, the "Project Management Platform API" can be used.

[0478] Users can receive real-time notifications for generated meeting minutes and the progress of work items. Based on this, users can plan their next actions and perform their work efficiently.

[0479] For example, if a participant says "Please prepare for the next presentation" during a meeting, the terminal sends this audio to the server. The server converts it into text information, "Please prepare for the next presentation," and uses natural language processing to detect the task item "Prepare for the presentation."

[0480] An example of a prompt message is, "Convert the meeting audio log to text and extract specific work items." In this way, the system can automate meeting information management and dramatically improve work efficiency.

[0481] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0482] Step 1:

[0483] The terminal uses its built-in microphone to input voice information as soon as the meeting starts. The input voice is converted into digital data and temporarily stored in the terminal's memory. The converted digital voice data is then sent sequentially to the server as data chunks.

[0484] Step 2:

[0485] The server receives digital audio data transmitted from the terminal. The received digital audio data is converted into text information using a speech analysis API. In the speech-to-text conversion process, the acoustic signal is phonemic-analyzed, and text is generated by a continuous speech recognition model. As a result, the raw text data is stored in the server's temporary storage.

[0486] Step 3:

[0487] The server analyzes the stored raw text data using a natural language processing library. During this analysis process, named entities and important information are extracted from the text, and key points and decisions are identified using predefined rules and machine learning models. The extracted information is then output in a structured format as the analysis result.

[0488] Step 4:

[0489] The server generates meeting minutes using recording mechanisms based on structured information. In this process, information is organized according to a template and output as formatted meeting minutes. The generated meeting minutes are stored in a database for later access by users.

[0490] Step 5:

[0491] The server applies work management tools to detect new work items from meeting records. Information about the detected work items is sent to the task management system via API, and a work item list is generated. At this time, deadlines and responsible person information are automatically added to the items.

[0492] Step 6:

[0493] After a meeting ends, users receive a generated task list via real-time notification. Upon receiving the notification, users can review the details through their task management tool and take action to manage the progress of their tasks. This ensures that individual work items are efficiently tracked and completed.

[0494] (Application Example 1)

[0495] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0496] In modern brick-and-mortar stores, efficiently recording meeting and discussion content and managing tasks based on that information is essential. However, doing this manually is typically time-consuming, labor-intensive, and prone to errors. Furthermore, the lack of mechanisms for immediate information sharing on-site and visualization of task progress makes optimizing operations difficult. A system is needed to solve these problems and operate operations efficiently.

[0497] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0498] In this invention, the server includes speech recognition means for receiving audio information and converting it into text information, natural language processing means for extracting important data from the generated text information, and record creation means for automatically generating meeting documents based on the extracted important data. This enables the content of meetings to be recorded in real time, facilitating efficient task management and information sharing.

[0499] "Audio information" refers to information used to record and process audio in a digital format.

[0500] "Textual information" refers to text data expressed in digital format, specifically text converted from audio information.

[0501] "Speech recognition means" refers to technology that analyzes speech information and converts it into text information.

[0502] "Natural language processing means" refers to language processing techniques used to analyze textual information and extract important data from it.

[0503] "Record creation means" refers to the means for generating meeting documents based on extracted data, and for formalizing the documents.

[0504] "Work items" refer to specific actions or tasks that are decided or presented in meetings or discussions.

[0505] "Controlling tasks" refers to managing work items, monitoring their progress, and making appropriate adjustments.

[0506] "Business support" refers to back-office support activities aimed at improving the operational efficiency of stores.

[0507] "Business documents" refer to documents that organize and record information related to business operations.

[0508] "Business documents derived from audio" are recorded documents generated based on audio information, and reflect the content of the meeting.

[0509] "Methods for optimizing business progress" refer to techniques for organizing information and clarifying tasks in order to operate business efficiently.

[0510] The system that realizes this invention mainly consists of a server, a terminal, and a user. First, the terminal collects audio information at the start of a meeting or discussion. The terminal acquires audio information using its built-in microphone, and this audio information is transmitted to the server in real time.

[0511] Next, the server processes the received audio information. The server uses speech recognition technology to convert the audio information into text. For this process, for example, a speech recognition service provided by AWS can be used. The converted text information is temporarily stored in a database on the server.

[0512] Subsequently, the server analyzes the textual information using natural language processing technology and extracts important data. Natural language processing tools such as Amazon Comprehend can be utilized for this process. The extracted data is organized by a recording system and automatically formatted as a meeting document.

[0513] Next, the server automatically detects and manages work items based on the meeting documents. Work items are organized with their associated deadlines and assignee information, and their progress is tracked. Managed tasks are notified to the user in real time, allowing them to take appropriate action. Task management can be effectively managed using tools such as the Trello API.

[0514] For example, if someone says during a meeting at a store, "We need to prepare the product display for the next sale," the terminal sends this statement as audio information to the server. The server converts it into text using speech recognition and extracts the task item "Prepare product display" using natural language processing. The task item is then assigned an appropriate person and deadline, and the user is notified.

[0515] An example of a prompt message would be, "Please upload the audio file, summarize the key points of the meeting, and create a list of tasks to complete before the next meeting."

[0516] In this way, this invention achieves improved operational efficiency and optimized information sharing in physical stores.

[0517] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0518] Step 1:

[0519] The terminal collects audio information as soon as a meeting or discussion begins. The terminal's built-in microphone captures ambient sound, and the recorded audio data is transmitted to the server in real time. The input is raw audio information, while the output is digital audio data transmitted to the server.

[0520] Step 2:

[0521] The server converts received audio data into text information using speech recognition technology. Specifically, the server analyzes the audio data using a speech recognition API and generates text data. The input is digital audio data, and the output is the corresponding text information.

[0522] Step 3:

[0523] The server analyzes the generated text information using natural language processing techniques and extracts important data. Here, the natural language processing engine identifies key points and decisions from the text information. The input is text information, and the output is the extracted important data.

[0524] Step 4:

[0525] The server automatically generates meeting documents using a record-keeping system based on the extracted key data. Here, the server applies formatting rules and outputs well-formed documents. The input is key data, and the output is a formatted meeting document.

[0526] Step 5:

[0527] The server automatically detects and manages work items based on the generated meeting documents. The server understands the context from the meeting documents and registers them in the task management system. The input is the meeting documents, and the output is a list of detected work items.

[0528] Step 6:

[0529] The user checks the progress of tasks by referring to a list of work items registered by the server in the task management system. The user sets actions for each work item as needed. The input is a list of work items, and the output is the updated status after user confirmation.

[0530] Step 7:

[0531] The server manages the progress of updated work items and notifies the user in real time. The input is the updated data, and the output is the information notified to the user.

[0532] This series of processes enables efficient recording and management of information from meetings and discussions at physical stores.

[0533] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0534] This invention relates to a system that processes audio and emotions in a meeting in real time and uses this information to create meeting records and manage tasks. The system consists of a server, terminals, and users, and also supports emotion recognition.

[0535] First, the device captures the audio and video of the meeting. This is done using the built-in microphone and camera, recording participants' statements and facial expressions in real time. This data is sent to a server, where the audio data is converted into text data, and the video data is analyzed by an emotion engine.

[0536] The server converts the received audio data into text data using a speech recognition engine. This text data is then subjected to natural language processing to extract important information. Based on this information, meeting minutes are automatically generated.

[0537] Here, the emotion engine installed on the server analyzes the user's emotional state from the video data. This emotional information is recorded along with the meeting content and used to analyze changes in emotion and tone during important discussions.

[0538] As a concrete example, when someone says "This proposal involves risks" during a meeting, the device sends the audio to the server, where a speech recognition engine transcribes it as "This proposal involves risks." Simultaneously, the camera captures the speaker's facial expressions, and an emotion engine detects emotions such as "anxiety" or "concern." The server combines this information to create meeting minutes that include the participants' emotions in relation to the context of the statement.

[0539] Furthermore, the server automatically extracts tasks from meeting minutes and sets deadlines and assigns responsibilities using task management tools. Sentimental information may also be used to evaluate the priority and urgency of tasks. Based on this information, users plan post-meeting follow-ups and manage projects.

[0540] This system enables comprehensive meeting recording and task management, including emotional shifts, supporting effective decision-making that takes into account the psychological nuances of participants.

[0541] The following describes the processing flow.

[0542] Step 1:

[0543] The device begins capturing audio and video at the start of the meeting. It uses the built-in microphone to record participants' speech and the camera to continuously record their facial expressions. This data is transmitted to the server in real time.

[0544] Step 2:

[0545] The server receives the audio data and passes it to the speech recognition engine, which converts it into text data. This conversion is performed sequentially, and the process of generating text data in real time continues throughout the meeting.

[0546] Step 3:

[0547] The server analyzes the generated text data using a natural language processing algorithm. The algorithm extracts important information and keywords from the text, identifying key parts of the meeting.

[0548] Step 4:

[0549] The server inputs video data collected during the meeting into the emotion engine. The emotion engine analyzes the video, identifies the emotional state of the participants from their facial expressions and tone of voice, and records the results in a database.

[0550] Step 5:

[0551] The server automatically generates meeting minutes by combining extracted key information and sentiment data. The minutes include not only the content of the statements but also the sentiment information of the participants at that time.

[0552] Step 6:

[0553] The server detects tasks from the meeting minutes and registers them in the task management system. Sentimental information may be considered as a factor in determining the importance and priority of tasks.

[0554] Step 7:

[0555] Users receive meeting minutes and task lists sent from the server via their terminals. Based on the information provided, users can manage post-meeting follow-ups and track the progress of each task.

[0556] Step 8:

[0557] Users update task progress within the system and add comments and feedback as needed. This ensures that progress is always kept up-to-date and that stakeholders are automatically notified.

[0558] (Example 2)

[0559] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0560] Traditional meeting recording systems simply convert audio data into text data to create minutes, failing to consider the emotions of participants or the atmosphere of the discussion during the meeting. Furthermore, it was difficult to properly prioritize tasks generated during the meeting and to manage them efficiently.

[0561] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0562] In this invention, the server includes a speech conversion means for receiving audio information and converting the audio information into text information, a natural language processing means for extracting important elements from the generated text information, and an emotion recognition means for analyzing emotional states from video information. This makes it possible to create detailed meeting minutes that take into account the emotions of participants during a meeting, and to prioritize them based on their emotional states.

[0563] "Speech conversion means" refers to a device or function that receives speech information and converts that speech information into text information.

[0564] "Natural language processing means" refers to techniques or algorithms that extract important elements from generated textual information and analyze their meaning and context.

[0565] A "meeting record creation method" refers to a system or method that automatically creates meeting minutes based on extracted key elements.

[0566] "Emotion recognition means" refers to a technology or system that identifies an emotional state by analyzing a participant's facial expressions and movements based on video information.

[0567] "Work management means" refers to a method or system that automatically detects work items based on generated meeting records and manages and organizes them.

[0568] This invention is a system designed to streamline information processing in meetings. Specific embodiments are described below.

[0569] First, the device captures the audio and video of the meeting in real time. This is done using the built-in microphone and camera, acquiring audio data in WAV format and video data in MP4 format. This data is then sent to the server in streaming format.

[0570] The server uses a speech converter to process the received audio data. Specifically, it utilizes a speech recognition engine to convert audio into text data. This engine is based on APIs commonly used in speech recognition software. Next, the generated text information is analyzed using natural language processing techniques to extract important keywords and utterances. This process uses a natural language processing engine and organizes the information using a language model.

[0571] Meanwhile, the server receives the video data and uses an emotion recognition system to analyze the emotional state of the participants. This system utilizes emotion recognition technologies such as facial recognition APIs. The analyzed emotion data is then used for creating meeting minutes and task management.

[0572] The generated meeting minutes are automatically created based on data processing and reflect emotional states, resulting in a summary that makes it easy to understand the atmosphere of the meeting and the tone of the participants. Furthermore, the server automatically extracts tasks within the meeting template and sets schedules and priorities using a task management system.

[0573] Users can view meeting minutes and task lists generated on their devices. They can also plan post-meeting follow-ups and manage projects as needed.

[0574] As a concrete example, when someone says "This proposal involves risks" during a meeting, the device sends the audio to the server, and the speech recognition engine transcribes it as "This proposal involves risks." Simultaneously, the camera captures the speaker's facial expressions, and the emotion recognition engine can analyze emotions such as "anxiety" or "concern." This information is then used to generate meeting minutes that take into account the emotions of the participants.

[0575] As an example of a prompt, the user can enter the instruction, "Based on the meeting audio and sentiment data, propose an action plan for the next meeting using automatically generated meeting minutes." This will trigger an automated proposal based on the generative AI model.

[0576] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0577] Step 1:

[0578] The terminal activates its built-in microphone and camera at the start of the meeting. Inputs include audio and video from the meeting room. This data is captured as WAV audio files and MP4 video files. The terminal collects this data in real time and sends it to the server via streaming.

[0579] Step 2:

[0580] The server uses a speech conversion system with audio data received from the terminal as input. Specifically, it activates a speech recognition engine and converts the audio data into text data. This process outputs the spoken content as text, which is then used for subsequent analysis.

[0581] Step 3:

[0582] The server uses a natural language processing system with text data as input. It leverages a natural language processing engine to extract important keywords and content from the text data. This extracted information becomes the output, forming the basis for creating meeting minutes. Furthermore, a language model is applied to understand the context of the text.

[0583] Step 4:

[0584] The server uses an emotion recognition system as input, taking video data received from the terminal. The emotion recognition engine analyzes the emotional state of participants based on their facial expressions. The results of this analysis are output as emotion data, which is used for meeting minutes and task management.

[0585] Step 5:

[0586] The server uses the extracted textual information and sentiment data as input to launch the meeting minutes creation system. Using the meeting minutes creation engine, it automatically generates meeting minutes. The output is a comprehensive record of meeting minutes that reflects the content and emotions of the meeting. These minutes also record the emotional tone of the participants.

[0587] Step 6:

[0588] The server uses the generated meeting minutes as input and utilizes a task management system. The task management engine automatically extracts tasks from the meeting minutes and sets their priorities and schedules. The output is compiled into a task list and provided to the user.

[0589] Step 7:

[0590] Users review the generated meeting minutes and task lists using their devices. Based on the outputted information, they plan meeting follow-ups and manage projects. Specifically, they adjust task deadlines and assign responsibilities, and monitor progress. This enables efficient work execution.

[0591] (Application Example 2)

[0592] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0593] In meetings, it is necessary to effectively record and manage participants' statements and the resulting emotional changes in real time to improve the quality of task management and decision-making. However, conventional meeting recording systems cannot take emotional changes into account, making it difficult to implement effective project management that takes into account the psychological nuances of participants.

[0594] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0595] In this invention, the server includes speech recognition means for receiving audio data and converting it into text data, natural language processing means for extracting important information, and emotion recognition means for analyzing the emotional state of participants from video data. This makes it possible to record and manage tasks while taking into account changes in participants' emotions during the progress of the meeting.

[0596] "Speech recognition means" refers to technology that receives speech data and converts that speech into text data.

[0597] "Natural language processing means" refers to technologies that extract important information from generated text data and perform information analysis and understanding.

[0598] "Meeting minutes creation method" refers to technology that automatically generates documents recording the content of a meeting based on important information.

[0599] "Task management method" refers to a technology that organizes and manages tasks automatically detected based on meeting minutes.

[0600] "Emotion recognition means" refers to technology that analyzes the emotional state of participants based on video data and understands their situation.

[0601] "Methods for adjusting meeting records based on emotion analysis results" refers to techniques that utilize emotion analysis results to adjust the recorded meeting content and its interpretation.

[0602] The system implementing this invention mainly consists of a server, a terminal, and a user.

[0603] The server converts speech data into text data using a speech recognition engine. Specifically, services such as Google Speech-to-Text or AWS Transcribe can be used. Important information is extracted from this text data using natural language processing tools, and it becomes the basic data for creating meeting minutes. Libraries such as SpaCy and NLTK are used for natural language processing.

[0604] Furthermore, the server receives video data collected by the built-in camera and performs emotion recognition using Microsoft Azure's facial recognition API, among other things. This allows for real-time analysis of participants' emotional states, and the meeting record is adjusted based on the results.

[0605] The terminal uses its built-in microphone and camera to capture audio and video of meetings and transmits this data to the server. Through this system, users can receive meeting transcripts and sentiment analysis results generated in real time, which can be used for task management and project decision-making.

[0606] As a concrete example of this system, during a residents' meeting for a local shopping district revitalization project, a comment was made: "I'm worried about holding a new event." The system analyzes the speaker's facial expression from the captured video footage along with this comment, interpreting it as "anxiety," and adjusts the priority of related event preparation tasks accordingly.

[0607] An example of a prompt sentence using a generative AI model is "a method for analyzing the sentiment behind statements made at a residents' meeting and using that information to help with decision-making at the meeting."

[0608] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0609] Step 1:

[0610] The device captures the audio and video data of the meeting. Using the built-in microphone and camera, it records participants' speech and facial expressions in real time. To optimize resource usage, the data is compressed into an appropriate format before being transferred to the server. The input is the meeting's audio and video, and the output is raw data.

[0611] Step 2:

[0612] The server converts the received audio data into text data using a speech recognition engine. For example, using Google Speech-to-Text, this data is converted into text format. The input is audio data, and the output is the corresponding text data.

[0613] Step 3:

[0614] The server processes text data using natural language processing to extract important information. It uses natural language processing libraries such as SpaCy and NLTK to extract key points from spoken content for meeting minutes creation. The input is text data, and the output is the important information necessary for meeting minutes.

[0615] Step 4:

[0616] The server receives video data from the device's camera and analyzes it using an emotion recognition API. Microsoft Azure's facial recognition API is used to identify emotional states from participants' facial expressions. The input is video data, and the output is information about the detected emotions.

[0617] Step 5:

[0618] The server incorporates the sentiment analysis results into the meeting minutes, adjusting the content to reflect the tone and emotional impact of the statements. Based on the sentiment recognition results, the system processes the minutes to emphasize key points and adjust their tone. The input consists of key information and sentiment information from the meeting minutes, and the output is the sentiment-adjusted meeting minutes.

[0619] Step 6:

[0620] The server automatically detects tasks based on meeting minutes, sets their priorities, and adds them to the task management system. The system sends information to a project management interface to help visualize tasks and manage their progress. The input is the edited meeting minutes, and the output is a set of prioritized tasks.

[0621] Step 7:

[0622] Users receive meeting minutes and sentiment data generated in real time from the server, and then use this data for subsequent meeting follow-up and task management. Participants can view and edit information through the user interface. Inputs are meeting minutes and sentiment information from the server, while outputs are the displayed content and task progress information provided to the user.

[0623] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0624] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0625] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0626] [Fourth Embodiment]

[0627] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0628] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0629] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0630] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0631] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0632] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0633] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0634] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0635] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0636] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0637] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0638] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0639] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0640] The system according to the present invention is configured to automate the creation of meeting minutes from meeting recordings and task management. This system mainly consists of a server, terminals, and users.

[0641] First, the terminal performs voice input at the start of the meeting. This voice input is collected using the terminal's built-in microphone and sent to the server in real time. The voice data sent to the server is immediately converted into text data through speech recognition. This converted text data is stored in temporary storage on the server.

[0642] Next, the server analyzes the stored text data using natural language processing. This extracts important information and keywords from the meeting content. In particular, it is designed to automatically identify decisions and key topics. This extracted information is then organized by a meeting minutes creation system and automatically generated as formatted meeting minutes.

[0643] Furthermore, the server automatically detects tasks from the generated meeting minutes. The task management system analyzes deadlines and responsible persons associated with these tasks and records their progress. The task list is notified to the user, who can then take further action based on it.

[0644] As a concrete example, if a participant says "We need to prepare the materials before the next meeting" during a meeting, the terminal sends this statement as audio data to the server. The server uses speech recognition to convert "We need to prepare the materials before the next meeting" into text data, and then uses natural language processing to extract the task "Prepare materials" from this statement. The task management system lists this task and notifies the user along with related information. In this way, tasks are managed efficiently even after the meeting, enabling smooth information sharing and follow-up.

[0645] In this way, the present invention aims to streamline administrative tasks in meetings and provide an environment where participants can focus on more important discussions.

[0646] The following describes the processing flow.

[0647] Step 1:

[0648] The device starts recording audio at the beginning of the meeting. The device uses its built-in microphone to capture participants' speech in real time and stores it as audio data.

[0649] Step 2:

[0650] After the terminal buffers a certain amount of audio data, it sends this audio data to the server in packet format. Transmission is performed periodically in real time to minimize data delay.

[0651] Step 3:

[0652] The server passes the received audio data to the speech recognition engine. The speech recognition engine analyzes the audio data and converts it into text data. This process is continuous, generating text data in real time during the meeting.

[0653] Step 4:

[0654] The server analyzes the generated text data using a natural language processing algorithm. The algorithm extracts context and keywords from the text data, identifying important information from the meeting. This information is then organized into key points to focus on.

[0655] Step 5:

[0656] The server automatically generates meeting minutes based on important information. The meeting minutes creation module organizes the information in an appropriate format and saves it as a meeting record.

[0657] Step 6:

[0658] The server automatically extracts tasks using the generated meeting minutes and registers them in the task management database. Tasks can be automatically assigned deadlines and responsible parties, and this information is stored in a structured format.

[0659] Step 7:

[0660] Users receive meeting minutes and task lists from the server via their terminals. They use this information to manage post-meeting tasks and track task progress.

[0661] Step 8:

[0662] Users can access the task management system through an interface to update task progress. Users can add progress information and notify other stakeholders.

[0663] (Example 1)

[0664] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0665] Traditional meeting record-keeping and task management processes lack automation, resulting in cumbersome and time-consuming post-meeting processing. In particular, extracting key information from discussions, quickly creating meeting minutes, and efficiently managing related tasks is challenging.

[0666] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0667] In this invention, the server includes speech recognition means for receiving audio information and converting it into text information, natural language processing means for extracting important information from the generated text information, and record creation means for automatically generating meeting minutes based on the extracted important information. This makes it possible to streamline post-meeting task management and provide an environment in which participants can concentrate on important discussions.

[0668] "Audio information" refers to audio data, including spoken words during a meeting.

[0669] "Textual information" refers to text data converted from audio information.

[0670] "Speech recognition means" refers to technical means that convert speech information into text information, and includes the process of analyzing acoustic signals and generating corresponding text.

[0671] "Natural language processing means" refers to technologies for extracting important information from textual data, and includes the process of understanding and structuring the meaning and context of the text.

[0672] "Record creation means" refers to a method for automatically generating meeting minutes based on extracted important information, organizing the information and outputting it in a predetermined format.

[0673] A "work item" refers to a specific task or action decided upon in a meeting, accompanied by relevant information such as the person who will perform it and the deadline.

[0674] "Work management means" refers to methods for handling detected work items and managing their processing and progress.

[0675] "Analysis of relevant deadlines and responsible persons" involves analyzing the deadlines set for work items and the responsible persons involved through information processing, and utilizing this information in the management process.

[0676] This invention is a system that efficiently creates meeting minutes by recording audio information from meetings and converting it into text information, and also automatically detects and manages work items. The system mainly consists of a server, terminals, and users.

[0677] The terminal inputs audio information during a meeting using its built-in microphone and transmits it to the server via its communication function. The basic hardware used in the terminal includes a standard microphone and a computer with network connectivity.

[0678] The server receives audio information transmitted from the terminal and converts it into text using speech recognition software. This process utilizes, for example, a "speech analysis API" as speech recognition software. After the audio information is converted to text, the server analyzes the text using natural language processing techniques. Here, software such as a "natural language processing library" is used to extract important information.

[0679] Next, the server uses the recording mechanism to format the meeting minutes based on the extracted information. The minutes are stored in a database and made available to users for access as needed.

[0680] Furthermore, the server detects work items from meeting records and manages these items using work management tools. Specifically, it analyzes deadlines and responsible parties and tracks progress using task management tool APIs. For example, the "Project Management Platform API" can be used.

[0681] Users can receive real-time notifications for generated meeting minutes and the progress of work items. Based on this, users can plan their next actions and perform their work efficiently.

[0682] For example, if a participant says "Please prepare for the next presentation" during a meeting, the terminal sends this audio to the server. The server converts it into text information, "Please prepare for the next presentation," and uses natural language processing to detect the task item "Prepare for the presentation."

[0683] An example of a prompt message is, "Convert the meeting audio log to text and extract specific work items." In this way, the system can automate meeting information management and dramatically improve work efficiency.

[0684] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0685] Step 1:

[0686] The terminal uses its built-in microphone to input voice information as soon as the meeting starts. The input voice is converted into digital data and temporarily stored in the terminal's memory. The converted digital voice data is then sent sequentially to the server as data chunks.

[0687] Step 2:

[0688] The server receives digital audio data transmitted from the terminal. The received digital audio data is converted into text information using a speech analysis API. In the speech-to-text conversion process, the acoustic signal is phonemic-analyzed, and text is generated by a continuous speech recognition model. As a result, the raw text data is stored in the server's temporary storage.

[0689] Step 3:

[0690] The server analyzes the stored raw text data using a natural language processing library. During this analysis process, named entities and important information are extracted from the text, and key points and decisions are identified using predefined rules and machine learning models. The extracted information is then output in a structured format as the analysis result.

[0691] Step 4:

[0692] The server generates meeting minutes using recording mechanisms based on structured information. In this process, information is organized according to a template and output as formatted meeting minutes. The generated meeting minutes are stored in a database for later access by users.

[0693] Step 5:

[0694] The server applies work management tools to detect new work items from meeting records. Information about the detected work items is sent to the task management system via API, and a work item list is generated. At this time, deadlines and responsible person information are automatically added to the items.

[0695] Step 6:

[0696] After a meeting ends, users receive a generated task list via real-time notification. Upon receiving the notification, users can review the details through their task management tool and take action to manage the progress of their tasks. This ensures that individual work items are efficiently tracked and completed.

[0697] (Application Example 1)

[0698] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0699] In modern brick-and-mortar stores, efficiently recording meeting and discussion content and managing tasks based on that information is essential. However, doing this manually is typically time-consuming, labor-intensive, and prone to errors. Furthermore, the lack of mechanisms for immediate information sharing on-site and visualization of task progress makes optimizing operations difficult. A system is needed to solve these problems and operate operations efficiently.

[0700] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0701] In this invention, the server includes speech recognition means for receiving audio information and converting it into text information, natural language processing means for extracting important data from the generated text information, and record creation means for automatically generating meeting documents based on the extracted important data. This enables the content of meetings to be recorded in real time, facilitating efficient task management and information sharing.

[0702] "Audio information" refers to information used to record and process audio in a digital format.

[0703] "Textual information" refers to text data expressed in digital format, specifically text converted from audio information.

[0704] "Speech recognition means" refers to technology that analyzes speech information and converts it into text information.

[0705] "Natural language processing means" refers to language processing techniques used to analyze textual information and extract important data from it.

[0706] "Record creation means" refers to the means for generating meeting documents based on extracted data, and for formalizing the documents.

[0707] "Work items" refer to specific actions or tasks that are decided or presented in meetings or discussions.

[0708] "Controlling tasks" refers to managing work items, monitoring their progress, and making appropriate adjustments.

[0709] "Business support" refers to back-office support activities aimed at improving the operational efficiency of stores.

[0710] "Business documents" refer to documents that organize and record information related to business operations.

[0711] "Business documents derived from audio" are recorded documents generated based on audio information, and reflect the content of the meeting.

[0712] "Methods for optimizing business progress" refer to techniques for organizing information and clarifying tasks in order to operate business efficiently.

[0713] The system that realizes this invention mainly consists of a server, a terminal, and a user. First, the terminal collects audio information at the start of a meeting or discussion. The terminal acquires audio information using its built-in microphone, and this audio information is transmitted to the server in real time.

[0714] Next, the server processes the received audio information. The server uses speech recognition technology to convert the audio information into text. For this process, for example, a speech recognition service provided by AWS can be used. The converted text information is temporarily stored in a database on the server.

[0715] Subsequently, the server analyzes the textual information using natural language processing technology and extracts important data. Natural language processing tools such as Amazon Comprehend can be utilized for this process. The extracted data is organized by a recording system and automatically formatted as a meeting document.

[0716] Next, the server automatically detects and manages work items based on the meeting documents. Work items are organized with their associated deadlines and assignee information, and their progress is tracked. Managed tasks are notified to the user in real time, allowing them to take appropriate action. Task management can be effectively managed using tools such as the Trello API.

[0717] For example, if someone says during a meeting at a store, "We need to prepare the product display for the next sale," the terminal sends this statement as audio information to the server. The server converts it into text using speech recognition and extracts the task item "Prepare product display" using natural language processing. The task item is then assigned an appropriate person and deadline, and the user is notified.

[0718] An example of a prompt message would be, "Please upload the audio file, summarize the key points of the meeting, and create a list of tasks to complete before the next meeting."

[0719] In this way, this invention achieves improved operational efficiency and optimized information sharing in physical stores.

[0720] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0721] Step 1:

[0722] The terminal collects audio information as soon as a meeting or discussion begins. The terminal's built-in microphone captures ambient sound, and the recorded audio data is transmitted to the server in real time. The input is raw audio information, while the output is digital audio data transmitted to the server.

[0723] Step 2:

[0724] The server converts received audio data into text information using speech recognition technology. Specifically, the server analyzes the audio data using a speech recognition API and generates text data. The input is digital audio data, and the output is the corresponding text information.

[0725] Step 3:

[0726] The server analyzes the generated text information using natural language processing techniques and extracts important data. Here, the natural language processing engine identifies key points and decisions from the text information. The input is text information, and the output is the extracted important data.

[0727] Step 4:

[0728] The server automatically generates meeting documents using a record-keeping system based on the extracted key data. Here, the server applies formatting rules and outputs well-formed documents. The input is key data, and the output is a formatted meeting document.

[0729] Step 5:

[0730] The server automatically detects and manages work items based on the generated meeting documents. The server understands the context from the meeting documents and registers them in the task management system. The input is the meeting documents, and the output is a list of detected work items.

[0731] Step 6:

[0732] The user checks the progress of tasks by referring to a list of work items registered by the server in the task management system. The user sets actions for each work item as needed. The input is a list of work items, and the output is the updated status after user confirmation.

[0733] Step 7:

[0734] The server manages the progress of updated work items and notifies the user in real time. The input is the updated data, and the output is the information notified to the user.

[0735] This series of processes enables efficient recording and management of information from meetings and discussions at physical stores.

[0736] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0737] This invention relates to a system that processes audio and emotions in a meeting in real time and uses this information to create meeting records and manage tasks. The system consists of a server, terminals, and users, and also supports emotion recognition.

[0738] First, the device captures the audio and video of the meeting. This is done using the built-in microphone and camera, recording participants' statements and facial expressions in real time. This data is sent to a server, where the audio data is converted into text data, and the video data is analyzed by an emotion engine.

[0739] The server converts the received audio data into text data using a speech recognition engine. This text data is then subjected to natural language processing to extract important information. Based on this information, meeting minutes are automatically generated.

[0740] Here, the emotion engine installed on the server analyzes the user's emotional state from the video data. This emotional information is recorded along with the meeting content and used to analyze changes in emotion and tone during important discussions.

[0741] As a concrete example, when someone says "This proposal involves risks" during a meeting, the device sends the audio to the server, where a speech recognition engine transcribes it as "This proposal involves risks." Simultaneously, the camera captures the speaker's facial expressions, and an emotion engine detects emotions such as "anxiety" or "concern." The server combines this information to create meeting minutes that include the participants' emotions in relation to the context of the statement.

[0742] Furthermore, the server automatically extracts tasks from meeting minutes and sets deadlines and assigns responsibilities using task management tools. Sentimental information may also be used to evaluate the priority and urgency of tasks. Based on this information, users plan post-meeting follow-ups and manage projects.

[0743] This system enables comprehensive meeting recording and task management, including emotional shifts, supporting effective decision-making that takes into account the psychological nuances of participants.

[0744] The following describes the processing flow.

[0745] Step 1:

[0746] The device begins capturing audio and video at the start of the meeting. It uses the built-in microphone to record participants' speech and the camera to continuously record their facial expressions. This data is transmitted to the server in real time.

[0747] Step 2:

[0748] The server receives the audio data and passes it to the speech recognition engine, which converts it into text data. This conversion is performed sequentially, and the process of generating text data in real time continues throughout the meeting.

[0749] Step 3:

[0750] The server analyzes the generated text data using a natural language processing algorithm. The algorithm extracts important information and keywords from the text, identifying key parts of the meeting.

[0751] Step 4:

[0752] The server inputs video data collected during the meeting into the emotion engine. The emotion engine analyzes the video, identifies the emotional state of the participants from their facial expressions and tone of voice, and records the results in a database.

[0753] Step 5:

[0754] The server automatically generates meeting minutes by combining extracted key information and sentiment data. The minutes include not only the content of the statements but also the sentiment information of the participants at that time.

[0755] Step 6:

[0756] The server detects tasks from the meeting minutes and registers them in the task management system. Sentimental information may be considered as a factor in determining the importance and priority of tasks.

[0757] Step 7:

[0758] Users receive meeting minutes and task lists sent from the server via their terminals. Based on the information provided, users can manage post-meeting follow-ups and track the progress of each task.

[0759] Step 8:

[0760] Users update task progress within the system and add comments and feedback as needed. This ensures that progress is always kept up-to-date and that stakeholders are automatically notified.

[0761] (Example 2)

[0762] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0763] Traditional meeting recording systems simply convert audio data into text data to create minutes, failing to consider the emotions of participants or the atmosphere of the discussion during the meeting. Furthermore, it was difficult to properly prioritize tasks generated during the meeting and to manage them efficiently.

[0764] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0765] In this invention, the server includes a speech conversion means for receiving audio information and converting the audio information into text information, a natural language processing means for extracting important elements from the generated text information, and an emotion recognition means for analyzing emotional states from video information. This makes it possible to create detailed meeting minutes that take into account the emotions of participants during a meeting, and to prioritize them based on their emotional states.

[0766] "Speech conversion means" refers to a device or function that receives speech information and converts that speech information into text information.

[0767] "Natural language processing means" refers to techniques or algorithms that extract important elements from generated textual information and analyze their meaning and context.

[0768] A "meeting record creation method" refers to a system or method that automatically creates meeting minutes based on extracted key elements.

[0769] "Emotion recognition means" refers to a technology or system that identifies an emotional state by analyzing a participant's facial expressions and movements based on video information.

[0770] "Work management means" refers to a method or system that automatically detects work items based on generated meeting records and manages and organizes them.

[0771] This invention is a system designed to streamline information processing in meetings. Specific embodiments are described below.

[0772] First, the device captures the audio and video of the meeting in real time. This is done using the built-in microphone and camera, acquiring audio data in WAV format and video data in MP4 format. This data is then sent to the server in streaming format.

[0773] The server uses a speech converter to process the received audio data. Specifically, it utilizes a speech recognition engine to convert audio into text data. This engine is based on APIs commonly used in speech recognition software. Next, the generated text information is analyzed using natural language processing techniques to extract important keywords and utterances. This process uses a natural language processing engine and organizes the information using a language model.

[0774] Meanwhile, the server receives the video data and uses an emotion recognition system to analyze the emotional state of the participants. This system utilizes emotion recognition technologies such as facial recognition APIs. The analyzed emotion data is then used for creating meeting minutes and task management.

[0775] The generated meeting minutes are automatically created based on data processing and reflect emotional states, resulting in a summary that makes it easy to understand the atmosphere of the meeting and the tone of the participants. Furthermore, the server automatically extracts tasks within the meeting template and sets schedules and priorities using a task management system.

[0776] Users can view meeting minutes and task lists generated on their devices. They can also plan post-meeting follow-ups and manage projects as needed.

[0777] As a concrete example, when someone says "This proposal involves risks" during a meeting, the device sends the audio to the server, and the speech recognition engine transcribes it as "This proposal involves risks." Simultaneously, the camera captures the speaker's facial expressions, and the emotion recognition engine can analyze emotions such as "anxiety" or "concern." This information is then used to generate meeting minutes that take into account the emotions of the participants.

[0778] As an example of a prompt, the user can enter the instruction, "Based on the meeting audio and sentiment data, propose an action plan for the next meeting using automatically generated meeting minutes." This will trigger an automated proposal based on the generative AI model.

[0779] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0780] Step 1:

[0781] The terminal activates its built-in microphone and camera at the start of the meeting. Inputs include audio and video from the meeting room. This data is captured as WAV audio files and MP4 video files. The terminal collects this data in real time and sends it to the server via streaming.

[0782] Step 2:

[0783] The server uses a speech conversion system with audio data received from the terminal as input. Specifically, it activates a speech recognition engine and converts the audio data into text data. This process outputs the spoken content as text, which is then used for subsequent analysis.

[0784] Step 3:

[0785] The server uses a natural language processing system with text data as input. It leverages a natural language processing engine to extract important keywords and content from the text data. This extracted information becomes the output, forming the basis for creating meeting minutes. Furthermore, a language model is applied to understand the context of the text.

[0786] Step 4:

[0787] The server uses an emotion recognition system as input, taking video data received from the terminal. The emotion recognition engine analyzes the emotional state of participants based on their facial expressions. The results of this analysis are output as emotion data, which is used for meeting minutes and task management.

[0788] Step 5:

[0789] The server uses the extracted textual information and sentiment data as input to launch the meeting minutes creation system. Using the meeting minutes creation engine, it automatically generates meeting minutes. The output is a comprehensive record of meeting minutes that reflects the content and emotions of the meeting. These minutes also record the emotional tone of the participants.

[0790] Step 6:

[0791] The server uses the generated meeting minutes as input and utilizes a task management system. The task management engine automatically extracts tasks from the meeting minutes and sets their priorities and schedules. The output is compiled into a task list and provided to the user.

[0792] Step 7:

[0793] Users review the generated meeting minutes and task lists using their devices. Based on the outputted information, they plan meeting follow-ups and manage projects. Specifically, they adjust task deadlines and assign responsibilities, and monitor progress. This enables efficient work execution.

[0794] (Application Example 2)

[0795] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0796] In meetings, it is necessary to effectively record and manage participants' statements and the resulting emotional changes in real time to improve the quality of task management and decision-making. However, conventional meeting recording systems cannot take emotional changes into account, making it difficult to implement effective project management that takes into account the psychological nuances of participants.

[0797] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0798] In this invention, the server includes speech recognition means for receiving audio data and converting it into text data, natural language processing means for extracting important information, and emotion recognition means for analyzing the emotional state of participants from video data. This makes it possible to record and manage tasks while taking into account changes in participants' emotions during the progress of the meeting.

[0799] "Speech recognition means" refers to technology that receives speech data and converts that speech into text data.

[0800] "Natural language processing means" refers to technologies that extract important information from generated text data and perform information analysis and understanding.

[0801] "Meeting minutes creation method" refers to technology that automatically generates documents recording the content of a meeting based on important information.

[0802] "Task management method" refers to a technology that organizes and manages tasks automatically detected based on meeting minutes.

[0803] "Emotion recognition means" refers to technology that analyzes the emotional state of participants based on video data and understands their situation.

[0804] "Methods for adjusting meeting records based on emotion analysis results" refers to techniques that utilize emotion analysis results to adjust the recorded meeting content and its interpretation.

[0805] The system implementing this invention mainly consists of a server, a terminal, and a user.

[0806] The server converts speech data into text data using a speech recognition engine. Specifically, services such as Google Speech-to-Text or AWS Transcribe can be used. Important information is extracted from this text data using natural language processing tools, and it becomes the basic data for creating meeting minutes. Libraries such as SpaCy and NLTK are used for natural language processing.

[0807] Furthermore, the server receives video data collected by the built-in camera and performs emotion recognition using Microsoft Azure's facial recognition API, among other things. This allows for real-time analysis of participants' emotional states, and the meeting record is adjusted based on the results.

[0808] The terminal uses its built-in microphone and camera to capture audio and video of meetings and transmits this data to the server. Through this system, users can receive meeting transcripts and sentiment analysis results generated in real time, which can be used for task management and project decision-making.

[0809] As a concrete example of this system, during a residents' meeting for a local shopping district revitalization project, a comment was made: "I'm worried about holding a new event." The system analyzes the speaker's facial expression from the captured video footage along with this comment, interpreting it as "anxiety," and adjusts the priority of related event preparation tasks accordingly.

[0810] An example of a prompt sentence using a generative AI model is "a method for analyzing the sentiment behind statements made at a residents' meeting and using that information to help with decision-making at the meeting."

[0811] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0812] Step 1:

[0813] The device captures the audio and video data of the meeting. Using the built-in microphone and camera, it records participants' speech and facial expressions in real time. To optimize resource usage, the data is compressed into an appropriate format before being transferred to the server. The input is the meeting's audio and video, and the output is raw data.

[0814] Step 2:

[0815] The server converts the received audio data into text data using a speech recognition engine. For example, using Google Speech-to-Text, this data is converted into text format. The input is audio data, and the output is the corresponding text data.

[0816] Step 3:

[0817] The server processes text data using natural language processing to extract important information. It uses natural language processing libraries such as SpaCy and NLTK to extract key points from spoken content for meeting minutes creation. The input is text data, and the output is the important information necessary for meeting minutes.

[0818] Step 4:

[0819] The server receives video data from the device's camera and analyzes it using an emotion recognition API. Microsoft Azure's facial recognition API is used to identify emotional states from participants' facial expressions. The input is video data, and the output is information about the detected emotions.

[0820] Step 5:

[0821] The server incorporates the sentiment analysis results into the meeting minutes, adjusting the content to reflect the tone and emotional impact of the statements. Based on the sentiment recognition results, the system processes the minutes to emphasize key points and adjust their tone. The input consists of key information and sentiment information from the meeting minutes, and the output is the sentiment-adjusted meeting minutes.

[0822] Step 6:

[0823] The server automatically detects tasks based on meeting minutes, sets their priorities, and adds them to the task management system. The system sends information to a project management interface to help visualize tasks and manage their progress. The input is the edited meeting minutes, and the output is a set of prioritized tasks.

[0824] Step 7:

[0825] Users receive meeting minutes and sentiment data generated in real time from the server, and then use this data for subsequent meeting follow-up and task management. Participants can view and edit information through the user interface. Inputs are meeting minutes and sentiment information from the server, while outputs are the displayed content and task progress information provided to the user.

[0826] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0827] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0828] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0829] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0830] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0831] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0832] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0833] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0834] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0835] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0836] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0837] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0838] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0839] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0840] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0841] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using this memory.

[0842] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0843] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0844] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0845] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0846] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0847] The following is further disclosed regarding the embodiments described above.

[0848] (Claim 1)

[0849] A speech recognition means that receives audio data and converts the audio data into text data,

[0850] A natural language processing method for extracting important information from generated text data,

[0851] A meeting minutes creation method that automatically generates meeting records based on extracted important information,

[0852] A task management means that automatically detects tasks based on the meeting records and manages those tasks,

[0853] A system that includes this.

[0854] (Claim 2)

[0855] The system according to claim 1, further comprising means for notifying participants of meeting records generated in real time during a meeting.

[0856] (Claim 3)

[0857] The system according to claim 1, comprising means for referencing the progress of managed tasks and generating progress notifications based on the referencing results.

[0858] "Example 1"

[0859] (Claim 1)

[0860] A speech recognition means that receives audio information and converts said audio information into text information,

[0861] A natural language processing method for extracting important information from generated textual information,

[0862] A record-keeping method that automatically generates meeting minutes based on extracted key information,

[0863] A work management means that automatically detects work items based on the meeting record and processes those work items,

[0864] A means for aggregating the generated work items and analyzing the relevant deadlines and responsible persons information,

[0865] A system that includes this.

[0866] (Claim 2)

[0867] The system according to claim 1, further comprising means for notifying participants of meeting records generated in real time during a meeting.

[0868] (Claim 3)

[0869] The system according to claim 1, comprising means for monitoring the progress of managed work items and generating progress notifications based on the monitoring results.

[0870] "Application Example 1"

[0871] (Claim 1)

[0872] A speech recognition means that receives audio information and converts said audio information into text information,

[0873] A natural language processing method for extracting important data from generated text information,

[0874] A record-keeping method that automatically generates meeting documents based on extracted important data,

[0875] A work management means that automatically detects work items based on the meeting document and controls those work items,

[0876] This method, applied to store operations support, optimizes business processes by using business documents derived from recorded audio.

[0877] A system that includes this.

[0878] (Claim 2)

[0879] The system according to claim 1, comprising means for notifying participants of meeting documents generated in real time during a meeting.

[0880] (Claim 3)

[0881] The system according to claim 1, comprising means for referencing the progress of controlled work items and generating progress notifications based on the referencing results.

[0882] "Example 2 of combining an emotion engine"

[0883] (Claim 1)

[0884] A voice conversion means that receives voice information and converts said voice information into text information,

[0885] A natural language processing method for extracting important elements from generated textual information,

[0886] A meeting record creation method that automatically generates meeting minutes based on extracted key elements,

[0887] An emotion recognition means that receives video information and analyzes the emotional state from said video information,

[0888] A work management means that automatically detects work items based on meeting records generated while taking emotional states into consideration, and manages said work items,

[0889] An information processing system that includes this.

[0890] (Claim 2)

[0891] The information processing system according to claim 1, comprising means for notifying participants of meeting minutes generated in real time during a meeting.

[0892] (Claim 3)

[0893] The information processing system according to claim 1, comprising means for referencing the progress status of managed work items and generating progress notifications based on the referencing results.

[0894] "Application example 2 when combining with an emotional engine"

[0895] (Claim 1)

[0896] A speech recognition means that receives audio data and converts the audio data into text data,

[0897] A natural language processing method for extracting important information from generated text data,

[0898] A meeting minutes creation method that automatically generates meeting records based on extracted important information,

[0899] A task management means that automatically detects tasks based on the meeting records and manages those tasks,

[0900] An emotion recognition method that receives video data from participants and analyzes their emotional state,

[0901] A means of adjusting the content of meeting minutes based on the results of emotion analysis,

[0902] A system that includes this.

[0903] (Claim 2)

[0904] The system according to claim 1, further comprising means for notifying participants of meeting records and sentiment analysis results generated in real time during a meeting.

[0905] (Claim 3)

[0906] The system according to claim 1, comprising means for referencing the progress of managed tasks and sentiment analysis results, and generating progress notifications based on the reference results. [Explanation of Symbols]

[0907] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A speech recognition means that receives audio data and converts the audio data into text data, A natural language processing method for extracting important information from generated text data, A meeting minutes creation method that automatically generates meeting records based on extracted important information, A task management means that automatically detects tasks based on the meeting records and manages those tasks, A system that includes this.

2. The system according to claim 1, further comprising means for notifying participants of meeting records generated in real time during a meeting.

3. The system according to claim 1, comprising means for referencing the progress of managed tasks and generating progress notifications based on the referencing results.