system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A system for real-time audio capture and AI-driven analysis enhances data input accuracy and efficiency in sales negotiations by converting audio to text, extracting key points, and automatically managing meeting minutes.

JP2026101363APending Publication Date: 2026-06-22SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-10
Publication Date: 2026-06-22

Application Information

Patent Timeline

10 Dec 2024

Application

22 Jun 2026

Publication

JP2026101363A

IPC: G06F40/56; G06F40/279; G10L15/00; G06Q10/00; G10L15/22; G06Q10/10; G10L15/10; G10L15/26; G10L15/30

AI Tagging

Application Domain

Natural language translation Office automation

Technology Topics

Information processing Data ingestion

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

The manual recording and data entry of negotiation content in sales departments result in low data input rate, insufficient accuracy, and significant time and labor consumption, interfering with the primary tasks of sales members.

Method used

A system utilizing voice input for real-time audio capture, speech recognition to convert audio to text, generation AI for extracting key points, and AI-driven memo analysis to automatically generate and input meeting minutes into a management system.

Benefits of technology

Improves data input rate and accuracy, reducing the burden on sales members by allowing them to focus on negotiations while ensuring rapid and accurate recording and management of negotiation content.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026101363000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] An audio acquisition means that acquires audio during a business negotiation and generates audio data, Language recognition means that analyzes the aforementioned audio data and converts it into text data, A generation mechanism that extracts key points based on the aforementioned text data and generates meeting minutes, A document input means for inputting the generated meeting minutes into an information processing system, A descriptive analysis means that analyzes memo data entered after a business negotiation, converts it into detailed information, and records it in the information processing system, A presentation means that processes data during a business negotiation in real time and displays the results on a visual device that possesses it, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of the chatbot's character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] The recording of negotiation content in the sales department conventionally depends on manual memo creation and data entry. As a result, there are problems such as a low data input rate and insufficient input accuracy. Furthermore, since these operations take a lot of time and labor, they may interfere with the negotiation activities, which are the original tasks of sales members. Therefore, it is necessary to develop a system for quickly and accurately recording and managing negotiation content.

Means for Solving the Problems

[0005] This invention provides a system that includes a voice input means for acquiring audio during business negotiations in real time and converting it into text data using speech recognition technology. Furthermore, this system includes a generation means that uses a generation AI to extract important points based on the acquired text data and automatically generates meeting minutes. In addition, after the business negotiation, the system includes a memo analysis means that allows the user to input simple notes, which are then analyzed by AI and automatically entered as detailed data into the management system. This system improves the data input rate and accuracy, reduces the data input burden on sales members, and allows them to concentrate on business negotiation activities.

[0006] A "voice input method" refers to a device or system that has the function of acquiring voices generated during business negotiations in real time and converting them into digital data.

[0007] "Speech recognition means" refers to a technology or system that analyzes acquired speech data and converts it into natural language text data.

[0008] "Generation means" refers to a system or device that has the function of automatically extracting important information from text data generated by speech recognition and compiling it into meeting minutes.

[0009] "Input means" refers to a system or device that has the function of automatically inputting generated meeting minutes and analysis data into a management system.

[0010] A "memo analysis tool" is a device or system that uses AI technology to analyze memos entered by users after business negotiations and convert them into detailed data.

[0011] A "management system" is a platform or software that centrally manages data related to business negotiations and makes the information available for extraction and use as needed. [Brief explanation of the drawing]

[0012] [Figure 1]This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0013] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.

[0014] First, the terms used in the following description will be explained.

[0015] In the following embodiments, a tagged processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0016] In the following embodiments, a tagged RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0017] In the following embodiments, a tagged storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0018] In the following embodiments, a tagged communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), etc.

[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0020] [First Embodiment]

[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0033] In order to implement this invention, it is essential to construct a system that accurately and efficiently processes data during and after business negotiations. Below, we will show an example of implementing the invention based on the processing of the program in this system.

[0034] First, during a business meeting, the device uses its microphone to capture the audio of the conversation. The audio data is sent to the server in streaming format. The server feeds the received audio data into a speech recognition engine, which converts it into text data. During this process, noise reduction and filtering techniques are used to extract only the voice of specific speakers.

[0035] Once the text data is generated, the server uses a generation AI to automatically extract key points based on the context of the business negotiation. This creates meeting minutes in real time. These minutes are then displayed to the user via their device for review after adjustments have been made.

[0036] Once a business meeting concludes, the user enters a brief memo into their terminal. This memo is often a short message containing additional information or noteworthy points from the meeting. The terminal sends the entered memo data to a server. The server uses AI to analyze this memo in detail and convert it into specific items. These include the purpose of the meeting, the requested content, and the next action points.

[0037] The analyzed detailed data is automatically and sequentially entered by the server into the company's management system, such as a customer management platform. This process is conducted in accordance with security protocols, ensuring rapid and accurate data entry while maintaining data integrity.

[0038] These features allow users to quickly record key points from business negotiations and seamlessly integrate important information into the management system. This system is expected to allow sales team members to focus on their core responsibilities and improve overall work efficiency.

[0039] As a concrete example of implementing the invention, consider a scenario where a customer is discussing product delivery dates. A terminal acquires the audio, and a server transcribes it into text, extracts key points, and automatically generates meeting minutes such as "The product is scheduled for delivery at the end of next month." After the meeting, the user enters a note stating, "The customer wants to expedite the delivery date," and this information is converted into a detailed action plan through AI analysis and reflected in the management system.

[0040] The following describes the processing flow.

[0041] Step 1:

[0042] The terminal enables voice input at the start of a business meeting and captures the conversation of the meeting participants in real time. The captured voice data is sent to the server using a streaming protocol.

[0043] Step 2:

[0044] The server passes the received audio data to the speech recognition engine, which converts the audio into text. This conversion process includes noise reduction and speaker identification, ensuring that only the necessary parts are transcribed.

[0045] Step 3:

[0046] The server analyzes the text data generated by speech recognition and uses a generation AI to extract important keywords and phrases. Based on this, it automatically generates meeting minutes summarizing the business negotiation content.

[0047] Step 4:

[0048] The server sends the generated meeting minutes to the terminal, allowing the user to review them in real time during the business negotiation. It also provides an interface for the user to review important information.

[0049] Step 5:

[0050] After the business meeting concludes, the user enters a brief memo about the meeting into their device. This memo may include additional comments or information about the next steps.

[0051] Step 6:

[0052] The terminal sends the entered memo to the server.

[0053] (Example 1)

[0054] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0055] For many companies, accurately and efficiently recording information during business negotiations in real time, and quickly following up after those negotiations, is a critical challenge. With current technology, it's common to manually organize and record the vast amount of information generated during negotiations, which is time-consuming and labor-intensive. Furthermore, there are limited methods for efficiently incorporating supplementary information after negotiations and reflecting it in the company's management system. To address these challenges, there is a need to automate information processing and generate summaries and follow-up information quickly and accurately.

[0056] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0057] In this invention, the server includes an audio acquisition means for acquiring audio during business negotiations and generating audio information, an audio conversion means for analyzing the audio information and converting it into text information, an information generation means for extracting important information based on the text information and generating summary information, and an information analysis means for analyzing supplementary information entered after the business negotiation and converting it into detailed information. This makes it possible to record important information during business negotiations in real time and to quickly and efficiently reflect the information necessary for follow-up after the business negotiations in the management system.

[0058] "Voice acquisition means" refers to a device or function for collecting voice information from conversations such as business negotiations.

[0059] A "speech conversion means" is a function that analyzes acquired speech information and converts it into text information.

[0060] "Information generation means" refers to a device or system that has the function of automatically extracting important information from converted character information and generating summary information.

[0061] "Information input means" refers to a function for registering or inputting generated summary information into a company's management system.

[0062] An "information analysis tool" is a tool that has the function of analyzing supplementary information obtained after a business negotiation and converting it into detailed content.

[0063] A "data processing device" is a device or mechanism that instantly analyzes voice and text information and links it with a management system.

[0064] This invention provides a system for efficiently processing information during and after business negotiations. The following describes a specific implementation of this system.

[0065] During business negotiations, the terminal acquires audio of the conversation through a microphone device. Since the audio information is processed in real time, built-in microphones or high-quality external microphones are often used. The acquired audio information is immediately transmitted to the server using a streaming protocol. Technologies such as WebSocket are used for this purpose.

[0066] The server converts the received audio information into text information using a speech conversion method. Here, APIs providing speech recognition technology, such as speech recognition services from various vendors, are used. The audio data is subjected to noise cancellation and speaker identification filtering to ensure highly accurate text information is obtained.

[0067] Next, the server analyzes the text information using a generative AI model. The generative AI model extracts important information based on the context and automatically generates a summary of the business negotiation. This involves commonly used generative AI technologies and text summarization algorithms. Specifically, the information is organized into meeting minutes.

[0068] The generated summary information is stored in the management system via an information input method. This management system can be integrated with enterprise systems such as customer information management platforms and project management systems. Data is imported while being accurately protected based on security protocols.

[0069] After a business meeting, the user inputs supplementary information via a terminal. This is recorded concisely as additional notes. The terminal sends this supplementary information to a server, which uses information analysis tools to convert the content into detailed information. Natural language processing technology is used for the analysis, and the supplementary information leads to specific steps and actions.

[0070] An example of a prompt message is, "Extract key points from the sales meeting audio data and generate meeting minutes." This allows for efficient management of sales meeting content and enables quick follow-up.

[0071] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0072] Step 1:

[0073] The terminal acquires audio information during a business negotiation via a microphone device. At this stage, the speaker's utterances are captured by the microphone, and audio information is generated as a digital signal. The input is raw audio, which is then prepared for transmission in real time via a streaming protocol, such as WebSocket. The output is the streaming audio data transmitted from the terminal.

[0074] Step 2:

[0075] The server receives streaming audio data transmitted from the terminal. Using speech conversion technology, it converts this audio data into text information. During this process, the received audio data undergoes noise reduction and speaker separation filtering. The input is streaming audio data, and the output is filtered text information.

[0076] Step 3:

[0077] The server uses a generative AI model to analyze textual information. It converts the input textual information into prompt-based analysis to extract key points and generate summary information. The generative AI automatically extracts key points from the context and organizes them into concise meeting minutes. The input is textual information, and the output is a summary with the key points extracted.

[0078] Step 4:

[0079] The server inputs the generated summary information into the management system. This input method ensures that data is systematically stored on the company's information management platform. To maintain data integrity and security, information is securely transmitted to the management system via an API. The input is summary information, and the output is organized data stored on the company's system.

[0080] Step 5:

[0081] Users enter supplementary information using a terminal after a business meeting. These additional notes are entered as concise, key-point supplementary information. The input is text data manually entered by the user.

[0082] Step 6:

[0083] The terminal sends the input supplementary information to the server. The server analyzes the received supplementary information in detail using information analysis tools and converts it into specific items. In this process, natural language processing technology is used to identify specific action items that will lead to the next step in the business negotiation. The input is the text data of the supplementary information, and the output is the analyzed detailed information.

[0084] Step 7:

[0085] The server inputs detailed information based on the analysis results into the company's management system. This prepares the system for follow-up based on the content of the business negotiations. The input is detailed analysis information, and the output is a data update in the management system that reflects this information.

[0086] (Application Example 1)

[0087] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0088] In business negotiations and sales promotions, it is necessary to improve operational efficiency by immediately recording conversations with customers and extracting key points. However, conventional methods require some manual recording, limiting accuracy and speed. Furthermore, organizing and inputting information for follow-up after negotiations is cumbersome and prone to human error. Moreover, there was a need for new methods to practically utilize this information and improve the quality of sales and services.

[0089] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0090] In this invention, the server includes an acoustic acquisition means, a language recognition means, and a generation mechanism. This makes it possible to instantly record the audio of a business negotiation, transcribe it into text, extract important points, and present them on a visual device in real time. The aim is to improve the quality of customer service by enabling efficient recording of business negotiations and immediate use of the information.

[0091] "Audio acquisition means" refers to a part of a system that accurately acquires audio during business negotiations and records and generates it as audio data.

[0092] A "language recognition means" is a processing device that analyzes acquired audio data and converts it into text data.

[0093] The "generation mechanism" is a component that extracts key points from converted text data and automatically generates meeting minutes.

[0094] A "visual device" is a device that visually displays the results of the generated audio data and provides them to the user during a business negotiation.

[0095] An "information processing system" refers to the entire database and its surrounding systems used to record and manage generated meeting minutes and analyzed detailed information.

[0096] The "descriptive analysis means" is a processing unit that uses natural language processing technology to expand notes entered after a business negotiation into detailed information.

[0097] A "data input device" is an automated component for accurately inputting generated meeting minutes into an information processing system.

[0098] A "remote information processing system" is a server and related equipment that operates via a network to instantly process audio data acquired during business negotiations and generate necessary information.

[0099] To implement this invention, multiple system components must work in coordination. A terminal equipped with sound acquisition means acquires audio generated during business negotiations in real time. The acquired audio data is immediately transmitted to a server. The server converts the audio data into text data using language recognition means. This process utilizes a speech recognition engine (e.g., Google® Speech-to-Text API) to enable accurate and rapid text generation.

[0100] Subsequently, a generation mechanism is used to automatically extract key points from the text data of the business negotiation. This process applies a generation AI model (e.g., OpenAI® GPT) to automatically generate summaries of the points and meeting minutes. The generated meeting minutes are immediately displayed on the terminal via a visual device, allowing the user to review them in real time. This information is recorded in the information processing system via a document input mechanism.

[0101] Furthermore, after a business meeting, users input any additional notes they have taken into their terminals. The server then analyzes this information in detail using descriptive analysis tools and utilizes it for customer management and creating next action plans in the information processing system. This allows sales members to quickly record the key points of business meetings and easily reflect important information in the management system.

[0102] As a concrete example, a sales staff member uses smart glasses when explaining a new product to a customer. Voice recognition instantly summarizes important information, such as "The new model is scheduled for release at the end of next month and is available for pre-order," as meeting minutes, which are then displayed on the visual device's screen. This process is based on prompts such as, "Please extract and summarize key sales points in real time during the negotiation."

[0103] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0104] Step 1:

[0105] The terminal uses an acoustic acquisition device to capture audio during business negotiations in real time. The input is ambient sound, which is recorded as digital data. The output is digital audio data in audio format.

[0106] Step 2:

[0107] The terminal transmits the acquired audio data to a server, which is a remote information processing device. The input is digital audio data, which is transferred to the server via the network. The output is the audio data stored on the server.

[0108] Step 3:

[0109] The server analyzes speech data using language recognition tools and converts it into text data. The input is the received speech data, which is processed using a speech recognition engine (e.g., Google Speech-to-Text API). The output is text data.

[0110] Step 4:

[0111] The server uses a generation mechanism to extract key points from the text data of a business negotiation. This process employs a generative AI model (e.g., OpenAI GPT). The input is the text data generated in step 3, and the output is a summarized meeting memo.

[0112] Step 5:

[0113] The server sends the generated meeting minutes to the terminal's visual device for immediate display. The input is the meeting minutes, which are presented to the user via the visual device. The output is the summary information displayed on the terminal's display.

[0114] Step 6:

[0115] After the business meeting concludes, the user enters additional notes into the terminal. These entered notes will be processed in the next step. The input is a natural language note entered by the user.

[0116] Step 7:

[0117] The server analyzes the input memo data using descriptive analysis tools and generates detailed information. The input is the memo collected in step 6, which is analyzed using natural language processing techniques. The output is a summary and detailed information necessary for the next step.

[0118] Step 8:

[0119] The server records the analyzed details in the information processing system and uses them to create the next action plan as needed. The input is the detailed information generated in step 7, which is stored in the company's information system. The output is data integrated into the company's management platform.

[0120] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0121] The present invention aims to provide a system that improves the quality of meeting minutes and sales records by integrally handling audio data during sales negotiations and memo data afterward, and further analyzing the user's emotions. The program processing in this system will be described in detail below.

[0122] During a business negotiation, the terminal uses multi-directional microphones to capture conversational audio in real time. The acquired audio data is then transmitted to a server using real-time streaming technology. The server inputs this audio data into a speech recognition engine and an emotion engine, which simultaneously convert the audio into text and analyze the speaker's emotions.

[0123] The speech recognition engine analyzes the received audio data and converts it into text data. The converted text data is then summarized by a generative AI, and the key points of the business negotiation are extracted and compiled into meeting minutes.

[0124] The emotion engine analyzes the characteristics of the speech signal, such as intonation, speed, and volume, and infers the speaker's emotions based on this analysis. The analyzed emotion data is added to the meeting minutes as emotion labels and indicators, providing users with information that allows them to understand the atmosphere and emotional flow of the business negotiation.

[0125] After a business meeting, the user enters a brief memo on their device as supplementary information. This memo is sent to a server and transformed into detailed data through AI analysis. Finally, the text data, sentiment data, and detailed data are centralized and automatically entered into management systems such as customer relationship management (CRM) systems.

[0126] As a concrete implementation example, let's consider a business meeting where a product demonstration is conducted. In this scenario, the server generates meeting minutes based on the audio data acquired in real time by the terminal, such as "Introducing two new features as strengths of the product." Simultaneously, the server uses sentiment analysis to assign a sentiment label indicating that "the customer showed a positive reaction to the new features."

[0127] In this way, the present invention combines information extraction from voice with emotion analysis to provide emotional insights along with important data from business negotiations, thereby enabling a deeper understanding of business negotiation activities and effective follow-up.

[0128] The following describes the processing flow.

[0129] Step 1:

[0130] The terminal enables voice input at the start of a business meeting and uses its built-in microphone to capture the conversation of multiple meeting participants in real time. The captured audio data is compressed and sent to a server over the network.

[0131] Step 2:

[0132] The server sends the received audio data to the speech recognition engine, which converts the audio into text data according to the context. This conversion process includes filtering of background noise and identification of each speaker.

[0133] Step 3:

[0134] Simultaneously, the server sends the audio data to the emotion engine, which evaluates and determines the speaker's emotions by analyzing the tone, pitch, speed, and other acoustic characteristics of the voice.

[0135] Step 4:

[0136] The server uses a generation AI to extract important keywords and phrases from text data and automatically generate summarized meeting minutes. These minutes include the specific details of the business negotiation, along with sentiment labels obtained from the sentiment engine.

[0137] Step 5:

[0138] The generated meeting minutes and sentiment labels are sent to the terminal after the meeting concludes, allowing the user to review them and add annotations or additional comments as needed.

[0139] Step 6:

[0140] After a business meeting, notes entered by the user are sent to the server via the terminal. The server uses AI to analyze these notes, convert them into a specific and detailed data form, and re-enter them into the management system.

[0141] Step 7:

[0142] Ultimately, all text data, sentiment data, and additional memo data are integrated and automatically recorded and managed in the company's management system. This process creates a comprehensive and multifaceted information database related to business negotiations.

[0143] (Example 2)

[0144] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0145] In business negotiations, it is necessary to record participants' statements as acoustic information and understand their content in detail. However, conventional systems have difficulty accurately capturing human emotions and understanding the atmosphere and flow of emotions during negotiations, and it is also difficult to efficiently supplement information after negotiations. Therefore, a system is needed that integrates everything from acquiring acoustic information to emotion analysis, generating meeting minutes, and inputting them into a management system.

[0146] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0147] In this invention, the server includes an acquisition means for acquiring acoustic information and generating acoustic data, a conversion means for analyzing the acoustic data and converting it into text data, and an emotion analysis means for analyzing emotion information based on the text data and acoustic data and generating emotion labels. This enables accurate acquisition and analysis of acoustic information, and a deep understanding of the content of business negotiations through emotion analysis.

[0148] "Acoustic information" refers to data that includes the properties of sound, acquired during business negotiations, conversations, and other interactions.

[0149] "Acoustic data" refers to acoustic information represented in digital format and is used for speech recognition and analysis.

[0150] "Acquisition means" refers to a mechanism for collecting acoustic information and generating acoustic data.

[0151] A "conversion means" is a device that has the function of analyzing acoustic data and converting it into text data.

[0152] "Text data" refers to data in text format converted from speech, in a format that humans can read and understand.

[0153] "Emotional information" refers to data about the speaker's emotional state, analyzed from acoustic data.

[0154] An "emotional label" is a label assigned to audio or dialogue as an indicator of emotional information.

[0155] A "sentiment analysis tool" is a device that analyzes the speaker's emotional information from acoustic and text data and generates emotion labels.

[0156] A "generation method" is a mechanism for extracting important items from text data and generating meeting minutes.

[0157] "Meeting minutes" are records that summarize important statements and key points of discussion during business negotiations or meetings.

[0158] A "management platform" is a system for integrating, managing, and storing generated information.

[0159] "Implementation method" refers to a mechanism for inputting generated meeting minutes and sentiment labels into the management platform.

[0160] "Memo information" refers to additional information entered after a business negotiation, and serves a supplementary role.

[0161] A "memo analysis tool" is a mechanism for analyzing memo information, converting it into detailed information, and inputting it into a management platform.

[0162] The following describes the configuration for implementing this system.

[0163] The present invention aims to provide more detailed and valuable information by acquiring and analyzing acoustic information during business negotiations and meetings. First, a terminal acquires acoustic information during business negotiations using a multi-directional microphone. This terminal is equipped with communication means for acquiring acoustic information in real time and transmitting the data to a computing device. Here, acoustic data can be transmitted using the Internet Protocol.

[0164] The server acts as a computing device, analyzing the received acoustic data. This analysis utilizes speech recognition software and sentiment analysis software. Specifically, a commercial speech recognition API is used for speech recognition, and acoustic characteristic analysis software is used for sentiment analysis, converting the acoustic data into text data and generating sentiment labels from the acoustic and text data. Furthermore, a generation AI is used to extract key points from the text data and record them as meeting minutes.

[0165] After a business meeting concludes, users can manually enter memo information using their terminals. This information is further analyzed and centrally recorded as detailed information. This recorded information is then integrated into the management platform and shared among stakeholders.

[0166] As a concrete example, let's consider a sales meeting where a product is being introduced. In this case, the terminal picks up everyone's comments, and the server generates a summary such as, "Two innovative features were introduced as characteristics of the new product." Additionally, by adding sentiment labels, information such as, "The customer showed positive surprise about the new features" is also included. This process makes it possible to grasp both the important content of the sales meeting and the atmosphere of the situation.

[0167] An example of a prompt using a generative AI model would be something like, "Please indicate the themes that elicited the most responses during the business negotiation and the associated sentiment labels."

[0168] In this way, this system enhances the information gathering and analysis during business negotiations, thereby maximizing the effectiveness of negotiations and supporting subsequent operations.

[0169] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0170] Step 1:

[0171] The terminal acquires acoustic information in real time during business negotiations using multi-directional microphones. The input is acoustic signals from the surroundings, which are converted into digital acoustic data. Specifically, analog sound is collected from the microphone and converted into a digital signal using an A / D converter. This data is temporarily stored in a buffer within the terminal.

[0172] Step 2:

[0173] The terminal sends digital audio data stored in a buffer to the server via real-time streaming technology. The input is the audio data in the buffer, and the output is the audio data sent to the server over the network. Specifically, this transmission uses the Internet Protocol, and the data is divided into small packets and transmitted using streaming technologies such as the UDP protocol.

[0174] Step 3:

[0175] The server analyzes the received audio data. The input is digital audio data sent from the terminal, and the output is text data and emotion information. Specifically, speech recognition software is used to convert the audio data into text data, and emotion analysis software is used to analyze the intonation and speed of the speech. Based on this analysis, an emotion label is generated that quantifies the speaker's emotions.

[0176] Step 4:

[0177] The server uses a generative AI model to summarize key points based on the analyzed text data and generate meeting minutes. The input is text data, and the output is a summary of the meeting minutes. Specifically, the AI model extracts key phrases from the text and logically organizes them. This summary also includes speaker sentiment labels.

[0178] Step 5:

[0179] After the business meeting concludes, the user enters supplementary memo information using a terminal. The input is text information entered by the user, and the output is data converted into detailed memo information. Specifically, the user manually enters text, which is then used with natural language processing technology for subsequent analysis.

[0180] Step 6:

[0181] The server analyzes supplementary information from users and integrates it with existing meeting minutes and sentiment labels. Input is detailed memo information entered by the user, and output is an integrated sales opportunity record. Specifically, the analysis engine stores the memo information in a database, combines it with existing meeting minutes to build a comprehensive sales opportunity database, and automatically deploys it to the management platform.

[0182] (Application Example 2)

[0183] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0184] Modern security services require accurate understanding of surrounding audio information and early detection of suspicious activity and conversations. However, current technology is insufficient for extracting key points from audio and understanding emotional nuances, making it difficult to achieve highly accurate surveillance. There is also a need for efficient systems that perform real-time analysis of audio data.

[0185] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0186] In this invention, the server includes voice acquisition means, voice recognition means, information recording generation means, emotion analysis means, and anomaly detection means. This makes it possible to analyze the flow of voice information and emotions within the environment and to quickly detect suspicious movements and conversations.

[0187] "Sound acquisition means" refers to a device or configuration that effectively collects ambient sounds in an environment and supplies those sounds to a subsequent analysis process.

[0188] "Speech recognition means" refers to a technology or device that analyzes acquired speech information and converts it into corresponding text information.

[0189] "Information record generation means" refers to a process or system for extracting specific points based on textual information and creating appropriate information records.

[0190] "Emotional analysis means" refers to a technology or device for analyzing characteristics such as intonation, speed, and volume in speech, and for inferring and labeling the emotional state of the speaker.

[0191] An "anomaly detection means" is a function or method for detecting suspicious behavior or conversation based on analyzed voice and emotion data.

[0192] A "management mechanism" is a system or device that centrally manages collected and analyzed data and appropriately stores or displays necessary information.

[0193] A "computer" is a computer or data processing device used to analyze audio data in real time and perform textual information and sentiment analysis.

[0194] To realize an application example of this invention, first, a security robot system is used to acquire sound from the environment. Specifically, a multi-directional microphone is used as the device. This allows the robot to collect ambient sound data in real time while patrolling.

[0195] The server analyzes the acquired audio data using speech recognition and converts it into text. This process can utilize speech recognition APIs such as Google Cloud Speech-to-Text. The recognized text is then further processed by an information recording generation system to extract specific key points and generate appropriate records.

[0196] Next, the server uses emotion analysis tools, such as IBM Watson® Tone Analyzer, to infer emotions from the intonation, speed, and volume characteristics of the audio data. This allows for the detection of potential disturbances in the patrolled area.

[0197] Furthermore, anomaly detection mechanisms are implemented to detect suspicious behavior and conversations based on analyzed voice and emotion data. In this process, the server utilizes AI technology to provide a warning system that enables a rapid response in emergencies.

[0198] As a concrete example, if a robot patrolling a shopping mall detects a conversation expressing dissatisfaction with a particular product and determines that anger is escalating, this information is reported to the management organization in real time, and appropriate action is taken.

[0199] An example of a prompt for a generative AI model would be: "Translate the current audio data into text, analyze its intonation, speed, and volume, assign emotion labels, and create a report that detects suspicious movements and conversations."

[0200] In this way, the present invention provides a practical system that contributes to improving safety in the environment while maintaining high accuracy in the collection and analysis of voice data.

[0201] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0202] Step 1:

[0203] The terminal uses multi-directional microphones to acquire ambient sound in real time. The acquired audio data is generated through the audio acquisition mechanism and sent to the server for subsequent processing. The input is ambient sound, and the output is raw audio data.

[0204] Step 2:

[0205] The server converts the received audio data into text data using speech recognition. A speech recognition API such as Google Cloud Speech-to-Text is used here. The input is audio data, and the output is the corresponding text information.

[0206] Step 3:

[0207] The server processes the obtained text data through an information record generation mechanism to generate an information record containing specific key points. During this process, a generation AI model is used for summarization. The input is text data, and the output is a summarized information record.

[0208] Step 4:

[0209] The server processes audio data using sentiment analysis tools to perform sentiment analysis. It uses sentiment analysis APIs such as IBM Watson Tone Analyzer to analyze the intonation, speed, and volume of the audio data. The input is audio data, and the output is sentiment labels and sentiment scores.

[0210] Step 5:

[0211] The server uses informational records and sentiment labels to activate anomaly detection mechanisms and detect suspicious behavior or conversations. This is especially triggered when the input information exceeds pre-configured criteria. Inputs are informational records and sentiment labels, and outputs are anomaly detection alarms.

[0212] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0213] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0214] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0215] [Second Embodiment]

[0216] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0217] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0218] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0219] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0220] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0221] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0222] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0223] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0224] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0225] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0226] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0227] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0228] In order to implement this invention, it is essential to construct a system that accurately and efficiently processes data during and after business negotiations. Below, we will show an example of implementing the invention based on the processing of the program in this system.

[0229] First, during a business meeting, the device uses its microphone to capture the audio of the conversation. The audio data is sent to the server in streaming format. The server feeds the received audio data into a speech recognition engine, which converts it into text data. During this process, noise reduction and filtering techniques are used to extract only the voice of specific speakers.

[0230] Once the text data is generated, the server uses a generation AI to automatically extract key points based on the context of the business negotiation. This creates meeting minutes in real time. These minutes are then displayed to the user via their device for review after adjustments have been made.

[0231] Once a business meeting concludes, the user enters a brief memo into their terminal. This memo is often a short message containing additional information or noteworthy points from the meeting. The terminal sends the entered memo data to a server. The server uses AI to analyze this memo in detail and convert it into specific items. These include the purpose of the meeting, the requested content, and the next action points.

[0232] The analyzed detailed data is automatically and sequentially entered by the server into the company's management system, such as a customer management platform. This process is conducted in accordance with security protocols, ensuring rapid and accurate data entry while maintaining data integrity.

[0233] These features allow users to quickly record key points from business negotiations and seamlessly integrate important information into the management system. This system is expected to allow sales team members to focus on their core responsibilities and improve overall work efficiency.

[0234] As a concrete example of implementing the invention, consider a scenario where a customer is discussing product delivery dates. A terminal acquires the audio, and a server transcribes it into text, extracts key points, and automatically generates meeting minutes such as "The product is scheduled for delivery at the end of next month." After the meeting, the user enters a note stating, "The customer wants to expedite the delivery date," and this information is converted into a detailed action plan through AI analysis and reflected in the management system.

[0235] The following describes the processing flow.

[0236] Step 1:

[0237] The terminal enables voice input at the start of a business meeting and captures the conversation of the meeting participants in real time. The captured voice data is sent to the server using a streaming protocol.

[0238] Step 2:

[0239] The server passes the received audio data to the speech recognition engine, which converts the audio into text. This conversion process includes noise reduction and speaker identification, ensuring that only the necessary parts are transcribed.

[0240] Step 3:

[0241] The server analyzes the text data generated by speech recognition and uses a generation AI to extract important keywords and phrases. Based on this, it automatically generates meeting minutes summarizing the business negotiation content.

[0242] Step 4:

[0243] The server sends the generated meeting minutes to the terminal, allowing the user to review them in real time during the business negotiation. It also provides an interface for the user to review important information.

[0244] Step 5:

[0245] After the business meeting concludes, the user enters a brief memo about the meeting into their device. This memo may include additional comments or information about the next steps.

[0246] Step 6:

[0247] The terminal sends the entered memo to the server.

[0248] (Example 1)

[0249] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0250] For many companies, accurately and efficiently recording information during business negotiations in real time, and quickly following up after those negotiations, is a critical challenge. With current technology, it's common to manually organize and record the vast amount of information generated during negotiations, which is time-consuming and labor-intensive. Furthermore, there are limited methods for efficiently incorporating supplementary information after negotiations and reflecting it in the company's management system. To address these challenges, there is a need to automate information processing and generate summaries and follow-up information quickly and accurately.

[0251] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0252] In this invention, the server includes an audio acquisition means for acquiring audio during business negotiations and generating audio information, an audio conversion means for analyzing the audio information and converting it into text information, an information generation means for extracting important information based on the text information and generating summary information, and an information analysis means for analyzing supplementary information entered after the business negotiation and converting it into detailed information. This makes it possible to record important information during business negotiations in real time and to quickly and efficiently reflect the information necessary for follow-up after the business negotiations in the management system.

[0253] "Voice acquisition means" refers to a device or function for collecting voice information from conversations such as business negotiations.

[0254] A "speech conversion means" is a function that analyzes acquired speech information and converts it into text information.

[0255] "Information generation means" refers to a device or system that has the function of automatically extracting important information from converted character information and generating summary information.

[0256] "Information input means" refers to a function for registering or inputting generated summary information into a company's management system.

[0257] An "information analysis tool" is a tool that has the function of analyzing supplementary information obtained after a business negotiation and converting it into detailed content.

[0258] A "data processing device" is a device or mechanism that instantly analyzes voice and text information and links it with a management system.

[0259] This invention provides a system for efficiently processing information during and after business negotiations. The following describes a specific implementation of this system.

[0260] During business negotiations, the terminal acquires audio of the conversation through a microphone device. Since the audio information is processed in real time, built-in microphones or high-quality external microphones are often used. The acquired audio information is immediately transmitted to the server using a streaming protocol. Technologies such as WebSocket are used for this purpose.

[0261] The server converts the received audio information into text information using a speech conversion method. Here, APIs providing speech recognition technology, such as speech recognition services from various vendors, are used. The audio data is subjected to noise cancellation and speaker identification filtering to ensure highly accurate text information is obtained.

[0262] Next, the server analyzes the text information using a generative AI model. The generative AI model extracts important information based on the context and automatically generates a summary of the business negotiation. This involves commonly used generative AI technologies and text summarization algorithms. Specifically, the information is organized into meeting minutes.

[0263] The generated summary information is stored in the management system via an information input method. This management system can be integrated with enterprise systems such as customer information management platforms and project management systems. Data is imported while being accurately protected based on security protocols.

[0264] After a business meeting, the user inputs supplementary information via a terminal. This is recorded concisely as additional notes. The terminal sends this supplementary information to a server, which uses information analysis tools to convert the content into detailed information. Natural language processing technology is used for the analysis, and the supplementary information leads to specific steps and actions.

[0265] An example of a prompt message is, "Extract key points from the sales meeting audio data and generate meeting minutes." This allows for efficient management of sales meeting content and enables quick follow-up.

[0266] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0267] Step 1:

[0268] The terminal acquires audio information during a business negotiation via a microphone device. At this stage, the speaker's utterances are captured by the microphone, and audio information is generated as a digital signal. The input is raw audio, which is then prepared for transmission in real time via a streaming protocol, such as WebSocket. The output is the streaming audio data transmitted from the terminal.

[0269] Step 2:

[0270] The server receives streaming audio data transmitted from the terminal. Using speech conversion technology, it converts this audio data into text information. During this process, the received audio data undergoes noise reduction and speaker separation filtering. The input is streaming audio data, and the output is filtered text information.

[0271] Step 3:

[0272] The server uses a generative AI model to analyze textual information. It converts the input textual information into prompt-based analysis to extract key points and generate summary information. The generative AI automatically extracts key points from the context and organizes them into concise meeting minutes. The input is textual information, and the output is a summary with the key points extracted.

[0273] Step 4:

[0274] The server inputs the generated summary information into the management system. This input method ensures that data is systematically stored on the company's information management platform. To maintain data integrity and security, information is securely transmitted to the management system via an API. The input is summary information, and the output is organized data stored on the company's system.

[0275] Step 5:

[0276] Users enter supplementary information using a terminal after a business meeting. These additional notes are entered as concise, key-point supplementary information. The input is text data manually entered by the user.

[0277] Step 6:

[0278] The terminal sends the input supplementary information to the server. The server analyzes the received supplementary information in detail using information analysis tools and converts it into specific items. In this process, natural language processing technology is used to identify specific action items that will lead to the next step in the business negotiation. The input is the text data of the supplementary information, and the output is the analyzed detailed information.

[0279] Step 7:

[0280] The server inputs detailed information based on the analysis results into the company's management system. This prepares the system for follow-up based on the content of the business negotiations. The input is detailed analysis information, and the output is a data update in the management system that reflects this information.

[0281] (Application Example 1)

[0282] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as a "server", and the smart glasses 214 are referred to as a "terminal".

[0283] In a business negotiation or sales promotion setting, it is necessary to immediately record the conversation with customers and extract key points to improve business efficiency. However, with conventional methods, some manual recording is required, and there are limitations in accuracy and speed. Also, the collation and input of information for follow-up after business negotiations are cumbersome and prone to human error. Furthermore, new means for practically utilizing this information to improve the quality of sales and services have been demanded.

[0284] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0285] In this invention, the server includes an acoustic acquisition means, a language recognition means, and a generation mechanism. As a result, it becomes possible to immediately record the voice of a business negotiation, convert it into text, extract important points, and present them to a visual device in real time. This aims to enable efficient recording of business negotiations and immediate utilization of information, thereby improving the quality of customer response.

[0286] The "acoustic acquisition means" is a part of a system that accurately acquires the voice during a business negotiation and records / generates it as voice data.

[0287] The "language recognition means" is a processing device that analyzes the acquired voice data and converts it into character data.

[0288] The "generation mechanism" is a component that extracts important points of a business negotiation from the converted text data and automatically generates a meeting memo.

[0289] A "visual device" is a device that visually displays the results of the generated audio data and provides them to the user during a business negotiation.

[0290] An "information processing system" refers to the entire database and its surrounding systems used to record and manage generated meeting minutes and analyzed detailed information.

[0291] The "descriptive analysis means" is a processing unit that uses natural language processing technology to expand notes entered after a business negotiation into detailed information.

[0292] A "data input device" is an automated component for accurately inputting generated meeting minutes into an information processing system.

[0293] A "remote information processing system" is a server and related equipment that operates via a network to instantly process audio data acquired during business negotiations and generate necessary information.

[0294] To implement this invention, multiple system components must work in coordination. A terminal equipped with sound acquisition means acquires audio generated during business negotiations in real time. The acquired audio data is immediately transmitted to a server. The server converts the audio data into text data using language recognition means. By utilizing a speech recognition engine (e.g., Google Speech-to-Text API) in this process, accurate and rapid text generation is possible.

[0295] Subsequently, a generation mechanism is used to automatically extract key points from the text data of the business negotiation. This process applies a generation AI model (e.g., OpenAI GPT) to automatically generate summaries of the points and meeting minutes. The generated meeting minutes are immediately displayed on the terminal via a visual device, allowing the user to review them in real time. This information is recorded in the information processing system via a document input mechanism.

[0296] Furthermore, after a business meeting, users input any additional notes they have taken into their terminals. The server then analyzes this information in detail using descriptive analysis tools and utilizes it for customer management and creating next action plans in the information processing system. This allows sales members to quickly record the key points of business meetings and easily reflect important information in the management system.

[0297] As a concrete example, a sales staff member uses smart glasses when explaining a new product to a customer. Voice recognition instantly summarizes important information, such as "The new model is scheduled for release at the end of next month and is available for pre-order," as meeting minutes, which are then displayed on the visual device's screen. This process is based on prompts such as, "Please extract and summarize key sales points in real time during the negotiation."

[0298] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0299] Step 1:

[0300] The terminal uses an acoustic acquisition device to capture audio during business negotiations in real time. The input is ambient sound, which is recorded as digital data. The output is digital audio data in audio format.

[0301] Step 2:

[0302] The terminal transmits the acquired audio data to a server, which is a remote information processing device. The input is digital audio data, which is transferred to the server via the network. The output is the audio data stored on the server.

[0303] Step 3:

[0304] The server analyzes speech data using language recognition tools and converts it into text data. The input is the received speech data, which is processed using a speech recognition engine (e.g., Google Speech-to-Text API). The output is text data.

[0305] Step 4:

[0306] The server extracts the key points of the negotiation from the text data using a generation mechanism. A generative AI model (e.g., OpenAI GPT) is used in this process. The input is the text data generated in Step 3, and the output is a summarized meeting memo.

[0307] Step 5:

[0308] The server sends the generated meeting memo to the visual device of the terminal and displays it immediately. The input is the meeting memo, which is presented to the user by the visual device. The output is the summarized information displayed on the terminal's display.

[0309] Step 6:

[0310] After the negotiation ends, the user inputs an additional memo into the terminal. This input memo will be the target for processing in the next step. The input is a memo in natural language entered by the user.

[0311] Step 7:

[0312] The server analyzes the input memo data using description analysis means and generates detailed information. The input is the memo collected in Step 6, which is analyzed using natural language processing technology. The output is a summary and detailed information required for the next step.

[0313] Step 8:

[0314] The server records the analyzed detailed information in the information processing system and uses it for creating the next action plan as needed. The input is the detailed information generated in Step 7, which is stored in the enterprise's information system. The output is data integrated into the enterprise's management platform.

[0315] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0316] The present invention aims to provide a system that improves the quality of meeting minutes and sales records by integrally handling audio data during sales negotiations and memo data afterward, and further analyzing the user's emotions. The program processing in this system will be described in detail below.

[0317] During a business negotiation, the terminal uses multi-directional microphones to capture conversational audio in real time. The acquired audio data is then transmitted to a server using real-time streaming technology. The server inputs this audio data into a speech recognition engine and an emotion engine, which simultaneously convert the audio into text and analyze the speaker's emotions.

[0318] The speech recognition engine analyzes the received audio data and converts it into text data. The converted text data is then summarized by a generative AI, and the key points of the business negotiation are extracted and compiled into meeting minutes.

[0319] The emotion engine analyzes the characteristics of the speech signal, such as intonation, speed, and volume, and infers the speaker's emotions based on this analysis. The analyzed emotion data is added to the meeting minutes as emotion labels and indicators, providing users with information that allows them to understand the atmosphere and emotional flow of the business negotiation.

[0320] After a business meeting, the user enters a brief memo on their device as supplementary information. This memo is sent to a server and transformed into detailed data through AI analysis. Finally, the text data, sentiment data, and detailed data are centralized and automatically entered into management systems such as customer relationship management (CRM) systems.

[0321] As a concrete implementation example, let's consider a business meeting where a product demonstration is conducted. In this scenario, the server generates meeting minutes based on the audio data acquired in real time by the terminal, such as "Introducing two new features as strengths of the product." Simultaneously, the server uses sentiment analysis to assign a sentiment label indicating that "the customer showed a positive reaction to the new features."

[0322] In this way, the present invention combines information extraction from voice with emotion analysis to provide emotional insights along with important data from business negotiations, thereby enabling a deeper understanding of business negotiation activities and effective follow-up.

[0323] The following describes the processing flow.

[0324] Step 1:

[0325] The terminal enables voice input at the start of a business meeting and uses its built-in microphone to capture the conversation of multiple meeting participants in real time. The captured audio data is compressed and sent to a server over the network.

[0326] Step 2:

[0327] The server sends the received audio data to the speech recognition engine, which converts the audio into text data according to the context. This conversion process includes filtering of background noise and identification of each speaker.

[0328] Step 3:

[0329] Simultaneously, the server sends the audio data to the emotion engine, which evaluates and determines the speaker's emotions by analyzing the tone, pitch, speed, and other acoustic characteristics of the voice.

[0330] Step 4:

[0331] The server uses a generation AI to extract important keywords and phrases from text data and automatically generate summarized meeting minutes. These minutes include the specific details of the business negotiation, along with sentiment labels obtained from the sentiment engine.

[0332] Step 5:

[0333] The generated meeting minutes and sentiment labels are sent to the terminal after the meeting concludes, allowing the user to review them and add annotations or additional comments as needed.

[0334] Step 6:

[0335] After a business meeting, notes entered by the user are sent to the server via the terminal. The server uses AI to analyze these notes, convert them into a specific and detailed data form, and re-enter them into the management system.

[0336] Step 7:

[0337] Ultimately, all text data, sentiment data, and additional memo data are integrated and automatically recorded and managed in the company's management system. This process creates a comprehensive and multifaceted information database related to business negotiations.

[0338] (Example 2)

[0339] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0340] In business negotiations, it is necessary to record participants' statements as acoustic information and understand their content in detail. However, conventional systems have difficulty accurately capturing human emotions and understanding the atmosphere and flow of emotions during negotiations, and it is also difficult to efficiently supplement information after negotiations. Therefore, a system is needed that integrates everything from acquiring acoustic information to emotion analysis, generating meeting minutes, and inputting them into a management system.

[0341] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0342] In this invention, the server includes an acquisition means for acquiring acoustic information and generating acoustic data, a conversion means for analyzing the acoustic data and converting it into text data, and an emotion analysis means for analyzing emotion information based on the text data and acoustic data and generating emotion labels. This enables accurate acquisition and analysis of acoustic information, and a deep understanding of the content of business negotiations through emotion analysis.

[0343] "Acoustic information" refers to data that includes the properties of sound, acquired during business negotiations, conversations, and other interactions.

[0344] "Acoustic data" refers to acoustic information represented in digital format and is used for speech recognition and analysis.

[0345] "Acquisition means" refers to a mechanism for collecting acoustic information and generating acoustic data.

[0346] A "conversion means" is a device that has the function of analyzing acoustic data and converting it into text data.

[0347] "Text data" refers to data in text format converted from speech, in a format that humans can read and understand.

[0348] "Emotional information" refers to data about the speaker's emotional state, analyzed from acoustic data.

[0349] An "emotional label" is a label assigned to audio or dialogue as an indicator of emotional information.

[0350] A "sentiment analysis tool" is a device that analyzes the speaker's emotional information from acoustic and text data and generates emotion labels.

[0351] A "generation method" is a mechanism for extracting important items from text data and generating meeting minutes.

[0352] "Meeting minutes" are records that summarize important statements and key points of discussion during business negotiations or meetings.

[0353] A "management platform" is a system for integrating, managing, and storing generated information.

[0354] "Implementation method" refers to a mechanism for inputting generated meeting minutes and sentiment labels into the management platform.

[0355] "Memo information" refers to additional information entered after a business negotiation, and serves a supplementary role.

[0356] A "memo analysis tool" is a mechanism for analyzing memo information, converting it into detailed information, and inputting it into a management platform.

[0357] The following describes the configuration for implementing this system.

[0358] The present invention aims to provide more detailed and valuable information by acquiring and analyzing acoustic information during business negotiations and meetings. First, a terminal acquires acoustic information during business negotiations using a multi-directional microphone. This terminal is equipped with communication means for acquiring acoustic information in real time and transmitting the data to a computing device. Here, acoustic data can be transmitted using the Internet Protocol.

[0359] The server acts as a computing device, analyzing the received acoustic data. This analysis utilizes speech recognition software and sentiment analysis software. Specifically, a commercial speech recognition API is used for speech recognition, and acoustic characteristic analysis software is used for sentiment analysis, converting the acoustic data into text data and generating sentiment labels from the acoustic and text data. Furthermore, a generation AI is used to extract key points from the text data and record them as meeting minutes.

[0360] After a business meeting concludes, users can manually enter memo information using their terminals. This information is further analyzed and centrally recorded as detailed information. This recorded information is then integrated into the management platform and shared among stakeholders.

[0361] As a concrete example, let's consider a sales meeting where a product is being introduced. In this case, the terminal picks up everyone's comments, and the server generates a summary such as, "Two innovative features were introduced as characteristics of the new product." Additionally, by adding sentiment labels, information such as, "The customer showed positive surprise about the new features" is also included. This process makes it possible to grasp both the important content of the sales meeting and the atmosphere of the situation.

[0362] An example of a prompt using a generative AI model would be something like, "Please indicate the themes that elicited the most responses during the business negotiation and the associated sentiment labels."

[0363] In this way, this system enhances the information gathering and analysis during business negotiations, thereby maximizing the effectiveness of negotiations and supporting subsequent operations.

[0364] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0365] Step 1:

[0366] The terminal acquires acoustic information in real time during business negotiations using multi-directional microphones. The input is acoustic signals from the surroundings, which are converted into digital acoustic data. Specifically, analog sound is collected from the microphone and converted into a digital signal using an A / D converter. This data is temporarily stored in a buffer within the terminal.

[0367] Step 2:

[0368] The terminal sends digital audio data stored in a buffer to the server via real-time streaming technology. The input is the audio data in the buffer, and the output is the audio data sent to the server over the network. Specifically, this transmission uses the Internet Protocol, and the data is divided into small packets and transmitted using streaming technologies such as the UDP protocol.

[0369] Step 3:

[0370] The server analyzes the received audio data. The input is digital audio data sent from the terminal, and the output is text data and emotion information. Specifically, speech recognition software is used to convert the audio data into text data, and emotion analysis software is used to analyze the intonation and speed of the speech. Based on this analysis, an emotion label is generated that quantifies the speaker's emotions.

[0371] Step 4:

[0372] The server uses a generative AI model to summarize key points based on the analyzed text data and generate meeting minutes. The input is text data, and the output is a summary of the meeting minutes. Specifically, the AI model extracts key phrases from the text and logically organizes them. This summary also includes speaker sentiment labels.

[0373] Step 5:

[0374] After the business meeting concludes, the user enters supplementary memo information using a terminal. The input is text information entered by the user, and the output is data converted into detailed memo information. Specifically, the user manually enters text, which is then used with natural language processing technology for subsequent analysis.

[0375] Step 6:

[0376] The server analyzes supplementary information from users and integrates it with existing meeting minutes and sentiment labels. Input is detailed memo information entered by the user, and output is an integrated sales opportunity record. Specifically, the analysis engine stores the memo information in a database, combines it with existing meeting minutes to build a comprehensive sales opportunity database, and automatically deploys it to the management platform.

[0377] (Application Example 2)

[0378] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the smart glasses 214 as the "terminal".

[0379] Modern security services require accurate understanding of surrounding audio information and early detection of suspicious activity and conversations. However, current technology is insufficient for extracting key points from audio and understanding emotional nuances, making it difficult to achieve highly accurate surveillance. There is also a need for efficient systems that perform real-time analysis of audio data.

[0380] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0381] In this invention, the server includes voice acquisition means, voice recognition means, information recording generation means, emotion analysis means, and anomaly detection means. This makes it possible to analyze the flow of voice information and emotions within the environment and to quickly detect suspicious movements and conversations.

[0382] "Sound acquisition means" refers to a device or configuration that effectively collects ambient sounds in an environment and supplies those sounds to a subsequent analysis process.

[0383] "Speech recognition means" refers to a technology or device that analyzes acquired speech information and converts it into corresponding text information.

[0384] "Information record generation means" refers to a process or system for extracting specific points based on textual information and creating appropriate information records.

[0385] "Emotional analysis means" refers to a technology or device for analyzing characteristics such as intonation, speed, and volume in speech, and for inferring and labeling the emotional state of the speaker.

[0386] An "anomaly detection means" is a function or method for detecting suspicious behavior or conversation based on analyzed voice and emotion data.

[0387] A "management mechanism" is a system or device that centrally manages collected and analyzed data and appropriately stores or displays necessary information.

[0388] A "computer" is a computer or data processing device used to analyze audio data in real time and perform textual information and sentiment analysis.

[0389] To realize an application example of this invention, first, a security robot system is used to acquire sound from the environment. Specifically, a multi-directional microphone is used as the device. This allows the robot to collect ambient sound data in real time while patrolling.

[0390] The server analyzes the acquired audio data using speech recognition and converts it into text. This process can utilize speech recognition APIs such as Google Cloud Speech-to-Text. The recognized text is then further processed by an information recording generation system to extract specific key points and generate appropriate records.

[0391] Next, the server uses emotion analysis tools, such as an emotion analysis API like IBM Watson Tone Analyzer, to infer emotions from the intonation, speed, and volume characteristics of the voice data. This allows for the detection of potential disturbances in the patrolled area.

[0392] Furthermore, anomaly detection mechanisms are implemented to detect suspicious behavior and conversations based on analyzed voice and emotion data. In this process, the server utilizes AI technology to provide a warning system that enables a rapid response in emergencies.

[0393] As a concrete example, if a robot patrolling a shopping mall detects a conversation expressing dissatisfaction with a particular product and determines that anger is escalating, this information is reported to the management organization in real time, and appropriate action is taken.

[0394] An example of a prompt for a generative AI model would be: "Translate the current audio data into text, analyze its intonation, speed, and volume, assign emotion labels, and create a report that detects suspicious movements and conversations."

[0395] In this way, the present invention provides a practical system that contributes to improving safety in the environment while maintaining high accuracy in the collection and analysis of voice data.

[0396] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0397] Step 1:

[0398] The terminal uses multi-directional microphones to acquire ambient sound in real time. The acquired audio data is generated through the audio acquisition mechanism and sent to the server for subsequent processing. The input is ambient sound, and the output is raw audio data.

[0399] Step 2:

[0400] The server converts the received audio data into text data using speech recognition. A speech recognition API such as Google Cloud Speech-to-Text is used here. The input is audio data, and the output is the corresponding text information.

[0401] Step 3:

[0402] The server processes the obtained text data through an information record generation mechanism to generate an information record containing specific key points. During this process, a generation AI model is used for summarization. The input is text data, and the output is a summarized information record.

[0403] Step 4:

[0404] The server processes audio data using sentiment analysis tools to perform sentiment analysis. It uses sentiment analysis APIs such as IBM Watson Tone Analyzer to analyze the intonation, speed, and volume of the audio data. The input is audio data, and the output is sentiment labels and sentiment scores.

[0405] Step 5:

[0406] The server uses informational records and sentiment labels to activate anomaly detection mechanisms and detect suspicious behavior or conversations. This is especially triggered when the input information exceeds pre-configured criteria. Inputs are informational records and sentiment labels, and outputs are anomaly detection alarms.

[0407] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0408] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0409] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0410] [Third Embodiment]

[0411] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0412] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0413] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0414] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0415] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0416] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0417] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0418] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0419] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0420] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0421] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0422] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0423] In order to implement this invention, it is essential to construct a system that accurately and efficiently processes data during and after business negotiations. Below, we will show an example of implementing the invention based on the processing of the program in this system.

[0424] First, during a business meeting, the device uses its microphone to capture the audio of the conversation. The audio data is sent to the server in streaming format. The server feeds the received audio data into a speech recognition engine, which converts it into text data. During this process, noise reduction and filtering techniques are used to extract only the voice of specific speakers.

[0425] Once the text data is generated, the server uses a generation AI to automatically extract key points based on the context of the business negotiation. This creates meeting minutes in real time. These minutes are then displayed to the user via their device for review after adjustments have been made.

[0426] Once a business meeting concludes, the user enters a brief memo into their terminal. This memo is often a short message containing additional information or noteworthy points from the meeting. The terminal sends the entered memo data to a server. The server uses AI to analyze this memo in detail and convert it into specific items. These include the purpose of the meeting, the requested content, and the next action points.

[0427] The analyzed detailed data is automatically and sequentially entered by the server into the company's management system, such as a customer management platform. This process is conducted in accordance with security protocols, ensuring rapid and accurate data entry while maintaining data integrity.

[0428] These features allow users to quickly record key points from business negotiations and seamlessly integrate important information into the management system. This system is expected to allow sales team members to focus on their core responsibilities and improve overall work efficiency.

[0429] As a concrete example of implementing the invention, consider a scenario where a customer is discussing product delivery dates. A terminal acquires the audio, and a server transcribes it into text, extracts key points, and automatically generates meeting minutes such as "The product is scheduled for delivery at the end of next month." After the meeting, the user enters a note stating, "The customer wants to expedite the delivery date," and this information is converted into a detailed action plan through AI analysis and reflected in the management system.

[0430] The following describes the processing flow.

[0431] Step 1:

[0432] The terminal enables voice input at the start of a business meeting and captures the conversation of the meeting participants in real time. The captured voice data is sent to the server using a streaming protocol.

[0433] Step 2:

[0434] The server passes the received audio data to the speech recognition engine, which converts the audio into text. This conversion process includes noise reduction and speaker identification, ensuring that only the necessary parts are transcribed.

[0435] Step 3:

[0436] The server analyzes the text data generated by speech recognition and uses a generation AI to extract important keywords and phrases. Based on this, it automatically generates meeting minutes summarizing the business negotiation content.

[0437] Step 4:

[0438] The server sends the generated meeting minutes to the terminal, allowing the user to review them in real time during the business negotiation. It also provides an interface for the user to review important information.

[0439] Step 5:

[0440] After the business meeting concludes, the user enters a brief memo about the meeting into their device. This memo may include additional comments or information about the next steps.

[0441] Step 6:

[0442] The terminal sends the entered memo to the server.

[0443] (Example 1)

[0444] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0445] For many companies, accurately and efficiently recording information during business negotiations in real time, and quickly following up after those negotiations, is a critical challenge. With current technology, it's common to manually organize and record the vast amount of information generated during negotiations, which is time-consuming and labor-intensive. Furthermore, there are limited methods for efficiently incorporating supplementary information after negotiations and reflecting it in the company's management system. To address these challenges, there is a need to automate information processing and generate summaries and follow-up information quickly and accurately.

[0446] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0447] In this invention, the server includes an audio acquisition means for acquiring audio during business negotiations and generating audio information, an audio conversion means for analyzing the audio information and converting it into text information, an information generation means for extracting important information based on the text information and generating summary information, and an information analysis means for analyzing supplementary information entered after the business negotiation and converting it into detailed information. This makes it possible to record important information during business negotiations in real time and to quickly and efficiently reflect the information necessary for follow-up after the business negotiations in the management system.

[0448] "Voice acquisition means" refers to a device or function for collecting voice information from conversations such as business negotiations.

[0449] A "speech conversion means" is a function that analyzes acquired speech information and converts it into text information.

[0450] "Information generation means" refers to a device or system that has the function of automatically extracting important information from converted character information and generating summary information.

[0451] "Information input means" refers to a function for registering or inputting generated summary information into a company's management system.

[0452] An "information analysis tool" is a tool that has the function of analyzing supplementary information obtained after a business negotiation and converting it into detailed content.

[0453] A "data processing device" is a device or mechanism that instantly analyzes voice and text information and links it with a management system.

[0454] This invention provides a system for efficiently processing information during and after business negotiations. The following describes a specific implementation of this system.

[0455] During business negotiations, the terminal acquires audio of the conversation through a microphone device. Since the audio information is processed in real time, built-in microphones or high-quality external microphones are often used. The acquired audio information is immediately transmitted to the server using a streaming protocol. Technologies such as WebSocket are used for this purpose.

[0456] The server converts the received audio information into text information using a speech conversion method. Here, APIs providing speech recognition technology, such as speech recognition services from various vendors, are used. The audio data is subjected to noise cancellation and speaker identification filtering to ensure highly accurate text information is obtained.

[0457] Next, the server analyzes the text information using a generative AI model. The generative AI model extracts important information based on the context and automatically generates a summary of the business negotiation. This involves commonly used generative AI technologies and text summarization algorithms. Specifically, the information is organized into meeting minutes.

[0458] The generated summary information is stored in the management system via an information input method. This management system can be integrated with enterprise systems such as customer information management platforms and project management systems. Data is imported while being accurately protected based on security protocols.

[0459] After a business meeting, the user inputs supplementary information via a terminal. This is recorded concisely as additional notes. The terminal sends this supplementary information to a server, which uses information analysis tools to convert the content into detailed information. Natural language processing technology is used for the analysis, and the supplementary information leads to specific steps and actions.

[0460] An example of a prompt message is, "Extract key points from the sales meeting audio data and generate meeting minutes." This allows for efficient management of sales meeting content and enables quick follow-up.

[0461] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0462] Step 1:

[0463] The terminal acquires audio information during a business negotiation via a microphone device. At this stage, the speaker's utterances are captured by the microphone, and audio information is generated as a digital signal. The input is raw audio, which is then prepared for transmission in real time via a streaming protocol, such as WebSocket. The output is the streaming audio data transmitted from the terminal.

[0464] Step 2:

[0465] The server receives streaming audio data transmitted from the terminal. Using speech conversion technology, it converts this audio data into text information. During this process, the received audio data undergoes noise reduction and speaker separation filtering. The input is streaming audio data, and the output is filtered text information.

[0466] Step 3:

[0467] The server uses a generative AI model to analyze textual information. It converts the input textual information into prompt-based analysis to extract key points and generate summary information. The generative AI automatically extracts key points from the context and organizes them into concise meeting minutes. The input is textual information, and the output is a summary with the key points extracted.

[0468] Step 4:

[0469] The server inputs the generated summary information into the management system. This input method ensures that data is systematically stored on the company's information management platform. To maintain data integrity and security, information is securely transmitted to the management system via an API. The input is summary information, and the output is organized data stored on the company's system.

[0470] Step 5:

[0471] Users enter supplementary information using a terminal after a business meeting. These additional notes are entered as concise, key-point supplementary information. The input is text data manually entered by the user.

[0472] Step 6:

[0473] The terminal sends the input supplementary information to the server. The server analyzes the received supplementary information in detail using information analysis tools and converts it into specific items. In this process, natural language processing technology is used to identify specific action items that will lead to the next step in the business negotiation. The input is the text data of the supplementary information, and the output is the analyzed detailed information.

[0474] Step 7:

[0475] The server inputs detailed information based on the analysis results into the company's management system. This prepares the system for follow-up based on the content of the business negotiations. The input is detailed analysis information, and the output is a data update in the management system that reflects this information.

[0476] (Application Example 1)

[0477] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0478] In business negotiations and sales promotions, it is necessary to improve operational efficiency by immediately recording conversations with customers and extracting key points. However, conventional methods require some manual recording, limiting accuracy and speed. Furthermore, organizing and inputting information for follow-up after negotiations is cumbersome and prone to human error. Moreover, there was a need for new methods to practically utilize this information and improve the quality of sales and services.

[0479] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0480] In this invention, the server includes an acoustic acquisition means, a language recognition means, and a generation mechanism. This makes it possible to instantly record the audio of a business negotiation, transcribe it into text, extract important points, and present them on a visual device in real time. The aim is to improve the quality of customer service by enabling efficient recording of business negotiations and immediate use of the information.

[0481] "Audio acquisition means" refers to a part of a system that accurately acquires audio during business negotiations and records and generates it as audio data.

[0482] A "language recognition means" is a processing device that analyzes acquired audio data and converts it into text data.

[0483] The "generation mechanism" is a component that extracts key points from converted text data and automatically generates meeting minutes.

[0484] A "visual device" is a device that visually displays the results of the generated audio data and provides them to the user during a business negotiation.

[0485] An "information processing system" refers to the entire database and its surrounding systems used to record and manage generated meeting minutes and analyzed detailed information.

[0486] The "descriptive analysis means" is a processing unit that uses natural language processing technology to expand notes entered after a business negotiation into detailed information.

[0487] A "data input device" is an automated component for accurately inputting generated meeting minutes into an information processing system.

[0488] A "remote information processing system" is a server and related equipment that operates via a network to instantly process audio data acquired during business negotiations and generate necessary information.

[0489] To implement this invention, multiple system components must work in coordination. A terminal equipped with sound acquisition means acquires audio generated during business negotiations in real time. The acquired audio data is immediately transmitted to a server. The server converts the audio data into text data using language recognition means. By utilizing a speech recognition engine (e.g., Google Speech-to-Text API) in this process, accurate and rapid text generation is possible.

[0490] Subsequently, a generation mechanism is used to automatically extract key points from the text data of the business negotiation. This process applies a generation AI model (e.g., OpenAI GPT) to automatically generate summaries of the points and meeting minutes. The generated meeting minutes are immediately displayed on the terminal via a visual device, allowing the user to review them in real time. This information is recorded in the information processing system via a document input mechanism.

[0491] Furthermore, after a business meeting, users input any additional notes they have taken into their terminals. The server then analyzes this information in detail using descriptive analysis tools and utilizes it for customer management and creating next action plans in the information processing system. This allows sales members to quickly record the key points of business meetings and easily reflect important information in the management system.

[0492] As a concrete example, a sales staff member uses smart glasses when explaining a new product to a customer. Voice recognition instantly summarizes important information, such as "The new model is scheduled for release at the end of next month and is available for pre-order," as meeting minutes, which are then displayed on the visual device's screen. This process is based on prompts such as, "Please extract and summarize key sales points in real time during the negotiation."

[0493] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0494] Step 1:

[0495] The terminal uses an acoustic acquisition device to capture audio during business negotiations in real time. The input is ambient sound, which is recorded as digital data. The output is digital audio data in audio format.

[0496] Step 2:

[0497] The terminal transmits the acquired audio data to a server, which is a remote information processing device. The input is digital audio data, which is transferred to the server via the network. The output is the audio data stored on the server.

[0498] Step 3:

[0499] The server analyzes speech data using language recognition tools and converts it into text data. The input is the received speech data, which is processed using a speech recognition engine (e.g., Google Speech-to-Text API). The output is text data.

[0500] Step 4:

[0501] The server uses a generation mechanism to extract key points from the text data of a business negotiation. This process employs a generative AI model (e.g., OpenAI GPT). The input is the text data generated in step 3, and the output is a summarized meeting memo.

[0502] Step 5:

[0503] The server sends the generated meeting minutes to the terminal's visual device for immediate display. The input is the meeting minutes, which are presented to the user via the visual device. The output is the summary information displayed on the terminal's display.

[0504] Step 6:

[0505] After the business meeting concludes, the user enters additional notes into the terminal. These entered notes will be processed in the next step. The input is a natural language note entered by the user.

[0506] Step 7:

[0507] The server analyzes the input memo data using descriptive analysis tools and generates detailed information. The input is the memo collected in step 6, which is analyzed using natural language processing techniques. The output is a summary and detailed information necessary for the next step.

[0508] Step 8:

[0509] The server records the analyzed details in the information processing system and uses them to create the next action plan as needed. The input is the detailed information generated in step 7, which is stored in the company's information system. The output is data integrated into the company's management platform.

[0510] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0511] The present invention aims to provide a system that improves the quality of meeting minutes and sales records by integrally handling audio data during sales negotiations and memo data afterward, and further analyzing the user's emotions. The program processing in this system will be described in detail below.

[0512] During a business negotiation, the terminal uses multi-directional microphones to capture conversational audio in real time. The acquired audio data is then transmitted to a server using real-time streaming technology. The server inputs this audio data into a speech recognition engine and an emotion engine, which simultaneously convert the audio into text and analyze the speaker's emotions.

[0513] The speech recognition engine analyzes the received audio data and converts it into text data. The converted text data is then summarized by a generative AI, and the key points of the business negotiation are extracted and compiled into meeting minutes.

[0514] The emotion engine analyzes the characteristics of the speech signal, such as intonation, speed, and volume, and infers the speaker's emotions based on this analysis. The analyzed emotion data is added to the meeting minutes as emotion labels and indicators, providing users with information that allows them to understand the atmosphere and emotional flow of the business negotiation.

[0515] After a business meeting, the user enters a brief memo on their device as supplementary information. This memo is sent to a server and transformed into detailed data through AI analysis. Finally, the text data, sentiment data, and detailed data are centralized and automatically entered into management systems such as customer relationship management (CRM) systems.

[0516] As a concrete implementation example, let's consider a business meeting where a product demonstration is conducted. In this scenario, the server generates meeting minutes based on the audio data acquired in real time by the terminal, such as "Introducing two new features as strengths of the product." Simultaneously, the server uses sentiment analysis to assign a sentiment label indicating that "the customer showed a positive reaction to the new features."

[0517] In this way, the present invention combines information extraction from voice with emotion analysis to provide emotional insights along with important data from business negotiations, thereby enabling a deeper understanding of business negotiation activities and effective follow-up.

[0518] The following describes the processing flow.

[0519] Step 1:

[0520] The terminal enables voice input at the start of a business meeting and uses its built-in microphone to capture the conversation of multiple meeting participants in real time. The captured audio data is compressed and sent to a server over the network.

[0521] Step 2:

[0522] The server sends the received audio data to the speech recognition engine, which converts the audio into text data according to the context. This conversion process includes filtering of background noise and identification of each speaker.

[0523] Step 3:

[0524] Simultaneously, the server sends the audio data to the emotion engine, which evaluates and determines the speaker's emotions by analyzing the tone, pitch, speed, and other acoustic characteristics of the voice.

[0525] Step 4:

[0526] The server uses a generation AI to extract important keywords and phrases from text data and automatically generate summarized meeting minutes. These minutes include the specific details of the business negotiation, along with sentiment labels obtained from the sentiment engine.

[0527] Step 5:

[0528] The generated meeting minutes and sentiment labels are sent to the terminal after the meeting concludes, allowing the user to review them and add annotations or additional comments as needed.

[0529] Step 6:

[0530] After a business meeting, notes entered by the user are sent to the server via the terminal. The server uses AI to analyze these notes, convert them into a specific and detailed data form, and re-enter them into the management system.

[0531] Step 7:

[0532] Ultimately, all text data, sentiment data, and additional memo data are integrated and automatically recorded and managed in the company's management system. This process creates a comprehensive and multifaceted information database related to business negotiations.

[0533] (Example 2)

[0534] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0535] In business negotiations, it is necessary to record participants' statements as acoustic information and understand their content in detail. However, conventional systems have difficulty accurately capturing human emotions and understanding the atmosphere and flow of emotions during negotiations, and it is also difficult to efficiently supplement information after negotiations. Therefore, a system is needed that integrates everything from acquiring acoustic information to emotion analysis, generating meeting minutes, and inputting them into a management system.

[0536] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0537] In this invention, the server includes an acquisition means for acquiring acoustic information and generating acoustic data, a conversion means for analyzing the acoustic data and converting it into text data, and an emotion analysis means for analyzing emotion information based on the text data and acoustic data and generating emotion labels. This enables accurate acquisition and analysis of acoustic information, and a deep understanding of the content of business negotiations through emotion analysis.

[0538] "Acoustic information" refers to data that includes the properties of sound, acquired during business negotiations, conversations, and other interactions.

[0539] "Acoustic data" refers to acoustic information represented in digital format and is used for speech recognition and analysis.

[0540] "Acquisition means" refers to a mechanism for collecting acoustic information and generating acoustic data.

[0541] A "conversion means" is a device that has the function of analyzing acoustic data and converting it into text data.

[0542] "Text data" refers to data in text format converted from speech, in a format that humans can read and understand.

[0543] "Emotional information" refers to data about the speaker's emotional state, analyzed from acoustic data.

[0544] An "emotional label" is a label assigned to audio or dialogue as an indicator of emotional information.

[0545] A "sentiment analysis tool" is a device that analyzes the speaker's emotional information from acoustic and text data and generates emotion labels.

[0546] A "generation method" is a mechanism for extracting important items from text data and generating meeting minutes.

[0547] "Meeting minutes" are records that summarize important statements and key points of discussion during business negotiations or meetings.

[0548] A "management platform" is a system for integrating, managing, and storing generated information.

[0549] "Implementation method" refers to a mechanism for inputting generated meeting minutes and sentiment labels into the management platform.

[0550] "Memo information" refers to additional information entered after a business negotiation, and serves a supplementary role.

[0551] A "memo analysis tool" is a mechanism for analyzing memo information, converting it into detailed information, and inputting it into a management platform.

[0552] The following describes the configuration for implementing this system.

[0553] The present invention aims to provide more detailed and valuable information by acquiring and analyzing acoustic information during business negotiations and meetings. First, a terminal acquires acoustic information during business negotiations using a multi-directional microphone. This terminal is equipped with communication means for acquiring acoustic information in real time and transmitting the data to a computing device. Here, acoustic data can be transmitted using the Internet Protocol.

[0554] The server acts as a computing device, analyzing the received acoustic data. This analysis utilizes speech recognition software and sentiment analysis software. Specifically, a commercial speech recognition API is used for speech recognition, and acoustic characteristic analysis software is used for sentiment analysis, converting the acoustic data into text data and generating sentiment labels from the acoustic and text data. Furthermore, a generation AI is used to extract key points from the text data and record them as meeting minutes.

[0555] After a business meeting concludes, users can manually enter memo information using their terminals. This information is further analyzed and centrally recorded as detailed information. This recorded information is then integrated into the management platform and shared among stakeholders.

[0556] As a concrete example, let's consider a sales meeting where a product is being introduced. In this case, the terminal picks up everyone's comments, and the server generates a summary such as, "Two innovative features were introduced as characteristics of the new product." Additionally, by adding sentiment labels, information such as, "The customer showed positive surprise about the new features" is also included. This process makes it possible to grasp both the important content of the sales meeting and the atmosphere of the situation.

[0557] An example of a prompt using a generative AI model would be something like, "Please indicate the themes that elicited the most responses during the business negotiation and the associated sentiment labels."

[0558] In this way, this system enhances the information gathering and analysis during business negotiations, thereby maximizing the effectiveness of negotiations and supporting subsequent operations.

[0559] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0560] Step 1:

[0561] The terminal acquires acoustic information in real time during business negotiations using multi-directional microphones. The input is acoustic signals from the surroundings, which are converted into digital acoustic data. Specifically, analog sound is collected from the microphone and converted into a digital signal using an A / D converter. This data is temporarily stored in a buffer within the terminal.

[0562] Step 2:

[0563] The terminal sends digital audio data stored in a buffer to the server via real-time streaming technology. The input is the audio data in the buffer, and the output is the audio data sent to the server over the network. Specifically, this transmission uses the Internet Protocol, and the data is divided into small packets and transmitted using streaming technologies such as the UDP protocol.

[0564] Step 3:

[0565] The server analyzes the received audio data. The input is digital audio data sent from the terminal, and the output is text data and emotion information. Specifically, speech recognition software is used to convert the audio data into text data, and emotion analysis software is used to analyze the intonation and speed of the speech. Based on this analysis, an emotion label is generated that quantifies the speaker's emotions.

[0566] Step 4:

[0567] The server uses a generative AI model to summarize key points based on the analyzed text data and generate meeting minutes. The input is text data, and the output is a summary of the meeting minutes. Specifically, the AI model extracts key phrases from the text and logically organizes them. This summary also includes speaker sentiment labels.

[0568] Step 5:

[0569] After the business meeting concludes, the user enters supplementary memo information using a terminal. The input is text information entered by the user, and the output is data converted into detailed memo information. Specifically, the user manually enters text, which is then used with natural language processing technology for subsequent analysis.

[0570] Step 6:

[0571] The server analyzes supplementary information from users and integrates it with existing meeting minutes and sentiment labels. Input is detailed memo information entered by the user, and output is an integrated sales opportunity record. Specifically, the analysis engine stores the memo information in a database, combines it with existing meeting minutes to build a comprehensive sales opportunity database, and automatically deploys it to the management platform.

[0572] (Application Example 2)

[0573] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0574] Modern security services require accurate understanding of surrounding audio information and early detection of suspicious activity and conversations. However, current technology is insufficient for extracting key points from audio and understanding emotional nuances, making it difficult to achieve highly accurate surveillance. There is also a need for efficient systems that perform real-time analysis of audio data.

[0575] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0576] In this invention, the server includes voice acquisition means, voice recognition means, information recording generation means, emotion analysis means, and anomaly detection means. This makes it possible to analyze the flow of voice information and emotions within the environment and to quickly detect suspicious movements and conversations.

[0577] "Sound acquisition means" refers to a device or configuration that effectively collects ambient sounds in an environment and supplies those sounds to a subsequent analysis process.

[0578] "Speech recognition means" refers to a technology or device that analyzes acquired speech information and converts it into corresponding text information.

[0579] "Information record generation means" refers to a process or system for extracting specific points based on textual information and creating appropriate information records.

[0580] "Emotional analysis means" refers to a technology or device for analyzing characteristics such as intonation, speed, and volume in speech, and for inferring and labeling the emotional state of the speaker.

[0581] An "anomaly detection means" is a function or method for detecting suspicious behavior or conversation based on analyzed voice and emotion data.

[0582] A "management mechanism" is a system or device that centrally manages collected and analyzed data and appropriately stores or displays necessary information.

[0583] A "computer" is a computer or data processing device used to analyze audio data in real time and perform textual information and sentiment analysis.

[0584] To realize an application example of this invention, first, a security robot system is used to acquire sound from the environment. Specifically, a multi-directional microphone is used as the device. This allows the robot to collect ambient sound data in real time while patrolling.

[0585] The server analyzes the acquired audio data using speech recognition and converts it into text. This process can utilize speech recognition APIs such as Google Cloud Speech-to-Text. The recognized text is then further processed by an information recording generation system to extract specific key points and generate appropriate records.

[0586] Next, the server uses emotion analysis tools, such as an emotion analysis API like IBM Watson Tone Analyzer, to infer emotions from the intonation, speed, and volume characteristics of the voice data. This allows for the detection of potential disturbances in the patrolled area.

[0587] Furthermore, anomaly detection mechanisms are implemented to detect suspicious behavior and conversations based on analyzed voice and emotion data. In this process, the server utilizes AI technology to provide a warning system that enables a rapid response in emergencies.

[0588] As a concrete example, if a robot patrolling a shopping mall detects a conversation expressing dissatisfaction with a particular product and determines that anger is escalating, this information is reported to the management organization in real time, and appropriate action is taken.

[0589] An example of a prompt for a generative AI model would be: "Translate the current audio data into text, analyze its intonation, speed, and volume, assign emotion labels, and create a report that detects suspicious movements and conversations."

[0590] In this way, the present invention provides a practical system that contributes to improving safety in the environment while maintaining high accuracy in the collection and analysis of voice data.

[0591] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0592] Step 1:

[0593] The terminal uses multi-directional microphones to acquire ambient sound in real time. The acquired audio data is generated through the audio acquisition mechanism and sent to the server for subsequent processing. The input is ambient sound, and the output is raw audio data.

[0594] Step 2:

[0595] The server converts the received audio data into text data using speech recognition. A speech recognition API such as Google Cloud Speech-to-Text is used here. The input is audio data, and the output is the corresponding text information.

[0596] Step 3:

[0597] The server processes the obtained text data through an information record generation mechanism to generate an information record containing specific key points. During this process, a generation AI model is used for summarization. The input is text data, and the output is a summarized information record.

[0598] Step 4:

[0599] The server processes audio data using sentiment analysis tools to perform sentiment analysis. It uses sentiment analysis APIs such as IBM Watson Tone Analyzer to analyze the intonation, speed, and volume of the audio data. The input is audio data, and the output is sentiment labels and sentiment scores.

[0600] Step 5:

[0601] The server uses informational records and sentiment labels to activate anomaly detection mechanisms and detect suspicious behavior or conversations. This is especially triggered when the input information exceeds pre-configured criteria. Inputs are informational records and sentiment labels, and outputs are anomaly detection alarms.

[0602] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0603] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0604] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0605] [Fourth Embodiment]

[0606] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0607] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0608] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0609] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0610] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0611] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0612] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0613] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0614] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0615] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0616] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0617] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0618] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0619] In order to implement this invention, it is essential to construct a system that accurately and efficiently processes data during and after business negotiations. Below, we will show an example of implementing the invention based on the processing of the program in this system.

[0620] First, during a business meeting, the device uses its microphone to capture the audio of the conversation. The audio data is sent to the server in streaming format. The server feeds the received audio data into a speech recognition engine, which converts it into text data. During this process, noise reduction and filtering techniques are used to extract only the voice of specific speakers.

[0621] Once the text data is generated, the server uses a generation AI to automatically extract key points based on the context of the business negotiation. This creates meeting minutes in real time. These minutes are then displayed to the user via their device for review after adjustments have been made.

[0622] Once a business meeting concludes, the user enters a brief memo into their terminal. This memo is often a short message containing additional information or noteworthy points from the meeting. The terminal sends the entered memo data to a server. The server uses AI to analyze this memo in detail and convert it into specific items. These include the purpose of the meeting, the requested content, and the next action points.

[0623] The analyzed detailed data is automatically and sequentially entered by the server into the company's management system, such as a customer management platform. This process is conducted in accordance with security protocols, ensuring rapid and accurate data entry while maintaining data integrity.

[0624] These features allow users to quickly record key points from business negotiations and seamlessly integrate important information into the management system. This system is expected to allow sales team members to focus on their core responsibilities and improve overall work efficiency.

[0625] As a concrete example of implementing the invention, consider a scenario where a customer is discussing product delivery dates. A terminal acquires the audio, and a server transcribes it into text, extracts key points, and automatically generates meeting minutes such as "The product is scheduled for delivery at the end of next month." After the meeting, the user enters a note stating, "The customer wants to expedite the delivery date," and this information is converted into a detailed action plan through AI analysis and reflected in the management system.

[0626] The following describes the processing flow.

[0627] Step 1:

[0628] The terminal enables voice input at the start of a business meeting and captures the conversation of the meeting participants in real time. The captured voice data is sent to the server using a streaming protocol.

[0629] Step 2:

[0630] The server passes the received audio data to the speech recognition engine, which converts the audio into text. This conversion process includes noise reduction and speaker identification, ensuring that only the necessary parts are transcribed.

[0631] Step 3:

[0632] The server analyzes the text data generated by speech recognition and uses a generation AI to extract important keywords and phrases. Based on this, it automatically generates meeting minutes summarizing the business negotiation content.

[0633] Step 4:

[0634] The server sends the generated meeting minutes to the terminal, allowing the user to review them in real time during the business negotiation. It also provides an interface for the user to review important information.

[0635] Step 5:

[0636] After the business meeting concludes, the user enters a brief memo about the meeting into their device. This memo may include additional comments or information about the next steps.

[0637] Step 6:

[0638] The terminal sends the entered memo to the server.

[0639] (Example 1)

[0640] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0641] For many companies, accurately and efficiently recording information during business negotiations in real time, and quickly following up after those negotiations, is a critical challenge. With current technology, it's common to manually organize and record the vast amount of information generated during negotiations, which is time-consuming and labor-intensive. Furthermore, there are limited methods for efficiently incorporating supplementary information after negotiations and reflecting it in the company's management system. To address these challenges, there is a need to automate information processing and generate summaries and follow-up information quickly and accurately.

[0642] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0643] In this invention, the server includes an audio acquisition means for acquiring audio during business negotiations and generating audio information, an audio conversion means for analyzing the audio information and converting it into text information, an information generation means for extracting important information based on the text information and generating summary information, and an information analysis means for analyzing supplementary information entered after the business negotiation and converting it into detailed information. This makes it possible to record important information during business negotiations in real time and to quickly and efficiently reflect the information necessary for follow-up after the business negotiations in the management system.

[0644] "Voice acquisition means" refers to a device or function for collecting voice information from conversations such as business negotiations.

[0645] A "speech conversion means" is a function that analyzes acquired speech information and converts it into text information.

[0646] "Information generation means" refers to a device or system that has the function of automatically extracting important information from converted character information and generating summary information.

[0647] "Information input means" refers to a function for registering or inputting generated summary information into a company's management system.

[0648] An "information analysis tool" is a tool that has the function of analyzing supplementary information obtained after a business negotiation and converting it into detailed content.

[0649] A "data processing device" is a device or mechanism that instantly analyzes voice and text information and links it with a management system.

[0650] This invention provides a system for efficiently processing information during and after business negotiations. The following describes a specific implementation of this system.

[0651] During business negotiations, the terminal acquires audio of the conversation through a microphone device. Since the audio information is processed in real time, built-in microphones or high-quality external microphones are often used. The acquired audio information is immediately transmitted to the server using a streaming protocol. Technologies such as WebSocket are used for this purpose.

[0652] The server converts the received audio information into text information using a speech conversion method. Here, APIs providing speech recognition technology, such as speech recognition services from various vendors, are used. The audio data is subjected to noise cancellation and speaker identification filtering to ensure highly accurate text information is obtained.

[0653] Next, the server analyzes the text information using a generative AI model. The generative AI model extracts important information based on the context and automatically generates a summary of the business negotiation. This involves commonly used generative AI technologies and text summarization algorithms. Specifically, the information is organized into meeting minutes.

[0654] The generated summary information is stored in the management system via an information input method. This management system can be integrated with enterprise systems such as customer information management platforms and project management systems. Data is imported while being accurately protected based on security protocols.

[0655] After a business meeting, the user inputs supplementary information via a terminal. This is recorded concisely as additional notes. The terminal sends this supplementary information to a server, which uses information analysis tools to convert the content into detailed information. Natural language processing technology is used for the analysis, and the supplementary information leads to specific steps and actions.

[0656] An example of a prompt message is, "Extract key points from the sales meeting audio data and generate meeting minutes." This allows for efficient management of sales meeting content and enables quick follow-up.

[0657] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0658] Step 1:

[0659] The terminal acquires audio information during a business negotiation via a microphone device. At this stage, the speaker's utterances are captured by the microphone, and audio information is generated as a digital signal. The input is raw audio, which is then prepared for transmission in real time via a streaming protocol, such as WebSocket. The output is the streaming audio data transmitted from the terminal.

[0660] Step 2:

[0661] The server receives streaming audio data transmitted from the terminal. Using speech conversion technology, it converts this audio data into text information. During this process, the received audio data undergoes noise reduction and speaker separation filtering. The input is streaming audio data, and the output is filtered text information.

[0662] Step 3:

[0663] The server uses a generative AI model to analyze textual information. It converts the input textual information into prompt-based analysis to extract key points and generate summary information. The generative AI automatically extracts key points from the context and organizes them into concise meeting minutes. The input is textual information, and the output is a summary with the key points extracted.

[0664] Step 4:

[0665] The server inputs the generated summary information into the management system. This input method ensures that data is systematically stored on the company's information management platform. To maintain data integrity and security, information is securely transmitted to the management system via an API. The input is summary information, and the output is organized data stored on the company's system.

[0666] Step 5:

[0667] Users enter supplementary information using a terminal after a business meeting. These additional notes are entered as concise, key-point supplementary information. The input is text data manually entered by the user.

[0668] Step 6:

[0669] The terminal sends the input supplementary information to the server. The server analyzes the received supplementary information in detail using information analysis tools and converts it into specific items. In this process, natural language processing technology is used to identify specific action items that will lead to the next step in the business negotiation. The input is the text data of the supplementary information, and the output is the analyzed detailed information.

[0670] Step 7:

[0671] The server inputs detailed information based on the analysis results into the company's management system. This prepares the system for follow-up based on the content of the business negotiations. The input is detailed analysis information, and the output is a data update in the management system that reflects this information.

[0672] (Application Example 1)

[0673] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0674] In business negotiations and sales promotions, it is necessary to improve operational efficiency by immediately recording conversations with customers and extracting key points. However, conventional methods require some manual recording, limiting accuracy and speed. Furthermore, organizing and inputting information for follow-up after negotiations is cumbersome and prone to human error. Moreover, there was a need for new methods to practically utilize this information and improve the quality of sales and services.

[0675] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0676] In this invention, the server includes an acoustic acquisition means, a language recognition means, and a generation mechanism. This makes it possible to instantly record the audio of a business negotiation, transcribe it into text, extract important points, and present them on a visual device in real time. The aim is to improve the quality of customer service by enabling efficient recording of business negotiations and immediate use of the information.

[0677] "Audio acquisition means" refers to a part of a system that accurately acquires audio during business negotiations and records and generates it as audio data.

[0678] A "language recognition means" is a processing device that analyzes acquired audio data and converts it into text data.

[0679] The "generation mechanism" is a component that extracts key points from converted text data and automatically generates meeting minutes.

[0680] A "visual device" is a device that visually displays the results of the generated audio data and provides them to the user during a business negotiation.

[0681] An "information processing system" refers to the entire database and its surrounding systems used to record and manage generated meeting minutes and analyzed detailed information.

[0682] The "descriptive analysis means" is a processing unit that uses natural language processing technology to expand notes entered after a business negotiation into detailed information.

[0683] A "data input device" is an automated component for accurately inputting generated meeting minutes into an information processing system.

[0684] A "remote information processing system" is a server and related equipment that operates via a network to instantly process audio data acquired during business negotiations and generate necessary information.

[0685] To implement this invention, multiple system components must work in coordination. A terminal equipped with sound acquisition means acquires audio generated during business negotiations in real time. The acquired audio data is immediately transmitted to a server. The server converts the audio data into text data using language recognition means. By utilizing a speech recognition engine (e.g., Google Speech-to-Text API) in this process, accurate and rapid text generation is possible.

[0686] Subsequently, a generation mechanism is used to automatically extract key points from the text data of the business negotiation. This process applies a generation AI model (e.g., OpenAI GPT) to automatically generate summaries of the points and meeting minutes. The generated meeting minutes are immediately displayed on the terminal via a visual device, allowing the user to review them in real time. This information is recorded in the information processing system via a document input mechanism.

[0687] Furthermore, after a business meeting, users input any additional notes they have taken into their terminals. The server then analyzes this information in detail using descriptive analysis tools and utilizes it for customer management and creating next action plans in the information processing system. This allows sales members to quickly record the key points of business meetings and easily reflect important information in the management system.

[0688] As a concrete example, a sales staff member uses smart glasses when explaining a new product to a customer. Voice recognition instantly summarizes important information, such as "The new model is scheduled for release at the end of next month and is available for pre-order," as meeting minutes, which are then displayed on the visual device's screen. This process is based on prompts such as, "Please extract and summarize key sales points in real time during the negotiation."

[0689] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0690] Step 1:

[0691] The terminal uses an acoustic acquisition device to capture audio during business negotiations in real time. The input is ambient sound, which is recorded as digital data. The output is digital audio data in audio format.

[0692] Step 2:

[0693] The terminal transmits the acquired audio data to a server, which is a remote information processing device. The input is digital audio data, which is transferred to the server via the network. The output is the audio data stored on the server.

[0694] Step 3:

[0695] The server analyzes speech data using language recognition tools and converts it into text data. The input is the received speech data, which is processed using a speech recognition engine (e.g., Google Speech-to-Text API). The output is text data.

[0696] Step 4:

[0697] The server uses a generation mechanism to extract key points from the text data of a business negotiation. This process employs a generative AI model (e.g., OpenAI GPT). The input is the text data generated in step 3, and the output is a summarized meeting memo.

[0698] Step 5:

[0699] The server sends the generated meeting minutes to the terminal's visual device for immediate display. The input is the meeting minutes, which are presented to the user via the visual device. The output is the summary information displayed on the terminal's display.

[0700] Step 6:

[0701] After the business meeting concludes, the user enters additional notes into the terminal. These entered notes will be processed in the next step. The input is a natural language note entered by the user.

[0702] Step 7:

[0703] The server analyzes the input memo data using descriptive analysis tools and generates detailed information. The input is the memo collected in step 6, which is analyzed using natural language processing techniques. The output is a summary and detailed information necessary for the next step.

[0704] Step 8:

[0705] The server records the analyzed details in the information processing system and uses them to create the next action plan as needed. The input is the detailed information generated in step 7, which is stored in the company's information system. The output is data integrated into the company's management platform.

[0706] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0707] The present invention aims to provide a system that improves the quality of meeting minutes and sales records by integrally handling audio data during sales negotiations and memo data afterward, and further analyzing the user's emotions. The program processing in this system will be described in detail below.

[0708] During a business negotiation, the terminal uses multi-directional microphones to capture conversational audio in real time. The acquired audio data is then transmitted to a server using real-time streaming technology. The server inputs this audio data into a speech recognition engine and an emotion engine, which simultaneously convert the audio into text and analyze the speaker's emotions.

[0709] The speech recognition engine analyzes the received audio data and converts it into text data. The converted text data is then summarized by a generative AI, and the key points of the business negotiation are extracted and compiled into meeting minutes.

[0710] The emotion engine analyzes the characteristics of the speech signal, such as intonation, speed, and volume, and infers the speaker's emotions based on this analysis. The analyzed emotion data is added to the meeting minutes as emotion labels and indicators, providing users with information that allows them to understand the atmosphere and emotional flow of the business negotiation.

[0711] After a business meeting, the user enters a brief memo on their device as supplementary information. This memo is sent to a server and transformed into detailed data through AI analysis. Finally, the text data, sentiment data, and detailed data are centralized and automatically entered into management systems such as customer relationship management (CRM) systems.

[0712] As a concrete implementation example, let's consider a business meeting where a product demonstration is conducted. In this scenario, the server generates meeting minutes based on the audio data acquired in real time by the terminal, such as "Introducing two new features as strengths of the product." Simultaneously, the server uses sentiment analysis to assign a sentiment label indicating that "the customer showed a positive reaction to the new features."

[0713] In this way, the present invention combines information extraction from voice with emotion analysis to provide emotional insights along with important data from business negotiations, thereby enabling a deeper understanding of business negotiation activities and effective follow-up.

[0714] The following describes the processing flow.

[0715] Step 1:

[0716] The terminal enables voice input at the start of a business meeting and uses its built-in microphone to capture the conversation of multiple meeting participants in real time. The captured audio data is compressed and sent to a server over the network.

[0717] Step 2:

[0718] The server sends the received audio data to the speech recognition engine, which converts the audio into text data according to the context. This conversion process includes filtering of background noise and identification of each speaker.

[0719] Step 3:

[0720] Simultaneously, the server sends the audio data to the emotion engine, which evaluates and determines the speaker's emotions by analyzing the tone, pitch, speed, and other acoustic characteristics of the voice.

[0721] Step 4:

[0722] The server uses a generation AI to extract important keywords and phrases from text data and automatically generate summarized meeting minutes. These minutes include the specific details of the business negotiation, along with sentiment labels obtained from the sentiment engine.

[0723] Step 5:

[0724] The generated meeting minutes and sentiment labels are sent to the terminal after the meeting concludes, allowing the user to review them and add annotations or additional comments as needed.

[0725] Step 6:

[0726] After a business meeting, notes entered by the user are sent to the server via the terminal. The server uses AI to analyze these notes, convert them into a specific and detailed data form, and re-enter them into the management system.

[0727] Step 7:

[0728] Ultimately, all text data, sentiment data, and additional memo data are integrated and automatically recorded and managed in the company's management system. This process creates a comprehensive and multifaceted information database related to business negotiations.

[0729] (Example 2)

[0730] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0731] In business negotiations, it is necessary to record participants' statements as acoustic information and understand their content in detail. However, conventional systems have difficulty accurately capturing human emotions and understanding the atmosphere and flow of emotions during negotiations, and it is also difficult to efficiently supplement information after negotiations. Therefore, a system is needed that integrates everything from acquiring acoustic information to emotion analysis, generating meeting minutes, and inputting them into a management system.

[0732] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0733] In this invention, the server includes an acquisition means for acquiring acoustic information and generating acoustic data, a conversion means for analyzing the acoustic data and converting it into text data, and an emotion analysis means for analyzing emotion information based on the text data and acoustic data and generating emotion labels. This enables accurate acquisition and analysis of acoustic information, and a deep understanding of the content of business negotiations through emotion analysis.

[0734] "Acoustic information" refers to data that includes the properties of sound, acquired during business negotiations, conversations, and other interactions.

[0735] "Acoustic data" refers to acoustic information represented in digital format and is used for speech recognition and analysis.

[0736] "Acquisition means" refers to a mechanism for collecting acoustic information and generating acoustic data.

[0737] A "conversion means" is a device that has the function of analyzing acoustic data and converting it into text data.

[0738] "Text data" refers to data in text format converted from speech, in a format that humans can read and understand.

[0739] "Emotional information" refers to data about the speaker's emotional state, analyzed from acoustic data.

[0740] An "emotional label" is a label assigned to audio or dialogue as an indicator of emotional information.

[0741] A "sentiment analysis tool" is a device that analyzes the speaker's emotional information from acoustic and text data and generates emotion labels.

[0742] A "generation method" is a mechanism for extracting important items from text data and generating meeting minutes.

[0743] "Meeting minutes" are records that summarize important statements and key points of discussion during business negotiations or meetings.

[0744] A "management platform" is a system for integrating, managing, and storing generated information.

[0745] "Implementation method" refers to a mechanism for inputting generated meeting minutes and sentiment labels into the management platform.

[0746] "Memo information" refers to additional information entered after a business negotiation, and serves a supplementary role.

[0747] A "memo analysis tool" is a mechanism for analyzing memo information, converting it into detailed information, and inputting it into a management platform.

[0748] The following describes the configuration for implementing this system.

[0749] The present invention aims to provide more detailed and valuable information by acquiring and analyzing acoustic information during business negotiations and meetings. First, a terminal acquires acoustic information during business negotiations using a multi-directional microphone. This terminal is equipped with communication means for acquiring acoustic information in real time and transmitting the data to a computing device. Here, acoustic data can be transmitted using the Internet Protocol.

[0750] The server acts as a computing device, analyzing the received acoustic data. This analysis utilizes speech recognition software and sentiment analysis software. Specifically, a commercial speech recognition API is used for speech recognition, and acoustic characteristic analysis software is used for sentiment analysis, converting the acoustic data into text data and generating sentiment labels from the acoustic and text data. Furthermore, a generation AI is used to extract key points from the text data and record them as meeting minutes.

[0751] After a business meeting concludes, users can manually enter memo information using their terminals. This information is further analyzed and centrally recorded as detailed information. This recorded information is then integrated into the management platform and shared among stakeholders.

[0752] As a concrete example, let's consider a sales meeting where a product is being introduced. In this case, the terminal picks up everyone's comments, and the server generates a summary such as, "Two innovative features were introduced as characteristics of the new product." Additionally, by adding sentiment labels, information such as, "The customer showed positive surprise about the new features" is also included. This process makes it possible to grasp both the important content of the sales meeting and the atmosphere of the situation.

[0753] An example of a prompt using a generative AI model would be something like, "Please indicate the themes that elicited the most responses during the business negotiation and the associated sentiment labels."

[0754] In this way, this system enhances the information gathering and analysis during business negotiations, thereby maximizing the effectiveness of negotiations and supporting subsequent operations.

[0755] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0756] Step 1:

[0757] The terminal acquires acoustic information in real time during business negotiations using multi-directional microphones. The input is acoustic signals from the surroundings, which are converted into digital acoustic data. Specifically, analog sound is collected from the microphone and converted into a digital signal using an A / D converter. This data is temporarily stored in a buffer within the terminal.

[0758] Step 2:

[0759] The terminal sends digital audio data stored in a buffer to the server via real-time streaming technology. The input is the audio data in the buffer, and the output is the audio data sent to the server over the network. Specifically, this transmission uses the Internet Protocol, and the data is divided into small packets and transmitted using streaming technologies such as the UDP protocol.

[0760] Step 3:

[0761] The server analyzes the received audio data. The input is digital audio data sent from the terminal, and the output is text data and emotion information. Specifically, speech recognition software is used to convert the audio data into text data, and emotion analysis software is used to analyze the intonation and speed of the speech. Based on this analysis, an emotion label is generated that quantifies the speaker's emotions.

[0762] Step 4:

[0763] The server uses a generative AI model to summarize key points based on the analyzed text data and generate meeting minutes. The input is text data, and the output is a summary of the meeting minutes. Specifically, the AI model extracts key phrases from the text and logically organizes them. This summary also includes speaker sentiment labels.

[0764] Step 5:

[0765] After the business meeting concludes, the user enters supplementary memo information using a terminal. The input is text information entered by the user, and the output is data converted into detailed memo information. Specifically, the user manually enters text, which is then used with natural language processing technology for subsequent analysis.

[0766] Step 6:

[0767] The server analyzes supplementary information from users and integrates it with existing meeting minutes and sentiment labels. Input is detailed memo information entered by the user, and output is an integrated sales opportunity record. Specifically, the analysis engine stores the memo information in a database, combines it with existing meeting minutes to build a comprehensive sales opportunity database, and automatically deploys it to the management platform.

[0768] (Application Example 2)

[0769] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0770] Modern security services require accurate understanding of surrounding audio information and early detection of suspicious activity and conversations. However, current technology is insufficient for extracting key points from audio and understanding emotional nuances, making it difficult to achieve highly accurate surveillance. There is also a need for efficient systems that perform real-time analysis of audio data.

[0771] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0772] In this invention, the server includes voice acquisition means, voice recognition means, information recording generation means, emotion analysis means, and anomaly detection means. This makes it possible to analyze the flow of voice information and emotions within the environment and to quickly detect suspicious movements and conversations.

[0773] "Sound acquisition means" refers to a device or configuration that effectively collects ambient sounds in an environment and supplies those sounds to a subsequent analysis process.

[0774] "Speech recognition means" refers to a technology or device that analyzes acquired speech information and converts it into corresponding text information.

[0775] "Information record generation means" refers to a process or system for extracting specific points based on textual information and creating appropriate information records.

[0776] "Emotional analysis means" refers to a technology or device for analyzing characteristics such as intonation, speed, and volume in speech, and for inferring and labeling the emotional state of the speaker.

[0777] An "anomaly detection means" is a function or method for detecting suspicious behavior or conversation based on analyzed voice and emotion data.

[0778] A "management mechanism" is a system or device that centrally manages collected and analyzed data and appropriately stores or displays necessary information.

[0779] A "computer" is a computer or data processing device used to analyze audio data in real time and perform textual information and sentiment analysis.

[0780] To realize an application example of this invention, first, a security robot system is used to acquire sound from the environment. Specifically, a multi-directional microphone is used as the device. This allows the robot to collect ambient sound data in real time while patrolling.

[0781] The server analyzes the acquired audio data using speech recognition and converts it into text. This process can utilize speech recognition APIs such as Google Cloud Speech-to-Text. The recognized text is then further processed by an information recording generation system to extract specific key points and generate appropriate records.

[0782] Next, the server uses emotion analysis tools, such as an emotion analysis API like IBM Watson Tone Analyzer, to infer emotions from the intonation, speed, and volume characteristics of the voice data. This allows for the detection of potential disturbances in the patrolled area.

[0783] Furthermore, anomaly detection mechanisms are implemented to detect suspicious behavior and conversations based on analyzed voice and emotion data. In this process, the server utilizes AI technology to provide a warning system that enables a rapid response in emergencies.

[0784] As a concrete example, if a robot patrolling a shopping mall detects a conversation expressing dissatisfaction with a particular product and determines that anger is escalating, this information is reported to the management organization in real time, and appropriate action is taken.

[0785] An example of a prompt for a generative AI model would be: "Translate the current audio data into text, analyze its intonation, speed, and volume, assign emotion labels, and create a report that detects suspicious movements and conversations."

[0786] In this way, the present invention provides a practical system that contributes to improving safety in the environment while maintaining high accuracy in the collection and analysis of voice data.

[0787] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0788] Step 1:

[0789] The terminal uses multi-directional microphones to acquire ambient sound in real time. The acquired audio data is generated through the audio acquisition mechanism and sent to the server for subsequent processing. The input is ambient sound, and the output is raw audio data.

[0790] Step 2:

[0791] The server converts the received audio data into text data using speech recognition. A speech recognition API such as Google Cloud Speech-to-Text is used here. The input is audio data, and the output is the corresponding text information.

[0792] Step 3:

[0793] The server processes the obtained text data through an information record generation mechanism to generate an information record containing specific key points. During this process, a generation AI model is used for summarization. The input is text data, and the output is a summarized information record.

[0794] Step 4:

[0795] The server processes audio data using sentiment analysis tools to perform sentiment analysis. It uses sentiment analysis APIs such as IBM Watson Tone Analyzer to analyze the intonation, speed, and volume of the audio data. The input is audio data, and the output is sentiment labels and sentiment scores.

[0796] Step 5:

[0797] The server uses informational records and sentiment labels to activate anomaly detection mechanisms and detect suspicious behavior or conversations. This is especially triggered when the input information exceeds pre-configured criteria. Inputs are informational records and sentiment labels, and outputs are anomaly detection alarms.

[0798] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0799] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0800] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0801] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0802] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. In the upper and lower directions of the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. Also, the upper side of the concentric circles is where "pleasant" emotions are located, and the lower side is where "unpleasant" emotions are located. In this way, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0803] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0804] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0805] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0806] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0807] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0808] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0809] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0810] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0811] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0812] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0813] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0814] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0815] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0816] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0817] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0818] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0819] The following is further disclosed regarding the embodiments described above.

[0820] (Claim 1)

[0821] A voice input means that acquires audio during a business negotiation and generates audio data,

[0822] A speech recognition means that analyzes the aforementioned audio data and converts it into text data,

[0823] A generation means for extracting key points based on the aforementioned text data and generating meeting minutes,

[0824] An input means for inputting the generated meeting minutes into a management system,

[0825] A memo analysis means that analyzes memo data entered after a business negotiation, converts it into detailed data, and inputs it into the management system,

[0826] A system that includes this.

[0827] (Claim 2)

[0828] The system according to claim 1, further comprising an analysis means for analyzing the content of user-input notes after a business negotiation using natural language processing.

[0829] (Claim 3)

[0830] The system according to claim 1, comprising the steps of transmitting audio data acquired during a business negotiation to a server in real time, and analyzing the audio data on the server to generate text data.

[0831] "Example 1"

[0832] (Claim 1)

[0833] A voice acquisition means that acquires audio during a business negotiation and generates audio information,

[0834] A speech conversion means that analyzes the aforementioned speech information and converts it into text information,

[0835] Information generation means for extracting important information based on the aforementioned textual information and generating summary information,

[0836] Information input means for inputting the generated summary information into a management system,

[0837] Information analysis means that analyzes supplementary information entered after a business negotiation, converts it into detailed information, and inputs it into the management system,

[0838] A system that includes this.

[0839] (Claim 2)

[0840] The system according to claim 1, further comprising an analysis means for analyzing the content of supplementary information entered by the user after a business negotiation using language processing.

[0841] (Claim 3)

[0842] The system according to claim 1, comprising the steps of immediately transmitting voice information acquired during a business negotiation to a data processing device, and analyzing the voice information in the data processing device to generate text information.

[0843] "Application Example 1"

[0844] (Claim 1)

[0845] An audio acquisition means that acquires audio during a business negotiation and generates audio data,

[0846] Language recognition means that analyzes the aforementioned audio data and converts it into text data,

[0847] A generation mechanism that extracts key points based on the aforementioned text data and generates meeting minutes,

[0848] A document input means for inputting the generated meeting minutes into an information processing system,

[0849] A descriptive analysis means that analyzes memo data entered after a business negotiation, converts it into detailed information, and records it in the information processing system,

[0850] A presentation means that processes data during a business negotiation in real time and displays the results on a visual device that possesses it,

[0851] A system that includes this.

[0852] (Claim 2)

[0853] The system according to claim 1, further comprising an analysis means for analyzing the content of user-generated notes after a business negotiation using natural language processing.

[0854] (Claim 3)

[0855] The system according to claim 1, which includes a process of immediately transmitting audio data acquired during a business negotiation to a remote information processing device, and analyzing the audio data at the remote information processing device to generate text information.

[0856] "Example 2 of combining an emotion engine"

[0857] (Claim 1)

[0858] Acquisition means for acquiring acoustic information and generating acoustic data,

[0859] A conversion means for analyzing the aforementioned acoustic data and converting it into text data,

[0860] An emotion analysis means that analyzes emotion information based on the aforementioned text data and sound data and generates emotion labels,

[0861] A generation means for extracting important items based on the aforementioned text data and generating meeting minutes,

[0862] An implementation means for inputting the generated meeting minutes and sentiment labels into a management platform,

[0863] A memo analysis means that analyzes memo information entered after a business negotiation, converts it into detailed information, and inputs it into the management platform,

[0864] A system that includes this.

[0865] (Claim 2)

[0866] The system according to claim 1, further comprising an analysis function that analyzes the content of user-entered memo information after a business negotiation using automated language processing.

[0867] (Claim 3)

[0868] The system according to claim 1, comprising the steps of transmitting acoustic data acquired during a business negotiation to a computing device in real time, analyzing the acoustic data in the computing device, and generating character data.

[0869] "Application example 2 when combining with an emotional engine"

[0870] (Claim 1)

[0871] A means for acquiring sounds from the environment and generating audio information,

[0872] A speech recognition means that analyzes the aforementioned speech information and converts it into text information,

[0873] A generation means for extracting specific points based on the aforementioned textual information and generating an information record,

[0874] An input means for inputting the generated information record into the management mechanism,

[0875] Information analysis means that analyzes the information acquired after the environment, converts it into detailed information, and inputs it into the management mechanism,

[0876] An emotion analysis means that analyzes emotional information within the environment and assigns emotion labels,

[0877] An anomaly detection means for detecting anomalies,

[0878] A system that includes this.

[0879] (Claim 2)

[0880] The system according to claim 1, further comprising an analysis means for analyzing the content of user input information after the environment has been set up using natural language processing.

[0881] (Claim 3)

[0882] The system according to claim 1, comprising the steps of transmitting audio information acquired in the environment to a computer in real time, analyzing the audio information on the computer, and generating text information. [Explanation of Symbols]

[0883] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. An audio acquisition means that acquires audio during a business negotiation and generates audio data, Language recognition means that analyzes the aforementioned audio data and converts it into text data, A generation mechanism that extracts key points based on the aforementioned text data and generates meeting minutes, A document input means for inputting the generated meeting minutes into an information processing system, A descriptive analysis means that analyzes memo data entered after a business negotiation, converts it into detailed information, and records it in the information processing system, A presentation means that processes data during a business negotiation in real time and displays the results on a visual device that possesses it, A system that includes this.

2. The system according to claim 1, further comprising an analysis means for analyzing the content of user-generated notes after a business negotiation using natural language processing.

3. The system according to claim 1, which includes a process of immediately transmitting audio data acquired during a business negotiation to a remote information processing device, and analyzing the audio data at the remote information processing device to generate text information.

Citation Information

Patent Citations

Persona chatbot control method and system
JP2022180282A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Persona chatbot control method and system