system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The system addresses language barriers in conferences by converting voice data to text, translating, and generating real-time meeting minutes, enhancing communication and accuracy in multilingual settings.

JP2026100739APending Publication Date: 2026-06-19SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-09
Publication Date: 2026-06-19

Application Information

Patent Timeline

09 Dec 2024

Application

19 Jun 2026

Publication

JP2026100739A

IPC: G06F3/16; H04N21/4408; G10L15/00; H04N21/222; H04N21/214; H04N21/442; H04N21/482; H04N21/238; H04N21/6408; G06Q10/00; H04N7/173; H04N21/8405; H04N7/14; G10L15/10; H04N21/4415; H04N21/443; H04N21/4227; H04N21/233; H04N21/6334; H04N21/4425; H04N21/422; H04N21/431; H04N21/2368; G06F40/58; G10L15/30; H04N21/4147; H04N21/436; H04N21/63; H04N21/6379; H04N21/4223; G06F40/40; H04N21/633; H04N21/6402; H04N7/10; G06Q10/10; H04N21/4786; G10L15/18; H04N21/6377; H04N7/15

AI Tagging

Application Domain

Natural language translation Television conference systems

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Input method, input device, electronic device, medium, and program product
CN122195386ANatural language translation Semantic analysis
Speech-to-speech translation
US20260154515A1Natural language translation Sound input/outputSpeech to speech translationSpeech translation
Information processing device, processing method, and program
JP2026093281ANatural language translation Transmission Information processing Software engineering
Streaming speech-to-speech model with automatic speaker turn detection
US12651591B2Natural language translation Speech recognition Time domain Speech sound
system
JP2026108027ANatural language translation Sound input/output

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In conventional conference systems, participants speaking different languages face challenges with accurate real-time communication, and creating and reviewing meeting minutes is time-consuming and inaccurate.

Method used

A system that acquires voice data, converts it into text using voice recognition technology, translates it into multiple languages, generates meeting minutes in real-time, and distributes them to participants.

Benefits of technology

Facilitates seamless communication and accurate recording of conference content by overcoming language barriers and ensuring timely and precise meeting minute generation.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026100739000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A voice acquisition means for acquiring voice data, A speech recognition means for converting acquired audio data into text, Translation means for translating the aforementioned text into another language, A means for generating meeting minutes using translated text, Information distribution means for outputting the generated translated text and meeting minutes, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0004] , , , ,

[0005] , , , , ,

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance as a response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In a conventional conference system, when there are participants who speak different languages, accurate real-time communication is difficult. Also, it takes time to create a record of the conference content, and there may be a lack of accuracy when reviewing it later. Therefore, there is a need for a system that can overcome the language barrier, easily generate and share meeting minutes.

Means for Solving the Problems

[0005] The present invention acquires voice data, converts it into text using voice recognition technology, and further enables translation into a plurality of specified languages using translation technology. Based on the obtained translated text, it automatically generates meeting minutes and outputs them in real time, thereby facilitating communication between different languages and providing means for accurately recording the content of a conference.

[0006] "Voice acquisition means" refers to a device or process for capturing the speech of meeting participants in real time.

[0007] "Speech recognition means" refers to a technology or process that converts acquired speech data into text.

[0008] "Translation means" refers to a technology or process that converts speech-recognized text into another specified language.

[0009] A "meeting minutes generation method" is a process that automatically generates structured documents to record the content of a meeting based on translated text.

[0010] "Information distribution means" refers to a device or process for providing the generated translated text and meeting minutes to the user.

[0011] A "data splitting and transmission method" is a process that divides audio data into small chunks, enabling efficient transmission to servers and other devices.

[0012] "Data storage means" refers to a device or process that records and securely stores generated meeting minutes or translated texts so that they can be referenced later. [Brief explanation of the drawing]

[0013] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5]It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

Mode for Carrying Out the Invention

[0014] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0019] In the following embodiments, the numbered communication I / F (Interface) is an interface that includes a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] The present invention is a system that processes meeting audio data in real time, translates it into multiple languages, generates meeting minutes, and outputs them. This system consists of several main components.

[0035] Voice acquisition method

[0036] The device is equipped with a microphone to capture participants' speech in real time. This makes it possible to capture all audio information during the meeting.

[0037] Speech recognition means

[0038] The server receives audio data sent from the terminal and converts it to text using a speech recognition engine. This conversion is performed in real time, and machine learning algorithms are used to improve the accuracy of the speech-to-text conversion.

[0039] Translation methods

[0040] The server translates the digitized data into the specified language. It utilizes machine translation technology to quickly translate into English, Chinese, and other languages.

[0041] Meeting minutes generation method

[0042] The server organizes the translated text and generates meeting minutes that reflect the content of the meeting. This process organizes the text order and compiles it into minutes, highlighting important information.

[0043] Information distribution methods

[0044] The server distributes the translated text and meeting minutes it generates to the users. Users can access the translated content and meeting minutes in real time via their devices.

[0045] As a concrete example, in a global conference, if a participant speaks in Japanese, their device captures the audio, and the server converts it into Japanese text through speech recognition. Then, it translates it into a specified language, such as English and Chinese, and the server generates meeting minutes in a format appropriate to the conference progress. These translations and minutes are delivered to the user's device, allowing all participants to share information in real time, regardless of language.

[0046] This system supports meetings in multilingual environments, providing efficient communication and accurate record-keeping.

[0047] The following describes the processing flow.

[0048] Step 1:

[0049] The user initiates the meeting. This puts the device into voice acquisition mode and activates the microphone.

[0050] Step 2:

[0051] The device captures the participants' speech and simultaneously divides the audio data into chunks of a certain length.

[0052] Step 3:

[0053] The terminal sends the segmented audio data to the server. This transmission is optimized to minimize network latency.

[0054] Step 4:

[0055] The server processes the received audio data using a speech recognition engine and converts it into text data. Machine learning algorithms are used to improve recognition accuracy.

[0056] Step 5:

[0057] The server inputs the converted text into a machine translation system, which then translates it into the specified language. The translation results are provided in real time.

[0058] Step 6:

[0059] The server organizes the translated text chronologically and generates meeting minutes based on the content of the meeting.

[0060] Step 7:

[0061] The server sends the generated translated text and meeting minutes to the user's device. The user can then view this information in real time on their device.

[0062] Step 8:

[0063] After the meeting ends, the user instructs the server to stop audio acquisition and processing. The server and terminal then save the final audio data and meeting minutes.

[0064] (Example 1)

[0065] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0066] In meetings and discussions, there is a need for systems that can efficiently and accurately share information in a multilingual environment. However, current technology makes it difficult to translate audio into multiple languages in real time and share it immediately with all participants. In particular, it is important to process audio data quickly while maintaining accuracy and minimizing mistranslations. Furthermore, accurately distributing and saving the generated meeting minutes is essential in today's business environment.

[0067] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0068] In this invention, the server includes an input means for acquiring sound information, a conversion means for converting the acquired sound information into text data, and a conversion means for converting the text data into different languages. This enables real-time multilingual translation and efficient and accurate information sharing.

[0069] "Audio information" refers to the voice data spoken by participants during meetings and discussions, which is processed as a digital signal.

[0070] "Input means" refers to devices or systems used to capture sound information in real time, and typically includes hardware such as microphones.

[0071] "Conversion means" refers to the process or technology for converting sound information into text data, and specifically includes speech recognition engines and related software.

[0072] "Character data" refers to text-formatted data generated from sound information, which is stored as a digital string that can be processed by a computer.

[0073] "Different languages" refers to the languages to which the text data is converted, meaning that it is translated into a language different from the original sound.

[0074] "Generation means" refers to processes and technologies for creating information records using converted character data, and includes natural language processing technologies.

[0075] "Transmission means" refers to the technologies and methods for transmitting generated converted character data and information records to other terminals or systems, and includes data compression and optimization.

[0076] This invention is a system that enables real-time multilingual translation and information sharing in meetings and discussions. The system aims to efficiently acquire audio information, process it quickly, and deliver it to participants immediately.

[0077] Acquisition of sound information

[0078] The terminal uses a high-performance microphone to capture participants' speech in real time. This audio information is converted into a digital signal, which is then processed to remove background noise. The device may utilize a microphone equipped with noise-canceling technology, for example.

[0079] Data transformation and translation

[0080] The server receives audio information transmitted from the terminal and converts it into text data in real time using a speech recognition engine (for example, a speech conversion engine using the latest machine learning algorithms). This text data is then translated into a specified language using neural machine translation technology. This process includes, for example, translation from English to Chinese.

[0081] Generation and distribution of converted data

[0082] The server uses the translated text data to generate meeting minutes. These minutes are organized logically, with key points highlighted. The server then delivers this information to users in real time. Users can access and view this information via their devices.

[0083] Specific examples and prompt statements

[0084] As a concrete example, in a global meeting of a multinational corporation, if a participant speaks in Japanese, the terminal captures the audio data, which is then converted into text data on a server. Next, it is translated into English and Chinese in real time, and the resulting meeting minutes are immediately delivered to the user. An example of an input prompt for the generation AI model of this system would be an instruction such as, "Translate the Japanese meeting audio into English and Chinese in real time and generate meeting minutes."

[0085] This system allows participants who speak different languages to collaborate and share information instantly without disrupting the flow of the meeting.

[0086] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0087] Step 1:

[0088] Acquisition of sound information

[0089] The terminal acquires audio information spoken by meeting participants in real time via a microphone. This input audio information is then sent to the server after background noise is removed using digital signal processing technology. Specifically, audio filtering is performed within the terminal, and a noise reduction algorithm is applied.

[0090] Step 2:

[0091] Audio data conversion

[0092] The server converts the denoised audio information received from the terminal into text data using a speech recognition engine. During this process, the server decomposes the audio information into segments, performs analysis using an acoustic model, and then converts it into text. The resulting output is text data that reflects the content of the meeting.

[0093] Step 3:

[0094] Translation of text data

[0095] The server translates text data generated by speech recognition into different languages. This process uses neural machine translation technology, inputting text data into a translation model to convert it into the specified language. The output is translated text data compatible with multiple languages. Specifically, the server refers to English and Chinese translation dictionaries and performs appropriate translation rewriting.

[0096] Step 4:

[0097] Minutes generation

[0098] The server generates meeting minutes based on the translated text data. During this generation process, the server organizes the text logically, extracts important information, and highlights it. The resulting meeting minutes are formatted for easy viewing. Specific actions include extracting key points and highlighting keywords.

[0099] Step 5:

[0100] Information distribution

[0101] The server distributes the generated meeting minutes and translated text data to all participants' devices. Users receive this in real time via their devices and can refer to the meeting content in multiple languages. Data compression technology is applied during distribution, and transmission is performed in the most optimal way depending on the network bandwidth. Specifically, the server generates data packets and transmits them while avoiding network congestion.

[0102] (Application Example 1)

[0103] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0104] In multilingual households and groups, facilitating smooth communication among members who speak different languages is often challenging. Furthermore, there is a need for efficient systems for real-time translation and meeting minute creation.

[0105] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0106] In this invention, the server includes: an audio data acquisition means for acquiring audio information; an audio recognition means for converting the acquired audio information into text data; a translation means for translating the acquired text data into another language; a document generation means for creating a record document using the translated text data; an information distribution means for outputting the generated translated text data and record document; and a multilingual communication support means for supporting communication between users who speak different languages by analyzing and translating audio in multiple languages. This enables immediate and accurate communication between members who speak different languages.

[0107] "Audio information" refers to acquired audio data that is used for speech recognition and translation.

[0108] "Voice data acquisition means" refers to technical means used to acquire voice information, and includes devices such as microphones and sensors.

[0109] "Speech recognition means" refers to technology that converts acquired speech information into text data, and includes the process of analyzing a large amount of speech data to represent it as text.

[0110] "Translation methods" refer to technologies used to convert text data obtained through speech recognition into other necessary languages.

[0111] "Document generation means" refers to a function for creating a record document based on translated text data.

[0112] "Information distribution means" refers to technology that outputs generated translated text data and recorded documents to terminals or other devices.

[0113] "Multilingual communication support means" refers to technology that analyzes and translates the speech of users who speak different languages to facilitate smooth communication.

[0114] This invention is a multilingual translation system aimed at facilitating communication between users who speak different languages. This system is designed to be installed in household assistant robots and other devices, and to function effectively in daily life.

[0115] The server utilizes the robot's built-in microphone to acquire voice information. The voice data acquisition mechanism captures voice information in real time. The acquired voice information is converted into text data using the Google® Cloud Speech-to-Text API running on the server. This speech recognition mechanism enables highly accurate text conversion.

[0116] Next, the text data is translated into the user's specified language using a translation method that utilizes the Microsoft® Azure® Translator API. This process performs the necessary language conversion for communication between users with different native languages.

[0117] Furthermore, the server uses document generation means to create a record document based on the translated text data. This record document is used to accurately record important conversations and events, especially within the home, and to facilitate later review. The generated record document and translated text data are delivered to the user's terminal or robot display using information distribution means.

[0118] Multilingual communication support systems facilitate instant and accurate dialogue between users who speak different languages. For example, they enable smooth conversations between hosts and guests at home, or at international events.

[0119] A concrete example is a dinner event at home. For instance, a robot could translate speech and provide real-time support so that an English-speaking guest and a Japanese-speaking host can understand each other's languages.

[0120] An example of a prompt sentence provided to the generative AI model would be: "To streamline multilingual communication at home, please use the robot to translate the following English conversation into Japanese."

[0121] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0122] Step 1:

[0123] The terminal acquires audio information. Using the robot's built-in microphone, it captures ambient sound in real time. The acquired audio information is sent to the server as a digital signal.

[0124] Step 2:

[0125] The server performs speech recognition processing. The input digital speech signal is converted into text data using the Google Cloud Speech-to-Text API. The features of the speech signal are analyzed and converted into text format based on the language grammar and context. The output of this step is text data.

[0126] Step 3:

[0127] The server performs the translation process. Using the resulting text data as input, it utilizes the Microsoft Azure Translator API to translate it into the user-specified target language. This process is designed to translate the text data contextually, resulting in more natural language expression. The output obtained in this step is the translated text data.

[0128] Step 4:

[0129] The server generates the document. Translated text data is used as input, formatted into a standard document type, and processed to create a conversation record. The document generation process ensures that important information is clearly organized.

[0130] Step 5:

[0131] The server distributes information. It delivers the generated translated text data and recorded documents to the terminal. Here, the information is presented visually or audibly to the user's robot or display, allowing the user to confirm it. Simultaneously, multilingual communication support is used to facilitate real-time interaction and support communication between users.

[0132] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0133] This invention relates to a system that processes meeting audio data in real time, combines it with an emotion recognition engine to identify emotions from user statements, translates them into multiple languages, generates meeting minutes, and outputs them. This system consists of an audio acquisition means, an audio recognition means, a translation means, a meeting minutes generation means, an information distribution means, and an emotion recognition means.

[0134] Voice acquisition method

[0135] The device uses a high-sensitivity microphone to capture the speech of meeting participants. This allows for real-time acquisition of audio data, providing the information necessary for subsequent processing.

[0136] Speech recognition means

[0137] The server converts the audio data transferred from the terminal into text using a speech recognition engine. This process utilizes noise cancellation technology and high-precision speech models to maximize the accuracy of the text conversion.

[0138] Translation methods

[0139] The server translates the text obtained through speech recognition into a specific language. It utilizes natural language processing techniques to improve translation accuracy between languages with different cultural backgrounds.

[0140] emotion recognition means

[0141] The server analyzes the tone, speed, and text content of the voice data to determine the user's emotions in real time. It identifies emotions such as joy, anger, sadness, and happiness, and records that information.

[0142] Meeting minutes generation method

[0143] The server automatically generates meeting minutes that reflect the flow of the meeting, based on the translated text and information obtained from sentiment recognition. The minutes will include not only the content of each statement but also the emotional fluctuations associated with that statement.

[0144] Information distribution methods

[0145] The server delivers the generated meeting minutes and sentiment information to the user's terminal. Users can view the meeting minutes, which are translated in real time and reflect sentiment.

[0146] As a concrete example, in a meeting of a multinational corporation, participants speak English, Chinese, and Japanese. Using this system, all statements made during the meeting are translated in real time, and the speaker's emotions are reflected in the minutes. This information improves the quality of the meeting, transcending language and cultural barriers.

[0147] This system eliminates language barriers in international conferences, facilitates smooth communication, and enables information recording that also takes into account the emotions of the participants.

[0148] The following describes the processing flow.

[0149] Step 1:

[0150] The system becomes active when the user initiates the meeting. This puts the device into audio capture mode, and it begins recording participants' speech in real time.

[0151] Step 2:

[0152] The terminal divides the audio data into specific timeframes and prepares it for efficient processing. The process of sending this data to the server then begins.

[0153] Step 3:

[0154] The server receives the audio data, feeds it into the speech recognition engine, and converts it into text data. The goal is to accurately translate even complex languages using a high-precision model.

[0155] Step 4:

[0156] The server passes the converted text to the translation system, which translates it into the specified target language. The translation results are generated quickly and support multilingual environments.

[0157] Step 5:

[0158] The server analyzes voice features from the voice data and identifies the user's emotions through an emotion recognition engine. The identified emotions are categorized into several major emotional categories, such as joy, anger, sadness, and happiness.

[0159] Step 6:

[0160] The server integrates the translated text and sentiment information to generate meeting minutes. These minutes include the content of each statement, along with the emotional state at the time of the statement.

[0161] Step 7:

[0162] The server sends the generated meeting minutes to the user's terminal in real time. Users can conduct the meeting while checking the translated content of statements and sentiment information obtained during the meeting on their terminal.

[0163] Step 8:

[0164] After the meeting ends, the user instructs the system to shut down. The server saves all meeting minutes and sentiment data and archives them for later reference or analysis.

[0165] (Example 2)

[0166] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0167] In modern international conferences and multilingual settings, smooth communication is often hindered by participants having different languages and cultural backgrounds. Furthermore, while there is a need for information sharing that takes into account the emotions of participants during meetings, there is currently a lack of effective methods to achieve this.

[0168] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0169] In this invention, the server includes a device for acquiring audio data, a device for converting the acquired audio data into text, a device for translating the text into different languages, a device for analyzing the characteristics of the audio data and identifying emotions, a device for creating a document using the translated text and identified emotion data, and a device for outputting the generated translated text and document. This makes it possible to translate audio during a meeting into multiple languages in real time and to generate and share meeting minutes that take emotions into account.

[0170] A "device for acquiring audio data" is a device used to capture participants' speech in real time during meetings or discussions, and it collects audio information using high-sensitivity microphones or similar equipment.

[0171] A "device for converting acquired audio data into text" is a device that analyzes collected audio data and converts it into corresponding text data using a language model.

[0172] A "device for translating into different languages" is a device that converts text data into a specified different language, and it provides high translation accuracy by applying natural language processing technology.

[0173] A "device that analyzes the characteristics of voice data to identify emotions" is a device that analyzes the tone and speed of acquired voice data, the content of text, etc., to determine the emotional state of the speaker.

[0174] A "device for creating documents using translated text and identified sentiment data" is a device for automatically generating documented information records based on translated text data and sentiment data.

[0175] A "device for outputting generated translated text and documents" is a device for distributing generated text and documents in real time and providing them in a state where relevant parties can access them.

[0176] This system enables smooth communication among participants in international conferences and multilingual environments by recognizing, translating, and analyzing speech in real time. The system includes the following devices and means:

[0177] The device uses a high-sensitivity microphone to capture participants' speech in real time during meetings and discussions. This audio data is then used to pass on to the next processing stage.

[0178] The server analyzes the audio data transmitted from the terminal using a speech recognition engine and converts it into text. This process ensures high accuracy by employing noise cancellation technology and advanced speech recognition models.

[0179] The server translates the acquired text data into different languages through natural language processing. This provides appropriate translations for participants with different cultural and linguistic backgrounds.

[0180] The server further identifies the speaker's emotional state based on characteristics of the audio data, such as tone and speed, and the content of the text. Major emotions such as joy, anger, sadness, and happiness are determined in real time during this process.

[0181] Based on the translated text and sentiment data, the server automatically creates meeting minutes. These minutes reflect not only the content of the utterances but also the emotional characteristics of the situation.

[0182] The server delivers the generated meeting minutes and translated text to the user's terminal in real time. Through this, the user can instantly grasp the content and emotional elements of the meeting.

[0183] As a concrete example, consider a meeting at a multinational corporation. Participants use English, Chinese, and Japanese, and each statement is recognized and translated in real time by the system. The emotional elements of each utterance are also detected and reflected in the meeting minutes, ensuring that all parties involved share the same understanding and facilitating efficient communication.

[0184] Another example of a prompt for the generative AI model is an instruction such as, "Analyze the meeting audio data in real time and create meeting minutes that reflect the sentiment of the statements along with the translation." This prompt causes the system to process speech recognition, translation, and sentiment analysis in an integrated manner.

[0185] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0186] Step 1:

[0187] The terminal uses a high-sensitivity microphone to capture the speech of meeting participants in real time. The input is the participants' voices. This audio data is digitized to create initial data for the next processing step.

[0188] Step 2:

[0189] The server receives digital audio data transmitted from the terminal and converts this data into text using a speech recognition engine. It receives digital audio data as input and applies noise cancellation technology and a speech recognition model to output the spoken content as text data.

[0190] Step 3:

[0191] The server receives text data obtained through speech recognition and translates it into a specific language using natural language processing techniques. The input for this step is the recognized text, and it outputs a highly accurate translation as text data, taking into account cultural nuances between languages.

[0192] Step 4:

[0193] The server references the initial audio data and identifies the speaker's emotions in real time based on the tone and speed of speech, as well as the content of the translated text. The input includes the initial audio data and translated text, and emotion recognition technology is used to output emotion information corresponding to each utterance.

[0194] Step 5:

[0195] The server uses translated text data and identified sentiment information to create meeting minutes that reflect the progress of the meeting. Inputs include translated text and sentiment information, which are used to generate meeting minutes. These minutes output document data that reflects each statement and the resulting changes in sentiment.

[0196] Step 6:

[0197] The server delivers the generated meeting minutes and sentiment information to each user's terminal. The input for this step is the document data created in the previous step, and the user sends the data so that they can receive meeting minutes with translations and sentiments reflected in real time.

[0198] (Application Example 2)

[0199] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0200] In real-time audio content such as meetings, seminars, and lectures, there is a need to provide an environment where participants who speak different languages can smoothly understand information and grasp the speaker's emotional changes. However, conventional systems have faced challenges such as misunderstandings and misinterpretations due to differences in language and emotions, hindering smooth communication.

[0201] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0202] In this invention, the server includes: a voice acquisition means for acquiring voice data; a voice recognition means for converting the acquired voice data into text; a translation means for translating the text into another language; an emotion recognition means for identifying the translated text and emotions; a minutes generation means for creating minutes that reflect emotions; and an information distribution means for outputting the generated translated text and minutes with added emotions. This allows participants to understand the accurately translated content in real time and experience the changes in the speaker's emotions.

[0203] "Voice acquisition means" refers to a device or process that captures participants' speech with high sensitivity in settings such as meetings and seminars, and collects audio data necessary for subsequent processing in real time.

[0204] "Speech recognition means" refers to a technology or device that analyzes acquired speech data and converts it into text information using noise cancellation technology or an advanced speech model.

[0205] A "translation method" is a process or system that converts text obtained through speech recognition into another specified language, thereby enabling smooth communication between users of different languages.

[0206] "Emotion recognition means" refers to a technology or system that determines emotions based on the speaker's tone, speed, and content through the analysis of voice and text data, and identifies emotions such as joy, anger, sadness, and happiness in real time.

[0207] A "meeting minutes generation method" is a device or system for automatically creating meeting minutes that reflect the progress of a meeting or seminar, based on translated text and information obtained from sentiment recognition.

[0208] "Information distribution means" refers to a technology or process that delivers generated translated text and sentiment-added meeting minutes to the user's device, thereby sharing information in real time.

[0209] This invention is a system that translates audio content, such as meetings, into multiple languages in real time and generates and distributes meeting minutes with added emotion.

[0210] First, the device uses a high-sensitivity microphone to capture participants' speech during the meeting and transmits it to the server as audio data. Smartphones or dedicated microphones are used for audio acquisition.

[0211] Next, the server uses the Google Speech-to-Text API to convert the acquired audio data into text. Noise cancellation technology is applied during this process to improve the accuracy of the text conversion.

[0212] The server then uses the Google Translate API to translate the converted text into the specified language. The translation language varies depending on the user's settings, but it utilizes natural language processing technology to provide highly accurate translations that take cultural context into account.

[0213] The server further uses IBM Watson® sentiment analysis APIs to recognize the speaker's emotions in real time from speech and text. Based on tone, speed, and context, it identifies emotions such as joy, anger, sadness, and happiness, and records that information.

[0214] Based on this, the server automatically generates meeting minutes that reflect the translated text and sentiment information. The minutes include detailed information on the content of each statement, along with fluctuations in sentiment.

[0215] Finally, the server delivers the generated meeting minutes and sentiment information to the user's terminal via an information distribution system. This allows the user to access meeting minutes in real time, with multilingual support and including sentiment information.

[0216] As a concrete example, in an international conference, the English-speaking presenter's remarks are instantly translated into Japanese, and Japanese-speaking participants can view the projected minutes in real time. Furthermore, changes in the presenter's emotional tone during speaking (for example, when they show excitement on a particular slide) are also displayed. An example of a prompt to input into the generation AI model would be, "Translate the real-time conference audio and generate a textual transcript with added emotion."

[0217] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0218] Step 1:

[0219] The terminal captures meeting speeches as audio data using a high-sensitivity microphone. The input is an audio signal, which is collected in real time by capturing it as digital audio data. The output is digital audio data sent to the server.

[0220] Step 2:

[0221] The server converts the received audio data into text using the Google Speech-to-Text API. The input is the digital audio data generated in step 1, and high-precision text information is obtained by applying noise cancellation technology and performing audio signal analysis. The output is text data.

[0222] Step 3:

[0223] The server translates text data into the specified language using the Google Translate API. The input is the text data generated in step 2, and cross-cultural language conversion is performed using natural language processing techniques. The output is the translated text data.

[0224] Step 4:

[0225] The server uses IBM Watson's sentiment analysis API to identify emotions from audio and text data. The input consists of audio data from Step 1 and text data from Step 2. The server analyzes the tone and speed of the audio, the content of the text, etc., to estimate the emotion. The output is the identified emotion data.

[0226] Step 5:

[0227] The server generates meeting minutes based on the translated text and sentiment data. The input is the translated text data from step 3 and the sentiment data from step 4, and the output is meeting minutes that reflect the progress of the meeting in detail, including the content of the statements and the fluctuations in emotions.

[0228] Step 6:

[0229] The server distributes the generated meeting minutes and sentiment information to the user's terminal using an information distribution method. The input is the meeting minutes data generated in step 5, and network technology is used to output it to the terminal in real time to deliver it to the user.

[0230] This process allows users to accurately understand the content of a meeting based on translated information and emotional information.

[0231] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0232] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0233] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0234] [Second Embodiment]

[0235] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0236] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0237] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0238] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0239] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0240] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0241] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0242] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0243] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0244] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0245] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0246] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0247] The present invention is a system that processes meeting audio data in real time, translates it into multiple languages, generates meeting minutes, and outputs them. This system consists of several main components.

[0248] Voice acquisition method

[0249] The device is equipped with a microphone to capture participants' speech in real time. This makes it possible to capture all audio information during the meeting.

[0250] Speech recognition means

[0251] The server receives audio data sent from the terminal and converts it to text using a speech recognition engine. This conversion is performed in real time, and machine learning algorithms are used to improve the accuracy of the speech-to-text conversion.

[0252] Translation methods

[0253] The server translates the digitized data into the specified language. It utilizes machine translation technology to quickly translate into English, Chinese, and other languages.

[0254] Meeting minutes generation method

[0255] The server organizes the translated text and generates meeting minutes that reflect the content of the meeting. This process organizes the text order and compiles it into minutes, highlighting important information.

[0256] Information distribution methods

[0257] The server distributes the translated text and meeting minutes it generates to the users. Users can access the translated content and meeting minutes in real time via their devices.

[0258] As a concrete example, in a global conference, if a participant speaks in Japanese, their device captures the audio, and the server converts it into Japanese text through speech recognition. Then, it translates it into a specified language, such as English and Chinese, and the server generates meeting minutes in a format appropriate to the conference progress. These translations and minutes are delivered to the user's device, allowing all participants to share information in real time, regardless of language.

[0259] This system supports meetings in multilingual environments, providing efficient communication and accurate record-keeping.

[0260] The following describes the processing flow.

[0261] Step 1:

[0262] The user initiates the meeting. This puts the device into voice acquisition mode and activates the microphone.

[0263] Step 2:

[0264] The device captures the participants' speech and simultaneously divides the audio data into chunks of a certain length.

[0265] Step 3:

[0266] The terminal sends the segmented audio data to the server. This transmission is optimized to minimize network latency.

[0267] Step 4:

[0268] The server processes the received audio data using a speech recognition engine and converts it into text data. Machine learning algorithms are used to improve recognition accuracy.

[0269] Step 5:

[0270] The server inputs the converted text into a machine translation system, which then translates it into the specified language. The translation results are provided in real time.

[0271] Step 6:

[0272] The server organizes the translated text chronologically and generates meeting minutes based on the content of the meeting.

[0273] Step 7:

[0274] The server sends the generated translated text and meeting minutes to the user's device. The user can then view this information in real time on their device.

[0275] Step 8:

[0276] After the meeting ends, the user instructs the server to stop audio acquisition and processing. The server and terminal then save the final audio data and meeting minutes.

[0277] (Example 1)

[0278] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0279] In meetings and discussions, there is a need for systems that can efficiently and accurately share information in a multilingual environment. However, current technology makes it difficult to translate audio into multiple languages in real time and share it immediately with all participants. In particular, it is important to process audio data quickly while maintaining accuracy and minimizing mistranslations. Furthermore, accurately distributing and saving the generated meeting minutes is essential in today's business environment.

[0280] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0281] In this invention, the server includes an input means for acquiring sound information, a conversion means for converting the acquired sound information into text data, and a conversion means for converting the text data into different languages. This enables real-time multilingual translation and efficient and accurate information sharing.

[0282] "Audio information" refers to the voice data spoken by participants during meetings and discussions, which is processed as a digital signal.

[0283] "Input means" refers to devices or systems used to capture sound information in real time, and typically includes hardware such as microphones.

[0284] "Conversion means" refers to the process or technology for converting audio information into character data, specifically including speech recognition engines and related software.

[0285] "Character data" refers to text-formatted data generated from audio information and is stored as a digital character string that can be processed by a computer.

[0286] "Different language" refers to the language into which the character data is converted, meaning it is translated into a language different from the original sound.

[0287] "Generation means" refers to the process or technology for creating information records using the converted character data, including natural language processing technology.

[0288] "Transmission means" refers to the technology or method for transmitting the generated converted character data and information records to other terminals or systems, including data compression and optimization.

[0289] This invention is a system that realizes real-time multilingual translation and information sharing in meetings and discussions. The system aims to efficiently acquire audio information, quickly process it, and immediately distribute it to participants.

[0290] Acquisition of audio information

[0291] The terminal uses a high-performance microphone to acquire the speech of meeting participants in real time. This audio information is converted into a digital signal, and then signal processing is performed to remove background noise. Devices such as microphones equipped with noise cancellation technology are used in the device.

[0292] Conversion and translation of data

[0293] The server receives audio information transmitted from the terminal and converts it into text data in real time using a speech recognition engine (for example, a speech conversion engine using the latest machine learning algorithms). This text data is then translated into a specified language using neural machine translation technology. This process includes, for example, translation from English to Chinese.

[0294] Generation and distribution of converted data

[0295] The server uses the translated text data to generate meeting minutes. These minutes are organized logically, with key points highlighted. The server then delivers this information to users in real time. Users can access and view this information via their devices.

[0296] Specific examples and prompt statements

[0297] As a concrete example, in a global meeting of a multinational corporation, if a participant speaks in Japanese, the terminal captures the audio data, which is then converted into text data on a server. Next, it is translated into English and Chinese in real time, and the resulting meeting minutes are immediately delivered to the user. An example of an input prompt for the generation AI model of this system would be an instruction such as, "Translate the Japanese meeting audio into English and Chinese in real time and generate meeting minutes."

[0298] This system allows participants who speak different languages to collaborate and share information instantly without disrupting the flow of the meeting.

[0299] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0300] Step 1:

[0301] Acquisition of sound information

[0302] The terminal acquires in real time the sound information uttered by the meeting participants through the microphone. After removing the background noise from this input sound information using digital signal processing technology, it is transmitted to the server. As a specific operation, voice filtering is performed within the terminal, and a noise removal algorithm is applied.

[0303] Step 2:

[0304] Conversion of voice data

[0305] The server converts the noise-removed sound information received from the terminal into character data by means of a speech recognition engine. At this time, the server decomposes the sound information into segments, performs analysis using an acoustic model, and based on that, texturizes it. The output thus obtained is character data reflecting the speech content of the meeting.

[0306] Step 3:

[0307] Translation of character data

[0308] The server translates the character data generated by speech recognition into different languages. In this process, neural machine translation technology is used, and by inputting the character data into a translation model, conversion to the specified language is performed. The output is translated character data corresponding to multiple languages. As a specific operation, the server refers to translation dictionaries in English and Chinese and performs appropriate translation word rewriting.

[0309] Step 4:

[0310] Generation of meeting minutes

[0311] The server generates meeting minutes based on the translated character data. In this generation process, the server arranges the logical order of the text, extracts important information and emphasizes it. The resulting meeting minutes are formatted in a format suitable for viewing. Specific operations include extraction of key points and emphasis of keywords.

[0312] Step 5:

[0313] Information distribution

[0314] The server distributes the generated meeting minutes and translated text data to all participants' devices. Users receive this in real time via their devices and can refer to the meeting content in multiple languages. Data compression technology is applied during distribution, and transmission is performed in the most optimal way depending on the network bandwidth. Specifically, the server generates data packets and transmits them while avoiding network congestion.

[0315] (Application Example 1)

[0316] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0317] In multilingual households and groups, facilitating smooth communication among members who speak different languages is often challenging. Furthermore, there is a need for efficient systems for real-time translation and meeting minute creation.

[0318] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0319] In this invention, the server includes: an audio data acquisition means for acquiring audio information; an audio recognition means for converting the acquired audio information into text data; a translation means for translating the acquired text data into another language; a document generation means for creating a record document using the translated text data; an information distribution means for outputting the generated translated text data and record document; and a multilingual communication support means for supporting communication between users who speak different languages by analyzing and translating audio in multiple languages. This enables immediate and accurate communication between members who speak different languages.

[0320] "Audio information" refers to acquired audio data that is used for speech recognition and translation.

[0321] "Voice data acquisition means" refers to technical means used to acquire voice information, and includes devices such as microphones and sensors.

[0322] "Speech recognition means" refers to technology that converts acquired speech information into text data, and includes the process of analyzing a large amount of speech data to represent it as text.

[0323] "Translation methods" refer to technologies used to convert text data obtained through speech recognition into other necessary languages.

[0324] "Document generation means" refers to a function for creating a record document based on translated text data.

[0325] "Information distribution means" refers to technology that outputs generated translated text data and recorded documents to terminals or other devices.

[0326] "Multilingual communication support means" refers to technology that analyzes and translates the speech of users who speak different languages to facilitate smooth communication.

[0327] This invention is a multilingual translation system aimed at facilitating communication between users who speak different languages. This system is designed to be installed in household assistant robots and other devices, and to function effectively in daily life.

[0328] The server utilizes the robot's built-in microphone to acquire voice information. The voice data acquisition mechanism captures voice information in real time. The acquired voice information is converted into text data using the Google Cloud Speech-to-Text API running on the server. This speech recognition mechanism enables highly accurate text conversion.

[0329] Next, the text data is translated into the user's specified language using the Microsoft Azure Translator API. This process performs the necessary language conversion for communication between users with different native languages.

[0330] Furthermore, the server uses document generation means to create a record document based on the translated text data. This record document is used to accurately record important conversations and events, especially within the home, and to facilitate later review. The generated record document and translated text data are delivered to the user's terminal or robot display using information distribution means.

[0331] Multilingual communication support systems facilitate instant and accurate dialogue between users who speak different languages. For example, they enable smooth conversations between hosts and guests at home, or at international events.

[0332] A concrete example is a dinner event at home. For instance, a robot could translate speech and provide real-time support so that an English-speaking guest and a Japanese-speaking host can understand each other's languages.

[0333] An example of a prompt sentence provided to the generative AI model would be: "To streamline multilingual communication at home, please use the robot to translate the following English conversation into Japanese."

[0334] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0335] Step 1:

[0336] The terminal acquires audio information. Using the robot's built-in microphone, it captures ambient sound in real time. The acquired audio information is sent to the server as a digital signal.

[0337] Step 2:

[0338] The server performs speech recognition processing. The input digital speech signal is converted into text data using the Google Cloud Speech-to-Text API. The features of the speech signal are analyzed and converted into text format based on the language grammar and context. The output of this step is text data.

[0339] Step 3:

[0340] The server performs the translation process. Using the resulting text data as input, it utilizes the Microsoft Azure Translator API to translate it into the user-specified target language. This process is designed to translate the text data contextually, resulting in more natural language expression. The output obtained in this step is the translated text data.

[0341] Step 4:

[0342] The server generates the document. Translated text data is used as input, formatted into a standard document type, and processed to create a conversation record. The document generation process ensures that important information is clearly organized.

[0343] Step 5:

[0344] The server distributes information. It delivers the generated translated text data and recorded documents to the terminal. Here, the information is presented visually or audibly to the user's robot or display, allowing the user to confirm it. Simultaneously, multilingual communication support is used to facilitate real-time interaction and support communication between users.

[0345] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0346] This invention relates to a system that processes meeting audio data in real time, combines it with an emotion recognition engine to identify emotions from user statements, translates them into multiple languages, generates meeting minutes, and outputs them. This system consists of an audio acquisition means, an audio recognition means, a translation means, a meeting minutes generation means, an information distribution means, and an emotion recognition means.

[0347] Voice acquisition method

[0348] The device uses a high-sensitivity microphone to capture the speech of meeting participants. This allows for real-time acquisition of audio data, providing the information necessary for subsequent processing.

[0349] Speech recognition means

[0350] The server converts the audio data transferred from the terminal into text using a speech recognition engine. This process utilizes noise cancellation technology and high-precision speech models to maximize the accuracy of the text conversion.

[0351] Translation methods

[0352] The server translates the text obtained through speech recognition into a specific language. It utilizes natural language processing techniques to improve translation accuracy between languages with different cultural backgrounds.

[0353] emotion recognition means

[0354] The server analyzes the tone, speed, and text content of the voice data to determine the user's emotions in real time. It identifies emotions such as joy, anger, sadness, and happiness, and records that information.

[0355] Meeting minutes generation method

[0356] The server automatically generates meeting minutes that reflect the flow of the meeting, based on the translated text and information obtained from sentiment recognition. The minutes will include not only the content of each statement but also the emotional fluctuations associated with that statement.

[0357] Information distribution methods

[0358] The server delivers the generated meeting minutes and sentiment information to the user's terminal. Users can view the meeting minutes, which are translated in real time and reflect sentiment.

[0359] As a concrete example, in a meeting of a multinational corporation, participants speak English, Chinese, and Japanese. Using this system, all statements made during the meeting are translated in real time, and the speaker's emotions are reflected in the minutes. This information improves the quality of the meeting, transcending language and cultural barriers.

[0360] This system eliminates language barriers in international conferences, facilitates smooth communication, and enables information recording that also takes into account the emotions of the participants.

[0361] The following describes the processing flow.

[0362] Step 1:

[0363] The system becomes active when the user initiates the meeting. This puts the device into audio capture mode, and it begins recording participants' speech in real time.

[0364] Step 2:

[0365] The terminal divides the audio data into specific timeframes and prepares it for efficient processing. The process of sending this data to the server then begins.

[0366] Step 3:

[0367] The server receives the audio data, feeds it into the speech recognition engine, and converts it into text data. The goal is to accurately translate even complex languages using a high-precision model.

[0368] Step 4:

[0369] The server passes the converted text to the translation system, which translates it into the specified target language. The translation results are generated quickly and support multilingual environments.

[0370] Step 5:

[0371] The server analyzes voice features from the voice data and identifies the user's emotions through an emotion recognition engine. The identified emotions are categorized into several major emotional categories, such as joy, anger, sadness, and happiness.

[0372] Step 6:

[0373] The server integrates the translated text and sentiment information to generate meeting minutes. These minutes include the content of each statement, along with the emotional state at the time of the statement.

[0374] Step 7:

[0375] The server sends the generated meeting minutes to the user's terminal in real time. Users can conduct the meeting while checking the translated content of statements and sentiment information obtained during the meeting on their terminal.

[0376] Step 8:

[0377] After the meeting ends, the user instructs the system to shut down. The server saves all meeting minutes and sentiment data and archives them for later reference or analysis.

[0378] (Example 2)

[0379] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0380] In modern international conferences and multilingual settings, smooth communication is often hindered by participants having different languages and cultural backgrounds. Furthermore, while there is a need for information sharing that takes into account the emotions of participants during meetings, there is currently a lack of effective methods to achieve this.

[0381] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0382] In this invention, the server includes a device for acquiring audio data, a device for converting the acquired audio data into text, a device for translating the text into different languages, a device for analyzing the characteristics of the audio data and identifying emotions, a device for creating a document using the translated text and identified emotion data, and a device for outputting the generated translated text and document. This makes it possible to translate audio during a meeting into multiple languages in real time and to generate and share meeting minutes that take emotions into account.

[0383] A "device for acquiring audio data" is a device used to capture participants' speech in real time during meetings or discussions, and it collects audio information using high-sensitivity microphones or similar equipment.

[0384] A "device for converting acquired audio data into text" is a device that analyzes collected audio data and converts it into corresponding text data using a language model.

[0385] A "device for translating into different languages" is a device that converts text data into a specified different language, and it provides high translation accuracy by applying natural language processing technology.

[0386] A "device that analyzes the characteristics of voice data to identify emotions" is a device that analyzes the tone and speed of acquired voice data, the content of text, etc., to determine the emotional state of the speaker.

[0387] A "device for creating documents using translated text and identified sentiment data" is a device for automatically generating documented information records based on translated text data and sentiment data.

[0388] A "device for outputting generated translated text and documents" is a device for distributing generated text and documents in real time and providing them in a state where relevant parties can access them.

[0389] This system enables smooth communication among participants in international conferences and multilingual environments by recognizing, translating, and analyzing speech in real time. The system includes the following devices and means:

[0390] The device uses a high-sensitivity microphone to capture participants' speech in real time during meetings and discussions. This audio data is then used to pass on to the next processing stage.

[0391] The server analyzes the audio data transmitted from the terminal using a speech recognition engine and converts it into text. This process ensures high accuracy by employing noise cancellation technology and advanced speech recognition models.

[0392] The server translates the acquired text data into different languages through natural language processing. This provides appropriate translations for participants with different cultural and linguistic backgrounds.

[0393] The server further identifies the speaker's emotional state based on characteristics of the audio data, such as tone and speed, and the content of the text. Major emotions such as joy, anger, sadness, and happiness are determined in real time during this process.

[0394] Based on the translated text and sentiment data, the server automatically creates meeting minutes. These minutes reflect not only the content of the utterances but also the emotional characteristics of the situation.

[0395] The server delivers the generated meeting minutes and translated text to the user's terminal in real time. Through this, the user can instantly grasp the content and emotional elements of the meeting.

[0396] As a concrete example, consider a meeting at a multinational corporation. Participants use English, Chinese, and Japanese, and each statement is recognized and translated in real time by the system. The emotional elements of each utterance are also detected and reflected in the meeting minutes, ensuring that all parties involved share the same understanding and facilitating efficient communication.

[0397] Another example of a prompt for the generative AI model is an instruction such as, "Analyze the meeting audio data in real time and create meeting minutes that reflect the sentiment of the statements along with the translation." This prompt causes the system to process speech recognition, translation, and sentiment analysis in an integrated manner.

[0398] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0399] Step 1:

[0400] The terminal uses a high-sensitivity microphone to capture the speech of meeting participants in real time. The input is the participants' voices. This audio data is digitized to create initial data for the next processing step.

[0401] Step 2:

[0402] The server receives digital audio data transmitted from the terminal and converts this data into text using a speech recognition engine. It receives digital audio data as input and applies noise cancellation technology and a speech recognition model to output the spoken content as text data.

[0403] Step 3:

[0404] The server receives text data obtained through speech recognition and translates it into a specific language using natural language processing techniques. The input for this step is the recognized text, and it outputs a highly accurate translation as text data, taking into account cultural nuances between languages.

[0405] Step 4:

[0406] The server references the initial audio data and identifies the speaker's emotions in real time based on the tone and speed of speech, as well as the content of the translated text. The input includes the initial audio data and translated text, and emotion recognition technology is used to output emotion information corresponding to each utterance.

[0407] Step 5:

[0408] The server uses translated text data and identified sentiment information to create meeting minutes that reflect the progress of the meeting. Inputs include translated text and sentiment information, which are used to generate meeting minutes. These minutes output document data that reflects each statement and the resulting changes in sentiment.

[0409] Step 6:

[0410] The server delivers the generated meeting minutes and sentiment information to each user's terminal. The input for this step is the document data created in the previous step, and the user sends the data so that they can receive meeting minutes with translations and sentiments reflected in real time.

[0411] (Application Example 2)

[0412] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0413] In real-time audio content such as meetings, seminars, and lectures, there is a need to provide an environment where participants who speak different languages can smoothly understand information and grasp the speaker's emotional changes. However, conventional systems have faced challenges such as misunderstandings and misinterpretations due to differences in language and emotions, hindering smooth communication.

[0414] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0415] In this invention, the server includes: a voice acquisition means for acquiring voice data; a voice recognition means for converting the acquired voice data into text; a translation means for translating the text into another language; an emotion recognition means for identifying the translated text and emotions; a minutes generation means for creating minutes that reflect emotions; and an information distribution means for outputting the generated translated text and minutes with added emotions. This allows participants to understand the accurately translated content in real time and experience the changes in the speaker's emotions.

[0416] "Voice acquisition means" refers to a device or process that captures participants' speech with high sensitivity in settings such as meetings and seminars, and collects audio data necessary for subsequent processing in real time.

[0417] "Speech recognition means" refers to a technology or device that analyzes acquired speech data and converts it into text information using noise cancellation technology or an advanced speech model.

[0418] A "translation method" is a process or system that converts text obtained through speech recognition into another specified language, thereby enabling smooth communication between users of different languages.

[0419] "Emotion recognition means" refers to a technology or system that determines emotions based on the speaker's tone, speed, and content through the analysis of voice and text data, and identifies emotions such as joy, anger, sadness, and happiness in real time.

[0420] A "meeting minutes generation method" is a device or system for automatically creating meeting minutes that reflect the progress of a meeting or seminar, based on translated text and information obtained from sentiment recognition.

[0421] "Information distribution means" refers to a technology or process that delivers generated translated text and sentiment-added meeting minutes to the user's device, thereby sharing information in real time.

[0422] This invention is a system that translates audio content, such as meetings, into multiple languages in real time and generates and distributes meeting minutes with added emotion.

[0423] First, the device uses a high-sensitivity microphone to capture participants' speech during the meeting and transmits it to the server as audio data. Smartphones or dedicated microphones are used for audio acquisition.

[0424] Next, the server uses the Google Speech-to-Text API to convert the acquired audio data into text. Noise cancellation technology is applied during this process to improve the accuracy of the text conversion.

[0425] The server then uses the Google Translate API to translate the converted text into the specified language. The translation language varies depending on the user's settings, but it utilizes natural language processing technology to provide highly accurate translations that take cultural context into account.

[0426] The server further uses IBM Watson's sentiment analysis API to recognize the speaker's emotions in real time from speech and text. Based on tone, speed, and context, it identifies emotions such as joy, anger, sadness, and happiness, and records that information.

[0427] Based on this, the server automatically generates meeting minutes that reflect the translated text and sentiment information. The minutes include detailed information on the content of each statement, along with fluctuations in sentiment.

[0428] Finally, the server delivers the generated meeting minutes and sentiment information to the user's terminal via an information distribution system. This allows the user to access meeting minutes in real time, with multilingual support and including sentiment information.

[0429] As a concrete example, in an international conference, the English-speaking presenter's remarks are instantly translated into Japanese, and Japanese-speaking participants can view the projected minutes in real time. Furthermore, changes in the presenter's emotional tone during speaking (for example, when they show excitement on a particular slide) are also displayed. An example of a prompt to input into the generation AI model would be, "Translate the real-time conference audio and generate a textual transcript with added emotion."

[0430] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0431] Step 1:

[0432] The terminal captures meeting speeches as audio data using a high-sensitivity microphone. The input is an audio signal, which is collected in real time by capturing it as digital audio data. The output is digital audio data sent to the server.

[0433] Step 2:

[0434] The server converts the received audio data into text using the Google Speech-to-Text API. The input is the digital audio data generated in step 1, and high-precision text information is obtained by applying noise cancellation technology and performing audio signal analysis. The output is text data.

[0435] Step 3:

[0436] The server translates text data into the specified language using the Google Translate API. The input is the text data generated in step 2, and cross-cultural language conversion is performed using natural language processing techniques. The output is the translated text data.

[0437] Step 4:

[0438] The server uses IBM Watson's sentiment analysis API to identify emotions from audio and text data. The input consists of audio data from Step 1 and text data from Step 2. The server analyzes the tone and speed of the audio, the content of the text, etc., to estimate the emotion. The output is the identified emotion data.

[0439] Step 5:

[0440] The server generates meeting minutes based on the translated text and sentiment data. The input is the translated text data from step 3 and the sentiment data from step 4, and the output is meeting minutes that reflect the progress of the meeting in detail, including the content of the statements and the fluctuations in emotions.

[0441] Step 6:

[0442] The server distributes the generated meeting minutes and sentiment information to the user's terminal using an information distribution method. The input is the meeting minutes data generated in step 5, and network technology is used to output it to the terminal in real time to deliver it to the user.

[0443] This process allows users to accurately understand the content of a meeting based on translated information and emotional information.

[0444] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0445] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0446] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0447] [Third Embodiment]

[0448] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0449] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0450] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0451] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0452] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0453] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0454] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0455] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0456] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0457] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0458] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0459] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0460] The present invention is a system that processes meeting audio data in real time, translates it into multiple languages, generates meeting minutes, and outputs them. This system consists of several main components.

[0461] Voice acquisition method

[0462] The device is equipped with a microphone to capture participants' speech in real time. This makes it possible to capture all audio information during the meeting.

[0463] Speech recognition means

[0464] The server receives audio data sent from the terminal and converts it to text using a speech recognition engine. This conversion is performed in real time, and machine learning algorithms are used to improve the accuracy of the speech-to-text conversion.

[0465] Translation methods

[0466] The server translates the digitized data into the specified language. It utilizes machine translation technology to quickly translate into English, Chinese, and other languages.

[0467] Meeting minutes generation method

[0468] The server organizes the translated text and generates meeting minutes that reflect the content of the meeting. This process organizes the text order and compiles it into minutes, highlighting important information.

[0469] Information distribution methods

[0470] The server distributes the translated text and meeting minutes it generates to the users. Users can access the translated content and meeting minutes in real time via their devices.

[0471] As a concrete example, in a global conference, if a participant speaks in Japanese, their device captures the audio, and the server converts it into Japanese text through speech recognition. Then, it translates it into a specified language, such as English and Chinese, and the server generates meeting minutes in a format appropriate to the conference progress. These translations and minutes are delivered to the user's device, allowing all participants to share information in real time, regardless of language.

[0472] This system supports meetings in multilingual environments, providing efficient communication and accurate record-keeping.

[0473] The following describes the processing flow.

[0474] Step 1:

[0475] The user initiates the meeting. This puts the device into voice acquisition mode and activates the microphone.

[0476] Step 2:

[0477] The device captures the participants' speech and simultaneously divides the audio data into chunks of a certain length.

[0478] Step 3:

[0479] The terminal sends the segmented audio data to the server. This transmission is optimized to minimize network latency.

[0480] Step 4:

[0481] The server processes the received audio data using a speech recognition engine and converts it into text data. Machine learning algorithms are used to improve recognition accuracy.

[0482] Step 5:

[0483] The server inputs the converted text into a machine translation system, which then translates it into the specified language. The translation results are provided in real time.

[0484] Step 6:

[0485] The server organizes the translated text chronologically and generates meeting minutes based on the content of the meeting.

[0486] Step 7:

[0487] The server sends the generated translated text and meeting minutes to the user's device. The user can then view this information in real time on their device.

[0488] Step 8:

[0489] After the meeting ends, the user instructs the server to stop audio acquisition and processing. The server and terminal then save the final audio data and meeting minutes.

[0490] (Example 1)

[0491] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0492] In meetings and discussions, there is a need for systems that can efficiently and accurately share information in a multilingual environment. However, current technology makes it difficult to translate audio into multiple languages in real time and share it immediately with all participants. In particular, it is important to process audio data quickly while maintaining accuracy and minimizing mistranslations. Furthermore, accurately distributing and saving the generated meeting minutes is essential in today's business environment.

[0493] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0494] In this invention, the server includes an input means for acquiring sound information, a conversion means for converting the acquired sound information into text data, and a conversion means for converting the text data into different languages. This enables real-time multilingual translation and efficient and accurate information sharing.

[0495] "Audio information" refers to the voice data spoken by participants during meetings and discussions, which is processed as a digital signal.

[0496] "Input means" refers to devices or systems used to capture sound information in real time, and typically includes hardware such as microphones.

[0497] "Conversion means" refers to the process or technology for converting sound information into text data, and specifically includes speech recognition engines and related software.

[0498] "Character data" refers to text-formatted data generated from sound information, which is stored as a digital string that can be processed by a computer.

[0499] "Different languages" refers to the languages to which the text data is converted, meaning that it is translated into a language different from the original sound.

[0500] "Generation means" refers to processes and technologies for creating information records using converted character data, and includes natural language processing technologies.

[0501] "Transmission means" refers to the technologies and methods for transmitting generated converted character data and information records to other terminals or systems, and includes data compression and optimization.

[0502] This invention is a system that enables real-time multilingual translation and information sharing in meetings and discussions. The system aims to efficiently acquire audio information, process it quickly, and deliver it to participants immediately.

[0503] Acquisition of sound information

[0504] The terminal uses a high-performance microphone to capture participants' speech in real time. This audio information is converted into a digital signal, which is then processed to remove background noise. The device may utilize a microphone equipped with noise-canceling technology, for example.

[0505] Data transformation and translation

[0506] The server receives audio information transmitted from the terminal and converts it into text data in real time using a speech recognition engine (for example, a speech conversion engine using the latest machine learning algorithms). This text data is then translated into a specified language using neural machine translation technology. This process includes, for example, translation from English to Chinese.

[0507] Generation and distribution of converted data

[0508] The server uses the translated text data to generate meeting minutes. These minutes are organized logically, with key points highlighted. The server then delivers this information to users in real time. Users can access and view this information via their devices.

[0509] Specific examples and prompt statements

[0510] As a concrete example, in a global meeting of a multinational corporation, if a participant speaks in Japanese, the terminal captures the audio data, which is then converted into text data on a server. Next, it is translated into English and Chinese in real time, and the resulting meeting minutes are immediately delivered to the user. An example of an input prompt for the generation AI model of this system would be an instruction such as, "Translate the Japanese meeting audio into English and Chinese in real time and generate meeting minutes."

[0511] This system allows participants who speak different languages to collaborate and share information instantly without disrupting the flow of the meeting.

[0512] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0513] Step 1:

[0514] Acquisition of sound information

[0515] The terminal acquires audio information spoken by meeting participants in real time via a microphone. This input audio information is then sent to the server after background noise is removed using digital signal processing technology. Specifically, audio filtering is performed within the terminal, and a noise reduction algorithm is applied.

[0516] Step 2:

[0517] Audio data conversion

[0518] The server converts the denoised audio information received from the terminal into text data using a speech recognition engine. During this process, the server decomposes the audio information into segments, performs analysis using an acoustic model, and then converts it into text. The resulting output is text data that reflects the content of the meeting.

[0519] Step 3:

[0520] Translation of text data

[0521] The server translates text data generated by speech recognition into different languages. This process uses neural machine translation technology, inputting text data into a translation model to convert it into the specified language. The output is translated text data compatible with multiple languages. Specifically, the server refers to English and Chinese translation dictionaries and performs appropriate translation rewriting.

[0522] Step 4:

[0523] Minutes generation

[0524] The server generates meeting minutes based on the translated text data. During this generation process, the server organizes the text logically, extracts important information, and highlights it. The resulting meeting minutes are formatted for easy viewing. Specific actions include extracting key points and highlighting keywords.

[0525] Step 5:

[0526] Information distribution

[0527] The server distributes the generated meeting minutes and translated text data to all participants' devices. Users receive this in real time via their devices and can refer to the meeting content in multiple languages. Data compression technology is applied during distribution, and transmission is performed in the most optimal way depending on the network bandwidth. Specifically, the server generates data packets and transmits them while avoiding network congestion.

[0528] (Application Example 1)

[0529] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0530] In multilingual households and groups, facilitating smooth communication among members who speak different languages is often challenging. Furthermore, there is a need for efficient systems for real-time translation and meeting minute creation.

[0531] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0532] In this invention, the server includes: an audio data acquisition means for acquiring audio information; an audio recognition means for converting the acquired audio information into text data; a translation means for translating the acquired text data into another language; a document generation means for creating a record document using the translated text data; an information distribution means for outputting the generated translated text data and record document; and a multilingual communication support means for supporting communication between users who speak different languages by analyzing and translating audio in multiple languages. This enables immediate and accurate communication between members who speak different languages.

[0533] "Audio information" refers to acquired audio data that is used for speech recognition and translation.

[0534] "Voice data acquisition means" refers to technical means used to acquire voice information, and includes devices such as microphones and sensors.

[0535] "Speech recognition means" refers to technology that converts acquired speech information into text data, and includes the process of analyzing a large amount of speech data to represent it as text.

[0536] "Translation methods" refer to technologies used to convert text data obtained through speech recognition into other necessary languages.

[0537] "Document generation means" refers to a function for creating a record document based on translated text data.

[0538] "Information distribution means" refers to technology that outputs generated translated text data and recorded documents to terminals or other devices.

[0539] "Multilingual communication support means" refers to technology that analyzes and translates the speech of users who speak different languages to facilitate smooth communication.

[0540] This invention is a multilingual translation system aimed at facilitating communication between users who speak different languages. This system is designed to be installed in household assistant robots and other devices, and to function effectively in daily life.

[0541] The server utilizes the robot's built-in microphone to acquire voice information. The voice data acquisition mechanism captures voice information in real time. The acquired voice information is converted into text data using the Google Cloud Speech-to-Text API running on the server. This speech recognition mechanism enables highly accurate text conversion.

[0542] Next, the text data is translated into the user's specified language using the Microsoft Azure Translator API. This process performs the necessary language conversion for communication between users with different native languages.

[0543] Furthermore, the server uses document generation means to create a record document based on the translated text data. This record document is used to accurately record important conversations and events, especially within the home, and to facilitate later review. The generated record document and translated text data are delivered to the user's terminal or robot display using information distribution means.

[0544] Multilingual communication support systems facilitate instant and accurate dialogue between users who speak different languages. For example, they enable smooth conversations between hosts and guests at home, or at international events.

[0545] A concrete example is a dinner event at home. For instance, a robot could translate speech and provide real-time support so that an English-speaking guest and a Japanese-speaking host can understand each other's languages.

[0546] An example of a prompt sentence provided to the generative AI model would be: "To streamline multilingual communication at home, please use the robot to translate the following English conversation into Japanese."

[0547] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0548] Step 1:

[0549] The terminal acquires audio information. Using the robot's built-in microphone, it captures ambient sound in real time. The acquired audio information is sent to the server as a digital signal.

[0550] Step 2:

[0551] The server performs speech recognition processing. The input digital speech signal is converted into text data using the Google Cloud Speech-to-Text API. The features of the speech signal are analyzed and converted into text format based on the language grammar and context. The output of this step is text data.

[0552] Step 3:

[0553] The server performs the translation process. Using the resulting text data as input, it utilizes the Microsoft Azure Translator API to translate it into the user-specified target language. This process is designed to translate the text data contextually, resulting in more natural language expression. The output obtained in this step is the translated text data.

[0554] Step 4:

[0555] The server generates the document. Translated text data is used as input, formatted into a standard document type, and processed to create a conversation record. The document generation process ensures that important information is clearly organized.

[0556] Step 5:

[0557] The server distributes information. It delivers the generated translated text data and recorded documents to the terminal. Here, the information is presented visually or audibly to the user's robot or display, allowing the user to confirm it. Simultaneously, multilingual communication support is used to facilitate real-time interaction and support communication between users.

[0558] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0559] This invention relates to a system that processes meeting audio data in real time, combines it with an emotion recognition engine to identify emotions from user statements, translates them into multiple languages, generates meeting minutes, and outputs them. This system consists of an audio acquisition means, an audio recognition means, a translation means, a meeting minutes generation means, an information distribution means, and an emotion recognition means.

[0560] Voice acquisition method

[0561] The device uses a high-sensitivity microphone to capture the speech of meeting participants. This allows for real-time acquisition of audio data, providing the information necessary for subsequent processing.

[0562] Speech recognition means

[0563] The server converts the audio data transferred from the terminal into text using a speech recognition engine. This process utilizes noise cancellation technology and high-precision speech models to maximize the accuracy of the text conversion.

[0564] Translation methods

[0565] The server translates the text obtained through speech recognition into a specific language. It utilizes natural language processing techniques to improve translation accuracy between languages with different cultural backgrounds.

[0566] emotion recognition means

[0567] The server analyzes the tone, speed, and text content of the voice data to determine the user's emotions in real time. It identifies emotions such as joy, anger, sadness, and happiness, and records that information.

[0568] Meeting minutes generation method

[0569] The server automatically generates meeting minutes that reflect the flow of the meeting, based on the translated text and information obtained from sentiment recognition. The minutes will include not only the content of each statement but also the emotional fluctuations associated with that statement.

[0570] Information distribution methods

[0571] The server delivers the generated meeting minutes and sentiment information to the user's terminal. Users can view the meeting minutes, which are translated in real time and reflect sentiment.

[0572] As a concrete example, in a meeting of a multinational corporation, participants speak English, Chinese, and Japanese. Using this system, all statements made during the meeting are translated in real time, and the speaker's emotions are reflected in the minutes. This information improves the quality of the meeting, transcending language and cultural barriers.

[0573] This system eliminates language barriers in international conferences, facilitates smooth communication, and enables information recording that also takes into account the emotions of the participants.

[0574] The following describes the processing flow.

[0575] Step 1:

[0576] The system becomes active when the user initiates the meeting. This puts the device into audio capture mode, and it begins recording participants' speech in real time.

[0577] Step 2:

[0578] The terminal divides the audio data into specific timeframes and prepares it for efficient processing. The process of sending this data to the server then begins.

[0579] Step 3:

[0580] The server receives the audio data, feeds it into the speech recognition engine, and converts it into text data. The goal is to accurately translate even complex languages using a high-precision model.

[0581] Step 4:

[0582] The server passes the converted text to the translation system, which translates it into the specified target language. The translation results are generated quickly and support multilingual environments.

[0583] Step 5:

[0584] The server analyzes voice features from the voice data and identifies the user's emotions through an emotion recognition engine. The identified emotions are categorized into several major emotional categories, such as joy, anger, sadness, and happiness.

[0585] Step 6:

[0586] The server integrates the translated text and sentiment information to generate meeting minutes. These minutes include the content of each statement, along with the emotional state at the time of the statement.

[0587] Step 7:

[0588] The server sends the generated meeting minutes to the user's terminal in real time. Users can conduct the meeting while checking the translated content of statements and sentiment information obtained during the meeting on their terminal.

[0589] Step 8:

[0590] After the meeting ends, the user instructs the system to shut down. The server saves all meeting minutes and sentiment data and archives them for later reference or analysis.

[0591] (Example 2)

[0592] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0593] In modern international conferences and multilingual settings, smooth communication is often hindered by participants having different languages and cultural backgrounds. Furthermore, while there is a need for information sharing that takes into account the emotions of participants during meetings, there is currently a lack of effective methods to achieve this.

[0594] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0595] In this invention, the server includes a device for acquiring audio data, a device for converting the acquired audio data into text, a device for translating the text into different languages, a device for analyzing the characteristics of the audio data and identifying emotions, a device for creating a document using the translated text and identified emotion data, and a device for outputting the generated translated text and document. This makes it possible to translate audio during a meeting into multiple languages in real time and to generate and share meeting minutes that take emotions into account.

[0596] A "device for acquiring audio data" is a device used to capture participants' speech in real time during meetings or discussions, and it collects audio information using high-sensitivity microphones or similar equipment.

[0597] A "device for converting acquired audio data into text" is a device that analyzes collected audio data and converts it into corresponding text data using a language model.

[0598] A "device for translating into different languages" is a device that converts text data into a specified different language, and it provides high translation accuracy by applying natural language processing technology.

[0599] A "device that analyzes the characteristics of voice data to identify emotions" is a device that analyzes the tone and speed of acquired voice data, the content of text, etc., to determine the emotional state of the speaker.

[0600] A "device for creating documents using translated text and identified sentiment data" is a device for automatically generating documented information records based on translated text data and sentiment data.

[0601] A "device for outputting generated translated text and documents" is a device for distributing generated text and documents in real time and providing them in a state where relevant parties can access them.

[0602] This system enables smooth communication among participants in international conferences and multilingual environments by recognizing, translating, and analyzing speech in real time. The system includes the following devices and means:

[0603] The device uses a high-sensitivity microphone to capture participants' speech in real time during meetings and discussions. This audio data is then used to pass on to the next processing stage.

[0604] The server analyzes the audio data transmitted from the terminal using a speech recognition engine and converts it into text. This process ensures high accuracy by employing noise cancellation technology and advanced speech recognition models.

[0605] The server translates the acquired text data into different languages through natural language processing. This provides appropriate translations for participants with different cultural and linguistic backgrounds.

[0606] The server further identifies the speaker's emotional state based on characteristics of the audio data, such as tone and speed, and the content of the text. Major emotions such as joy, anger, sadness, and happiness are determined in real time during this process.

[0607] Based on the translated text and sentiment data, the server automatically creates meeting minutes. These minutes reflect not only the content of the utterances but also the emotional characteristics of the situation.

[0608] The server delivers the generated meeting minutes and translated text to the user's terminal in real time. Through this, the user can instantly grasp the content and emotional elements of the meeting.

[0609] As a concrete example, consider a meeting at a multinational corporation. Participants use English, Chinese, and Japanese, and each statement is recognized and translated in real time by the system. The emotional elements of each utterance are also detected and reflected in the meeting minutes, ensuring that all parties involved share the same understanding and facilitating efficient communication.

[0610] Another example of a prompt for the generative AI model is an instruction such as, "Analyze the meeting audio data in real time and create meeting minutes that reflect the sentiment of the statements along with the translation." This prompt causes the system to process speech recognition, translation, and sentiment analysis in an integrated manner.

[0611] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0612] Step 1:

[0613] The terminal uses a high-sensitivity microphone to capture the speech of meeting participants in real time. The input is the participants' voices. This audio data is digitized to create initial data for the next processing step.

[0614] Step 2:

[0615] The server receives digital audio data transmitted from the terminal and converts this data into text using a speech recognition engine. It receives digital audio data as input and applies noise cancellation technology and a speech recognition model to output the spoken content as text data.

[0616] Step 3:

[0617] The server receives text data obtained through speech recognition and translates it into a specific language using natural language processing techniques. The input for this step is the recognized text, and it outputs a highly accurate translation as text data, taking into account cultural nuances between languages.

[0618] Step 4:

[0619] The server references the initial audio data and identifies the speaker's emotions in real time based on the tone and speed of speech, as well as the content of the translated text. The input includes the initial audio data and translated text, and emotion recognition technology is used to output emotion information corresponding to each utterance.

[0620] Step 5:

[0621] The server uses translated text data and identified sentiment information to create meeting minutes that reflect the progress of the meeting. Inputs include translated text and sentiment information, which are used to generate meeting minutes. These minutes output document data that reflects each statement and the resulting changes in sentiment.

[0622] Step 6:

[0623] The server delivers the generated meeting minutes and sentiment information to each user's terminal. The input for this step is the document data created in the previous step, and the user sends the data so that they can receive meeting minutes with translations and sentiments reflected in real time.

[0624] (Application Example 2)

[0625] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0626] In real-time audio content such as meetings, seminars, and lectures, there is a need to provide an environment where participants who speak different languages can smoothly understand information and grasp the speaker's emotional changes. However, conventional systems have faced challenges such as misunderstandings and misinterpretations due to differences in language and emotions, hindering smooth communication.

[0627] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0628] In this invention, the server includes: a voice acquisition means for acquiring voice data; a voice recognition means for converting the acquired voice data into text; a translation means for translating the text into another language; an emotion recognition means for identifying the translated text and emotions; a minutes generation means for creating minutes that reflect emotions; and an information distribution means for outputting the generated translated text and minutes with added emotions. This allows participants to understand the accurately translated content in real time and experience the changes in the speaker's emotions.

[0629] "Voice acquisition means" refers to a device or process that captures participants' speech with high sensitivity in settings such as meetings and seminars, and collects audio data necessary for subsequent processing in real time.

[0630] "Speech recognition means" refers to a technology or device that analyzes acquired speech data and converts it into text information using noise cancellation technology or an advanced speech model.

[0631] A "translation method" is a process or system that converts text obtained through speech recognition into another specified language, thereby enabling smooth communication between users of different languages.

[0632] "Emotion recognition means" refers to a technology or system that determines emotions based on the speaker's tone, speed, and content through the analysis of voice and text data, and identifies emotions such as joy, anger, sadness, and happiness in real time.

[0633] A "meeting minutes generation method" is a device or system for automatically creating meeting minutes that reflect the progress of a meeting or seminar, based on translated text and information obtained from sentiment recognition.

[0634] "Information distribution means" refers to a technology or process that delivers generated translated text and sentiment-added meeting minutes to the user's device, thereby sharing information in real time.

[0635] This invention is a system that translates audio content, such as meetings, into multiple languages in real time and generates and distributes meeting minutes with added emotion.

[0636] First, the device uses a high-sensitivity microphone to capture participants' speech during the meeting and transmits it to the server as audio data. Smartphones or dedicated microphones are used for audio acquisition.

[0637] Next, the server uses the Google Speech-to-Text API to convert the acquired audio data into text. Noise cancellation technology is applied during this process to improve the accuracy of the text conversion.

[0638] The server then uses the Google Translate API to translate the converted text into the specified language. The translation language varies depending on the user's settings, but it utilizes natural language processing technology to provide highly accurate translations that take cultural context into account.

[0639] The server further uses IBM Watson's sentiment analysis API to recognize the speaker's emotions in real time from speech and text. Based on tone, speed, and context, it identifies emotions such as joy, anger, sadness, and happiness, and records that information.

[0640] Based on this, the server automatically generates meeting minutes that reflect the translated text and sentiment information. The minutes include detailed information on the content of each statement, along with fluctuations in sentiment.

[0641] Finally, the server delivers the generated meeting minutes and sentiment information to the user's terminal via an information distribution system. This allows the user to access meeting minutes in real time, with multilingual support and including sentiment information.

[0642] As a concrete example, in an international conference, the English-speaking presenter's remarks are instantly translated into Japanese, and Japanese-speaking participants can view the projected minutes in real time. Furthermore, changes in the presenter's emotional tone during speaking (for example, when they show excitement on a particular slide) are also displayed. An example of a prompt to input into the generation AI model would be, "Translate the real-time conference audio and generate a textual transcript with added emotion."

[0643] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0644] Step 1:

[0645] The terminal captures meeting speeches as audio data using a high-sensitivity microphone. The input is an audio signal, which is collected in real time by capturing it as digital audio data. The output is digital audio data sent to the server.

[0646] Step 2:

[0647] The server converts the received audio data into text using the Google Speech-to-Text API. The input is the digital audio data generated in step 1, and high-precision text information is obtained by applying noise cancellation technology and performing audio signal analysis. The output is text data.

[0648] Step 3:

[0649] The server translates text data into the specified language using the Google Translate API. The input is the text data generated in step 2, and cross-cultural language conversion is performed using natural language processing techniques. The output is the translated text data.

[0650] Step 4:

[0651] The server uses IBM Watson's sentiment analysis API to identify emotions from audio and text data. The input consists of audio data from Step 1 and text data from Step 2. The server analyzes the tone and speed of the audio, the content of the text, etc., to estimate the emotion. The output is the identified emotion data.

[0652] Step 5:

[0653] The server generates meeting minutes based on the translated text and sentiment data. The input is the translated text data from step 3 and the sentiment data from step 4, and the output is meeting minutes that reflect the progress of the meeting in detail, including the content of the statements and the fluctuations in emotions.

[0654] Step 6:

[0655] The server distributes the generated meeting minutes and sentiment information to the user's terminal using an information distribution method. The input is the meeting minutes data generated in step 5, and network technology is used to output it to the terminal in real time to deliver it to the user.

[0656] This process allows users to accurately understand the content of a meeting based on translated information and emotional information.

[0657] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0658] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0659] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0660] [Fourth Embodiment]

[0661] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0662] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0663] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0664] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0665] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0666] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0667] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0668] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0669] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0670] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0671] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0672] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0673] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0674] The present invention is a system that processes meeting audio data in real time, translates it into multiple languages, generates meeting minutes, and outputs them. This system consists of several main components.

[0675] Voice acquisition method

[0676] The device is equipped with a microphone to capture participants' speech in real time. This makes it possible to capture all audio information during the meeting.

[0677] Speech recognition means

[0678] The server receives audio data sent from the terminal and converts it to text using a speech recognition engine. This conversion is performed in real time, and machine learning algorithms are used to improve the accuracy of the speech-to-text conversion.

[0679] Translation methods

[0680] The server translates the digitized data into the specified language. It utilizes machine translation technology to quickly translate into English, Chinese, and other languages.

[0681] Meeting minutes generation method

[0682] The server organizes the translated text and generates meeting minutes that reflect the content of the meeting. This process organizes the text order and compiles it into minutes, highlighting important information.

[0683] Information distribution methods

[0684] The server distributes the translated text and meeting minutes it generates to the users. Users can access the translated content and meeting minutes in real time via their devices.

[0685] As a concrete example, in a global conference, if a participant speaks in Japanese, their device captures the audio, and the server converts it into Japanese text through speech recognition. Then, it translates it into a specified language, such as English and Chinese, and the server generates meeting minutes in a format appropriate to the conference progress. These translations and minutes are delivered to the user's device, allowing all participants to share information in real time, regardless of language.

[0686] This system supports meetings in multilingual environments, providing efficient communication and accurate record-keeping.

[0687] The following describes the processing flow.

[0688] Step 1:

[0689] The user initiates the meeting. This puts the device into voice acquisition mode and activates the microphone.

[0690] Step 2:

[0691] The device captures the participants' speech and simultaneously divides the audio data into chunks of a certain length.

[0692] Step 3:

[0693] The terminal sends the segmented audio data to the server. This transmission is optimized to minimize network latency.

[0694] Step 4:

[0695] The server processes the received audio data using a speech recognition engine and converts it into text data. Machine learning algorithms are used to improve recognition accuracy.

[0696] Step 5:

[0697] The server inputs the converted text into a machine translation system, which then translates it into the specified language. The translation results are provided in real time.

[0698] Step 6:

[0699] The server organizes the translated text chronologically and generates meeting minutes based on the content of the meeting.

[0700] Step 7:

[0701] The server sends the generated translated text and meeting minutes to the user's device. The user can then view this information in real time on their device.

[0702] Step 8:

[0703] After the meeting ends, the user instructs the server to stop audio acquisition and processing. The server and terminal then save the final audio data and meeting minutes.

[0704] (Example 1)

[0705] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0706] In meetings and discussions, there is a need for systems that can efficiently and accurately share information in a multilingual environment. However, current technology makes it difficult to translate audio into multiple languages in real time and share it immediately with all participants. In particular, it is important to process audio data quickly while maintaining accuracy and minimizing mistranslations. Furthermore, accurately distributing and saving the generated meeting minutes is essential in today's business environment.

[0707] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0708] In this invention, the server includes an input means for acquiring sound information, a conversion means for converting the acquired sound information into text data, and a conversion means for converting the text data into different languages. This enables real-time multilingual translation and efficient and accurate information sharing.

[0709] "Audio information" refers to the voice data spoken by participants during meetings and discussions, which is processed as a digital signal.

[0710] "Input means" refers to devices or systems used to capture sound information in real time, and typically includes hardware such as microphones.

[0711] "Conversion means" refers to the process or technology for converting sound information into text data, and specifically includes speech recognition engines and related software.

[0712] "Character data" refers to text-formatted data generated from sound information, which is stored as a digital string that can be processed by a computer.

[0713] "Different languages" refers to the languages to which the text data is converted, meaning that it is translated into a language different from the original sound.

[0714] "Generation means" refers to processes and technologies for creating information records using converted character data, and includes natural language processing technologies.

[0715] "Transmission means" refers to the technologies and methods for transmitting generated converted character data and information records to other terminals or systems, and includes data compression and optimization.

[0716] This invention is a system that enables real-time multilingual translation and information sharing in meetings and discussions. The system aims to efficiently acquire audio information, process it quickly, and deliver it to participants immediately.

[0717] Acquisition of sound information

[0718] The terminal uses a high-performance microphone to capture participants' speech in real time. This audio information is converted into a digital signal, which is then processed to remove background noise. The device may utilize a microphone equipped with noise-canceling technology, for example.

[0719] Data transformation and translation

[0720] The server receives audio information transmitted from the terminal and converts it into text data in real time using a speech recognition engine (for example, a speech conversion engine using the latest machine learning algorithms). This text data is then translated into a specified language using neural machine translation technology. This process includes, for example, translation from English to Chinese.

[0721] Generation and distribution of converted data

[0722] The server uses the translated text data to generate meeting minutes. These minutes are organized logically, with key points highlighted. The server then delivers this information to users in real time. Users can access and view this information via their devices.

[0723] Specific examples and prompt statements

[0724] As a concrete example, in a global meeting of a multinational corporation, if a participant speaks in Japanese, the terminal captures the audio data, which is then converted into text data on a server. Next, it is translated into English and Chinese in real time, and the resulting meeting minutes are immediately delivered to the user. An example of an input prompt for the generation AI model of this system would be an instruction such as, "Translate the Japanese meeting audio into English and Chinese in real time and generate meeting minutes."

[0725] This system allows participants who speak different languages to collaborate and share information instantly without disrupting the flow of the meeting.

[0726] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0727] Step 1:

[0728] Acquisition of sound information

[0729] The terminal acquires audio information spoken by meeting participants in real time via a microphone. This input audio information is then sent to the server after background noise is removed using digital signal processing technology. Specifically, audio filtering is performed within the terminal, and a noise reduction algorithm is applied.

[0730] Step 2:

[0731] Audio data conversion

[0732] The server converts the denoised audio information received from the terminal into text data using a speech recognition engine. During this process, the server decomposes the audio information into segments, performs analysis using an acoustic model, and then converts it into text. The resulting output is text data that reflects the content of the meeting.

[0733] Step 3:

[0734] Translation of text data

[0735] The server translates text data generated by speech recognition into different languages. This process uses neural machine translation technology, inputting text data into a translation model to convert it into the specified language. The output is translated text data compatible with multiple languages. Specifically, the server refers to English and Chinese translation dictionaries and performs appropriate translation rewriting.

[0736] Step 4:

[0737] Minutes generation

[0738] The server generates meeting minutes based on the translated text data. During this generation process, the server organizes the text logically, extracts important information, and highlights it. The resulting meeting minutes are formatted for easy viewing. Specific actions include extracting key points and highlighting keywords.

[0739] Step 5:

[0740] Information distribution

[0741] The server distributes the generated meeting minutes and translated text data to all participants' devices. Users receive this in real time via their devices and can refer to the meeting content in multiple languages. Data compression technology is applied during distribution, and transmission is performed in the most optimal way depending on the network bandwidth. Specifically, the server generates data packets and transmits them while avoiding network congestion.

[0742] (Application Example 1)

[0743] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0744] In multilingual households and groups, facilitating smooth communication among members who speak different languages is often challenging. Furthermore, there is a need for efficient systems for real-time translation and meeting minute creation.

[0745] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0746] In this invention, the server includes: an audio data acquisition means for acquiring audio information; an audio recognition means for converting the acquired audio information into text data; a translation means for translating the acquired text data into another language; a document generation means for creating a record document using the translated text data; an information distribution means for outputting the generated translated text data and record document; and a multilingual communication support means for supporting communication between users who speak different languages by analyzing and translating audio in multiple languages. This enables immediate and accurate communication between members who speak different languages.

[0747] "Audio information" refers to acquired audio data that is used for speech recognition and translation.

[0748] "Voice data acquisition means" refers to technical means used to acquire voice information, and includes devices such as microphones and sensors.

[0749] "Speech recognition means" refers to technology that converts acquired speech information into text data, and includes the process of analyzing a large amount of speech data to represent it as text.

[0750] "Translation methods" refer to technologies used to convert text data obtained through speech recognition into other necessary languages.

[0751] "Document generation means" refers to a function for creating a record document based on translated text data.

[0752] "Information distribution means" refers to technology that outputs generated translated text data and recorded documents to terminals or other devices.

[0753] "Multilingual communication support means" refers to technology that analyzes and translates the speech of users who speak different languages to facilitate smooth communication.

[0754] This invention is a multilingual translation system aimed at facilitating communication between users who speak different languages. This system is designed to be installed in household assistant robots and other devices, and to function effectively in daily life.

[0755] The server utilizes the robot's built-in microphone to acquire voice information. The voice data acquisition mechanism captures voice information in real time. The acquired voice information is converted into text data using the Google Cloud Speech-to-Text API running on the server. This speech recognition mechanism enables highly accurate text conversion.

[0756] Next, the text data is translated into the user's specified language using the Microsoft Azure Translator API. This process performs the necessary language conversion for communication between users with different native languages.

[0757] Furthermore, the server uses document generation means to create a record document based on the translated text data. This record document is used to accurately record important conversations and events, especially within the home, and to facilitate later review. The generated record document and translated text data are delivered to the user's terminal or robot display using information distribution means.

[0758] Multilingual communication support systems facilitate instant and accurate dialogue between users who speak different languages. For example, they enable smooth conversations between hosts and guests at home, or at international events.

[0759] A concrete example is a dinner event at home. For instance, a robot could translate speech and provide real-time support so that an English-speaking guest and a Japanese-speaking host can understand each other's languages.

[0760] An example of a prompt sentence provided to the generative AI model would be: "To streamline multilingual communication at home, please use the robot to translate the following English conversation into Japanese."

[0761] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0762] Step 1:

[0763] The terminal acquires audio information. Using the robot's built-in microphone, it captures ambient sound in real time. The acquired audio information is sent to the server as a digital signal.

[0764] Step 2:

[0765] The server performs speech recognition processing. The input digital speech signal is converted into text data using the Google Cloud Speech-to-Text API. The features of the speech signal are analyzed and converted into text format based on the language grammar and context. The output of this step is text data.

[0766] Step 3:

[0767] The server performs the translation process. Using the resulting text data as input, it utilizes the Microsoft Azure Translator API to translate it into the user-specified target language. This process is designed to translate the text data contextually, resulting in more natural language expression. The output obtained in this step is the translated text data.

[0768] Step 4:

[0769] The server generates the document. Translated text data is used as input, formatted into a standard document type, and processed to create a conversation record. The document generation process ensures that important information is clearly organized.

[0770] Step 5:

[0771] The server distributes information. It delivers the generated translated text data and recorded documents to the terminal. Here, the information is presented visually or audibly to the user's robot or display, allowing the user to confirm it. Simultaneously, multilingual communication support is used to facilitate real-time interaction and support communication between users.

[0772] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0773] This invention relates to a system that processes meeting audio data in real time, combines it with an emotion recognition engine to identify emotions from user statements, translates them into multiple languages, generates meeting minutes, and outputs them. This system consists of an audio acquisition means, an audio recognition means, a translation means, a meeting minutes generation means, an information distribution means, and an emotion recognition means.

[0774] Voice acquisition method

[0775] The device uses a high-sensitivity microphone to capture the speech of meeting participants. This allows for real-time acquisition of audio data, providing the information necessary for subsequent processing.

[0776] Speech recognition means

[0777] The server converts the audio data transferred from the terminal into text using a speech recognition engine. This process utilizes noise cancellation technology and high-precision speech models to maximize the accuracy of the text conversion.

[0778] Translation methods

[0779] The server translates the text obtained through speech recognition into a specific language. It utilizes natural language processing techniques to improve translation accuracy between languages with different cultural backgrounds.

[0780] emotion recognition means

[0781] The server analyzes the tone, speed, and text content of the voice data to determine the user's emotions in real time. It identifies emotions such as joy, anger, sadness, and happiness, and records that information.

[0782] Meeting minutes generation method

[0783] The server automatically generates meeting minutes that reflect the flow of the meeting, based on the translated text and information obtained from sentiment recognition. The minutes will include not only the content of each statement but also the emotional fluctuations associated with that statement.

[0784] Information distribution methods

[0785] The server delivers the generated meeting minutes and sentiment information to the user's terminal. Users can view the meeting minutes, which are translated in real time and reflect sentiment.

[0786] As a concrete example, in a meeting of a multinational corporation, participants speak English, Chinese, and Japanese. Using this system, all statements made during the meeting are translated in real time, and the speaker's emotions are reflected in the minutes. This information improves the quality of the meeting, transcending language and cultural barriers.

[0787] This system eliminates language barriers in international conferences, facilitates smooth communication, and enables information recording that also takes into account the emotions of the participants.

[0788] The following describes the processing flow.

[0789] Step 1:

[0790] The system becomes active when the user initiates the meeting. This puts the device into audio capture mode, and it begins recording participants' speech in real time.

[0791] Step 2:

[0792] The terminal divides the audio data into specific timeframes and prepares it for efficient processing. The process of sending this data to the server then begins.

[0793] Step 3:

[0794] The server receives the audio data, feeds it into the speech recognition engine, and converts it into text data. The goal is to accurately translate even complex languages using a high-precision model.

[0795] Step 4:

[0796] The server passes the converted text to the translation system, which translates it into the specified target language. The translation results are generated quickly and support multilingual environments.

[0797] Step 5:

[0798] The server analyzes voice features from the voice data and identifies the user's emotions through an emotion recognition engine. The identified emotions are categorized into several major emotional categories, such as joy, anger, sadness, and happiness.

[0799] Step 6:

[0800] The server integrates the translated text and sentiment information to generate meeting minutes. These minutes include the content of each statement, along with the emotional state at the time of the statement.

[0801] Step 7:

[0802] The server sends the generated meeting minutes to the user's terminal in real time. Users can conduct the meeting while checking the translated content of statements and sentiment information obtained during the meeting on their terminal.

[0803] Step 8:

[0804] After the meeting ends, the user instructs the system to shut down. The server saves all meeting minutes and sentiment data and archives them for later reference or analysis.

[0805] (Example 2)

[0806] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0807] In modern international conferences and multilingual settings, smooth communication is often hindered by participants having different languages and cultural backgrounds. Furthermore, while there is a need for information sharing that takes into account the emotions of participants during meetings, there is currently a lack of effective methods to achieve this.

[0808] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0809] In this invention, the server includes a device for acquiring audio data, a device for converting the acquired audio data into text, a device for translating the text into different languages, a device for analyzing the characteristics of the audio data and identifying emotions, a device for creating a document using the translated text and identified emotion data, and a device for outputting the generated translated text and document. This makes it possible to translate audio during a meeting into multiple languages in real time and to generate and share meeting minutes that take emotions into account.

[0810] A "device for acquiring audio data" is a device used to capture participants' speech in real time during meetings or discussions, and it collects audio information using high-sensitivity microphones or similar equipment.

[0811] A "device for converting acquired audio data into text" is a device that analyzes collected audio data and converts it into corresponding text data using a language model.

[0812] A "device for translating into different languages" is a device that converts text data into a specified different language, and it provides high translation accuracy by applying natural language processing technology.

[0813] A "device that analyzes the characteristics of voice data to identify emotions" is a device that analyzes the tone and speed of acquired voice data, the content of text, etc., to determine the emotional state of the speaker.

[0814] A "device for creating documents using translated text and identified sentiment data" is a device for automatically generating documented information records based on translated text data and sentiment data.

[0815] A "device for outputting generated translated text and documents" is a device for distributing generated text and documents in real time and providing them in a state where relevant parties can access them.

[0816] This system enables smooth communication among participants in international conferences and multilingual environments by recognizing, translating, and analyzing speech in real time. The system includes the following devices and means:

[0817] The device uses a high-sensitivity microphone to capture participants' speech in real time during meetings and discussions. This audio data is then used to pass on to the next processing stage.

[0818] The server analyzes the audio data transmitted from the terminal using a speech recognition engine and converts it into text. This process ensures high accuracy by employing noise cancellation technology and advanced speech recognition models.

[0819] The server translates the acquired text data into different languages through natural language processing. This provides appropriate translations for participants with different cultural and linguistic backgrounds.

[0820] The server further identifies the speaker's emotional state based on characteristics of the audio data, such as tone and speed, and the content of the text. Major emotions such as joy, anger, sadness, and happiness are determined in real time during this process.

[0821] Based on the translated text and sentiment data, the server automatically creates meeting minutes. These minutes reflect not only the content of the utterances but also the emotional characteristics of the situation.

[0822] The server delivers the generated meeting minutes and translated text to the user's terminal in real time. Through this, the user can instantly grasp the content and emotional elements of the meeting.

[0823] As a concrete example, consider a meeting at a multinational corporation. Participants use English, Chinese, and Japanese, and each statement is recognized and translated in real time by the system. The emotional elements of each utterance are also detected and reflected in the meeting minutes, ensuring that all parties involved share the same understanding and facilitating efficient communication.

[0824] Another example of a prompt for the generative AI model is an instruction such as, "Analyze the meeting audio data in real time and create meeting minutes that reflect the sentiment of the statements along with the translation." This prompt causes the system to process speech recognition, translation, and sentiment analysis in an integrated manner.

[0825] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0826] Step 1:

[0827] The terminal uses a high-sensitivity microphone to capture the speech of meeting participants in real time. The input is the participants' voices. This audio data is digitized to create initial data for the next processing step.

[0828] Step 2:

[0829] The server receives digital audio data transmitted from the terminal and converts this data into text using a speech recognition engine. It receives digital audio data as input and applies noise cancellation technology and a speech recognition model to output the spoken content as text data.

[0830] Step 3:

[0831] The server receives text data obtained through speech recognition and translates it into a specific language using natural language processing techniques. The input for this step is the recognized text, and it outputs a highly accurate translation as text data, taking into account cultural nuances between languages.

[0832] Step 4:

[0833] The server references the initial audio data and identifies the speaker's emotions in real time based on the tone and speed of speech, as well as the content of the translated text. The input includes the initial audio data and translated text, and emotion recognition technology is used to output emotion information corresponding to each utterance.

[0834] Step 5:

[0835] The server uses translated text data and identified sentiment information to create meeting minutes that reflect the progress of the meeting. Inputs include translated text and sentiment information, which are used to generate meeting minutes. These minutes output document data that reflects each statement and the resulting changes in sentiment.

[0836] Step 6:

[0837] The server delivers the generated meeting minutes and sentiment information to each user's terminal. The input for this step is the document data created in the previous step, and the user sends the data so that they can receive meeting minutes with translations and sentiments reflected in real time.

[0838] (Application Example 2)

[0839] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0840] In real-time audio content such as meetings, seminars, and lectures, there is a need to provide an environment where participants who speak different languages can smoothly understand information and grasp the speaker's emotional changes. However, conventional systems have faced challenges such as misunderstandings and misinterpretations due to differences in language and emotions, hindering smooth communication.

[0841] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0842] In this invention, the server includes: a voice acquisition means for acquiring voice data; a voice recognition means for converting the acquired voice data into text; a translation means for translating the text into another language; an emotion recognition means for identifying the translated text and emotions; a minutes generation means for creating minutes that reflect emotions; and an information distribution means for outputting the generated translated text and minutes with added emotions. This allows participants to understand the accurately translated content in real time and experience the changes in the speaker's emotions.

[0843] "Voice acquisition means" refers to a device or process that captures participants' speech with high sensitivity in settings such as meetings and seminars, and collects audio data necessary for subsequent processing in real time.

[0844] "Speech recognition means" refers to a technology or device that analyzes acquired speech data and converts it into text information using noise cancellation technology or an advanced speech model.

[0845] A "translation method" is a process or system that converts text obtained through speech recognition into another specified language, thereby enabling smooth communication between users of different languages.

[0846] "Emotion recognition means" refers to a technology or system that determines emotions based on the speaker's tone, speed, and content through the analysis of voice and text data, and identifies emotions such as joy, anger, sadness, and happiness in real time.

[0847] A "meeting minutes generation method" is a device or system for automatically creating meeting minutes that reflect the progress of a meeting or seminar, based on translated text and information obtained from sentiment recognition.

[0848] "Information distribution means" refers to a technology or process that delivers generated translated text and sentiment-added meeting minutes to the user's device, thereby sharing information in real time.

[0849] This invention is a system that translates audio content, such as meetings, into multiple languages in real time and generates and distributes meeting minutes with added emotion.

[0850] First, the device uses a high-sensitivity microphone to capture participants' speech during the meeting and transmits it to the server as audio data. Smartphones or dedicated microphones are used for audio acquisition.

[0851] Next, the server uses the Google Speech-to-Text API to convert the acquired audio data into text. Noise cancellation technology is applied during this process to improve the accuracy of the text conversion.

[0852] The server then uses the Google Translate API to translate the converted text into the specified language. The translation language varies depending on the user's settings, but it utilizes natural language processing technology to provide highly accurate translations that take cultural context into account.

[0853] The server further uses IBM Watson's sentiment analysis API to recognize the speaker's emotions in real time from speech and text. Based on tone, speed, and context, it identifies emotions such as joy, anger, sadness, and happiness, and records that information.

[0854] Based on this, the server automatically generates meeting minutes that reflect the translated text and sentiment information. The minutes include detailed information on the content of each statement, along with fluctuations in sentiment.

[0855] Finally, the server delivers the generated meeting minutes and sentiment information to the user's terminal via an information distribution system. This allows the user to access meeting minutes in real time, with multilingual support and including sentiment information.

[0856] As a concrete example, in an international conference, the English-speaking presenter's remarks are instantly translated into Japanese, and Japanese-speaking participants can view the projected minutes in real time. Furthermore, changes in the presenter's emotional tone during speaking (for example, when they show excitement on a particular slide) are also displayed. An example of a prompt to input into the generation AI model would be, "Translate the real-time conference audio and generate a textual transcript with added emotion."

[0857] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0858] Step 1:

[0859] The terminal captures meeting speeches as audio data using a high-sensitivity microphone. The input is an audio signal, which is collected in real time by capturing it as digital audio data. The output is digital audio data sent to the server.

[0860] Step 2:

[0861] The server converts the received audio data into text using the Google Speech-to-Text API. The input is the digital audio data generated in step 1, and high-precision text information is obtained by applying noise cancellation technology and performing audio signal analysis. The output is text data.

[0862] Step 3:

[0863] The server translates text data into the specified language using the Google Translate API. The input is the text data generated in step 2, and cross-cultural language conversion is performed using natural language processing techniques. The output is the translated text data.

[0864] Step 4:

[0865] The server uses IBM Watson's sentiment analysis API to identify emotions from audio and text data. The input consists of audio data from Step 1 and text data from Step 2. The server analyzes the tone and speed of the audio, the content of the text, etc., to estimate the emotion. The output is the identified emotion data.

[0866] Step 5:

[0867] The server generates meeting minutes based on the translated text and sentiment data. The input is the translated text data from step 3 and the sentiment data from step 4, and the output is meeting minutes that reflect the progress of the meeting in detail, including the content of the statements and the fluctuations in emotions.

[0868] Step 6:

[0869] The server distributes the generated meeting minutes and sentiment information to the user's terminal using an information distribution method. The input is the meeting minutes data generated in step 5, and network technology is used to output it to the terminal in real time to deliver it to the user.

[0870] This process allows users to accurately understand the content of a meeting based on translated information and emotional information.

[0871] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0872] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0873] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0874] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0875] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0876] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0877] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0878] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0879] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0880] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0881] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0882] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0883] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0884] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0885] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0886] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0887] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0888] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0889] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0890] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0891] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0892] The following is further disclosed regarding the embodiments described above.

[0893] (Claim 1)

[0894] A means for acquiring audio data,

[0895] A speech recognition means for converting acquired audio data into text,

[0896] Translation means for translating the aforementioned text into another language,

[0897] A means for generating meeting minutes using translated text,

[0898] Information distribution means for outputting the generated translated text and meeting minutes,

[0899] A system that includes this.

[0900] (Claim 2)

[0901] The system according to claim 1, further comprising data splitting and transmission means for splitting and transferring audio data in real time.

[0902] (Claim 3)

[0903] The system according to claim 1, further comprising data storage means for saving the generated meeting minutes.

[0904] "Example 1"

[0905] (Claim 1)

[0906] An input means for acquiring sound information,

[0907] A conversion means for converting acquired sound information into text data,

[0908] A conversion means for converting the aforementioned character data into a different language,

[0909] A generation means for creating an information record using converted character data,

[0910] A means for transmitting generated converted character data and information records,

[0911] A system that includes this.

[0912] (Claim 2)

[0913] The system according to claim 1, further comprising data processing means for dividing and transmitting sound information in real time.

[0914] (Claim 3)

[0915] The system according to claim 1, further comprising data storage means for storing generated information records.

[0916] "Application Example 1"

[0917] (Claim 1)

[0918] A means for acquiring audio data to obtain audio information,

[0919] A speech recognition means for converting acquired speech information into text data,

[0920] Translation means for translating the aforementioned character data into another language,

[0921] A document generation means for creating a record document using translated text data,

[0922] Information distribution means for outputting generated translated character data and recorded documents,

[0923] A multilingual communication support system that analyzes and translates audio in multiple languages to facilitate communication between users who speak different languages,

[0924] A system that includes this.

[0925] (Claim 2)

[0926] The system according to claim 1, further comprising data splitting transmission means for splitting and transmitting audio information in real time.

[0927] (Claim 3)

[0928] The system according to claim 1, further comprising data storage means for storing generated record documents.

[0929] "Example 2 of combining an emotion engine"

[0930] (Claim 1)

[0931] A device for acquiring audio data,

[0932] A device for converting acquired audio data into text,

[0933] A device for translating the aforementioned text into different languages,

[0934] A device that analyzes the characteristics of voice data to identify emotions,

[0935] A device that creates documents using translated text and identified sentiment data,

[0936] A device that outputs the generated translated text and documents,

[0937] An information processing system that includes this.

[0938] (Claim 2)

[0939] The information processing system according to claim 1, further comprising means for dividing and transferring audio data in real time.

[0940] (Claim 3)

[0941] The information processing system according to claim 1, further comprising means for saving the generated documents.

[0942] "Application example 2 when combining with an emotional engine"

[0943] (Claim 1)

[0944] A means for acquiring audio data,

[0945] A speech recognition means for converting acquired audio data into text,

[0946] Translation means for translating the aforementioned text into another language,

[0947] Translated text and emotion recognition means for identifying emotions,

[0948] A means of generating meeting minutes that reflect emotions,

[0949] Information distribution means for outputting generated translated text and meeting minutes with added sentiment,

[0950] A system that includes this.

[0951] (Claim 2)

[0952] The system according to claim 1, further comprising data splitting and transmission means for splitting and transferring audio data in real time.

[0953] (Claim 3)

[0954] The system according to claim 1, further comprising data storage means for saving the generated meeting minutes. [Explanation of Symbols]

[0955] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means for acquiring audio data, A speech recognition means for converting acquired audio data into text, Translation means for translating the aforementioned text into another language, A means for generating meeting minutes using translated text, Information distribution means for outputting the generated translated text and meeting minutes, A system that includes this.

2. The system according to claim 1, further comprising data splitting and transmission means for splitting and transferring audio data in real time.

3. The system according to claim 1, further comprising data storage means for saving the generated meeting minutes.