system
The system addresses language barriers in meetings by converting audio to text, translating, and providing real-time meeting minutes, ensuring smooth communication and accurate information sharing.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-26
AI Technical Summary
Participants speaking different languages in a meeting face challenges in smooth communication, and conventional voice recordings require time-consuming transcription and lack real-time information sharing, leading to potential information discrepancies.
A system that acquires audio signals, converts them into text using speech recognition, translates the text into different languages, and provides the translated text in chronological order to user terminals in real time, enabling immediate meeting minutes creation.
Facilitates smooth information sharing and immediate understanding across multiple languages by converting audio to text, translating it, and providing real-time meeting minutes, overcoming language barriers and ensuring accurate communication.
Smart Images

Figure 2026105528000001_ABST
Abstract
Description
Technical Field
[0005] ,
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] There is a problem that it is difficult for participants speaking different languages in a meeting to communicate smoothly. Also, conventional voice recordings require the work of transcribing the content of the meeting later, which is time-consuming and labor-intensive. Furthermore, since real-time information sharing is not possible, there is a possibility of information discrepancies among participants.
Means for Solving the Problems
[0005] This invention solves the above problems by providing a system that acquires audio signals in a meeting, converts them into text data using speech recognition technology, and further translates this text data into different languages using machine translation technology. This makes it possible to record the translated text in chronological order and provide it to the user terminal in real time, thereby realizing smooth information sharing between multiple languages and immediate creation of meeting minutes.
[0006] "Audio signals" refer to conversations and voice information captured as electrical signals, specifically the flow of sound acquired through a microphone during a meeting.
[0007] "Text data" refers to character information converted from audio signals using speech recognition technology, and is a document expressed in a format that can be processed by a computer.
[0008] "Translation" refers to the process of converting text data expressed in one language into another language, and is a means of enabling communication between different languages.
[0009] "Time series" refers to information that conveys the order in which events occurred, and it describes the temporal arrangement of translated text data when it is recorded.
[0010] A "user terminal" refers to a device used by meeting participants that has the function of receiving and displaying translated meeting minutes in real time.
[0011] "Real-time" is a concept that refers to the entire process from the moment information is generated until it is processed with virtually no delay and delivered to the user. [Brief explanation of the drawing]
[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.
MODE FOR CARRYING OUT THE INVENTION
[0013] Hereinafter, an example of an embodiment of the system according to the technology of the present disclosure will be described according to the accompanying drawings.
[0014] First, the language used in the following description will be explained.
[0015] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0016] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0017] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes.
[0018] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0020] [First Embodiment]
[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0033] This invention relates to a system that converts audio from a meeting into text in real time, translates it into different languages, and provides it to the user. The system mainly consists of a server, user terminals, and a communication environment that connects them via a network.
[0034] First, the user's device acquires audio during the meeting via a microphone. This audio data is transmitted to the server in streaming format. The server passes the received audio data to a speech recognition module, which converts it into text data. The speech recognition module uses advanced algorithms to extract text information from the audio, achieving highly accurate text conversion.
[0035] Next, the server passes the text data to the translation module. The translation module uses machine translation technology to translate the text into the specified target language. The translation module utilizes the latest natural language processing technology to perform smooth, context-aware translations.
[0036] Furthermore, the translated text data is recorded chronologically by the server along with the original text data. This recorded data is updated in real time and sent from the server to the user's terminal. Users can view the translated meeting minutes through their terminal and edit and save them as needed.
[0037] As a concrete example, this system can be used in international conferences of multinational corporations. In a conference held in Japan, the system is used to allow English-speaking participants to understand what Japanese-speaking participants are saying. During the conference, discussions are conducted in Japanese, but the server translates them into English in real time and displays them on the participants' devices. This allows participants to immediately grasp the content and facilitates smooth communication.
[0038] The following describes the processing flow.
[0039] Step 1:
[0040] The user captures audio during the meeting through the device's microphone. This audio data is converted into a digital signal and temporarily stored within the device.
[0041] Step 2:
[0042] The terminal transmits the acquired audio data to the server in streaming format. Data compression technology is used during this process to improve communication efficiency.
[0043] Step 3:
[0044] The server passes the received audio data to the speech recognition module. The speech recognition module analyzes the audio signal and converts it into text data. Digital signal processing technology is applied to this conversion process.
[0045] Step 4:
[0046] The server passes the generated text data to the translation module. The translation module applies a machine translation algorithm to translate the text data into the specified target language.
[0047] Step 5:
[0048] The server attaches the translated text data to the original text data, organizes it chronologically, and creates meeting minutes that are updated in real time.
[0049] Step 6:
[0050] The server sends the created meeting minutes data to the user's terminal. The user can then view the meeting minutes, which are updated in real time, through their terminal.
[0051] Step 7:
[0052] Users can make corrections and annotations to the meeting minutes on their own devices as needed, ensuring the consistency and accuracy of the information.
[0053] (Example 1)
[0054] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0055] In international conferences where multiple languages are used, language barriers can hinder smooth communication among participants. Conventional technologies have struggled to provide efficient information processing necessary for accurate, real-time translation.
[0056] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0057] In this invention, the server includes an information processing device for acquiring audio data, an information processing device for converting the audio data into text information, and an information processing device for translating the text information into different natural languages. This enables smooth communication even in multilingual meetings by converting audio generated during a meeting into text information in real time, immediately translating that text information, and recording it chronologically.
[0058] "Audio data" refers to information obtained in digital format from audio signals generated during a meeting, and is data acquired by an information processing device.
[0059] "Textual information" refers to data in text format converted from audio data, and is generated by an information processing device.
[0060] A "different natural language" is a language other than the language in which the original written information was written, and is the language to be translated using machine translation technology.
[0061] An "information processing device" is a device that performs a series of processes from acquiring audio data to converting it into text information and then translating it, and includes dedicated software or hardware.
[0062] A "server" is a central processing unit that manages the reception, processing, and translation of audio data, and is a device that communicates with multiple terminals via a network.
[0063] "Providing in real time" means that the process from acquiring audio data to providing the translation results to the user is executed without delay, and the information is provided in an immediate, reflected manner.
[0064] This invention is a system for facilitating multilingual communication during meetings. The system mainly consists of terminals that acquire and process voice data, servers that convert and translate it into text information, and a communication network that connects them.
[0065] The terminal used by the user is a device that acquires audio during meetings via a microphone. This audio data is transmitted to a server over the network. The terminal is equipped with a high-performance microprocessor and appropriate communication protocols for stable connection and processing.
[0066] The server receives audio data and converts it into text using speech recognition software. For speech recognition, software such as a speech recognition API is used. This software utilizes advanced speech recognition technology to generate accurate text from the audio data.
[0067] The generated text information is translated into different natural languages by a translation module on the server. For example, a machine translation service is used for translation. The translation module utilizes the latest natural language processing technologies to provide contextually accurate translations.
[0068] Translation results are recorded chronologically on the server and transmitted to the user's device in real time. Users can view the translated text information on their device and edit and save it as needed. The device is equipped with a web application using HTML5 and JavaScript (registered trademark), allowing for efficient operation through the user interface.
[0069] As a concrete example, this system can be used in meetings of multinational corporations to enable real-time communication between participants who speak different languages. An example of a prompt for the generative AI model would be, "Please design a system that translates meeting audio in real time and displays it to the participants."
[0070] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0071] Step 1:
[0072] The user's terminal acquires audio during the meeting using a microphone. Audio input is obtained as an analog signal through the microphone device, and audio data is generated by converting this signal to a digital format. The terminal transmits the acquired audio data to the server via the network. A low-latency and highly efficient protocol is used for transmitting the audio data.
[0073] Step 2:
[0074] The server receives audio data transmitted from the terminal. This received data is passed to the speech recognition module. The speech recognition module takes the received audio data as input and uses an advanced phoneme analysis algorithm to extract text information from the speech. This process outputs the audio data as text data.
[0075] Step 3:
[0076] The server passes the text data obtained from the speech recognition module to the translation module. This module receives the text data as input and translates it into a specified different natural language using a probabilistic translation method. The translated text information is output while maintaining accuracy and recorded on the server.
[0077] Step 4:
[0078] The server records translated text information in a database in chronological order. The recording process includes timestamps to preserve information about when the utterance occurred. This data is systematically organized for later retrieval and management.
[0079] Step 5:
[0080] The server transmits recorded translation data to the user's device in real time. Push technology is used for this transmission to ensure the information is immediately reflected on the device. The user's device has the functionality to display the received translation data on its screen, providing an intuitive interface for viewing and editing the information.
[0081] (Application Example 1)
[0082] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0083] In international conferences and meetings where multiple languages are spoken, it is extremely difficult for participants who speak different languages to quickly and accurately understand what is being said. Therefore, there is a need for a system that eliminates communication barriers caused by language differences and enables smooth communication.
[0084] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0085] In this invention, the server includes means for acquiring audio signals in a meeting, means for converting the audio signals into text data, and means for translating the text data into different languages. This enables speakers of different languages to instantly understand the meeting content and communicate effectively.
[0086] "Audio signals in a meeting" refers to audio data generated during a meeting, which enables the recording of participants' statements and discussions.
[0087] "Means of converting to text data" refers to technologies that enable the process of analyzing audio signals and converting them into digital data as textual information.
[0088] "Means of translation into different languages" refers to machine translation technology used to convert text expressed in one language into another language.
[0089] A "means of recording based on a time series" is a system that records data in the order in which it occurred, making it easier to refer to and analyze later.
[0090] "Means of providing data to display devices in real time" refers to technologies for instantly transmitting translated text data to the user's visual display device.
[0091] "A means of instantly understanding the content of meetings held in different regions, regardless of the location, using participant devices" refers to a system that enables participants to quickly grasp the content of a meeting even when they are in a remote location.
[0092] The system for implementing this invention converts audio during a meeting into text in real time, translates it into different languages, and provides this information to the participants' terminals. The server first receives the audio signal acquired through the microphone. This audio data is transmitted in streaming format based on a communication protocol.
[0093] The server converts received audio data into text data using a speech recognition module. This module incorporates advanced speech recognition algorithms and is capable of accurately analyzing human language. The converted text is then passed to a machine translation module, which translates it into the specified target language. The translation process incorporates the latest natural language processing technologies, enabling smooth, context-aware translations.
[0094] The translated text data is recorded chronologically and immediately transmitted in real time to participants' display devices. This allows participants to instantly grasp the content of the meeting and overcome communication barriers caused by language barriers. For example, in international meetings with multilingual speakers, this system enables each participant to instantly understand what is said in their native language, allowing for efficient meeting progress.
[0095] Furthermore, it is possible to improve translation accuracy using generative AI models. In specific situations where participants require more specific translations or explanations, they can enter the following prompt to generate detailed explanations or supplementary translations: "Please develop a system that translates Japanese audio into English in real time. This will instantly convey the content to participants speaking other languages and help facilitate the smooth running of meetings."
[0096] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0097] Step 1:
[0098] The terminal acquires the conference audio signal through the microphone. It converts the acquired audio data using the microphone device into a digital format and encodes it into a streaming format for transmission to the server. The input is an analog audio signal, and the output is encoded audio data.
[0099] Step 2:
[0100] The server receives encoded audio data and processes it with a speech recognition module. The speech recognition module uses advanced speech recognition algorithms to analyze and convert the input audio data into text data. This step analyzes the phonemes and context of the speech to generate accurate textual information. The input is encoded audio data, and the output is text data.
[0101] Step 3:
[0102] The server passes the acquired text data to the translation module, which then translates it into the specified target language. The translation module utilizes machine translation technology and generative AI models to achieve highly accurate translations that take context into account. The input is the original text data, and the output is the translated text data.
[0103] Step 4:
[0104] The server records translated text data chronologically and stores it as a history. This record helps participants review past meeting content. The input is translated text data, and the output is a storage of translated text organized chronologically.
[0105] Step 5:
[0106] The server transmits the recorded translated text data to participants' display devices in real time. Participants' devices display the received data on their screens, allowing users to immediately grasp the meeting content. The input is real-time translated text data, and the output is the displayed translated text.
[0107] Step 6:
[0108] Users utilize translation data on their participant devices to understand the content of meetings held in different regions, regardless of location. Users can check the displayed translation results and immediately join the meeting. The input is the real-time displayed translated text, and the output is improved user understanding and communication.
[0109] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0110] This invention is a system that not only converts audio during meetings into text in real time and translates it into different languages, but also analyzes the speaker's emotions and supplements that information. The system consists of a server, user terminals, and a communication environment that links them via a network.
[0111] The devices used by users during meetings have been improved to acquire not only voice input but also voice characteristics (e.g., voice speed, tone) for emotional analysis. These characteristic information are captured by the device simultaneously with the voice data and transmitted to the server in streaming format.
[0112] On the server, the received audio data is converted into text data by a speech recognition module. Signal processing in this process utilizes advanced speech recognition algorithms to ensure accuracy. Simultaneously, an emotion engine analyzes the characteristics of the speech and extracts emotional information. The emotion engine analyzes parameters such as intonation and speed to identify the speaker's emotional state.
[0113] Next, the server translates the converted text into the target language specified by the translation module. The translated text data is then supplemented with sentiment information extracted by the sentiment engine. This adds emotional nuance to the translated content, allowing the recipient to gain a deeper understanding of the intended meaning.
[0114] Furthermore, the server organizes the translated text, including emotional information, and records it as meeting minutes in chronological order. This record is updated in real time and sent to the user's device. Through the device, the user can see the speaker's emotional state in real time along with the translated text.
[0115] As a concrete example, by utilizing emotion recognition functions in a multinational conference, even when each participant can speak in their native language, the emotions and intentions of the speaker can be understood more clearly when translating information into other languages. For instance, if a participant expresses dissatisfaction with a proposal, the emotion engine can capture their tone and manner of speaking and add emotion labels such as "anger" or "dissatisfaction" to the text, ensuring that the intention is accurately conveyed to other participants and enabling discussions that avoid misunderstandings.
[0116] The following describes the processing flow.
[0117] Step 1:
[0118] The user captures audio during the meeting through the device's microphone, simultaneously capturing features such as intonation and speed. This allows for the acquisition of audio data and parameters related to emotion.
[0119] Step 2:
[0120] The terminal transmits the acquired audio data and its characteristic parameters to the server in streaming format. Data compression technology is applied to ensure rapid transmission to the server over the network.
[0121] Step 3:
[0122] The server passes the received audio data to the speech recognition module, which converts the audio into text data. The speech recognition algorithm is then applied to generate accurate text strings from the audio signal.
[0123] Step 4:
[0124] The server simultaneously activates the emotion engine and analyzes feature parameters related to the voice. The emotion engine analyzes the tone, intonation, and speed of the voice to extract the user's emotions.
[0125] Step 5:
[0126] The server passes the converted text data to the translation module, which translates it into the specified target language. A translation algorithm is used to produce a contextually natural translation.
[0127] Step 6:
[0128] The server adds sentiment information extracted by the sentiment engine to the translated text data. Sentiment labels and sentiment scores are tagged to the text data, supplementing the nuances and intent of the statements.
[0129] Step 7:
[0130] The server organizes translated text, including sentiment information, as time-series data and creates meeting minutes that are updated in real time. This data is then sent to the user's terminal.
[0131] Step 8:
[0132] Users can view translated meeting minutes through their devices and see embedded sentiment information in real time. They can also provide feedback on the content of the minutes as needed to deepen their understanding of the information.
[0133] (Example 2)
[0134] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0135] In international conferences and multilingual environments, communication presents challenges not only in terms of language barriers but also in understanding the speaker's intentions and emotions. Translated texts, in particular, tend to lose emotion and nuance, potentially leading to the omission of crucial elements in effective communication. Such situations increase the likelihood of misunderstandings and communication breakdowns.
[0136] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0137] In this invention, the server includes means for analyzing and extracting emotional information from acoustic data, means for adding emotional information to translated text information, and means for arranging and recording the translated text information based on chronological order. This makes it possible to accurately convey emotional nuances along with the content of speech in communication between different languages, thereby facilitating smoother communication.
[0138] "Audio data" refers to information, including audio signals, acquired during meetings and other forms of voice communication.
[0139] "Textual information" refers to text data obtained by converting acoustic data using speech recognition technology.
[0140] "Emotional information" refers to information about the speaker's emotional state, extracted from the analysis of acoustic data.
[0141] Translation is the process of converting textual information recorded in one language into a different target language.
[0142] A "server" is a central computer system that processes audio data, generates text information, translates, performs sentiment analysis, and records data.
[0143] A "user terminal" is a device that receives recorded data in real time and provides information to the user.
[0144] A "time series" is an arrangement of information in which data is sorted according to the order of time.
[0145] The embodiments for carrying out the present invention are shown below.
[0146] The user's device acquires audio data in real time during the meeting using a microphone. This device is equipped with a high-precision speech recognition microphone and software for analyzing audio characteristics (e.g., volume, speed, tone), and simultaneously captures audio data and its characteristic information. The acquired audio data is transmitted to the server in streaming format via a communication protocol (e.g., WebSocket).
[0147] When the server receives audio data, it converts it into text information using a speech recognition module. During this process, the server utilizes a generative AI model and executes advanced speech recognition algorithms to maintain conversion accuracy. Simultaneously, the server uses an emotion engine to analyze the audio characteristics of the audio data and extract emotional information. The emotion engine incorporates a machine learning model to analyze voice intonation and speed changes to identify the speaker's emotions.
[0148] The converted text information is translated by the server into the specified target language via a translation module. The translated text information is then supplemented with the aforementioned sentiment information, making it easier for the emotional nuances contained in the text to be conveyed to the recipient.
[0149] The server organizes the translated text and sentiment data chronologically and records it as meeting minutes. This record is updated in real time and sent to the user's terminal. Through the terminal, the user can view the translated text and the speaker's sentiment.
[0150] A concrete example of its use is in multinational conferences. This system allows each participant to speak in their native language while receiving information translated into other languages, and also deepens their understanding of the other person's emotions and intentions. An example of a prompt for the generative AI model is, "Please input audio data, then perform text conversion, translation, and sentiment analysis." This allows the system to efficiently execute each specified process.
[0151] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0152] Step 1:
[0153] The terminal acquires audio data during a meeting in real time using a microphone. Simultaneously, audio characteristics such as volume, speed, and tone are also captured. The input is the raw audio signal, and the output is digital audio data for analysis. Software on the terminal converts this data into a streaming format and transmits it to the server in a stable manner via a communication protocol.
[0154] Step 2:
[0155] The server receives audio data streamed from the terminal. The input is a stream of audio data. This data is passed to a speech recognition module for conversion into text information. Here, an advanced speech recognition algorithm using a generative AI model operates, converting the data into text with high accuracy even in the presence of noise. The output is the converted text information.
[0156] Step 3:
[0157] The server activates an emotion engine to extract emotional information from acoustic data. The input is acoustic data containing speech characteristics. This engine analyzes changes in voice intonation, speed, and tone to identify the speaker's emotions. Specifically, it uses a machine learning model to perform pattern recognition and generate labels such as "joy," "anger," and "sadness." The output is the analysis results, including the emotion labels.
[0158] Step 4:
[0159] The server passes the text information to the translation module, which translates it into the specified target language. The input is the converted text information. A generative AI model performs grammatically and culturally accurate translation. Sentiment information is also added at this stage, adding emotional nuances to the translated text. The output is the translated text information, including the sentiment information.
[0160] Step 5:
[0161] The server organizes translated text and sentiment information chronologically and records it as meeting minutes. Input consists of translated text and sentiment labels. This data is stored in a database and updated in real time as needed. Output is meeting minutes data organized chronologically.
[0162] Step 6:
[0163] The server provides the latest meeting minutes to the user's terminal in real time. The input is meeting minutes data organized chronologically. The user's terminal receives this and displays translated text and sentiment information on the screen. The output is real-time information provided to the user. This allows the user to gain a deeper understanding of the speaker's intentions and emotions.
[0164] (Application Example 2)
[0165] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0166] In international communication settings, smooth communication between speakers of different languages is difficult, and accurately conveying emotions and nuances is particularly challenging. Furthermore, while there is a demand for multilingual communication capabilities in home robots and portable devices, there is a lack of technology to provide accurate translations that incorporate emotional information in real time.
[0167] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0168] In this invention, the server includes means for converting speech into text and translating it into different languages, means for adding speaker's emotional information to the translated text, and means for recording the translated text and emotional information in chronological order. This makes it possible to accurately convey the speaker's emotions and intentions across language barriers in communication between people who speak different languages.
[0169] A "meeting" is a place where multiple participants exchange opinions and information and make decisions.
[0170] "Sound" refers to human speech and voice, and is information that can be processed as electrical signals.
[0171] "Text" refers to character information converted from speech, and is data represented in a visually recognizable form.
[0172] "Different languages" refers to a situation where the speaker and listener have different language systems, making direct mutual understanding difficult.
[0173] Translation is the act of replacing information expressed in one language with information expressed in another language, thereby conveying its meaning.
[0174] "Emotional information" refers to data that indicates the emotional state of a speaker, analyzed based on factors such as intonation, tone, and speed of speech.
[0175] A "time series" refers to an arrangement of information organized according to chronological order, and is a fundamental concept for managing history.
[0176] "Recording" is the act of preserving information and maintaining it in a format that allows for later reference.
[0177] An "information processing terminal" refers to equipment or devices used to display or process audio, text, and translation results.
[0178] To implement this invention, a server, a user-operated information processing terminal, and a communication network connecting them are primarily required. The server is equipped with a speech recognition module that receives audio data and converts it into text data. This speech recognition module utilizes an advanced speech recognition algorithm to perform highly accurate conversions. Furthermore, an emotion engine simultaneously extracts emotional information of the speaker from the characteristics of the audio. The emotion engine generates this information by analyzing parameters such as voice tone and speed.
[0179] The acquired text data is translated in real time into the specified target language using a translation module on the server. This translation module uses the latest machine translation technology to process the text and adds sentiment information to the translated text to accurately convey the speaker's intent and emotions.
[0180] The final translated text data is organized chronologically by the server and provided to information processing terminals in real time. The terminals used by users instantly display the translated text and sentiment information, facilitating smooth communication between languages.
[0181] As a concrete example, if a home robot utilizes this system, the robot can support conversations with visitors who speak different languages. For instance, it could translate "Hello, how are you?" spoken in English as "Hello, how are you? (emotion: interest)" and convey this emotional information to the Japanese speaker. This enables natural communication while understanding the visitor's emotions.
[0182] An example of a prompt for the generative AI model is "Please translate and analyze the emotion of the following sentence: 'Hello, how are you?'" which will allow the AI to perform appropriate translation and sentiment analysis.
[0183] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0184] Step 1:
[0185] The user provides voice input through the device. The device records this voice as a digital audio signal. This input is sent to the device as audio data, and the device's microphone sensor detects the voice and converts it into a digital signal.
[0186] Step 2:
[0187] The terminal transmits the audio data directly to the server. This transmission utilizes a network, transferring the data in real time via streaming. Data compression technology is applied during this process to achieve high-speed and efficient data communication. The input is digital audio data, and the output is compressed audio data.
[0188] Step 3:
[0189] The server inputs the received audio data into a speech recognition module. This module analyzes the audio data and converts it into text data. Using speech signal recognition technology, word recognition is performed by analyzing phonemes. The input here is compressed audio data, and the output is text data converted from the audio.
[0190] Step 4:
[0191] The server inputs the generated text data into the translation module and performs translation into the specified language. The translation module uses a machine translation engine to dynamically perform language conversion. The input is text data, and the output is translated text data.
[0192] Step 5:
[0193] Simultaneously, a server-based emotion engine analyzes the original speech feature data to infer the speaker's emotions. This analysis is based on acoustic features such as tone, intonation, and tempo. The input is speech feature data, and the output is data accompanied by emotion labels.
[0194] Step 6:
[0195] The server adds sentiment information to the translated text data and organizes the entire dataset chronologically. This process generates data as translated text with sentiment information. This dataset is stored along with past speech history and can be reused as needed.
[0196] Step 7:
[0197] The server transmits the prepared data to the terminal in real time. The terminal immediately displays the information on the user's screen, allowing the user to simultaneously view the translated content and sentiment information. The output is display data with text and sentiment labels that the user can visually recognize.
[0198] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0199] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0200] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0201] [Second Embodiment]
[0202] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0203] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0204] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0205] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0206] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0207] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0208] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0209] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0210] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0211] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0212] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0213] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0214] This invention relates to a system that converts audio from a meeting into text in real time, translates it into different languages, and provides it to the user. The system mainly consists of a server, user terminals, and a communication environment that connects them via a network.
[0215] First, the user's device acquires audio during the meeting via a microphone. This audio data is transmitted to the server in streaming format. The server passes the received audio data to a speech recognition module, which converts it into text data. The speech recognition module uses advanced algorithms to extract text information from the audio, achieving highly accurate text conversion.
[0216] Next, the server passes the text data to the translation module. The translation module uses machine translation technology to translate the text into the specified target language. The translation module utilizes the latest natural language processing technology to perform smooth, context-aware translations.
[0217] Furthermore, the translated text data is recorded chronologically by the server along with the original text data. This recorded data is updated in real time and sent from the server to the user's terminal. Users can view the translated meeting minutes through their terminal and edit and save them as needed.
[0218] As a concrete example, this system can be used in international conferences of multinational corporations. In a conference held in Japan, the system is used to allow English-speaking participants to understand what Japanese-speaking participants are saying. During the conference, discussions are conducted in Japanese, but the server translates them into English in real time and displays them on the participants' devices. This allows participants to immediately grasp the content and facilitates smooth communication.
[0219] The following describes the processing flow.
[0220] Step 1:
[0221] The user captures audio during the meeting through the device's microphone. This audio data is converted into a digital signal and temporarily stored within the device.
[0222] Step 2:
[0223] The terminal transmits the acquired audio data to the server in streaming format. Data compression technology is used during this process to improve communication efficiency.
[0224] Step 3:
[0225] The server passes the received audio data to the speech recognition module. The speech recognition module analyzes the audio signal and converts it into text data. Digital signal processing technology is applied to this conversion process.
[0226] Step 4:
[0227] The server passes the generated text data to the translation module. The translation module applies a machine translation algorithm to translate the text data into the specified target language.
[0228] Step 5:
[0229] The server attaches the translated text data to the original text data, organizes it chronologically, and creates meeting minutes that are updated in real time.
[0230] Step 6:
[0231] The server sends the created meeting minutes data to the user's terminal. The user can then view the meeting minutes, which are updated in real time, through their terminal.
[0232] Step 7:
[0233] Users can make corrections and annotations to the meeting minutes on their own devices as needed, ensuring the consistency and accuracy of the information.
[0234] (Example 1)
[0235] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0236] In international conferences where multiple languages are used, language barriers can hinder smooth communication among participants. Conventional technologies have struggled to provide efficient information processing necessary for accurate, real-time translation.
[0237] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0238] In this invention, the server includes an information processing device for acquiring audio data, an information processing device for converting the audio data into text information, and an information processing device for translating the text information into different natural languages. This enables smooth communication even in multilingual meetings by converting audio generated during a meeting into text information in real time, immediately translating that text information, and recording it chronologically.
[0239] "Audio data" refers to information obtained in digital format from audio signals generated during a meeting, and is data acquired by an information processing device.
[0240] "Textual information" refers to data in text format converted from audio data, and is generated by an information processing device.
[0241] A "different natural language" is a language other than the language in which the original written information was written, and is the language to be translated using machine translation technology.
[0242] An "information processing device" is a device that performs a series of processes from acquiring audio data to converting it into text information and then translating it, and includes dedicated software or hardware.
[0243] A "server" is a central processing unit that manages the reception, processing, and translation of audio data, and is a device that communicates with multiple terminals via a network.
[0244] "Providing in real time" means that the process from acquiring audio data to providing the translation results to the user is executed without delay, and the information is provided in an immediate, reflected manner.
[0245] This invention is a system for facilitating multilingual communication during meetings. The system mainly consists of terminals that acquire and process voice data, servers that convert and translate it into text information, and a communication network that connects them.
[0246] The terminal used by the user is a device that acquires audio during meetings via a microphone. This audio data is transmitted to a server over the network. The terminal is equipped with a high-performance microprocessor and appropriate communication protocols for stable connection and processing.
[0247] The server receives audio data and converts it into text using speech recognition software. For speech recognition, software such as a speech recognition API is used. This software utilizes advanced speech recognition technology to generate accurate text from the audio data.
[0248] The generated text information is translated into different natural languages by a translation module on the server. For example, a machine translation service is used for translation. The translation module utilizes the latest natural language processing technologies to provide contextually accurate translations.
[0249] Translation results are recorded chronologically on the server and transmitted to the user's device in real time. Users can view the translated text information on their device and edit and save it as needed. The device is equipped with a web application using HTML5 and JavaScript, allowing for efficient operation through the user interface.
[0250] As a concrete example, this system can be used in meetings of multinational corporations to enable real-time communication between participants who speak different languages. An example of a prompt for the generative AI model would be, "Please design a system that translates meeting audio in real time and displays it to the participants."
[0251] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0252] Step 1:
[0253] The user's terminal acquires audio during the meeting using a microphone. Audio input is obtained as an analog signal through the microphone device, and audio data is generated by converting this signal to a digital format. The terminal transmits the acquired audio data to the server via the network. A low-latency and highly efficient protocol is used for transmitting the audio data.
[0254] Step 2:
[0255] The server receives audio data transmitted from the terminal. This received data is passed to the speech recognition module. The speech recognition module takes the received audio data as input and uses an advanced phoneme analysis algorithm to extract text information from the speech. This process outputs the audio data as text data.
[0256] Step 3:
[0257] The server passes the text data obtained from the speech recognition module to the translation module. This module receives the text data as input and translates it into a specified different natural language using a probabilistic translation method. The translated text information is output while maintaining accuracy and recorded on the server.
[0258] Step 4:
[0259] The server records translated text information in a database in chronological order. The recording process includes timestamps to preserve information about when the utterance occurred. This data is systematically organized for later retrieval and management.
[0260] Step 5:
[0261] The server transmits recorded translation data to the user's device in real time. Push technology is used for this transmission to ensure the information is immediately reflected on the device. The user's device has the functionality to display the received translation data on its screen, providing an intuitive interface for viewing and editing the information.
[0262] (Application Example 1)
[0263] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0264] In international conferences and meetings where multiple languages are spoken, it is extremely difficult for participants who speak different languages to quickly and accurately understand what is being said. Therefore, there is a need for a system that eliminates communication barriers caused by language differences and enables smooth communication.
[0265] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0266] In this invention, the server includes means for acquiring audio signals in a meeting, means for converting the audio signals into text data, and means for translating the text data into different languages. This enables speakers of different languages to instantly understand the meeting content and communicate effectively.
[0267] "Audio signals in a meeting" refers to audio data generated during a meeting, which enables the recording of participants' statements and discussions.
[0268] "Means of converting to text data" refers to technologies that enable the process of analyzing audio signals and converting them into digital data as textual information.
[0269] "Means of translation into different languages" refers to machine translation technology used to convert text expressed in one language into another language.
[0270] A "means of recording based on a time series" is a system that records data in the order in which it occurred, making it easier to refer to and analyze later.
[0271] "Means of providing data to display devices in real time" refers to technologies for instantly transmitting translated text data to the user's visual display device.
[0272] "A means of instantly understanding the content of meetings held in different regions, regardless of the location, using participant devices" refers to a system that enables participants to quickly grasp the content of a meeting even when they are in a remote location.
[0273] The system for implementing this invention converts audio during a meeting into text in real time, translates it into different languages, and provides this information to the participants' terminals. The server first receives the audio signal acquired through the microphone. This audio data is transmitted in streaming format based on a communication protocol.
[0274] The server converts received audio data into text data using a speech recognition module. This module incorporates advanced speech recognition algorithms and is capable of accurately analyzing human language. The converted text is then passed to a machine translation module, which translates it into the specified target language. The translation process incorporates the latest natural language processing technologies, enabling smooth, context-aware translations.
[0275] The translated text data is recorded chronologically and immediately transmitted in real time to participants' display devices. This allows participants to instantly grasp the content of the meeting and overcome communication barriers caused by language barriers. For example, in international meetings with multilingual speakers, this system enables each participant to instantly understand what is said in their native language, allowing for efficient meeting progress.
[0276] Furthermore, it is also possible to improve translation accuracy by using a generative AI model. In certain situations, when participants need more specific translations or explanations, detailed explanations and supplementary translations can be generated by inputting the following prompt: "Please develop a system that can translate Japanese voice into English in real time. It will immediately convey the content to participants in other languages and support the smooth progress of the meeting."
[0277] The flow of the specific process in Application Example 1 will be described using FIG. 12.
[0278] Step 1:
[0279] The terminal acquires the voice signal of the meeting through the microphone. The voice data acquired using the microphone device is converted into a digital format and encoded into a streaming format for transmission to the server. The input is an analog voice signal, and the output is the encoded voice data.
[0280] Step 2:
[0281] The server receives the encoded voice data and processes it with the voice recognition module. The voice recognition module analyzes and converts the input voice data into text data using an advanced voice recognition algorithm. In this step, the phonemes and context of the voice are analyzed to generate accurate character information. The input is the encoded voice data, and the output is the text data.
[0282] Step 3:
[0283] The server passes the acquired text data to the translation module for translation into the specified target language. The translation module utilizes machine translation technology and a generative AI model to achieve highly accurate translations considering the context. The input is the source text data, and the output is the translated text data.
[0284] Step 4:
[0285] The server records the translated text data based on the time series and stores it as a history. This record is useful when participants check the content of past meetings. The input is the translated text data, and the output is the storage of the translated text organized in time series.
[0286] Step 5:
[0287] The server transmits the recorded translated text data to the participants' display devices in real time. The participants' terminals display the received data on the screen, and users can immediately grasp the content of the meeting. The input is the real-time translated text data, and the output is the displayed translated text.
[0288] Step 6:
[0289] Users utilize the translation data on the participants' terminals and can understand the content even in meetings in different regions regardless of the venue. Users can check the displayed translation results and immediately participate in the meeting. The input is the translated text displayed in real time, and the output is the improvement of the users' understanding and communication.
[0290] Furthermore, an emotion engine for estimating the user's emotion may be combined. That is, the specific processing unit 290 may estimate the user's emotion using the emotion specific model 59 and perform specific processing using the user's emotion.
[0291] The present invention is a system that not only converts the voice during the meeting into text in real time and translates it into different languages, but also analyzes the emotion of the speaker and complements the information. The system is composed of a server, the terminals used by users, and a communication environment that links them through a network.
[0292] The terminal used by the user during the meeting is improved to acquire voice characteristics (e.g., voice speed, tone) for analyzing the emotion in addition to voice input. These characteristic information are captured by the terminal simultaneously with the voice data and transmitted to the server in a streaming format.
[0293] On the server, the received audio data is converted into text data by a speech recognition module. Signal processing in this process utilizes advanced speech recognition algorithms to ensure accuracy. Simultaneously, an emotion engine analyzes the characteristics of the speech and extracts emotional information. The emotion engine analyzes parameters such as intonation and speed to identify the speaker's emotional state.
[0294] Next, the server translates the converted text into the target language specified by the translation module. The translated text data is then supplemented with sentiment information extracted by the sentiment engine. This adds emotional nuance to the translated content, allowing the recipient to gain a deeper understanding of the intended meaning.
[0295] Furthermore, the server organizes the translated text, including emotional information, and records it as meeting minutes in chronological order. This record is updated in real time and sent to the user's device. Through the device, the user can see the speaker's emotional state in real time along with the translated text.
[0296] As a concrete example, by utilizing emotion recognition functions in a multinational conference, even when each participant can speak in their native language, the emotions and intentions of the speaker can be understood more clearly when translating information into other languages. For instance, if a participant expresses dissatisfaction with a proposal, the emotion engine can capture their tone and manner of speaking and add emotion labels such as "anger" or "dissatisfaction" to the text, ensuring that the intention is accurately conveyed to other participants and enabling discussions that avoid misunderstandings.
[0297] The following describes the processing flow.
[0298] Step 1:
[0299] The user captures audio during the meeting through the device's microphone, simultaneously capturing features such as intonation and speed. This allows for the acquisition of audio data and parameters related to emotion.
[0300] Step 2:
[0301] The terminal sends the acquired voice data and its feature parameters to the server in a streaming format. Data compression technology is applied and it is quickly sent to the server via the network.
[0302] Step 3:
[0303] The server passes the received voice data to the voice recognition module to convert the voice into text data. A voice recognition algorithm is applied to generate an accurate character string from the voice signal.
[0304] Step 4:
[0305] The server concurrently activates the emotion engine and analyzes the feature parameters related to the voice. The emotion engine analyzes the tone, intonation, and speed of the voice to extract the user's emotion.
[0306] Step 5:
[0307] The server passes the converted text data to the translation module to translate it into the specified target language. A translation algorithm is used to perform a natural translation in line with the context.
[0308] Step 6:
[0309] The server adds the emotion information extracted by the emotion engine to the translated text data. Emotion labels and emotion scores are tagged to the text data to complement the nuance and intention of the speech.
[0310] Step 7:
[0311] The server organizes the translated text containing emotion information as time-series data and creates a transcript that is updated in real time. This data is sent to the user terminal.
[0312] Step 8:
[0313] Users can view translated meeting minutes through their devices and see embedded sentiment information in real time. They can also provide feedback on the content of the minutes as needed to deepen their understanding of the information.
[0314] (Example 2)
[0315] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0316] In international conferences and multilingual environments, communication presents challenges not only in terms of language barriers but also in understanding the speaker's intentions and emotions. Translated texts, in particular, tend to lose emotion and nuance, potentially leading to the omission of crucial elements in effective communication. Such situations increase the likelihood of misunderstandings and communication breakdowns.
[0317] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0318] In this invention, the server includes means for analyzing and extracting emotional information from acoustic data, means for adding emotional information to translated text information, and means for arranging and recording the translated text information based on chronological order. This makes it possible to accurately convey emotional nuances along with the content of speech in communication between different languages, thereby facilitating smoother communication.
[0319] "Audio data" refers to information, including audio signals, acquired during meetings and other forms of voice communication.
[0320] "Textual information" refers to text data obtained by converting acoustic data using speech recognition technology.
[0321] "Emotional information" refers to information about the speaker's emotional state, extracted from the analysis of acoustic data.
[0322] Translation is the process of converting textual information recorded in one language into a different target language.
[0323] A "server" is a central computer system that processes audio data, generates text information, translates, performs sentiment analysis, and records data.
[0324] A "user terminal" is a device that receives recorded data in real time and provides information to the user.
[0325] A "time series" is an arrangement of information in which data is sorted according to the order of time.
[0326] The embodiments for carrying out the present invention are shown below.
[0327] The user's device acquires audio data in real time during the meeting using a microphone. This device is equipped with a high-precision speech recognition microphone and software for analyzing audio characteristics (e.g., volume, speed, tone), and simultaneously captures audio data and its characteristic information. The acquired audio data is transmitted to the server in streaming format via a communication protocol (e.g., WebSocket).
[0328] When the server receives audio data, it converts it into text information using a speech recognition module. During this process, the server utilizes a generative AI model and executes advanced speech recognition algorithms to maintain conversion accuracy. Simultaneously, the server uses an emotion engine to analyze the audio characteristics of the audio data and extract emotional information. The emotion engine incorporates a machine learning model to analyze voice intonation and speed changes to identify the speaker's emotions.
[0329] The converted text information is translated by the server into the specified target language via a translation module. The translated text information is then supplemented with the aforementioned sentiment information, making it easier for the emotional nuances contained in the text to be conveyed to the recipient.
[0330] The server organizes the translated text and sentiment data chronologically and records it as meeting minutes. This record is updated in real time and sent to the user's terminal. Through the terminal, the user can view the translated text and the speaker's sentiment.
[0331] A concrete example of its use is in multinational conferences. This system allows each participant to speak in their native language while receiving information translated into other languages, and also deepens their understanding of the other person's emotions and intentions. An example of a prompt for the generative AI model is, "Please input audio data, then perform text conversion, translation, and sentiment analysis." This allows the system to efficiently execute each specified process.
[0332] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0333] Step 1:
[0334] The terminal acquires audio data during a meeting in real time using a microphone. Simultaneously, audio characteristics such as volume, speed, and tone are also captured. The input is the raw audio signal, and the output is digital audio data for analysis. Software on the terminal converts this data into a streaming format and transmits it to the server in a stable manner via a communication protocol.
[0335] Step 2:
[0336] The server receives audio data streamed from the terminal. The input is a stream of audio data. This data is passed to a speech recognition module for conversion into text information. Here, an advanced speech recognition algorithm using a generative AI model operates, converting the data into text with high accuracy even in the presence of noise. The output is the converted text information.
[0337] Step 3:
[0338] The server activates an emotion engine to extract emotional information from acoustic data. The input is acoustic data containing speech characteristics. This engine analyzes changes in voice intonation, speed, and tone to identify the speaker's emotions. Specifically, it uses a machine learning model to perform pattern recognition and generate labels such as "joy," "anger," and "sadness." The output is the analysis results, including the emotion labels.
[0339] Step 4:
[0340] The server passes the text information to the translation module, which translates it into the specified target language. The input is the converted text information. A generative AI model performs grammatically and culturally accurate translation. Sentiment information is also added at this stage, adding emotional nuances to the translated text. The output is the translated text information, including the sentiment information.
[0341] Step 5:
[0342] The server organizes translated text and sentiment information chronologically and records it as meeting minutes. Input consists of translated text and sentiment labels. This data is stored in a database and updated in real time as needed. Output is meeting minutes data organized chronologically.
[0343] Step 6:
[0344] The server provides the latest meeting minutes to the user's terminal in real time. The input is meeting minutes data organized chronologically. The user's terminal receives this and displays translated text and sentiment information on the screen. The output is real-time information provided to the user. This allows the user to gain a deeper understanding of the speaker's intentions and emotions.
[0345] (Application Example 2)
[0346] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0347] In international communication settings, smooth communication between speakers of different languages is difficult, and accurately conveying emotions and nuances is particularly challenging. Furthermore, while there is a demand for multilingual communication capabilities in home robots and portable devices, there is a lack of technology to provide accurate translations that incorporate emotional information in real time.
[0348] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0349] In this invention, the server includes means for converting speech into text and translating it into different languages, means for adding speaker's emotional information to the translated text, and means for recording the translated text and emotional information in chronological order. This makes it possible to accurately convey the speaker's emotions and intentions across language barriers in communication between people who speak different languages.
[0350] A "meeting" is a place where multiple participants exchange opinions and information and make decisions.
[0351] "Sound" refers to human speech and voice, and is information that can be processed as electrical signals.
[0352] "Text" refers to character information converted from speech, and is data represented in a visually recognizable form.
[0353] "Different languages" refers to a situation where the speaker and listener have different language systems, making direct mutual understanding difficult.
[0354] Translation is the act of replacing information expressed in one language with information expressed in another language, thereby conveying its meaning.
[0355] "Emotional information" refers to data that indicates the emotional state of a speaker, analyzed based on factors such as intonation, tone, and speed of speech.
[0356] A "time series" refers to an arrangement of information organized according to chronological order, and is a fundamental concept for managing history.
[0357] "Recording" is the act of preserving information and maintaining it in a format that allows for later reference.
[0358] An "information processing terminal" refers to equipment or devices used to display or process audio, text, and translation results.
[0359] To implement this invention, a server, a user-operated information processing terminal, and a communication network connecting them are primarily required. The server is equipped with a speech recognition module that receives audio data and converts it into text data. This speech recognition module utilizes an advanced speech recognition algorithm to perform highly accurate conversions. Furthermore, an emotion engine simultaneously extracts emotional information of the speaker from the characteristics of the audio. The emotion engine generates this information by analyzing parameters such as voice tone and speed.
[0360] The acquired text data is translated in real time into the specified target language using a translation module on the server. This translation module uses the latest machine translation technology to process the text and adds sentiment information to the translated text to accurately convey the speaker's intent and emotions.
[0361] The final translated text data is organized chronologically by the server and provided to information processing terminals in real time. The terminals used by users instantly display the translated text and sentiment information, facilitating smooth communication between languages.
[0362] As a concrete example, if a home robot utilizes this system, the robot can support conversations with visitors who speak different languages. For instance, it could translate "Hello, how are you?" spoken in English as "Hello, how are you? (emotion: interest)" and convey this emotional information to the Japanese speaker. This enables natural communication while understanding the visitor's emotions.
[0363] An example of a prompt for the generative AI model is "Please translate and analyze the emotion of the following sentence: 'Hello, how are you?'" which will allow the AI to perform appropriate translation and sentiment analysis.
[0364] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0365] Step 1:
[0366] The user provides voice input through the device. The device records this voice as a digital audio signal. This input is sent to the device as audio data, and the device's microphone sensor detects the voice and converts it into a digital signal.
[0367] Step 2:
[0368] The terminal transmits the audio data directly to the server. This transmission utilizes a network, transferring the data in real time via streaming. Data compression technology is applied during this process to achieve high-speed and efficient data communication. The input is digital audio data, and the output is compressed audio data.
[0369] Step 3:
[0370] The server inputs the received audio data into a speech recognition module. This module analyzes the audio data and converts it into text data. Using speech signal recognition technology, word recognition is performed by analyzing phonemes. The input here is compressed audio data, and the output is text data converted from the audio.
[0371] Step 4:
[0372] The server inputs the generated text data into the translation module and performs translation into the specified language. The translation module uses a machine translation engine to dynamically perform language conversion. The input is text data, and the output is translated text data.
[0373] Step 5:
[0374] Simultaneously, a server-based emotion engine analyzes the original speech feature data to infer the speaker's emotions. This analysis is based on acoustic features such as tone, intonation, and tempo. The input is speech feature data, and the output is data accompanied by emotion labels.
[0375] Step 6:
[0376] The server adds sentiment information to the translated text data and organizes the entire dataset chronologically. This process generates data as translated text with sentiment information. This dataset is stored along with past speech history and can be reused as needed.
[0377] Step 7:
[0378] The server transmits the prepared data to the terminal in real time. The terminal immediately displays the information on the user's screen, allowing the user to simultaneously view the translated content and sentiment information. The output is display data with text and sentiment labels that the user can visually recognize.
[0379] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0380] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0381] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0382] [Third Embodiment]
[0383] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0384] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0385] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0386] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0387] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0388] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0389] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0390] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0391] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0392] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0393] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0394] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0395] This invention relates to a system that converts audio from a meeting into text in real time, translates it into different languages, and provides it to the user. The system mainly consists of a server, user terminals, and a communication environment that connects them via a network.
[0396] First, the user's device acquires audio during the meeting via a microphone. This audio data is transmitted to the server in streaming format. The server passes the received audio data to a speech recognition module, which converts it into text data. The speech recognition module uses advanced algorithms to extract text information from the audio, achieving highly accurate text conversion.
[0397] Next, the server passes the text data to the translation module. The translation module uses machine translation technology to translate the text into the specified target language. The translation module utilizes the latest natural language processing technology to perform smooth, context-aware translations.
[0398] Furthermore, the translated text data is recorded chronologically by the server along with the original text data. This recorded data is updated in real time and sent from the server to the user's terminal. Users can view the translated meeting minutes through their terminal and edit and save them as needed.
[0399] As a concrete example, this system can be used in international conferences of multinational corporations. In a conference held in Japan, the system is used to allow English-speaking participants to understand what Japanese-speaking participants are saying. During the conference, discussions are conducted in Japanese, but the server translates them into English in real time and displays them on the participants' devices. This allows participants to immediately grasp the content and facilitates smooth communication.
[0400] The following describes the processing flow.
[0401] Step 1:
[0402] The user captures audio during the meeting through the device's microphone. This audio data is converted into a digital signal and temporarily stored within the device.
[0403] Step 2:
[0404] The terminal transmits the acquired audio data to the server in streaming format. Data compression technology is used during this process to improve communication efficiency.
[0405] Step 3:
[0406] The server passes the received audio data to the speech recognition module. The speech recognition module analyzes the audio signal and converts it into text data. Digital signal processing technology is applied to this conversion process.
[0407] Step 4:
[0408] The server passes the generated text data to the translation module. The translation module applies a machine translation algorithm to translate the text data into the specified target language.
[0409] Step 5:
[0410] The server attaches the translated text data to the original text data, organizes it chronologically, and creates meeting minutes that are updated in real time.
[0411] Step 6:
[0412] The server sends the created meeting minutes data to the user's terminal. The user can then view the meeting minutes, which are updated in real time, through their terminal.
[0413] Step 7:
[0414] Users can make corrections and annotations to the meeting minutes on their own devices as needed, ensuring the consistency and accuracy of the information.
[0415] (Example 1)
[0416] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0417] In international conferences where multiple languages are used, language barriers can hinder smooth communication among participants. Conventional technologies have struggled to provide efficient information processing necessary for accurate, real-time translation.
[0418] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0419] In this invention, the server includes an information processing device for acquiring audio data, an information processing device for converting the audio data into text information, and an information processing device for translating the text information into different natural languages. This enables smooth communication even in multilingual meetings by converting audio generated during a meeting into text information in real time, immediately translating that text information, and recording it chronologically.
[0420] "Audio data" refers to information obtained in digital format from audio signals generated during a meeting, and is data acquired by an information processing device.
[0421] "Textual information" refers to data in text format converted from audio data, and is generated by an information processing device.
[0422] A "different natural language" is a language other than the language in which the original written information was written, and is the language to be translated using machine translation technology.
[0423] An "information processing device" is a device that performs a series of processes from acquiring audio data to converting it into text information and then translating it, and includes dedicated software or hardware.
[0424] A "server" is a central processing unit that manages the reception, processing, and translation of audio data, and is a device that communicates with multiple terminals via a network.
[0425] "Providing in real time" means that the process from acquiring audio data to providing the translation results to the user is executed without delay, and the information is provided in an immediate, reflected manner.
[0426] This invention is a system for facilitating multilingual communication during meetings. The system mainly consists of terminals that acquire and process voice data, servers that convert and translate it into text information, and a communication network that connects them.
[0427] The terminal used by the user is a device that acquires audio during meetings via a microphone. This audio data is transmitted to a server over the network. The terminal is equipped with a high-performance microprocessor and appropriate communication protocols for stable connection and processing.
[0428] The server receives audio data and converts it into text using speech recognition software. For speech recognition, software such as a speech recognition API is used. This software utilizes advanced speech recognition technology to generate accurate text from the audio data.
[0429] The generated text information is translated into different natural languages by a translation module on the server. For example, a machine translation service is used for translation. The translation module utilizes the latest natural language processing technologies to provide contextually accurate translations.
[0430] Translation results are recorded chronologically on the server and transmitted to the user's device in real time. Users can view the translated text information on their device and edit and save it as needed. The device is equipped with a web application using HTML5 and JavaScript, allowing for efficient operation through the user interface.
[0431] As a concrete example, this system can be used in meetings of multinational corporations to enable real-time communication between participants who speak different languages. An example of a prompt for the generative AI model would be, "Please design a system that translates meeting audio in real time and displays it to the participants."
[0432] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0433] Step 1:
[0434] The user's terminal acquires audio during the meeting using a microphone. Audio input is obtained as an analog signal through the microphone device, and audio data is generated by converting this signal to a digital format. The terminal transmits the acquired audio data to the server via the network. A low-latency and highly efficient protocol is used for transmitting the audio data.
[0435] Step 2:
[0436] The server receives audio data transmitted from the terminal. This received data is passed to the speech recognition module. The speech recognition module takes the received audio data as input and uses an advanced phoneme analysis algorithm to extract text information from the speech. This process outputs the audio data as text data.
[0437] Step 3:
[0438] The server passes the text data obtained from the speech recognition module to the translation module. This module receives the text data as input and translates it into a specified different natural language using a probabilistic translation method. The translated text information is output while maintaining accuracy and recorded on the server.
[0439] Step 4:
[0440] The server records translated text information in a database in chronological order. The recording process includes timestamps to preserve information about when the utterance occurred. This data is systematically organized for later retrieval and management.
[0441] Step 5:
[0442] The server transmits recorded translation data to the user's device in real time. Push technology is used for this transmission to ensure the information is immediately reflected on the device. The user's device has the functionality to display the received translation data on its screen, providing an intuitive interface for viewing and editing the information.
[0443] (Application Example 1)
[0444] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0445] In international conferences and meetings where multiple languages are spoken, it is extremely difficult for participants who speak different languages to quickly and accurately understand what is being said. Therefore, there is a need for a system that eliminates communication barriers caused by language differences and enables smooth communication.
[0446] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0447] In this invention, the server includes means for acquiring audio signals in a meeting, means for converting the audio signals into text data, and means for translating the text data into different languages. This enables speakers of different languages to instantly understand the meeting content and communicate effectively.
[0448] "Audio signals in a meeting" refers to audio data generated during a meeting, which enables the recording of participants' statements and discussions.
[0449] "Means of converting to text data" refers to technologies that enable the process of analyzing audio signals and converting them into digital data as textual information.
[0450] "Means of translation into different languages" refers to machine translation technology used to convert text expressed in one language into another language.
[0451] A "means of recording based on a time series" is a system that records data in the order in which it occurred, making it easier to refer to and analyze later.
[0452] "Means of providing data to display devices in real time" refers to technologies for instantly transmitting translated text data to the user's visual display device.
[0453] "A means of instantly understanding the content of meetings held in different regions, regardless of the location, using participant devices" refers to a system that enables participants to quickly grasp the content of a meeting even when they are in a remote location.
[0454] The system for implementing this invention converts audio during a meeting into text in real time, translates it into different languages, and provides this information to the participants' terminals. The server first receives the audio signal acquired through the microphone. This audio data is transmitted in streaming format based on a communication protocol.
[0455] The server converts received audio data into text data using a speech recognition module. This module incorporates advanced speech recognition algorithms and is capable of accurately analyzing human language. The converted text is then passed to a machine translation module, which translates it into the specified target language. The translation process incorporates the latest natural language processing technologies, enabling smooth, context-aware translations.
[0456] The translated text data is recorded chronologically and immediately transmitted in real time to participants' display devices. This allows participants to instantly grasp the content of the meeting and overcome communication barriers caused by language barriers. For example, in international meetings with multilingual speakers, this system enables each participant to instantly understand what is said in their native language, allowing for efficient meeting progress.
[0457] Furthermore, it is possible to improve translation accuracy using generative AI models. In specific situations where participants require more specific translations or explanations, they can enter the following prompt to generate detailed explanations or supplementary translations: "Please develop a system that translates Japanese audio into English in real time. This will instantly convey the content to participants speaking other languages and help facilitate the smooth running of meetings."
[0458] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0459] Step 1:
[0460] The terminal acquires the conference audio signal through the microphone. It converts the acquired audio data using the microphone device into a digital format and encodes it into a streaming format for transmission to the server. The input is an analog audio signal, and the output is encoded audio data.
[0461] Step 2:
[0462] The server receives encoded audio data and processes it with a speech recognition module. The speech recognition module uses advanced speech recognition algorithms to analyze and convert the input audio data into text data. This step analyzes the phonemes and context of the speech to generate accurate textual information. The input is encoded audio data, and the output is text data.
[0463] Step 3:
[0464] The server passes the acquired text data to the translation module, which then translates it into the specified target language. The translation module utilizes machine translation technology and generative AI models to achieve highly accurate translations that take context into account. The input is the original text data, and the output is the translated text data.
[0465] Step 4:
[0466] The server records translated text data chronologically and stores it as a history. This record helps participants review past meeting content. The input is translated text data, and the output is a storage of translated text organized chronologically.
[0467] Step 5:
[0468] The server transmits the recorded translated text data to participants' display devices in real time. Participants' devices display the received data on their screens, allowing users to immediately grasp the meeting content. The input is real-time translated text data, and the output is the displayed translated text.
[0469] Step 6:
[0470] Users utilize translation data on their participant devices to understand the content of meetings held in different regions, regardless of location. Users can check the displayed translation results and immediately join the meeting. The input is the real-time displayed translated text, and the output is improved user understanding and communication.
[0471] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0472] This invention is a system that not only converts audio during meetings into text in real time and translates it into different languages, but also analyzes the speaker's emotions and supplements that information. The system consists of a server, user terminals, and a communication environment that links them via a network.
[0473] The devices used by users during meetings have been improved to acquire not only voice input but also voice characteristics (e.g., voice speed, tone) for emotional analysis. These characteristic information are captured by the device simultaneously with the voice data and transmitted to the server in streaming format.
[0474] On the server, the received audio data is converted into text data by a speech recognition module. Signal processing in this process utilizes advanced speech recognition algorithms to ensure accuracy. Simultaneously, an emotion engine analyzes the characteristics of the speech and extracts emotional information. The emotion engine analyzes parameters such as intonation and speed to identify the speaker's emotional state.
[0475] Next, the server translates the converted text into the target language specified by the translation module. The translated text data is then supplemented with sentiment information extracted by the sentiment engine. This adds emotional nuance to the translated content, allowing the recipient to gain a deeper understanding of the intended meaning.
[0476] Furthermore, the server organizes the translated text, including emotional information, and records it as meeting minutes in chronological order. This record is updated in real time and sent to the user's device. Through the device, the user can see the speaker's emotional state in real time along with the translated text.
[0477] As a concrete example, by utilizing emotion recognition functions in a multinational conference, even when each participant can speak in their native language, the emotions and intentions of the speaker can be understood more clearly when translating information into other languages. For instance, if a participant expresses dissatisfaction with a proposal, the emotion engine can capture their tone and manner of speaking and add emotion labels such as "anger" or "dissatisfaction" to the text, ensuring that the intention is accurately conveyed to other participants and enabling discussions that avoid misunderstandings.
[0478] The following describes the processing flow.
[0479] Step 1:
[0480] The user captures audio during the meeting through the device's microphone, simultaneously capturing features such as intonation and speed. This allows for the acquisition of audio data and parameters related to emotion.
[0481] Step 2:
[0482] The terminal transmits the acquired audio data and its characteristic parameters to the server in streaming format. Data compression technology is applied to ensure rapid transmission to the server over the network.
[0483] Step 3:
[0484] The server passes the received audio data to the speech recognition module, which converts the audio into text data. The speech recognition algorithm is then applied to generate accurate text strings from the audio signal.
[0485] Step 4:
[0486] The server simultaneously activates the emotion engine and analyzes feature parameters related to the voice. The emotion engine analyzes the tone, intonation, and speed of the voice to extract the user's emotions.
[0487] Step 5:
[0488] The server passes the converted text data to the translation module, which translates it into the specified target language. A translation algorithm is used to produce a contextually natural translation.
[0489] Step 6:
[0490] The server adds sentiment information extracted by the sentiment engine to the translated text data. Sentiment labels and sentiment scores are tagged to the text data, supplementing the nuances and intent of the statements.
[0491] Step 7:
[0492] The server organizes translated text, including sentiment information, as time-series data and creates meeting minutes that are updated in real time. This data is then sent to the user's terminal.
[0493] Step 8:
[0494] Users can view translated meeting minutes through their devices and see embedded sentiment information in real time. They can also provide feedback on the content of the minutes as needed to deepen their understanding of the information.
[0495] (Example 2)
[0496] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0497] In international conferences and multilingual environments, communication presents challenges not only in terms of language barriers but also in understanding the speaker's intentions and emotions. Translated texts, in particular, tend to lose emotion and nuance, potentially leading to the omission of crucial elements in effective communication. Such situations increase the likelihood of misunderstandings and communication breakdowns.
[0498] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0499] In this invention, the server includes means for analyzing and extracting emotional information from acoustic data, means for adding emotional information to translated text information, and means for arranging and recording the translated text information based on chronological order. This makes it possible to accurately convey emotional nuances along with the content of speech in communication between different languages, thereby facilitating smoother communication.
[0500] "Audio data" refers to information, including audio signals, acquired during meetings and other forms of voice communication.
[0501] "Textual information" refers to text data obtained by converting acoustic data using speech recognition technology.
[0502] "Emotional information" refers to information about the speaker's emotional state, extracted from the analysis of acoustic data.
[0503] Translation is the process of converting textual information recorded in one language into a different target language.
[0504] A "server" is a central computer system that processes audio data, generates text information, translates, performs sentiment analysis, and records data.
[0505] A "user terminal" is a device that receives recorded data in real time and provides information to the user.
[0506] A "time series" is an arrangement of information in which data is sorted according to the order of time.
[0507] The embodiments for carrying out the present invention are shown below.
[0508] The user's device acquires audio data in real time during the meeting using a microphone. This device is equipped with a high-precision speech recognition microphone and software for analyzing audio characteristics (e.g., volume, speed, tone), and simultaneously captures audio data and its characteristic information. The acquired audio data is transmitted to the server in streaming format via a communication protocol (e.g., WebSocket).
[0509] When the server receives audio data, it converts it into text information using a speech recognition module. During this process, the server utilizes a generative AI model and executes advanced speech recognition algorithms to maintain conversion accuracy. Simultaneously, the server uses an emotion engine to analyze the audio characteristics of the audio data and extract emotional information. The emotion engine incorporates a machine learning model to analyze voice intonation and speed changes to identify the speaker's emotions.
[0510] The converted text information is translated by the server into the specified target language via a translation module. The translated text information is then supplemented with the aforementioned sentiment information, making it easier for the emotional nuances contained in the text to be conveyed to the recipient.
[0511] The server organizes the translated text and sentiment data chronologically and records it as meeting minutes. This record is updated in real time and sent to the user's terminal. Through the terminal, the user can view the translated text and the speaker's sentiment.
[0512] A concrete example of its use is in multinational conferences. This system allows each participant to speak in their native language while receiving information translated into other languages, and also deepens their understanding of the other person's emotions and intentions. An example of a prompt for the generative AI model is, "Please input audio data, then perform text conversion, translation, and sentiment analysis." This allows the system to efficiently execute each specified process.
[0513] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0514] Step 1:
[0515] The terminal acquires audio data during a meeting in real time using a microphone. Simultaneously, audio characteristics such as volume, speed, and tone are also captured. The input is the raw audio signal, and the output is digital audio data for analysis. Software on the terminal converts this data into a streaming format and transmits it to the server in a stable manner via a communication protocol.
[0516] Step 2:
[0517] The server receives audio data streamed from the terminal. The input is a stream of audio data. This data is passed to a speech recognition module for conversion into text information. Here, an advanced speech recognition algorithm using a generative AI model operates, converting the data into text with high accuracy even in the presence of noise. The output is the converted text information.
[0518] Step 3:
[0519] The server activates an emotion engine to extract emotional information from acoustic data. The input is acoustic data containing speech characteristics. This engine analyzes changes in voice intonation, speed, and tone to identify the speaker's emotions. Specifically, it uses a machine learning model to perform pattern recognition and generate labels such as "joy," "anger," and "sadness." The output is the analysis results, including the emotion labels.
[0520] Step 4:
[0521] The server passes the text information to the translation module, which translates it into the specified target language. The input is the converted text information. A generative AI model performs grammatically and culturally accurate translation. Sentiment information is also added at this stage, adding emotional nuances to the translated text. The output is the translated text information, including the sentiment information.
[0522] Step 5:
[0523] The server organizes translated text and sentiment information chronologically and records it as meeting minutes. Input consists of translated text and sentiment labels. This data is stored in a database and updated in real time as needed. Output is meeting minutes data organized chronologically.
[0524] Step 6:
[0525] The server provides the latest meeting minutes to the user's terminal in real time. The input is meeting minutes data organized chronologically. The user's terminal receives this and displays translated text and sentiment information on the screen. The output is real-time information provided to the user. This allows the user to gain a deeper understanding of the speaker's intentions and emotions.
[0526] (Application Example 2)
[0527] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0528] In international communication settings, smooth communication between speakers of different languages is difficult, and accurately conveying emotions and nuances is particularly challenging. Furthermore, while there is a demand for multilingual communication capabilities in home robots and portable devices, there is a lack of technology to provide accurate translations that incorporate emotional information in real time.
[0529] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0530] In this invention, the server includes means for converting speech into text and translating it into different languages, means for adding speaker's emotional information to the translated text, and means for recording the translated text and emotional information in chronological order. This makes it possible to accurately convey the speaker's emotions and intentions across language barriers in communication between people who speak different languages.
[0531] A "meeting" is a place where multiple participants exchange opinions and information and make decisions.
[0532] "Sound" refers to human speech and voice, and is information that can be processed as electrical signals.
[0533] "Text" refers to character information converted from speech, and is data represented in a visually recognizable form.
[0534] "Different languages" refers to a situation where the speaker and listener have different language systems, making direct mutual understanding difficult.
[0535] Translation is the act of replacing information expressed in one language with information expressed in another language, thereby conveying its meaning.
[0536] "Emotional information" refers to data that indicates the emotional state of a speaker, analyzed based on factors such as intonation, tone, and speed of speech.
[0537] A "time series" refers to an arrangement of information organized according to chronological order, and is a fundamental concept for managing history.
[0538] "Recording" is the act of preserving information and maintaining it in a format that allows for later reference.
[0539] An "information processing terminal" refers to equipment or devices used to display or process audio, text, and translation results.
[0540] To implement this invention, a server, a user-operated information processing terminal, and a communication network connecting them are primarily required. The server is equipped with a speech recognition module that receives audio data and converts it into text data. This speech recognition module utilizes an advanced speech recognition algorithm to perform highly accurate conversions. Furthermore, an emotion engine simultaneously extracts emotional information of the speaker from the characteristics of the audio. The emotion engine generates this information by analyzing parameters such as voice tone and speed.
[0541] The acquired text data is translated in real time into the specified target language using a translation module on the server. This translation module uses the latest machine translation technology to process the text and adds sentiment information to the translated text to accurately convey the speaker's intent and emotions.
[0542] The final translated text data is organized chronologically by the server and provided to information processing terminals in real time. The terminals used by users instantly display the translated text and sentiment information, facilitating smooth communication between languages.
[0543] As a concrete example, if a home robot utilizes this system, the robot can support conversations with visitors who speak different languages. For instance, it could translate "Hello, how are you?" spoken in English as "Hello, how are you? (emotion: interest)" and convey this emotional information to the Japanese speaker. This enables natural communication while understanding the visitor's emotions.
[0544] An example of a prompt for the generative AI model is "Please translate and analyze the emotion of the following sentence: 'Hello, how are you?'" which will allow the AI to perform appropriate translation and sentiment analysis.
[0545] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0546] Step 1:
[0547] The user provides voice input through the device. The device records this voice as a digital audio signal. This input is sent to the device as audio data, and the device's microphone sensor detects the voice and converts it into a digital signal.
[0548] Step 2:
[0549] The terminal transmits the audio data directly to the server. This transmission utilizes a network, transferring the data in real time via streaming. Data compression technology is applied during this process to achieve high-speed and efficient data communication. The input is digital audio data, and the output is compressed audio data.
[0550] Step 3:
[0551] The server inputs the received audio data into a speech recognition module. This module analyzes the audio data and converts it into text data. Using speech signal recognition technology, word recognition is performed by analyzing phonemes. The input here is compressed audio data, and the output is text data converted from the audio.
[0552] Step 4:
[0553] The server inputs the generated text data into the translation module and performs translation into the specified language. The translation module uses a machine translation engine to dynamically perform language conversion. The input is text data, and the output is translated text data.
[0554] Step 5:
[0555] Simultaneously, a server-based emotion engine analyzes the original speech feature data to infer the speaker's emotions. This analysis is based on acoustic features such as tone, intonation, and tempo. The input is speech feature data, and the output is data accompanied by emotion labels.
[0556] Step 6:
[0557] The server adds sentiment information to the translated text data and organizes the entire dataset chronologically. This process generates data as translated text with sentiment information. This dataset is stored along with past speech history and can be reused as needed.
[0558] Step 7:
[0559] The server transmits the prepared data to the terminal in real time. The terminal immediately displays the information on the user's screen, allowing the user to simultaneously view the translated content and sentiment information. The output is display data with text and sentiment labels that the user can visually recognize.
[0560] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0561] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0562] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0563] [Fourth Embodiment]
[0564] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0565] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0566] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0567] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0568] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0569] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0570] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0571] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0572] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0573] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0574] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0575] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0576] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0577] This invention relates to a system that converts audio from a meeting into text in real time, translates it into different languages, and provides it to the user. The system mainly consists of a server, user terminals, and a communication environment that connects them via a network.
[0578] First, the user's device acquires audio during the meeting via a microphone. This audio data is transmitted to the server in streaming format. The server passes the received audio data to a speech recognition module, which converts it into text data. The speech recognition module uses advanced algorithms to extract text information from the audio, achieving highly accurate text conversion.
[0579] Next, the server passes the text data to the translation module. The translation module uses machine translation technology to translate the text into the specified target language. The translation module utilizes the latest natural language processing technology to perform smooth, context-aware translations.
[0580] Furthermore, the translated text data is recorded chronologically by the server along with the original text data. This recorded data is updated in real time and sent from the server to the user's terminal. Users can view the translated meeting minutes through their terminal and edit and save them as needed.
[0581] As a concrete example, this system can be used in international conferences of multinational corporations. In a conference held in Japan, the system is used to allow English-speaking participants to understand what Japanese-speaking participants are saying. During the conference, discussions are conducted in Japanese, but the server translates them into English in real time and displays them on the participants' devices. This allows participants to immediately grasp the content and facilitates smooth communication.
[0582] The following describes the processing flow.
[0583] Step 1:
[0584] The user captures audio during the meeting through the device's microphone. This audio data is converted into a digital signal and temporarily stored within the device.
[0585] Step 2:
[0586] The terminal transmits the acquired audio data to the server in streaming format. Data compression technology is used during this process to improve communication efficiency.
[0587] Step 3:
[0588] The server passes the received audio data to the speech recognition module. The speech recognition module analyzes the audio signal and converts it into text data. Digital signal processing technology is applied to this conversion process.
[0589] Step 4:
[0590] The server passes the generated text data to the translation module. The translation module applies a machine translation algorithm to translate the text data into the specified target language.
[0591] Step 5:
[0592] The server attaches the translated text data to the original text data, organizes it chronologically, and creates meeting minutes that are updated in real time.
[0593] Step 6:
[0594] The server sends the created meeting minutes data to the user's terminal. The user can then view the meeting minutes, which are updated in real time, through their terminal.
[0595] Step 7:
[0596] Users can make corrections and annotations to the meeting minutes on their own devices as needed, ensuring the consistency and accuracy of the information.
[0597] (Example 1)
[0598] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0599] In international conferences where multiple languages are used, language barriers can hinder smooth communication among participants. Conventional technologies have struggled to provide efficient information processing necessary for accurate, real-time translation.
[0600] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0601] In this invention, the server includes an information processing device for acquiring audio data, an information processing device for converting the audio data into text information, and an information processing device for translating the text information into different natural languages. This enables smooth communication even in multilingual meetings by converting audio generated during a meeting into text information in real time, immediately translating that text information, and recording it chronologically.
[0602] "Audio data" refers to information obtained in digital format from audio signals generated during a meeting, and is data acquired by an information processing device.
[0603] "Textual information" refers to data in text format converted from audio data, and is generated by an information processing device.
[0604] A "different natural language" is a language other than the language in which the original written information was written, and is the language to be translated using machine translation technology.
[0605] An "information processing device" is a device that performs a series of processes from acquiring audio data to converting it into text information and then translating it, and includes dedicated software or hardware.
[0606] A "server" is a central processing unit that manages the reception, processing, and translation of audio data, and is a device that communicates with multiple terminals via a network.
[0607] "Providing in real time" means that the process from acquiring audio data to providing the translation results to the user is executed without delay, and the information is provided in an immediate, reflected manner.
[0608] This invention is a system for facilitating multilingual communication during meetings. The system mainly consists of terminals that acquire and process voice data, servers that convert and translate it into text information, and a communication network that connects them.
[0609] The terminal used by the user is a device that acquires audio during meetings via a microphone. This audio data is transmitted to a server over the network. The terminal is equipped with a high-performance microprocessor and appropriate communication protocols for stable connection and processing.
[0610] The server receives audio data and converts it into text using speech recognition software. For speech recognition, software such as a speech recognition API is used. This software utilizes advanced speech recognition technology to generate accurate text from the audio data.
[0611] The generated text information is translated into different natural languages by a translation module on the server. For example, a machine translation service is used for translation. The translation module utilizes the latest natural language processing technologies to provide contextually accurate translations.
[0612] Translation results are recorded chronologically on the server and transmitted to the user's device in real time. Users can view the translated text information on their device and edit and save it as needed. The device is equipped with a web application using HTML5 and JavaScript, allowing for efficient operation through the user interface.
[0613] As a concrete example, this system can be used in meetings of multinational corporations to enable real-time communication between participants who speak different languages. An example of a prompt for the generative AI model would be, "Please design a system that translates meeting audio in real time and displays it to the participants."
[0614] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0615] Step 1:
[0616] The user's terminal acquires audio during the meeting using a microphone. Audio input is obtained as an analog signal through the microphone device, and audio data is generated by converting this signal to a digital format. The terminal transmits the acquired audio data to the server via the network. A low-latency and highly efficient protocol is used for transmitting the audio data.
[0617] Step 2:
[0618] The server receives audio data transmitted from the terminal. This received data is passed to the speech recognition module. The speech recognition module takes the received audio data as input and uses an advanced phoneme analysis algorithm to extract text information from the speech. This process outputs the audio data as text data.
[0619] Step 3:
[0620] The server passes the text data obtained from the speech recognition module to the translation module. This module receives the text data as input and translates it into a specified different natural language using a probabilistic translation method. The translated text information is output while maintaining accuracy and recorded on the server.
[0621] Step 4:
[0622] The server records translated text information in a database in chronological order. The recording process includes timestamps to preserve information about when the utterance occurred. This data is systematically organized for later retrieval and management.
[0623] Step 5:
[0624] The server transmits recorded translation data to the user's device in real time. Push technology is used for this transmission to ensure the information is immediately reflected on the device. The user's device has the functionality to display the received translation data on its screen, providing an intuitive interface for viewing and editing the information.
[0625] (Application Example 1)
[0626] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0627] In international conferences and meetings where multiple languages are spoken, it is extremely difficult for participants who speak different languages to quickly and accurately understand what is being said. Therefore, there is a need for a system that eliminates communication barriers caused by language differences and enables smooth communication.
[0628] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0629] In this invention, the server includes means for acquiring audio signals in a meeting, means for converting the audio signals into text data, and means for translating the text data into different languages. This enables speakers of different languages to instantly understand the meeting content and communicate effectively.
[0630] "Audio signals in a meeting" refers to audio data generated during a meeting, which enables the recording of participants' statements and discussions.
[0631] "Means of converting to text data" refers to technologies that enable the process of analyzing audio signals and converting them into digital data as textual information.
[0632] "Means of translation into different languages" refers to machine translation technology used to convert text expressed in one language into another language.
[0633] A "means of recording based on a time series" is a system that records data in the order in which it occurred, making it easier to refer to and analyze later.
[0634] "Means of providing data to display devices in real time" refers to technologies for instantly transmitting translated text data to the user's visual display device.
[0635] "A means of instantly understanding the content of meetings held in different regions, regardless of the location, using participant devices" refers to a system that enables participants to quickly grasp the content of a meeting even when they are in a remote location.
[0636] The system for implementing this invention converts audio during a meeting into text in real time, translates it into different languages, and provides this information to the participants' terminals. The server first receives the audio signal acquired through the microphone. This audio data is transmitted in streaming format based on a communication protocol.
[0637] The server converts received audio data into text data using a speech recognition module. This module incorporates advanced speech recognition algorithms and is capable of accurately analyzing human language. The converted text is then passed to a machine translation module, which translates it into the specified target language. The translation process incorporates the latest natural language processing technologies, enabling smooth, context-aware translations.
[0638] The translated text data is recorded chronologically and immediately transmitted in real time to participants' display devices. This allows participants to instantly grasp the content of the meeting and overcome communication barriers caused by language barriers. For example, in international meetings with multilingual speakers, this system enables each participant to instantly understand what is said in their native language, allowing for efficient meeting progress.
[0639] Furthermore, it is possible to improve translation accuracy using generative AI models. In specific situations where participants require more specific translations or explanations, they can enter the following prompt to generate detailed explanations or supplementary translations: "Please develop a system that translates Japanese audio into English in real time. This will instantly convey the content to participants speaking other languages and help facilitate the smooth running of meetings."
[0640] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0641] Step 1:
[0642] The terminal acquires the conference audio signal through the microphone. It converts the acquired audio data using the microphone device into a digital format and encodes it into a streaming format for transmission to the server. The input is an analog audio signal, and the output is encoded audio data.
[0643] Step 2:
[0644] The server receives encoded audio data and processes it with a speech recognition module. The speech recognition module uses advanced speech recognition algorithms to analyze and convert the input audio data into text data. This step analyzes the phonemes and context of the speech to generate accurate textual information. The input is encoded audio data, and the output is text data.
[0645] Step 3:
[0646] The server passes the acquired text data to the translation module, which then translates it into the specified target language. The translation module utilizes machine translation technology and generative AI models to achieve highly accurate translations that take context into account. The input is the original text data, and the output is the translated text data.
[0647] Step 4:
[0648] The server records translated text data chronologically and stores it as a history. This record helps participants review past meeting content. The input is translated text data, and the output is a storage of translated text organized chronologically.
[0649] Step 5:
[0650] The server transmits the recorded translated text data to participants' display devices in real time. Participants' devices display the received data on their screens, allowing users to immediately grasp the meeting content. The input is real-time translated text data, and the output is the displayed translated text.
[0651] Step 6:
[0652] Users utilize translation data on their participant devices to understand the content of meetings held in different regions, regardless of location. Users can check the displayed translation results and immediately join the meeting. The input is the real-time displayed translated text, and the output is improved user understanding and communication.
[0653] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0654] This invention is a system that not only converts audio during meetings into text in real time and translates it into different languages, but also analyzes the speaker's emotions and supplements that information. The system consists of a server, user terminals, and a communication environment that links them via a network.
[0655] The devices used by users during meetings have been improved to acquire not only voice input but also voice characteristics (e.g., voice speed, tone) for emotional analysis. These characteristic information are captured by the device simultaneously with the voice data and transmitted to the server in streaming format.
[0656] On the server, the received audio data is converted into text data by a speech recognition module. Signal processing in this process utilizes advanced speech recognition algorithms to ensure accuracy. Simultaneously, an emotion engine analyzes the characteristics of the speech and extracts emotional information. The emotion engine analyzes parameters such as intonation and speed to identify the speaker's emotional state.
[0657] Next, the server translates the converted text into the target language specified by the translation module. The translated text data is then supplemented with sentiment information extracted by the sentiment engine. This adds emotional nuance to the translated content, allowing the recipient to gain a deeper understanding of the intended meaning.
[0658] Furthermore, the server organizes the translated text, including emotional information, and records it as meeting minutes in chronological order. This record is updated in real time and sent to the user's device. Through the device, the user can see the speaker's emotional state in real time along with the translated text.
[0659] As a concrete example, by utilizing emotion recognition functions in a multinational conference, even when each participant can speak in their native language, the emotions and intentions of the speaker can be understood more clearly when translating information into other languages. For instance, if a participant expresses dissatisfaction with a proposal, the emotion engine can capture their tone and manner of speaking and add emotion labels such as "anger" or "dissatisfaction" to the text, ensuring that the intention is accurately conveyed to other participants and enabling discussions that avoid misunderstandings.
[0660] The following describes the processing flow.
[0661] Step 1:
[0662] The user captures audio during the meeting through the device's microphone, simultaneously capturing features such as intonation and speed. This allows for the acquisition of audio data and parameters related to emotion.
[0663] Step 2:
[0664] The terminal transmits the acquired audio data and its characteristic parameters to the server in streaming format. Data compression technology is applied to ensure rapid transmission to the server over the network.
[0665] Step 3:
[0666] The server passes the received audio data to the speech recognition module, which converts the audio into text data. The speech recognition algorithm is then applied to generate accurate text strings from the audio signal.
[0667] Step 4:
[0668] The server simultaneously activates the emotion engine and analyzes feature parameters related to the voice. The emotion engine analyzes the tone, intonation, and speed of the voice to extract the user's emotions.
[0669] Step 5:
[0670] The server passes the converted text data to the translation module, which translates it into the specified target language. A translation algorithm is used to produce a contextually natural translation.
[0671] Step 6:
[0672] The server adds sentiment information extracted by the sentiment engine to the translated text data. Sentiment labels and sentiment scores are tagged to the text data, supplementing the nuances and intent of the statements.
[0673] Step 7:
[0674] The server organizes translated text, including sentiment information, as time-series data and creates meeting minutes that are updated in real time. This data is then sent to the user's terminal.
[0675] Step 8:
[0676] Users can view translated meeting minutes through their devices and see embedded sentiment information in real time. They can also provide feedback on the content of the minutes as needed to deepen their understanding of the information.
[0677] (Example 2)
[0678] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0679] In international conferences and multilingual environments, communication presents challenges not only in terms of language barriers but also in understanding the speaker's intentions and emotions. Translated texts, in particular, tend to lose emotion and nuance, potentially leading to the omission of crucial elements in effective communication. Such situations increase the likelihood of misunderstandings and communication breakdowns.
[0680] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0681] In this invention, the server includes means for analyzing and extracting emotional information from acoustic data, means for adding emotional information to translated text information, and means for arranging and recording the translated text information based on chronological order. This makes it possible to accurately convey emotional nuances along with the content of speech in communication between different languages, thereby facilitating smoother communication.
[0682] "Audio data" refers to information, including audio signals, acquired during meetings and other forms of voice communication.
[0683] "Textual information" refers to text data obtained by converting acoustic data using speech recognition technology.
[0684] "Emotional information" refers to information about the speaker's emotional state, extracted from the analysis of acoustic data.
[0685] Translation is the process of converting textual information recorded in one language into a different target language.
[0686] A "server" is a central computer system that processes audio data, generates text information, translates, performs sentiment analysis, and records data.
[0687] A "user terminal" is a device that receives recorded data in real time and provides information to the user.
[0688] A "time series" is an arrangement of information in which data is sorted according to the order of time.
[0689] The embodiments for carrying out the present invention are shown below.
[0690] The user's device acquires audio data in real time during the meeting using a microphone. This device is equipped with a high-precision speech recognition microphone and software for analyzing audio characteristics (e.g., volume, speed, tone), and simultaneously captures audio data and its characteristic information. The acquired audio data is transmitted to the server in streaming format via a communication protocol (e.g., WebSocket).
[0691] When the server receives audio data, it converts it into text information using a speech recognition module. During this process, the server utilizes a generative AI model and executes advanced speech recognition algorithms to maintain conversion accuracy. Simultaneously, the server uses an emotion engine to analyze the audio characteristics of the audio data and extract emotional information. The emotion engine incorporates a machine learning model to analyze voice intonation and speed changes to identify the speaker's emotions.
[0692] The converted text information is translated by the server into the specified target language via a translation module. The translated text information is then supplemented with the aforementioned sentiment information, making it easier for the emotional nuances contained in the text to be conveyed to the recipient.
[0693] The server organizes the translated text and sentiment data chronologically and records it as meeting minutes. This record is updated in real time and sent to the user's terminal. Through the terminal, the user can view the translated text and the speaker's sentiment.
[0694] A concrete example of its use is in multinational conferences. This system allows each participant to speak in their native language while receiving information translated into other languages, and also deepens their understanding of the other person's emotions and intentions. An example of a prompt for the generative AI model is, "Please input audio data, then perform text conversion, translation, and sentiment analysis." This allows the system to efficiently execute each specified process.
[0695] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0696] Step 1:
[0697] The terminal acquires audio data during a meeting in real time using a microphone. Simultaneously, audio characteristics such as volume, speed, and tone are also captured. The input is the raw audio signal, and the output is digital audio data for analysis. Software on the terminal converts this data into a streaming format and transmits it to the server in a stable manner via a communication protocol.
[0698] Step 2:
[0699] The server receives audio data streamed from the terminal. The input is a stream of audio data. This data is passed to a speech recognition module for conversion into text information. Here, an advanced speech recognition algorithm using a generative AI model operates, converting the data into text with high accuracy even in the presence of noise. The output is the converted text information.
[0700] Step 3:
[0701] The server activates an emotion engine to extract emotional information from acoustic data. The input is acoustic data containing speech characteristics. This engine analyzes changes in voice intonation, speed, and tone to identify the speaker's emotions. Specifically, it uses a machine learning model to perform pattern recognition and generate labels such as "joy," "anger," and "sadness." The output is the analysis results, including the emotion labels.
[0702] Step 4:
[0703] The server passes the text information to the translation module, which translates it into the specified target language. The input is the converted text information. A generative AI model performs grammatically and culturally accurate translation. Sentiment information is also added at this stage, adding emotional nuances to the translated text. The output is the translated text information, including the sentiment information.
[0704] Step 5:
[0705] The server organizes translated text and sentiment information chronologically and records it as meeting minutes. Input consists of translated text and sentiment labels. This data is stored in a database and updated in real time as needed. Output is meeting minutes data organized chronologically.
[0706] Step 6:
[0707] The server provides the latest meeting minutes to the user's terminal in real time. The input is meeting minutes data organized chronologically. The user's terminal receives this and displays translated text and sentiment information on the screen. The output is real-time information provided to the user. This allows the user to gain a deeper understanding of the speaker's intentions and emotions.
[0708] (Application Example 2)
[0709] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0710] In international communication settings, smooth communication between speakers of different languages is difficult, and accurately conveying emotions and nuances is particularly challenging. Furthermore, while there is a demand for multilingual communication capabilities in home robots and portable devices, there is a lack of technology to provide accurate translations that incorporate emotional information in real time.
[0711] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0712] In this invention, the server includes means for converting speech into text and translating it into different languages, means for adding speaker's emotional information to the translated text, and means for recording the translated text and emotional information in chronological order. This makes it possible to accurately convey the speaker's emotions and intentions across language barriers in communication between people who speak different languages.
[0713] A "meeting" is a place where multiple participants exchange opinions and information and make decisions.
[0714] "Sound" refers to human speech and voice, and is information that can be processed as electrical signals.
[0715] "Text" refers to character information converted from speech, and is data represented in a visually recognizable form.
[0716] "Different languages" refers to a situation where the speaker and listener have different language systems, making direct mutual understanding difficult.
[0717] Translation is the act of replacing information expressed in one language with information expressed in another language, thereby conveying its meaning.
[0718] "Emotional information" refers to data that indicates the emotional state of a speaker, analyzed based on factors such as intonation, tone, and speed of speech.
[0719] A "time series" refers to an arrangement of information organized according to chronological order, and is a fundamental concept for managing history.
[0720] "Recording" is the act of preserving information and maintaining it in a format that allows for later reference.
[0721] An "information processing terminal" refers to equipment or devices used to display or process audio, text, and translation results.
[0722] To implement this invention, a server, a user-operated information processing terminal, and a communication network connecting them are primarily required. The server is equipped with a speech recognition module that receives audio data and converts it into text data. This speech recognition module utilizes an advanced speech recognition algorithm to perform highly accurate conversions. Furthermore, an emotion engine simultaneously extracts emotional information of the speaker from the characteristics of the audio. The emotion engine generates this information by analyzing parameters such as voice tone and speed.
[0723] The acquired text data is translated in real time into the specified target language using a translation module on the server. This translation module uses the latest machine translation technology to process the text and adds sentiment information to the translated text to accurately convey the speaker's intent and emotions.
[0724] The final translated text data is organized chronologically by the server and provided to information processing terminals in real time. The terminals used by users instantly display the translated text and sentiment information, facilitating smooth communication between languages.
[0725] As a concrete example, if a home robot utilizes this system, the robot can support conversations with visitors who speak different languages. For instance, it could translate "Hello, how are you?" spoken in English as "Hello, how are you? (emotion: interest)" and convey this emotional information to the Japanese speaker. This enables natural communication while understanding the visitor's emotions.
[0726] An example of a prompt for the generative AI model is "Please translate and analyze the emotion of the following sentence: 'Hello, how are you?'" which will allow the AI to perform appropriate translation and sentiment analysis.
[0727] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0728] Step 1:
[0729] The user provides voice input through the device. The device records this voice as a digital audio signal. This input is sent to the device as audio data, and the device's microphone sensor detects the voice and converts it into a digital signal.
[0730] Step 2:
[0731] The terminal transmits the audio data directly to the server. This transmission utilizes a network, transferring the data in real time via streaming. Data compression technology is applied during this process to achieve high-speed and efficient data communication. The input is digital audio data, and the output is compressed audio data.
[0732] Step 3:
[0733] The server inputs the received audio data into a speech recognition module. This module analyzes the audio data and converts it into text data. Using speech signal recognition technology, word recognition is performed by analyzing phonemes. The input here is compressed audio data, and the output is text data converted from the audio.
[0734] Step 4:
[0735] The server inputs the generated text data into the translation module and performs translation into the specified language. The translation module uses a machine translation engine to dynamically perform language conversion. The input is text data, and the output is translated text data.
[0736] Step 5:
[0737] Simultaneously, a server-based emotion engine analyzes the original speech feature data to infer the speaker's emotions. This analysis is based on acoustic features such as tone, intonation, and tempo. The input is speech feature data, and the output is data accompanied by emotion labels.
[0738] Step 6:
[0739] The server adds sentiment information to the translated text data and organizes the entire dataset chronologically. This process generates data as translated text with sentiment information. This dataset is stored along with past speech history and can be reused as needed.
[0740] Step 7:
[0741] The server transmits the prepared data to the terminal in real time. The terminal immediately displays the information on the user's screen, allowing the user to simultaneously view the translated content and sentiment information. The output is display data with text and sentiment labels that the user can visually recognize.
[0742] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0743] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0744] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0745] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0746] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0747] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0748] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0749] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0750] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0751] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0752] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0753] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0754] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0755] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0756] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0757] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0758] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0759] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0760] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0761] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0762] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0763] The following is further disclosed regarding the embodiments described above.
[0764] (Claim 1)
[0765] Means for acquiring audio signals in a meeting,
[0766] means for converting the aforementioned audio signal into text data,
[0767] A means for translating the aforementioned text data into a different language,
[0768] A means for recording the translated text data in chronological order,
[0769] A means for providing the recorded data to the user terminal in real time,
[0770] A system that includes this.
[0771] (Claim 2)
[0772] The system according to claim 1, comprising means for compressing an audio signal, optimizing network efficiency, and transmitting it to a server.
[0773] (Claim 3)
[0774] The system according to claim 1, comprising means for adding time information and speaker information to translated text data.
[0775] "Example 1"
[0776] (Claim 1)
[0777] An information processing device that acquires audio data,
[0778] An information processing device that converts the aforementioned audio data into text information,
[0779] An information processing device that translates the aforementioned textual information into a different natural language,
[0780] An information processing device that records the translated text information in chronological order,
[0781] An information processing device that provides the recorded information to a user information processing device in real time,
[0782] A system that includes this.
[0783] (Claim 2)
[0784] The system according to claim 1, comprising an information processing device that compresses audio data, optimizes the transmission path efficiency, and transmits it to a computing device.
[0785] (Claim 3)
[0786] The system according to claim 1, comprising an information processing device that adds time information and speaker information to translated text information.
[0787] "Application Example 1"
[0788] (Claim 1)
[0789] Means for acquiring audio signals in a meeting,
[0790] means for converting the aforementioned audio signal into text data,
[0791] A means for translating the aforementioned text data into a different language,
[0792] A means for recording the translated text data in chronological order,
[0793] Means for providing the recorded data to a display device in real time,
[0794] A means of instantly understanding the content of meetings held in different regions, regardless of the location, depending on the participant's device.
[0795] A system that includes this.
[0796] (Claim 2)
[0797] The system according to claim 1, comprising means for compressing an audio signal, optimizing communication efficiency, and transmitting it to a processing unit.
[0798] (Claim 3)
[0799] The system according to claim 1, comprising means for adding time information and speaker information to translated text data.
[0800] "Example 2 of combining an emotion engine"
[0801] (Claim 1)
[0802] Means of acquiring audio data in a meeting,
[0803] means for converting the aforementioned acoustic data into textual information,
[0804] A means for translating the aforementioned textual information into a different language,
[0805] A means for analyzing and extracting emotional information from the aforementioned acoustic data,
[0806] A means for adding emotional information to the translated text information,
[0807] A means for arranging and recording the translated text information in chronological order,
[0808] A means for transmitting the recorded data to the user terminal in real time,
[0809] A system that includes this.
[0810] (Claim 2)
[0811] The system according to claim 1, comprising means for compressing data and optimizing the communication channel before transmitting it to a server in order to improve the efficiency of transferring acoustic data.
[0812] (Claim 3)
[0813] The system according to claim 1, comprising means for adding time information and speaker information to translated text information.
[0814] "Application example 2 when combining with an emotional engine"
[0815] (Claim 1)
[0816] Means of acquiring audio from a meeting,
[0817] means for converting the aforementioned audio into text,
[0818] A means for translating the aforementioned text into a different language,
[0819] A means for adding information about the speaker's emotions to the translated text,
[0820] A means for recording the translated text and emotional information in chronological order,
[0821] A means for providing the recorded data to an information processing terminal in real time,
[0822] A system that includes this.
[0823] (Claim 2)
[0824] The system according to claim 1, comprising means for compressing audio data, optimizing data communication efficiency, and transmitting it to a server.
[0825] (Claim 3)
[0826] The system according to claim 1, comprising means for adding time information and speaker information to translated text data. [Explanation of Symbols]
[0827] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. Means for acquiring audio signals in a meeting, means for converting the aforementioned audio signal into text data, A means for translating the aforementioned text data into a different language, A means for recording the translated text data in chronological order, Means for providing the recorded data to a display device in real time, A means of instantly understanding the content of meetings held in different regions, regardless of the location, depending on the participant's device. A system that includes this.
2. The system according to claim 1, comprising means for compressing an audio signal, optimizing communication efficiency, and transmitting it to a processing unit.
3. The system according to claim 1, comprising means for adding time information and speaker information to translated text data.