system

The system addresses the challenge of understanding technical and business terms and language barriers in meetings by converting speech to text, identifying key terms, and providing translated summaries, improving communication efficiency.

JP2026105348APending Publication Date: 2026-06-26SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-16
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Conventional meetings face challenges with novice participants struggling to understand technical and business terms, and language barriers hinder effective communication, especially in international settings.

Method used

A system integrating speech recognition, part-of-speech analysis, information presentation, terminology conversion, and multilingual translation to facilitate understanding and smooth communication by converting speech to text, identifying key terms, providing summaries, and translating content.

Benefits of technology

Enables participants to easily comprehend meeting content, adapt to diverse language environments, and reduce communication barriers, enhancing work efficiency and clarity.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026105348000001_ABST
    Figure 2026105348000001_ABST
Patent Text Reader

Abstract

Provide a system. 【Solution means】 Speech recognition means for obtaining voice input in real time and converting the voice into character data, Morphological analysis means for morphologically analyzing the character data obtained by the speech recognition means and identifying specific part-of-speech, Information presentation means for obtaining related information and displaying a summary when the user selects the identified noun, Term conversion means for converting technical terms and business terms into general vocabulary, Translation means for performing translation between multiple languages, Information distribution means for providing the translated text information to the on-site staff via an information display device, Situation analysis means for associating visual information and voice reports obtained from a monitoring device to identify abnormal events, A system including the above.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In conventional meetings, novice businessmen or IT beginners often cannot fully understand technical terms or business terms, and it often takes time and effort to grasp the content of the meetings. As a result, smooth communication is hindered, leading to a decline in work efficiency. Furthermore, in international meetings where multiple languages are mixed, the language barrier makes understanding even more difficult. There is a need for a system that can solve these problems and enable all participants to effectively understand the content of the meetings.

Means for Solving the Problems

[0005] This invention includes a speech recognition means that converts speech into text data in real time, and a part-of-speech analysis means that performs morphological analysis on the text data to identify specific parts of speech. Furthermore, if the user selects an identified noun, it includes an information presentation means that retrieves information related to that noun and provides a summary. In addition, by including a terminology conversion means that converts technical and business terms into general terms, and a translation means that enables multilingual translation, this invention proposes a system that allows all participants, including new business professionals and IT beginners, to easily understand the meeting content and communicate smoothly.

[0006] A "speech recognition system" is a mechanism for acquiring speech data in real time and converting that speech into text data.

[0007] "Part-of-speech analysis" is a process that uses morphological analysis to identify the part of speech for each word in text data acquired by speech recognition.

[0008] An "information presentation method" is a system that, when a user selects an identified noun, collects information related to that noun and provides a summary of it.

[0009] A "terminology conversion tool" is a function that converts specialized and business terms into general terms to make them easier for users to understand.

[0010] A "translation method" is a system that translates text between different languages ​​and provides information in the language selected by the user. [Brief explanation of the drawing]

[0011] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0012] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.

[0013] First, let's explain the terminology used in the following explanation.

[0014] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0015] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0016] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0017] In the following embodiments, the numbered communication I / F (Interface) is an interface that includes a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0018] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0019] [First Embodiment]

[0020] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0021] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0022] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0023] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0024] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0025] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0026] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0027] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0028] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0029] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0030] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0031] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0032] This invention is a system that integrates speech recognition, morphological analysis, information presentation, terminology conversion, and multilingual translation, and aims to facilitate intended communication.

[0033] Server role:

[0034] The server functions as the central processing unit of this system, first receiving the audio of the meeting transmitted from the terminals. Using speech recognition technology, this audio is converted into highly accurate text information. The converted text information is then broken down by part of speech through a morphological analysis engine, with nouns being identified in particular.

[0035] The server also collects information related to nouns selected by the user from databases and external resources. The collected information is summarized by AI and sent to the terminal. Furthermore, the server has the ability to detect technical and business terms in the text and replace them with general terms.

[0036] The server is equipped with a multilingual translation engine that translates information into the language specified by the user, providing information tailored to the user's needs.

[0037] Terminal role:

[0038] The terminal is a device used by users participating in a meeting. It transmits audio to the server and displays text and summary information received from the server to the user. When a user clicks on a specific noun, additional information related to that noun is displayed on the terminal. The latest converted and translated text information is also reflected on the terminal in real time.

[0039] User actions:

[0040] Users can use their devices during meetings to select specific nouns or technical terms. This makes it easier to understand what is being said and the flow of the discussion through the presentation of useful information and translations. This system is designed to enable even new business professionals and IT novices to participate effectively in meetings.

[0041] Specific example:

[0042] For example, if a user hears the word "project management" during a meeting, clicking on the noun will cause the server to provide a summary of the definition of "project management" and related common practices. Furthermore, the Japanese meeting content can be translated into English and shared with colleagues in overseas branches.

[0043] By applying this invention, it is possible to save time by not having to understand technical terms and industry-specific jargon, and to adapt to diverse language environments, thereby reducing communication barriers.

[0044] The following describes the processing flow.

[0045] Step 1:

[0046] The terminal uses its microphone to capture the voices of meeting participants, converts them to a digital format, and prepares to send them to the server.

[0047] Step 2:

[0048] The server inputs the received audio data into the speech recognition engine, which converts the audio into text data. This text data is temporarily stored for further processing.

[0049] Step 3:

[0050] The server sends the acquired text data to a morphological analysis engine. It performs part-of-speech analysis, specifically identifying and marking nouns. The results of this processing are used to display information when a noun is clicked.

[0051] Step 4:

[0052] The user operates the device and clicks on a noun that interests them.

[0053] Step 5:

[0054] The server identifies the clicked noun and retrieves information related to that noun from databases and external sources. The retrieved data is then summarized into concise information through a summarization algorithm.

[0055] Step 6:

[0056] The server sends summarized information to the terminal and displays it to the user. This additional information allows the user to deepen their understanding of the meeting content.

[0057] Step 7:

[0058] The server detects technical and business terms and converts them into more general terms. The converted text is then sent to the terminal for the user to review.

[0059] Step 8:

[0060] When a user enters their desired language into their device, the server passes it to a translation engine, which then translates the meeting content into the specified language. The translated text is displayed on the device in real time.

[0061] (Example 1)

[0062] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0063] In meetings and discussions, it is difficult to understand the specialized terminology and industry-specific language used by participants in real time and to ensure smooth communication across multiple languages. Furthermore, there is a need for a means to quickly gather and summarize information and for participants to easily follow the progress of the meeting. Traditional systems often fail to effectively address these challenges.

[0064] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0065] In this invention, the server includes a speech conversion means that acquires speech data in real time and converts that data into text information; a data analysis means that performs morphological analysis on the text information acquired by the speech conversion means to identify parts of speech; and an information provision means that acquires related information when a user selects an identified noun and provides summary information. This generalizes specialized and business terms used during meetings and enables accurate and rapid communication among participants who speak different languages.

[0066] "Voice conversion means" refers to technology for acquiring voice data in real time and converting that voice into text information.

[0067] "Data analysis means" refers to a technology for identifying the parts of speech used in text information obtained by speech conversion means through morphological analysis.

[0068] "Information provision means" refers to technology that retrieves relevant information based on a noun selected by the user, summarizes it, and provides it to the user.

[0069] "Data conversion means" refers to technologies that convert technical or business terms into general terms, making the content easier for users to understand.

[0070] A "translation processing method" is a technology that accurately translates text information between multiple languages, enabling mutual understanding between different languages.

[0071] A "language conversion method" is a technology for translating text information into a language specified by the user.

[0072] A "generative model" is an algorithm that uses artificial intelligence technology to summarize information and generate new text.

[0073] This embodiment of the invention is a system for supporting smooth communication in meetings and discussions. The configuration and operation of this system will be described in detail below.

[0074] The server functions as the central processing unit of this system. The server first receives audio data from the terminals. To convert the audio data into text in real time, the server utilizes a common speech recognition API. This API has the capability to rapidly convert audio data into text data.

[0075] The identified text information is broken down into parts of speech using a morphological analysis engine. This engine is used to identify key keywords and nouns, and the server uses this information to evaluate its relevance. The server also collects relevant information from external resources and databases, and generates summary information using an artificial intelligence model based on the information presentation method.

[0076] The terminal is a user-operated device that captures meeting audio and sends it to the server. Furthermore, the server displays the user the converted text and summary information in its response. This allows the user to directly obtain relevant information by selecting identified nouns. Additionally, this information is translated into multiple languages ​​using a multilingual translation API, enabling smooth communication across language barriers even in international meetings.

[0077] Users can understand important technical and business terms in a discussion in real time based on the information displayed on their device. This system can provide additional information using prompts based on a generative AI model for specific words. For example, a prompt such as, "Please provide a summary of the general definition and related techniques for the noun 'project management' used during the meeting," can be used.

[0078] Thus, this system, starting with speech recognition and continuing through a series of processes including multilingual translation, provides an environment where participants can more efficiently and accurately grasp the flow of conversation.

[0079] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0080] Step 1:

[0081] The terminal captures audio data via the microphone during the meeting. This input audio data is streamed to the server in real time via a secure communication protocol. The terminal appropriately compresses this audio data and packets it to improve data transfer efficiency.

[0082] Step 2:

[0083] The server inputs the audio data received from the terminal into a speech recognition API. This API converts the audio into text data. Specifically, it analyzes the language from the audio waveform and outputs it as text information. Through this process, the audio input is converted into clean text data.

[0084] Step 3:

[0085] The server passes the generated text data to the morphological analysis engine. The morphological analysis engine divides the input text into words and identifies the part of speech of each word. The output is a list of identified nouns in the text. This list is used for subsequent information extraction.

[0086] Step 4:

[0087] The user selects a noun displayed on the device. Based on the selected noun, the server retrieves relevant information from an external source or database. This information gathering is performed via an API, and the information is returned to the server. The output of this process is detailed information related to the selected noun.

[0088] Step 5:

[0089] The server summarizes the acquired detailed information using a generative AI model. This generative AI model quickly summarizes large amounts of text information, extracting key points and organizing them concisely. This summarized information is output and used to provide information to users.

[0090] Step 6:

[0091] The server applies a data transformation algorithm to replace technical terms with more general ones. This process converts complex words and phrases into simpler terms to aid user understanding. The output is revised text information.

[0092] Step 7:

[0093] The server passes text information to a multilingual translation API as needed, translating it into the specified language. The translated text is processed on the server and immediately sent to the terminal. The terminal displays this translated information to the user, facilitating communication in a multilingual environment. The user can then use prompts on the terminal to request additional information.

[0094] (Application Example 1)

[0095] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0096] In communication settings, there is a need for information exchange among team members who speak different languages ​​and for facilitating conversations in fields with a lot of specialized terminology. In particular, in on-site security operations, rapid and reliable information transmission is crucial, but language and jargon barriers often hinder communication. Furthermore, there is a lack of support for effectively integrating audio and visual information acquired on-site to correctly understand the situation. Technologies are needed to address these challenges.

[0097] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0098] In this invention, the server includes: a speech recognition means for acquiring voice input in real time and converting the voice into text data; a part-of-speech parsing means for performing morphological analysis on the text data acquired by the speech recognition means and identifying specific parts of speech; an information presentation means for acquiring related information and displaying a summary when a user selects an identified noun; a term conversion means for converting technical terms and business terms into general vocabulary; a translation means for performing translation between multiple languages; an information distribution means for providing translated text information to on-site staff via an information display device; and a situation analysis means for linking visual information acquired from a monitoring device with voice reports to identify abnormal events. This enables rapid information sharing and accurate understanding of situations among staff who speak different languages.

[0099] "Voice input" refers to acquiring acoustic signals in real time and using that information as data for processing.

[0100] "Text data" refers to text information obtained by converting speech, and it can be processed and displayed electronically.

[0101] "Speech recognition means" refers to a device or system for receiving speech information and converting it into text data.

[0102] A "part-of-speech analysis tool" is a device or system for performing morphological analysis on character data and identifying specific parts of speech.

[0103] An "information presentation means" is a device or system for acquiring knowledge related to a noun selected by the user, summarizing it, and displaying it.

[0104] A "terminology conversion means" is a device or system for replacing technical terms or business jargon with general vocabulary.

[0105] "Translation means" refers to a device or system for converting text information between languages.

[0106] "Information distribution means" refers to a device or system for transmitting acquired information to an information display device and presenting it to staff.

[0107] A "situation analysis means" is a device or system that integrates visual and auditory information to identify abnormal events.

[0108] This invention is designed as an information gathering and presentation system for use by on-site security staff. The server acquires voice input in real time and uses the microphone of a smartphone or information presentation device to process the acoustic signal. Google® Speech-to-Text API or Amazon Transcribe is used for speech recognition, and the resulting text is converted into character data by the speech recognition means.

[0109] Next, the text data is preprocessed for morphological analysis. Here, parts of speech are analyzed using a morphological analysis engine such as MeCab, and particularly important nouns are extracted. When the user selects a specific noun, the server retrieves relevant information from a database or external resources and summarizes that information using a generative AI model. In this process, OpenAI's GPT model or similar might be used.

[0110] Subsequently, the extracted technical and business terms are converted into general vocabulary using a terminology conversion tool and displayed in a format that is easy for the user to understand. Text translation between multiple languages ​​is performed using translation tools such as the Google Translate API or the DeepL API.

[0111] The translated information is provided to on-site staff in real time via information display devices through information distribution means. Smart glasses or mobile devices may be used for this information display. Situation analysis means are also used to integrate visual information obtained from monitoring devices with audio reports. This makes it possible to quickly identify abnormal events and take appropriate countermeasures.

[0112] For example, if a security staff member verbally reports a suspicious vehicle, a voice recognition system converts the report into text data, extracting the key phrase "suspicious vehicle." The server automatically summarizes the information, translates it into multiple languages, and immediately shares it with the entire team.

[0113] An example of a prompt for a generative AI model might be: "Summarize the latest reports related to the keyword 'suspicious vehicle' and generate a multilingual description. Translate it into English and add any relevant information."

[0114] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0115] Step 1:

[0116] The terminal acquires voice input from on-site staff in real time. This input voice is collected through the microphone of the information display device.

[0117] Step 2:

[0118] The server converts the acquired speech input into text data using the Google Speech-to-Text API or Amazon Transcribe. This conversion process transforms the data from speech to text.

[0119] Step 3:

[0120] The server uses the MeCab library to analyze the parts of speech in order to perform morphological analysis on the obtained character data. Through this analysis, nouns and important phrases are extracted. The input is character data, and the output is a list of nouns.

[0121] Step 4:

[0122] The user selects a noun of interest from the displayed list. This selection action triggers the next information retrieval process.

[0123] Step 5:

[0124] The server retrieves information related to the selected noun from databases and external resources. This information is then input into a prompt sentence that is summarized using a generative AI model. The output of this step is the summarized information.

[0125] Step 6:

[0126] The server uses a terminology conversion mechanism to convert technical terms into general vocabulary and arranges them in a format that is easily understood by a wider range of users. The input is summary information, and the output is the converted information.

[0127] Step 7:

[0128] The server uses the DeepL API or Google Translate API to perform translations and generate multilingual information. This makes it possible to communicate information to users who speak different languages.

[0129] Step 8:

[0130] The server transmits the translated information to the terminal at the site via an information distribution system. The terminal then presents the information to the user using an information display device. This allows for immediate feedback.

[0131] Step 9:

[0132] The server integrates visual and audio data from monitoring devices and identifies abnormal events using situation analysis tools. Based on the integrated data, it creates example prompt statements generated by a generative AI model, providing specific action guidelines.

[0133] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0134] This invention is a system that combines speech recognition and emotion recognition to support smooth and effective communication. The system includes functions for speech recognition, morphological analysis, information presentation, terminology conversion, multilingual translation, and emotion recognition.

[0135] 1. Overall structure:

[0136] The server functions as the central hub of the system, processing voice data, analyzing sentiment data, and presenting and translating information. The terminal provides the interface with the user, collecting voice input from the user, sending it to the server, and displaying the processed information.

[0137] 2. Server processing:

[0138] The server converts the audio transmitted from the terminal into text data using a speech recognition engine. This text data is then broken down by morphological analysis to identify nouns and important words. Furthermore, the server can use an emotion engine to analyze the emotions contained in the audio data. The analysis results are used as data that provides a detailed indication of the user's emotional state.

[0139] The server adjusts information presentation based on the user's emotional state, collecting, summarizing, and displaying information related to the noun selected by the user. This information is presented in a way that aligns with the user's emotions, enabling a more personalized experience.

[0140] 3. Role of the terminal:

[0141] The terminal is a device used by the user and primarily manages voice input and output. The terminal acquires voice and sends it to the server, and displays the server's response to the user. It also has an interface for visualizing the user's emotional state, allowing changes to be monitored in real time.

[0142] 4. User interaction and experience:

[0143] Users input voice commands into their devices and receive information based on the analyzed results. For example, if a user says "project management," the server uses an emotion engine to analyze the emotions expressed in the statement and presents information in the most appropriate format. Furthermore, if the system determines that the user is feeling stressed, it can suggest information and methods to help them relax.

[0144] Specific example:

[0145] When a user mentions "the introduction of a new product" during a meeting, the server analyzes the emotions behind the statement and, if it determines that the user is feeling anxious, provides reassuring information or examples of past success stories.

[0146] This system provides an environment that enables users to make better decisions by delivering information with consideration for their emotions. Furthermore, the system continuously learns and improves the relevance and accuracy of the information presented.

[0147] The following describes the processing flow.

[0148] Step 1:

[0149] The device uses a microphone to capture the user's voice. The captured voice is converted into a digital format and sent to the server.

[0150] Step 2:

[0151] The server inputs the received audio data into the speech recognition engine and converts it into corresponding text data. This text data is then ready for further processing within the system.

[0152] Step 3:

[0153] The server passes the text data to a morphological analysis engine, which breaks down words into parts of speech. Nouns and keywords, in particular, are identified and marked so that their relevant information can be presented to the user later.

[0154] Step 4:

[0155] The server uses an emotion engine to analyze the user's emotions in the voice data. The emotion recognition results are used as data indicating the user's state.

[0156] Step 5:

[0157] When a user selects a noun or keyword on their device, the server collects relevant information from databases and external sources. The collected information can then be summarized according to the user's emotional state.

[0158] Step 6:

[0159] The server optimizes the summary information based on the results of sentiment analysis. For example, if the user is feeling anxious, it may emphasize information that provides reassurance.

[0160] Step 7:

[0161] The server sends summarized information to the terminal. The terminal displays this clearly to the user to support their understanding. However, the display method is adjusted as appropriate based on the sentiment analysis results.

[0162] Step 8:

[0163] The device visualizes the user's emotional state in real time and displays changes. This allows the user to understand their own emotional changes and respond as needed.

[0164] Step 9:

[0165] If a user requests multilingual translation, the device sends the request to the server. The server translates the text into the selected language, and the translation result is returned to the device and provided to the user.

[0166] (Example 2)

[0167] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0168] Conventional systems simply convert user speech into text data and present information, lacking the ability to provide flexible information based on the user's emotional state. Furthermore, they lack mechanisms to support decision-making through information presentation that considers the user's emotions, thus failing to contribute to smoother communication.

[0169] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0170] In this invention, the server includes speech recognition means, part-of-speech analysis means, emotion recognition means, information presentation means, and display means. This enables the recognition of emotions from the user's speech and the presentation of optimal information corresponding to those emotions.

[0171] "Voice recognition means" refers to a function that converts voice input acquired by a terminal into text data in real time.

[0172] A "part-of-speech analysis tool" is a function that performs morphological analysis on acquired character data to identify specific parts of speech.

[0173] "Emotion recognition means" refers to a function that analyzes identified terms and recognizes the emotional state of an utterance.

[0174] "Information presentation means" refers to a function that collects information in accordance with the user's emotions and uses artificial intelligence to generate and present a summary in the most optimal form.

[0175] "Display means" refers to a function that includes an interface for transmitting information to a terminal and providing that information to the user visually or audibly.

[0176] This invention is a system that combines speech recognition and emotion recognition to support smooth and effective communication with users. The system mainly consists of a server and terminals, with the server functioning as the central hub of the system.

[0177] The server receives audio data from the terminal and converts it into text data using speech recognition. For this purpose, the server can use general-purpose speech recognition software. Specifically, a speech recognition API can be used as the speech recognition engine. The converted text data is then parsed into parts of speech via a morphological analysis tool. This analysis allows the server to identify specific parts of speech and nouns.

[0178] Next, the server uses an emotion recognition engine to analyze the user's emotional state from the obtained data. For example, an emotion analysis API can be used for emotion recognition. This analysis allows the server to prepare information that reflects the user's emotional state.

[0179] Furthermore, the server uses information presentation devices to collect information tailored to the user's emotions and uses an artificial intelligence model to generate an optimal summary. This information is transmitted to the terminal via display devices and provided to the user visually or aurally.

[0180] The terminal functions as a user interface, handling voice input and output. It acquires voice from the user and transmits it to the server. It also presents the processed information to the user through a medium, allowing the user to make decisions while monitoring their emotional state in real time.

[0181] For example, if a user says, "I feel anxious about the next project meeting," the server analyzes the user's emotions from that statement and suggests ways to relax and provides helpful success stories. In this way, the system provides personalized information based on the user's emotions.

[0182] An example of a prompt to input into a generative AI model is, "Show how to provide effective information to a user who is feeling anxious while preparing a presentation." Using this prompt, the model can generate appropriate responses for a user's specific emotional state.

[0183] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0184] Step 1:

[0185] The terminal acquires voice input from the user. This voice data is collected in real time by the microphone and sent to the server as raw voice data.

[0186] Step 2:

[0187] The server inputs the received audio data into a speech recognition engine, which converts it into text data. This engine analyzes sound wave information and identifies words and phrases from combinations of phonemes. The output is text data of the user's speech.

[0188] Step 3:

[0189] The server inputs the converted character data into a morphological analysis tool and performs part-of-speech analysis. This analysis identifies grammatical structures and tags parts of speech such as nouns and verbs. The output is text data with part-of-speech information attached.

[0190] Step 4:

[0191] The server inputs the part-of-speech analysis results into an emotion recognition engine to identify the emotional state of the text. This engine evaluates the context of words based on specific emotion indicators and identifies the intensity and type of emotion. The output is structured data indicating the emotional state.

[0192] Step 5:

[0193] The server uses information presentation tools to collect relevant information based on the user's emotional state and identified keywords. Using artificial intelligence, it generates summaries tailored to the user's emotions and extracts information useful to the user. The output is summarized information that matches the user's emotions.

[0194] Step 6:

[0195] The server sends the generated information to the terminal, which then provides it to the user in visual or auditory form. Specific actions include displaying relevant information on the screen or playing audio messages through the speaker. The user can then use this information to make further decisions or take action.

[0196] (Application Example 2)

[0197] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0198] In elderly care settings, providing appropriate support in response to the emotions of both the care recipient and the caregiver is crucial. However, it is difficult for care staff to immediately understand the emotions of the care recipient and respond appropriately based on that understanding, highlighting the need for improved communication quality. Furthermore, language barriers pose a problem in communication between caregivers and care recipients who speak diverse languages. A system is needed to address these challenges.

[0199] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0200] In this invention, the server includes a speech recognition means, a morphological analysis means, an information presentation means, a terminology conversion means for converting technical terms and business terms into general terms, a translation means for performing translation between multiple languages, and an emotion analysis and information provision means for determining the emotional state of the user or person being cared for and providing appropriate information or messages based on those emotions. This enables timely and appropriate emotionally sensitive information presentation in care settings and facilitates smooth communication between users who speak diverse languages.

[0201] A "speech recognition means" is a device that has the function of acquiring speech input in real time and converting that speech into text data.

[0202] A "morphological analysis means" is a device that breaks down character data acquired by a speech recognition means and has the function of identifying specific parts of speech.

[0203] An "information presentation means" is a device that has the function of acquiring related information when an identified noun is selected, and summarizing and displaying that information.

[0204] A "terminology conversion tool" is a device that has the function of converting specialized or business-related terms into general terms.

[0205] A "translation tool" is a device that performs translation between multiple languages ​​and has the function of converting text into a specified language.

[0206] "Emotional analysis and information provision means" refers to a system that has the function of determining the emotional state of a user or person receiving care and providing appropriate information or messages based on those emotions.

[0207] "Machine learning" is a technology that uses data to automatically improve algorithms, and is applied to improve the accuracy of information.

[0208] To implement the present invention, a system centered on speech recognition and emotion analysis can be envisioned operating on a smart device. This system consists of a server and a terminal, and the user provides voice input via the terminal. Voice data is acquired in real time through the terminal's microphone and transmitted to the server. The server converts the voice into text data using speech recognition means.

[0209] The server uses morphological analysis to break down text data into parts of speech and identify specific important words and phrases. Then, using sentiment analysis and information provision tools, it determines the user's emotional state and collects, summarizes, and presents relevant information based on the analysis results. The information is adjusted according to the user's emotions and displayed on the device in a personalized format.

[0210] If multilingual support is required, the text will be translated into the user's specified language using translation tools. The information presented and the translated text are intended to provide support that is sensitive to the feelings of both the care recipient and the caregiver, and are provided on the device as screen display or audio output.

[0211] For example, if a care recipient inputs "I'm feeling a little anxious today...", the system uses emotion analysis to recognize this state and suggests calming music or relaxation techniques. It also advises care staff to "take a moment, hold the care recipient's hand, and encourage slow breathing."

[0212] An example of a prompt might be, "Please suggest effective communication methods to help caregivers alleviate the anxiety of those they care for."

[0213] This allows the server to personalize and deliver appropriate information in real time according to the user's needs, enabling effective, emotion-based support in caregiving settings.

[0214] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0215] Step 1:

[0216] The user inputs voice using the device's microphone. The device captures the voice in real time and generates digital audio data. This audio data is sent to a server for analysis.

[0217] Step 2:

[0218] The server receives the audio data and converts it to text using speech recognition technology. In this step, the input is the audio data, and the output is the converted text data. This process is performed using speech recognition technology.

[0219] Step 3:

[0220] The server analyzes character data using morphological analysis tools and extracts specific parts of speech, particularly nouns. The input in this step is character data, and the output is a list of analyzed nouns. The analysis is performed using a morphological analysis algorithm.

[0221] Step 4:

[0222] The server uses emotion analysis and information provision means to determine the user's emotions from the voice input. The input in this step is voice data and text data, and the output is emotion data indicating the emotional state. Emotion analysis is performed by an emotion recognition engine.

[0223] Step 5:

[0224] The server collects relevant information based on the emotional state and noun list, and summarizes the information using an information presentation tool. The input in this step is emotional data and a noun list, and the output is the summarized information.

[0225] Step 6:

[0226] The server uses translation tools to translate the text into the user's specified language as needed. The input in this step is summarized information, and the output is the translated information.

[0227] Step 7:

[0228] The server adjusts the information based on emotional data and sends it to the terminal. The terminal presents this information to the user as a screen display or audio output. In this step, the input is the adjusted information, and the output is the information delivered to the user.

[0229] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0230] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0231] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0232] [Second Embodiment]

[0233] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0234] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0235] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0236] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0237] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0238] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0239] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0240] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0241] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0242] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0243] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0244] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0245] This invention is a system that integrates speech recognition, morphological analysis, information presentation, terminology conversion, and multilingual translation, and aims to facilitate intended communication.

[0246] Server role:

[0247] The server functions as the central processing unit of this system, first receiving the audio of the meeting transmitted from the terminals. Using speech recognition technology, this audio is converted into highly accurate text information. The converted text information is then broken down by part of speech through a morphological analysis engine, with nouns being identified in particular.

[0248] The server also collects information related to nouns selected by the user from databases and external resources. The collected information is summarized by AI and sent to the terminal. Furthermore, the server has the ability to detect technical and business terms in the text and replace them with general terms.

[0249] The server is equipped with a multilingual translation engine that translates information into the language specified by the user, providing information tailored to the user's needs.

[0250] Terminal role:

[0251] The terminal is a device used by users participating in a meeting. It transmits audio to the server and displays text and summary information received from the server to the user. When a user clicks on a specific noun, additional information related to that noun is displayed on the terminal. The latest converted and translated text information is also reflected on the terminal in real time.

[0252] User actions:

[0253] Users can use their devices during meetings to select specific nouns or technical terms. This makes it easier to understand what is being said and the flow of the discussion through the presentation of useful information and translations. This system is designed to enable even new business professionals and IT novices to participate effectively in meetings.

[0254] Specific example:

[0255] For example, if a user hears the word "project management" during a meeting, clicking on the noun will cause the server to provide a summary of the definition of "project management" and related common practices. Furthermore, the Japanese meeting content can be translated into English and shared with colleagues in overseas branches.

[0256] By applying this invention, it is possible to save time by not having to understand technical terms and industry-specific jargon, and to adapt to diverse language environments, thereby reducing communication barriers.

[0257] The following describes the processing flow.

[0258] Step 1:

[0259] The terminal uses its microphone to capture the voices of meeting participants, converts them to a digital format, and prepares to send them to the server.

[0260] Step 2:

[0261] The server inputs the received audio data into the speech recognition engine, which converts the audio into text data. This text data is temporarily stored for further processing.

[0262] Step 3:

[0263] The server sends the acquired text data to a morphological analysis engine. It performs part-of-speech analysis, specifically identifying and marking nouns. The results of this processing are used to display information when a noun is clicked.

[0264] Step 4:

[0265] The user operates the device and clicks on a noun that interests them.

[0266] Step 5:

[0267] The server identifies the clicked noun and retrieves information related to that noun from databases and external sources. The retrieved data is then summarized into concise information through a summarization algorithm.

[0268] Step 6:

[0269] The server sends summarized information to the terminal and displays it to the user. This additional information allows the user to deepen their understanding of the meeting content.

[0270] Step 7:

[0271] The server detects technical and business terms and converts them into more general terms. The converted text is then sent to the terminal for the user to review.

[0272] Step 8:

[0273] When a user enters their desired language into their device, the server passes it to a translation engine, which then translates the meeting content into the specified language. The translated text is displayed on the device in real time.

[0274] (Example 1)

[0275] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0276] In meetings and discussions, it is difficult to understand the specialized terminology and industry-specific language used by participants in real time and to ensure smooth communication across multiple languages. Furthermore, there is a need for a means to quickly gather and summarize information and for participants to easily follow the progress of the meeting. Traditional systems often fail to effectively address these challenges.

[0277] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0278] In this invention, the server includes: voice conversion means for acquiring voice data in real time and converting the data into text information; data analysis means for performing morphological analysis on the text information acquired by the voice conversion means to identify the part of speech; and information providing means for acquiring related information and providing summary information when the user selects a noun that has been identified. This generalizes technical terms and business terms used during meetings, enabling accurate and rapid communication with participants who use different languages.

[0279] The "voice conversion means" is a technology for acquiring voice data in real time and converting the voice into text information.

[0280] The "data analysis means" is a technology for performing morphological analysis on the text information acquired by the voice conversion means to identify the part of speech used therein.

[0281] The "information providing means" is a technology for acquiring related information based on the noun selected by the user, summarizing it, and providing it to the user.

[0282] The "data conversion means" is a technology for converting technical terms and business terms into general terms so that users can easily understand the content.

[0283] The "translation processing means" is a technology for accurately translating text information between multiple languages to enable mutual understanding between different languages.

[0284] The "language conversion means" is a technology for translating text information into the language specified by the user.

[0285] The "generation model" is an algorithm for summarizing information using artificial intelligence technology and generating new text.

[0286] Embodiments of this invention are systems for assisting smooth communication in meetings and discussion settings. The configuration and operation of this system will be specifically described below.

[0287] The server functions as the central processing unit of this system. First, the server receives voice data from the terminal. The server utilizes a general speech recognition API to convert the voice data into text in real-time. This API has the ability to quickly convert voice data into character data.

[0288] The identified text information is decomposed by morphological analysis engine for each part of speech. This engine is used to identify major keywords and nouns, and the server evaluates the relevance of information based on this. Also, the server collects relevant information from external resources and databases, and generates summary information using an artificial intelligence model based on the information presentation means.

[0289] The terminal is a device operated by the user, captures the voice of the meeting, and transmits it to the server. Further, as a response from the server, it displays the converted text and summary information to the user. Thereby, the user can directly obtain relevant information by selecting the identified nouns. Moreover, since this information is translated into multiple languages using a multilingual translation API, smooth communication can be achieved across language barriers even in international meetings.

[0290] Based on the information displayed on the terminal, the user can understand important technical terms and business terms in the discussion in real-time. This system can provide additional information by using prompt sentences based on the generated AI model for specific words. As a specific example, prompt sentences such as "Regarding the noun 'project management' used during the meeting, please summarize the general definition and related methods and provide them." can be used.

[0291] Thus, this system, starting with speech recognition and continuing through a series of processes including multilingual translation, provides an environment where participants can more efficiently and accurately grasp the flow of conversation.

[0292] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0293] Step 1:

[0294] The terminal captures audio data via the microphone during the meeting. This input audio data is streamed to the server in real time via a secure communication protocol. The terminal appropriately compresses this audio data and packets it to improve data transfer efficiency.

[0295] Step 2:

[0296] The server inputs the audio data received from the terminal into a speech recognition API. This API converts the audio into text data. Specifically, it analyzes the language from the audio waveform and outputs it as text information. Through this process, the audio input is converted into clean text data.

[0297] Step 3:

[0298] The server passes the generated text data to the morphological analysis engine. The morphological analysis engine divides the input text into words and identifies the part of speech of each word. The output is a list of identified nouns in the text. This list is used for subsequent information extraction.

[0299] Step 4:

[0300] The user selects a noun displayed on the device. Based on the selected noun, the server retrieves relevant information from an external source or database. This information gathering is performed via an API, and the information is returned to the server. The output of this process is detailed information related to the selected noun.

[0301] Step 5:

[0302] The server summarizes the acquired detailed information using a generation AI model. This generation AI model quickly summarizes a large amount of text information, extracts important points, and compiles them compactly. This summarized information is output and used to provide information to the user.

[0303] Step 6:

[0304] The server applies a data conversion algorithm to replace technical terms with more common terms. Through this process, complex words and phrases are converted into plain terms to assist the user's understanding. What is output is the revised text information.

[0305] Step 7:

[0306] The server passes the text information to a multilingual translation API as needed and translates it into the specified language. The translated text is processed on the server and immediately sent to the terminal. The terminal displays this translated information to the user to support communication in a multilingual environment. And the user can use the prompt sentence on the terminal to request additional information.

[0307] (Application Example 1)

[0308] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".

[0309] In communication settings, there is a need for information exchange among team members who speak different languages ​​and for facilitating conversations in fields with a lot of specialized terminology. In particular, in on-site security operations, rapid and reliable information transmission is crucial, but language and jargon barriers often hinder communication. Furthermore, there is a lack of support for effectively integrating audio and visual information acquired on-site to correctly understand the situation. Technologies are needed to address these challenges.

[0310] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0311] In this invention, the server includes: a speech recognition means for acquiring voice input in real time and converting the voice into text data; a part-of-speech parsing means for performing morphological analysis on the text data acquired by the speech recognition means and identifying specific parts of speech; an information presentation means for acquiring related information and displaying a summary when a user selects an identified noun; a term conversion means for converting technical terms and business terms into general vocabulary; a translation means for performing translation between multiple languages; an information distribution means for providing translated text information to on-site staff via an information display device; and a situation analysis means for linking visual information acquired from a monitoring device with voice reports to identify abnormal events. This enables rapid information sharing and accurate understanding of situations among staff who speak different languages.

[0312] "Voice input" refers to acquiring acoustic signals in real time and using that information as data for processing.

[0313] "Text data" refers to text information obtained by converting speech, and it can be processed and displayed electronically.

[0314] "Speech recognition means" refers to a device or system for receiving speech information and converting it into text data.

[0315] A "part-of-speech analysis tool" is a device or system for performing morphological analysis on character data and identifying specific parts of speech.

[0316] An "information presentation means" is a device or system for acquiring knowledge related to a noun selected by the user, summarizing it, and displaying it.

[0317] A "terminology conversion means" is a device or system for replacing technical terms or business jargon with general vocabulary.

[0318] "Translation means" refers to a device or system for converting text information between languages.

[0319] "Information distribution means" refers to a device or system for transmitting acquired information to an information display device and presenting it to staff.

[0320] A "situation analysis means" is a device or system that integrates visual and auditory information to identify abnormal events.

[0321] This invention is designed as an information gathering and presentation system for use by on-site security staff. The server acquires voice input in real time and uses the microphones of smartphones or information presentation devices to process the acoustic signals. Google Speech-to-Text API or Amazon Transcribe is used for speech recognition, and the resulting text is converted into character data by the speech recognition means.

[0322] Next, the text data is preprocessed for morphological analysis. Here, parts of speech are analyzed using a morphological analysis engine such as MeCab, and particularly important nouns are extracted. When the user selects a specific noun, the server retrieves relevant information from a database or external resources and summarizes that information using a generative AI model. In this process, OpenAI's GPT model or similar might be used.

[0323] Subsequently, the extracted technical and business terms are converted into general vocabulary using a terminology conversion tool and displayed in a format that is easy for the user to understand. Text translation between multiple languages ​​is performed using translation tools such as the Google Translate API or the DeepL API.

[0324] The translated information is provided to on-site staff in real time via information display devices through information distribution means. Smart glasses or mobile devices may be used for this information display. Situation analysis means are also used to integrate visual information obtained from monitoring devices with audio reports. This makes it possible to quickly identify abnormal events and take appropriate countermeasures.

[0325] For example, if a security staff member verbally reports a suspicious vehicle, a voice recognition system converts the report into text data, extracting the key phrase "suspicious vehicle." The server automatically summarizes the information, translates it into multiple languages, and immediately shares it with the entire team.

[0326] An example of a prompt for a generative AI model might be: "Summarize the latest reports related to the keyword 'suspicious vehicle' and generate a multilingual description. Translate it into English and add any relevant information."

[0327] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0328] Step 1:

[0329] The terminal acquires voice input from on-site staff in real time. This input voice is collected through the microphone of the information display device.

[0330] Step 2:

[0331] The server converts the acquired speech input into text data using the Google Speech-to-Text API or Amazon Transcribe. This conversion process transforms the data from speech to text.

[0332] Step 3:

[0333] The server uses the MeCab library to analyze the parts of speech in order to perform morphological analysis on the obtained character data. Through this analysis, nouns and important phrases are extracted. The input is character data, and the output is a list of nouns.

[0334] Step 4:

[0335] The user selects a noun of interest from the displayed list. This selection action triggers the next information retrieval process.

[0336] Step 5:

[0337] The server retrieves information related to the selected noun from databases and external resources. This information is then input into a prompt sentence that is summarized using a generative AI model. The output of this step is the summarized information.

[0338] Step 6:

[0339] The server uses a terminology conversion mechanism to convert technical terms into general vocabulary and arranges them in a format that is easily understood by a wider range of users. The input is summary information, and the output is the converted information.

[0340] Step 7:

[0341] The server uses the DeepL API or Google Translate API to perform translations and generate multilingual information. This makes it possible to communicate information to users who speak different languages.

[0342] Step 8:

[0343] The server transmits the translated information to the terminal at the site via an information distribution system. The terminal then presents the information to the user using an information display device. This allows for immediate feedback.

[0344] Step 9:

[0345] The server integrates visual and audio data from monitoring devices and identifies abnormal events using situation analysis tools. Based on the integrated data, it creates example prompt statements generated by a generative AI model, providing specific action guidelines.

[0346] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0347] This invention is a system that combines speech recognition and emotion recognition to support smooth and effective communication. The system includes functions for speech recognition, morphological analysis, information presentation, terminology conversion, multilingual translation, and emotion recognition.

[0348] 1. Overall structure:

[0349] The server functions as the central hub of the system, processing voice data, analyzing sentiment data, and presenting and translating information. The terminal provides the interface with the user, collecting voice input from the user, sending it to the server, and displaying the processed information.

[0350] 2. Server processing:

[0351] The server converts the audio transmitted from the terminal into text data using a speech recognition engine. This text data is then broken down by morphological analysis to identify nouns and important words. Furthermore, the server can use an emotion engine to analyze the emotions contained in the audio data. The analysis results are used as data that provides a detailed indication of the user's emotional state.

[0352] The server adjusts information presentation based on the user's emotional state, collecting, summarizing, and displaying information related to the noun selected by the user. This information is presented in a way that aligns with the user's emotions, enabling a more personalized experience.

[0353] 3. Role of the terminal:

[0354] The terminal is a device used by the user and primarily manages voice input and output. The terminal acquires voice and sends it to the server, and displays the server's response to the user. It also has an interface for visualizing the user's emotional state, allowing changes to be monitored in real time.

[0355] 4. User interaction and experience:

[0356] Users input voice commands into their devices and receive information based on the analyzed results. For example, if a user says "project management," the server uses an emotion engine to analyze the emotions expressed in the statement and presents information in the most appropriate format. Furthermore, if the system determines that the user is feeling stressed, it can suggest information and methods to help them relax.

[0357] Specific example:

[0358] When a user mentions "the introduction of a new product" during a meeting, the server analyzes the emotions behind the statement and, if it determines that the user is feeling anxious, provides reassuring information or examples of past success stories.

[0359] This system provides an environment that enables users to make better decisions by delivering information with consideration for their emotions. Furthermore, the system continuously learns and improves the relevance and accuracy of the information presented.

[0360] The following describes the processing flow.

[0361] Step 1:

[0362] The device uses a microphone to capture the user's voice. The captured voice is converted into a digital format and sent to the server.

[0363] Step 2:

[0364] The server inputs the received audio data into the speech recognition engine and converts it into corresponding text data. This text data is then ready for further processing within the system.

[0365] Step 3:

[0366] The server passes the text data to a morphological analysis engine, which breaks down words into parts of speech. Nouns and keywords, in particular, are identified and marked so that their relevant information can be presented to the user later.

[0367] Step 4:

[0368] The server uses an emotion engine to analyze the user's emotions in the voice data. The emotion recognition results are used as data indicating the user's state.

[0369] Step 5:

[0370] When a user selects a noun or keyword on their device, the server collects relevant information from databases and external sources. The collected information can then be summarized according to the user's emotional state.

[0371] Step 6:

[0372] The server optimizes the summary information based on the results of sentiment analysis. For example, if the user is feeling anxious, it may emphasize information that provides reassurance.

[0373] Step 7:

[0374] The server sends summarized information to the terminal. The terminal displays this clearly to the user to support their understanding. However, the display method is adjusted as appropriate based on the sentiment analysis results.

[0375] Step 8:

[0376] The device visualizes the user's emotional state in real time and displays changes. This allows the user to understand their own emotional changes and respond as needed.

[0377] Step 9:

[0378] If a user requests multilingual translation, the device sends the request to the server. The server translates the text into the selected language, and the translation result is returned to the device and provided to the user.

[0379] (Example 2)

[0380] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0381] Conventional systems simply convert user speech into text data and present information, lacking the ability to provide flexible information based on the user's emotional state. Furthermore, they lack mechanisms to support decision-making through information presentation that considers the user's emotions, thus failing to contribute to smoother communication.

[0382] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0383] In this invention, the server includes speech recognition means, part-of-speech analysis means, emotion recognition means, information presentation means, and display means. This enables the recognition of emotions from the user's speech and the presentation of optimal information corresponding to those emotions.

[0384] "Voice recognition means" refers to a function that converts voice input acquired by a terminal into text data in real time.

[0385] A "part-of-speech analysis tool" is a function that performs morphological analysis on acquired character data to identify specific parts of speech.

[0386] "Emotion recognition means" refers to a function that analyzes identified terms and recognizes the emotional state of an utterance.

[0387] "Information presentation means" refers to a function that collects information in accordance with the user's emotions and uses artificial intelligence to generate and present a summary in the most optimal form.

[0388] "Display means" refers to a function that includes an interface for transmitting information to a terminal and providing that information to the user visually or audibly.

[0389] This invention is a system that combines speech recognition and emotion recognition to support smooth and effective communication with users. The system mainly consists of a server and terminals, with the server functioning as the central hub of the system.

[0390] The server receives audio data from the terminal and converts it into text data using speech recognition. For this purpose, the server can use general-purpose speech recognition software. Specifically, a speech recognition API can be used as the speech recognition engine. The converted text data is then parsed into parts of speech via a morphological analysis tool. This analysis allows the server to identify specific parts of speech and nouns.

[0391] Next, the server uses an emotion recognition engine to analyze the user's emotional state from the obtained data. For example, an emotion analysis API can be used for emotion recognition. This analysis allows the server to prepare information that reflects the user's emotional state.

[0392] Furthermore, the server uses information presentation devices to collect information tailored to the user's emotions and uses an artificial intelligence model to generate an optimal summary. This information is transmitted to the terminal via display devices and provided to the user visually or aurally.

[0393] The terminal functions as a user interface, handling voice input and output. It acquires voice from the user and transmits it to the server. It also presents the processed information to the user through a medium, allowing the user to make decisions while monitoring their emotional state in real time.

[0394] For example, if a user says, "I feel anxious about the next project meeting," the server analyzes the user's emotions from that statement and suggests ways to relax and provides helpful success stories. In this way, the system provides personalized information based on the user's emotions.

[0395] An example of a prompt to input into a generative AI model is, "Show how to provide effective information to a user who is feeling anxious while preparing a presentation." Using this prompt, the model can generate appropriate responses for a user's specific emotional state.

[0396] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0397] Step 1:

[0398] The terminal acquires voice input from the user. This voice data is collected in real time by the microphone and sent to the server as raw voice data.

[0399] Step 2:

[0400] The server inputs the received audio data into a speech recognition engine, which converts it into text data. This engine analyzes sound wave information and identifies words and phrases from combinations of phonemes. The output is text data of the user's speech.

[0401] Step 3:

[0402] The server inputs the converted character data into a morphological analysis tool and performs part-of-speech analysis. This analysis identifies grammatical structures and tags parts of speech such as nouns and verbs. The output is text data with part-of-speech information attached.

[0403] Step 4:

[0404] The server inputs the part-of-speech analysis results into an emotion recognition engine to identify the emotional state of the text. This engine evaluates the context of words based on specific emotion indicators and identifies the intensity and type of emotion. The output is structured data indicating the emotional state.

[0405] Step 5:

[0406] The server uses information presentation tools to collect relevant information based on the user's emotional state and identified keywords. Using artificial intelligence, it generates summaries tailored to the user's emotions and extracts information useful to the user. The output is summarized information that matches the user's emotions.

[0407] Step 6:

[0408] The server sends the generated information to the terminal, which then provides it to the user in visual or auditory form. Specific actions include displaying relevant information on the screen or playing audio messages through the speaker. The user can then use this information to make further decisions or take action.

[0409] (Application Example 2)

[0410] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0411] In elderly care settings, providing appropriate support in response to the emotions of both the care recipient and the caregiver is crucial. However, it is difficult for care staff to immediately understand the emotions of the care recipient and respond appropriately based on that understanding, highlighting the need for improved communication quality. Furthermore, language barriers pose a problem in communication between caregivers and care recipients who speak diverse languages. A system is needed to address these challenges.

[0412] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0413] In this invention, the server includes a speech recognition means, a morphological analysis means, an information presentation means, a terminology conversion means for converting technical terms and business terms into general terms, a translation means for performing translation between multiple languages, and an emotion analysis and information provision means for determining the emotional state of the user or person being cared for and providing appropriate information or messages based on those emotions. This enables timely and appropriate emotionally sensitive information presentation in care settings and facilitates smooth communication between users who speak diverse languages.

[0414] A "speech recognition means" is a device that has the function of acquiring speech input in real time and converting that speech into text data.

[0415] A "morphological analysis means" is a device that breaks down character data acquired by a speech recognition means and has the function of identifying specific parts of speech.

[0416] An "information presentation means" is a device that has the function of acquiring related information when an identified noun is selected, and summarizing and displaying that information.

[0417] A "terminology conversion tool" is a device that has the function of converting specialized or business-related terms into general terms.

[0418] A "translation tool" is a device that performs translation between multiple languages ​​and has the function of converting text into a specified language.

[0419] "Emotional analysis and information provision means" refers to a system that has the function of determining the emotional state of a user or person receiving care and providing appropriate information or messages based on those emotions.

[0420] "Machine learning" is a technology that uses data to automatically improve algorithms, and is applied to improve the accuracy of information.

[0421] To implement the present invention, a system centered on speech recognition and emotion analysis can be envisioned operating on a smart device. This system consists of a server and a terminal, and the user provides voice input via the terminal. Voice data is acquired in real time through the terminal's microphone and transmitted to the server. The server converts the voice into text data using speech recognition means.

[0422] The server uses morphological analysis to break down text data into parts of speech and identify specific important words and phrases. Then, using sentiment analysis and information provision tools, it determines the user's emotional state and collects, summarizes, and presents relevant information based on the analysis results. The information is adjusted according to the user's emotions and displayed on the device in a personalized format.

[0423] If multilingual support is required, the text will be translated into the user's specified language using translation tools. The information presented and the translated text are intended to provide support that is sensitive to the feelings of both the care recipient and the caregiver, and are provided on the device as screen display or audio output.

[0424] For example, if a care recipient inputs "I'm feeling a little anxious today...", the system uses emotion analysis to recognize this state and suggests calming music or relaxation techniques. It also advises care staff to "take a moment, hold the care recipient's hand, and encourage slow breathing."

[0425] An example of a prompt might be, "Please suggest effective communication methods to help caregivers alleviate the anxiety of those they care for."

[0426] This allows the server to personalize and deliver appropriate information in real time according to the user's needs, enabling effective, emotion-based support in caregiving settings.

[0427] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0428] Step 1:

[0429] The user inputs voice using the device's microphone. The device captures the voice in real time and generates digital audio data. This audio data is sent to a server for analysis.

[0430] Step 2:

[0431] The server receives the audio data and converts it to text using speech recognition technology. In this step, the input is the audio data, and the output is the converted text data. This process is performed using speech recognition technology.

[0432] Step 3:

[0433] The server analyzes character data using morphological analysis tools and extracts specific parts of speech, particularly nouns. The input in this step is character data, and the output is a list of analyzed nouns. The analysis is performed using a morphological analysis algorithm.

[0434] Step 4:

[0435] The server uses emotion analysis and information provision means to determine the user's emotions from the voice input. The input in this step is voice data and text data, and the output is emotion data indicating the emotional state. Emotion analysis is performed by an emotion recognition engine.

[0436] Step 5:

[0437] The server collects relevant information based on the emotional state and noun list, and summarizes the information using an information presentation tool. The input in this step is emotional data and a noun list, and the output is the summarized information.

[0438] Step 6:

[0439] The server uses translation tools to translate the text into the user's specified language as needed. The input in this step is summarized information, and the output is the translated information.

[0440] Step 7:

[0441] The server adjusts the information based on emotional data and sends it to the terminal. The terminal presents this information to the user as a screen display or audio output. In this step, the input is the adjusted information, and the output is the information delivered to the user.

[0442] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0443] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0444] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0445] [Third Embodiment]

[0446] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0447] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0448] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0449] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0450] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0451] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0452] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0453] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0454] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0455] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0456] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0457] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0458] This invention is a system that integrates speech recognition, morphological analysis, information presentation, terminology conversion, and multilingual translation, and aims to facilitate intended communication.

[0459] Server role:

[0460] The server functions as the central processing unit of this system, first receiving the audio of the meeting transmitted from the terminals. Using speech recognition technology, this audio is converted into highly accurate text information. The converted text information is then broken down by part of speech through a morphological analysis engine, with nouns being identified in particular.

[0461] The server also collects information related to nouns selected by the user from databases and external resources. The collected information is summarized by AI and sent to the terminal. Furthermore, the server has the ability to detect technical and business terms in the text and replace them with general terms.

[0462] The server is equipped with a multilingual translation engine that translates information into the language specified by the user, providing information tailored to the user's needs.

[0463] Terminal role:

[0464] The terminal is a device used by users participating in a meeting. It transmits audio to the server and displays text and summary information received from the server to the user. When a user clicks on a specific noun, additional information related to that noun is displayed on the terminal. The latest converted and translated text information is also reflected on the terminal in real time.

[0465] User actions:

[0466] Users can use their devices during meetings to select specific nouns or technical terms. This makes it easier to understand what is being said and the flow of the discussion through the presentation of useful information and translations. This system is designed to enable even new business professionals and IT novices to participate effectively in meetings.

[0467] Specific example:

[0468] For example, if a user hears the word "project management" during a meeting, clicking on the noun will cause the server to provide a summary of the definition of "project management" and related common practices. Furthermore, the Japanese meeting content can be translated into English and shared with colleagues in overseas branches.

[0469] By applying this invention, it is possible to save time by not having to understand technical terms and industry-specific jargon, and to adapt to diverse language environments, thereby reducing communication barriers.

[0470] The following describes the processing flow.

[0471] Step 1:

[0472] The terminal uses its microphone to capture the voices of meeting participants, converts them to a digital format, and prepares to send them to the server.

[0473] Step 2:

[0474] The server inputs the received audio data into the speech recognition engine, which converts the audio into text data. This text data is temporarily stored for further processing.

[0475] Step 3:

[0476] The server sends the acquired text data to a morphological analysis engine. It performs part-of-speech analysis, specifically identifying and marking nouns. The results of this processing are used to display information when a noun is clicked.

[0477] Step 4:

[0478] The user operates the device and clicks on a noun that interests them.

[0479] Step 5:

[0480] The server identifies the clicked noun and retrieves information related to that noun from databases and external sources. The retrieved data is then summarized into concise information through a summarization algorithm.

[0481] Step 6:

[0482] The server sends summarized information to the terminal and displays it to the user. This additional information allows the user to deepen their understanding of the meeting content.

[0483] Step 7:

[0484] The server detects technical and business terms and converts them into more general terms. The converted text is then sent to the terminal for the user to review.

[0485] Step 8:

[0486] When a user enters their desired language into their device, the server passes it to a translation engine, which then translates the meeting content into the specified language. The translated text is displayed on the device in real time.

[0487] (Example 1)

[0488] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0489] In meetings and discussions, it is difficult to understand the specialized terminology and industry-specific language used by participants in real time and to ensure smooth communication across multiple languages. Furthermore, there is a need for a means to quickly gather and summarize information and for participants to easily follow the progress of the meeting. Traditional systems often fail to effectively address these challenges.

[0490] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0491] In this invention, the server includes a speech conversion means that acquires speech data in real time and converts that data into text information; a data analysis means that performs morphological analysis on the text information acquired by the speech conversion means to identify parts of speech; and an information provision means that acquires related information when a user selects an identified noun and provides summary information. This generalizes specialized and business terms used during meetings and enables accurate and rapid communication among participants who speak different languages.

[0492] "Voice conversion means" refers to technology for acquiring voice data in real time and converting that voice into text information.

[0493] "Data analysis means" refers to a technology for identifying the parts of speech used in text information obtained by speech conversion means through morphological analysis.

[0494] "Information provision means" refers to technology that retrieves relevant information based on a noun selected by the user, summarizes it, and provides it to the user.

[0495] "Data conversion means" refers to technologies that convert technical or business terms into general terms, making the content easier for users to understand.

[0496] A "translation processing method" is a technology that accurately translates text information between multiple languages, enabling mutual understanding between different languages.

[0497] A "language conversion method" is a technology for translating text information into a language specified by the user.

[0498] A "generative model" is an algorithm that uses artificial intelligence technology to summarize information and generate new text.

[0499] This embodiment of the invention is a system for supporting smooth communication in meetings and discussions. The configuration and operation of this system will be described in detail below.

[0500] The server functions as the central processing unit of this system. The server first receives audio data from the terminals. To convert the audio data into text in real time, the server utilizes a common speech recognition API. This API has the capability to rapidly convert audio data into text data.

[0501] The identified text information is broken down into parts of speech using a morphological analysis engine. This engine is used to identify key keywords and nouns, and the server uses this information to evaluate its relevance. The server also collects relevant information from external resources and databases, and generates summary information using an artificial intelligence model based on the information presentation method.

[0502] The terminal is a user-operated device that captures meeting audio and sends it to the server. Furthermore, the server displays the user the converted text and summary information in its response. This allows the user to directly obtain relevant information by selecting identified nouns. Additionally, this information is translated into multiple languages ​​using a multilingual translation API, enabling smooth communication across language barriers even in international meetings.

[0503] Users can understand important technical and business terms in a discussion in real time based on the information displayed on their device. This system can provide additional information using prompts based on a generative AI model for specific words. For example, a prompt such as, "Please provide a summary of the general definition and related techniques for the noun 'project management' used during the meeting," can be used.

[0504] Thus, this system, starting with speech recognition and continuing through a series of processes including multilingual translation, provides an environment where participants can more efficiently and accurately grasp the flow of conversation.

[0505] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0506] Step 1:

[0507] The terminal captures audio data via the microphone during the meeting. This input audio data is streamed to the server in real time via a secure communication protocol. The terminal appropriately compresses this audio data and packets it to improve data transfer efficiency.

[0508] Step 2:

[0509] The server inputs the audio data received from the terminal into a speech recognition API. This API converts the audio into text data. Specifically, it analyzes the language from the audio waveform and outputs it as text information. Through this process, the audio input is converted into clean text data.

[0510] Step 3:

[0511] The server passes the generated text data to the morphological analysis engine. The morphological analysis engine divides the input text into words and identifies the part of speech of each word. The output is a list of identified nouns in the text. This list is used for subsequent information extraction.

[0512] Step 4:

[0513] The user selects a noun displayed on the device. Based on the selected noun, the server retrieves relevant information from an external source or database. This information gathering is performed via an API, and the information is returned to the server. The output of this process is detailed information related to the selected noun.

[0514] Step 5:

[0515] The server summarizes the acquired detailed information using a generative AI model. This generative AI model quickly summarizes large amounts of text information, extracting key points and organizing them concisely. This summarized information is output and used to provide information to users.

[0516] Step 6:

[0517] The server applies a data transformation algorithm to replace technical terms with more general ones. This process converts complex words and phrases into simpler terms to aid user understanding. The output is revised text information.

[0518] Step 7:

[0519] The server passes text information to a multilingual translation API as needed, translating it into the specified language. The translated text is processed on the server and immediately sent to the terminal. The terminal displays this translated information to the user, facilitating communication in a multilingual environment. The user can then use prompts on the terminal to request additional information.

[0520] (Application Example 1)

[0521] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0522] In communication settings, there is a need for information exchange among team members who speak different languages ​​and for facilitating conversations in fields with a lot of specialized terminology. In particular, in on-site security operations, rapid and reliable information transmission is crucial, but language and jargon barriers often hinder communication. Furthermore, there is a lack of support for effectively integrating audio and visual information acquired on-site to correctly understand the situation. Technologies are needed to address these challenges.

[0523] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0524] In this invention, the server includes: a speech recognition means for acquiring voice input in real time and converting the voice into text data; a part-of-speech parsing means for performing morphological analysis on the text data acquired by the speech recognition means and identifying specific parts of speech; an information presentation means for acquiring related information and displaying a summary when a user selects an identified noun; a term conversion means for converting technical terms and business terms into general vocabulary; a translation means for performing translation between multiple languages; an information distribution means for providing translated text information to on-site staff via an information display device; and a situation analysis means for linking visual information acquired from a monitoring device with voice reports to identify abnormal events. This enables rapid information sharing and accurate understanding of situations among staff who speak different languages.

[0525] "Voice input" refers to acquiring acoustic signals in real time and using that information as data for processing.

[0526] "Text data" refers to text information obtained by converting speech, and it can be processed and displayed electronically.

[0527] "Speech recognition means" refers to a device or system for receiving speech information and converting it into text data.

[0528] A "part-of-speech analysis tool" is a device or system for performing morphological analysis on character data and identifying specific parts of speech.

[0529] An "information presentation means" is a device or system for acquiring knowledge related to a noun selected by the user, summarizing it, and displaying it.

[0530] A "terminology conversion means" is a device or system for replacing technical terms or business jargon with general vocabulary.

[0531] "Translation means" refers to a device or system for converting text information between languages.

[0532] "Information distribution means" refers to a device or system for transmitting acquired information to an information display device and presenting it to staff.

[0533] A "situation analysis means" is a device or system that integrates visual and auditory information to identify abnormal events.

[0534] This invention is designed as an information gathering and presentation system for use by on-site security staff. The server acquires voice input in real time and uses the microphones of smartphones or information presentation devices to process the acoustic signals. Google Speech-to-Text API or Amazon Transcribe is used for speech recognition, and the resulting text is converted into character data by the speech recognition means.

[0535] Next, the text data is preprocessed for morphological analysis. Here, parts of speech are analyzed using a morphological analysis engine such as MeCab, and particularly important nouns are extracted. When the user selects a specific noun, the server retrieves relevant information from a database or external resources and summarizes that information using a generative AI model. In this process, OpenAI's GPT model or similar might be used.

[0536] Subsequently, the extracted technical and business terms are converted into general vocabulary using a terminology conversion tool and displayed in a format that is easy for the user to understand. Text translation between multiple languages ​​is performed using translation tools such as the Google Translate API or the DeepL API.

[0537] The translated information is provided to on-site staff in real time via information display devices through information distribution means. Smart glasses or mobile devices may be used for this information display. Situation analysis means are also used to integrate visual information obtained from monitoring devices with audio reports. This makes it possible to quickly identify abnormal events and take appropriate countermeasures.

[0538] For example, if a security staff member verbally reports a suspicious vehicle, a voice recognition system converts the report into text data, extracting the key phrase "suspicious vehicle." The server automatically summarizes the information, translates it into multiple languages, and immediately shares it with the entire team.

[0539] An example of a prompt for a generative AI model might be: "Summarize the latest reports related to the keyword 'suspicious vehicle' and generate a multilingual description. Translate it into English and add any relevant information."

[0540] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0541] Step 1:

[0542] The terminal acquires voice input from on-site staff in real time. This input voice is collected through the microphone of the information display device.

[0543] Step 2:

[0544] The server converts the acquired speech input into text data using the Google Speech-to-Text API or Amazon Transcribe. This conversion process transforms the data from speech to text.

[0545] Step 3:

[0546] The server uses the MeCab library to analyze the parts of speech in order to perform morphological analysis on the obtained character data. Through this analysis, nouns and important phrases are extracted. The input is character data, and the output is a list of nouns.

[0547] Step 4:

[0548] The user selects a noun of interest from the displayed list. This selection action triggers the next information retrieval process.

[0549] Step 5:

[0550] The server retrieves information related to the selected noun from databases and external resources. This information is then input into a prompt sentence that is summarized using a generative AI model. The output of this step is the summarized information.

[0551] Step 6:

[0552] The server uses a terminology conversion mechanism to convert technical terms into general vocabulary and arranges them in a format that is easily understood by a wider range of users. The input is summary information, and the output is the converted information.

[0553] Step 7:

[0554] The server uses the DeepL API or Google Translate API to perform translations and generate multilingual information. This makes it possible to communicate information to users who speak different languages.

[0555] Step 8:

[0556] The server transmits the translated information to the terminal at the site via an information distribution system. The terminal then presents the information to the user using an information display device. This allows for immediate feedback.

[0557] Step 9:

[0558] The server integrates visual and audio data from monitoring devices and identifies abnormal events using situation analysis tools. Based on the integrated data, it creates example prompt statements generated by a generative AI model, providing specific action guidelines.

[0559] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0560] This invention is a system that combines speech recognition and emotion recognition to support smooth and effective communication. The system includes functions for speech recognition, morphological analysis, information presentation, terminology conversion, multilingual translation, and emotion recognition.

[0561] 1. Overall structure:

[0562] The server functions as the central hub of the system, processing voice data, analyzing sentiment data, and presenting and translating information. The terminal provides the interface with the user, collecting voice input from the user, sending it to the server, and displaying the processed information.

[0563] 2. Server processing:

[0564] The server converts the audio transmitted from the terminal into text data using a speech recognition engine. This text data is then broken down by morphological analysis to identify nouns and important words. Furthermore, the server can use an emotion engine to analyze the emotions contained in the audio data. The analysis results are used as data that provides a detailed indication of the user's emotional state.

[0565] The server adjusts information presentation based on the user's emotional state, collecting, summarizing, and displaying information related to the noun selected by the user. This information is presented in a way that aligns with the user's emotions, enabling a more personalized experience.

[0566] 3. Role of the terminal:

[0567] The terminal is a device used by the user and primarily manages voice input and output. The terminal acquires voice and sends it to the server, and displays the server's response to the user. It also has an interface for visualizing the user's emotional state, allowing changes to be monitored in real time.

[0568] 4. User interaction and experience:

[0569] Users input voice commands into their devices and receive information based on the analyzed results. For example, if a user says "project management," the server uses an emotion engine to analyze the emotions expressed in the statement and presents information in the most appropriate format. Furthermore, if the system determines that the user is feeling stressed, it can suggest information and methods to help them relax.

[0570] Specific example:

[0571] When a user mentions "the introduction of a new product" during a meeting, the server analyzes the emotions behind the statement and, if it determines that the user is feeling anxious, provides reassuring information or examples of past success stories.

[0572] This system provides an environment that enables users to make better decisions by delivering information with consideration for their emotions. Furthermore, the system continuously learns and improves the relevance and accuracy of the information presented.

[0573] The following describes the processing flow.

[0574] Step 1:

[0575] The device uses a microphone to capture the user's voice. The captured voice is converted into a digital format and sent to the server.

[0576] Step 2:

[0577] The server inputs the received audio data into the speech recognition engine and converts it into corresponding text data. This text data is then ready for further processing within the system.

[0578] Step 3:

[0579] The server passes the text data to a morphological analysis engine, which breaks down words into parts of speech. Nouns and keywords, in particular, are identified and marked so that their relevant information can be presented to the user later.

[0580] Step 4:

[0581] The server uses an emotion engine to analyze the user's emotions in the voice data. The emotion recognition results are used as data indicating the user's state.

[0582] Step 5:

[0583] When a user selects a noun or keyword on their device, the server collects relevant information from databases and external sources. The collected information can then be summarized according to the user's emotional state.

[0584] Step 6:

[0585] The server optimizes the summary information based on the results of sentiment analysis. For example, if the user is feeling anxious, it may emphasize information that provides reassurance.

[0586] Step 7:

[0587] The server sends summarized information to the terminal. The terminal displays this clearly to the user to support their understanding. However, the display method is adjusted as appropriate based on the sentiment analysis results.

[0588] Step 8:

[0589] The device visualizes the user's emotional state in real time and displays changes. This allows the user to understand their own emotional changes and respond as needed.

[0590] Step 9:

[0591] If a user requests multilingual translation, the device sends the request to the server. The server translates the text into the selected language, and the translation result is returned to the device and provided to the user.

[0592] (Example 2)

[0593] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0594] Conventional systems simply convert user speech into text data and present information, lacking the ability to provide flexible information based on the user's emotional state. Furthermore, they lack mechanisms to support decision-making through information presentation that considers the user's emotions, thus failing to contribute to smoother communication.

[0595] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0596] In this invention, the server includes speech recognition means, part-of-speech analysis means, emotion recognition means, information presentation means, and display means. This enables the recognition of emotions from the user's speech and the presentation of optimal information corresponding to those emotions.

[0597] "Voice recognition means" refers to a function that converts voice input acquired by a terminal into text data in real time.

[0598] A "part-of-speech analysis tool" is a function that performs morphological analysis on acquired character data to identify specific parts of speech.

[0599] "Emotion recognition means" refers to a function that analyzes identified terms and recognizes the emotional state of an utterance.

[0600] "Information presentation means" refers to a function that collects information in accordance with the user's emotions and uses artificial intelligence to generate and present a summary in the most optimal form.

[0601] "Display means" refers to a function that includes an interface for transmitting information to a terminal and providing that information to the user visually or audibly.

[0602] This invention is a system that combines speech recognition and emotion recognition to support smooth and effective communication with users. The system mainly consists of a server and terminals, with the server functioning as the central hub of the system.

[0603] The server receives audio data from the terminal and converts it into text data using speech recognition. For this purpose, the server can use general-purpose speech recognition software. Specifically, a speech recognition API can be used as the speech recognition engine. The converted text data is then parsed into parts of speech via a morphological analysis tool. This analysis allows the server to identify specific parts of speech and nouns.

[0604] Next, the server uses an emotion recognition engine to analyze the user's emotional state from the obtained data. For example, an emotion analysis API can be used for emotion recognition. This analysis allows the server to prepare information that reflects the user's emotional state.

[0605] Furthermore, the server uses information presentation devices to collect information tailored to the user's emotions and uses an artificial intelligence model to generate an optimal summary. This information is transmitted to the terminal via display devices and provided to the user visually or aurally.

[0606] The terminal functions as a user interface, handling voice input and output. It acquires voice from the user and transmits it to the server. It also presents the processed information to the user through a medium, allowing the user to make decisions while monitoring their emotional state in real time.

[0607] For example, if a user says, "I feel anxious about the next project meeting," the server analyzes the user's emotions from that statement and suggests ways to relax and provides helpful success stories. In this way, the system provides personalized information based on the user's emotions.

[0608] An example of a prompt to input into a generative AI model is, "Show how to provide effective information to a user who is feeling anxious while preparing a presentation." Using this prompt, the model can generate appropriate responses for a user's specific emotional state.

[0609] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0610] Step 1:

[0611] The terminal acquires voice input from the user. This voice data is collected in real time by the microphone and sent to the server as raw voice data.

[0612] Step 2:

[0613] The server inputs the received audio data into a speech recognition engine, which converts it into text data. This engine analyzes sound wave information and identifies words and phrases from combinations of phonemes. The output is text data of the user's speech.

[0614] Step 3:

[0615] The server inputs the converted character data into a morphological analysis tool and performs part-of-speech analysis. This analysis identifies grammatical structures and tags parts of speech such as nouns and verbs. The output is text data with part-of-speech information attached.

[0616] Step 4:

[0617] The server inputs the part-of-speech analysis results into an emotion recognition engine to identify the emotional state of the text. This engine evaluates the context of words based on specific emotion indicators and identifies the intensity and type of emotion. The output is structured data indicating the emotional state.

[0618] Step 5:

[0619] The server uses information presentation tools to collect relevant information based on the user's emotional state and identified keywords. Using artificial intelligence, it generates summaries tailored to the user's emotions and extracts information useful to the user. The output is summarized information that matches the user's emotions.

[0620] Step 6:

[0621] The server sends the generated information to the terminal, which then provides it to the user in visual or auditory form. Specific actions include displaying relevant information on the screen or playing audio messages through the speaker. The user can then use this information to make further decisions or take action.

[0622] (Application Example 2)

[0623] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0624] In elderly care settings, providing appropriate support in response to the emotions of both the care recipient and the caregiver is crucial. However, it is difficult for care staff to immediately understand the emotions of the care recipient and respond appropriately based on that understanding, highlighting the need for improved communication quality. Furthermore, language barriers pose a problem in communication between caregivers and care recipients who speak diverse languages. A system is needed to address these challenges.

[0625] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0626] In this invention, the server includes a speech recognition means, a morphological analysis means, an information presentation means, a terminology conversion means for converting technical terms and business terms into general terms, a translation means for performing translation between multiple languages, and an emotion analysis and information provision means for determining the emotional state of the user or person being cared for and providing appropriate information or messages based on those emotions. This enables timely and appropriate emotionally sensitive information presentation in care settings and facilitates smooth communication between users who speak diverse languages.

[0627] A "speech recognition means" is a device that has the function of acquiring speech input in real time and converting that speech into text data.

[0628] A "morphological analysis means" is a device that breaks down character data acquired by a speech recognition means and has the function of identifying specific parts of speech.

[0629] An "information presentation means" is a device that has the function of acquiring related information when an identified noun is selected, and summarizing and displaying that information.

[0630] A "terminology conversion tool" is a device that has the function of converting specialized or business-related terms into general terms.

[0631] A "translation tool" is a device that performs translation between multiple languages ​​and has the function of converting text into a specified language.

[0632] "Emotional analysis and information provision means" refers to a system that has the function of determining the emotional state of a user or person receiving care and providing appropriate information or messages based on those emotions.

[0633] "Machine learning" is a technology that uses data to automatically improve algorithms, and is applied to improve the accuracy of information.

[0634] To implement the present invention, a system centered on speech recognition and emotion analysis can be envisioned operating on a smart device. This system consists of a server and a terminal, and the user provides voice input via the terminal. Voice data is acquired in real time through the terminal's microphone and transmitted to the server. The server converts the voice into text data using speech recognition means.

[0635] The server uses morphological analysis to break down text data into parts of speech and identify specific important words and phrases. Then, using sentiment analysis and information provision tools, it determines the user's emotional state and collects, summarizes, and presents relevant information based on the analysis results. The information is adjusted according to the user's emotions and displayed on the device in a personalized format.

[0636] If multilingual support is required, the text will be translated into the user's specified language using translation tools. The information presented and the translated text are intended to provide support that is sensitive to the feelings of both the care recipient and the caregiver, and are provided on the device as screen display or audio output.

[0637] For example, if a care recipient inputs "I'm feeling a little anxious today...", the system uses emotion analysis to recognize this state and suggests calming music or relaxation techniques. It also advises care staff to "take a moment, hold the care recipient's hand, and encourage slow breathing."

[0638] An example of a prompt might be, "Please suggest effective communication methods to help caregivers alleviate the anxiety of those they care for."

[0639] This allows the server to personalize and deliver appropriate information in real time according to the user's needs, enabling effective, emotion-based support in caregiving settings.

[0640] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0641] Step 1:

[0642] The user inputs voice using the device's microphone. The device captures the voice in real time and generates digital audio data. This audio data is sent to a server for analysis.

[0643] Step 2:

[0644] The server receives the audio data and converts it to text using speech recognition technology. In this step, the input is the audio data, and the output is the converted text data. This process is performed using speech recognition technology.

[0645] Step 3:

[0646] The server analyzes character data using morphological analysis tools and extracts specific parts of speech, particularly nouns. The input in this step is character data, and the output is a list of analyzed nouns. The analysis is performed using a morphological analysis algorithm.

[0647] Step 4:

[0648] The server uses emotion analysis and information provision means to determine the user's emotions from the voice input. The input in this step is voice data and text data, and the output is emotion data indicating the emotional state. Emotion analysis is performed by an emotion recognition engine.

[0649] Step 5:

[0650] The server collects relevant information based on the emotional state and noun list, and summarizes the information using an information presentation tool. The input in this step is emotional data and a noun list, and the output is the summarized information.

[0651] Step 6:

[0652] The server uses translation tools to translate the text into the user's specified language as needed. The input in this step is summarized information, and the output is the translated information.

[0653] Step 7:

[0654] The server adjusts the information based on emotional data and sends it to the terminal. The terminal presents this information to the user as a screen display or audio output. In this step, the input is the adjusted information, and the output is the information delivered to the user.

[0655] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0656] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0657] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0658] [Fourth Embodiment]

[0659] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0660] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0661] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0662] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0663] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0664] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0665] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0666] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0667] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0668] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0669] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0670] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0671] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0672] This invention is a system that integrates speech recognition, morphological analysis, information presentation, terminology conversion, and multilingual translation, and aims to facilitate intended communication.

[0673] Server role:

[0674] The server functions as the central processing unit of this system, first receiving the audio of the meeting transmitted from the terminals. Using speech recognition technology, this audio is converted into highly accurate text information. The converted text information is then broken down by part of speech through a morphological analysis engine, with nouns being identified in particular.

[0675] The server also collects information related to nouns selected by the user from databases and external resources. The collected information is summarized by AI and sent to the terminal. Furthermore, the server has the ability to detect technical and business terms in the text and replace them with general terms.

[0676] The server is equipped with a multilingual translation engine that translates information into the language specified by the user, providing information tailored to the user's needs.

[0677] Terminal role:

[0678] The terminal is a device used by users participating in a meeting. It transmits audio to the server and displays text and summary information received from the server to the user. When a user clicks on a specific noun, additional information related to that noun is displayed on the terminal. The latest converted and translated text information is also reflected on the terminal in real time.

[0679] User actions:

[0680] Users can use their devices during meetings to select specific nouns or technical terms. This makes it easier to understand what is being said and the flow of the discussion through the presentation of useful information and translations. This system is designed to enable even new business professionals and IT novices to participate effectively in meetings.

[0681] Specific example:

[0682] For example, if a user hears the word "project management" during a meeting, clicking on the noun will cause the server to provide a summary of the definition of "project management" and related common practices. Furthermore, the Japanese meeting content can be translated into English and shared with colleagues in overseas branches.

[0683] By applying this invention, it is possible to save time by not having to understand technical terms and industry-specific jargon, and to adapt to diverse language environments, thereby reducing communication barriers.

[0684] The following describes the processing flow.

[0685] Step 1:

[0686] The terminal uses its microphone to capture the voices of meeting participants, converts them to a digital format, and prepares to send them to the server.

[0687] Step 2:

[0688] The server inputs the received audio data into the speech recognition engine, which converts the audio into text data. This text data is temporarily stored for further processing.

[0689] Step 3:

[0690] The server sends the acquired text data to a morphological analysis engine. It performs part-of-speech analysis, specifically identifying and marking nouns. The results of this processing are used to display information when a noun is clicked.

[0691] Step 4:

[0692] The user operates the device and clicks on a noun that interests them.

[0693] Step 5:

[0694] The server identifies the clicked noun and retrieves information related to that noun from databases and external sources. The retrieved data is then summarized into concise information through a summarization algorithm.

[0695] Step 6:

[0696] The server sends summarized information to the terminal and displays it to the user. This additional information allows the user to deepen their understanding of the meeting content.

[0697] Step 7:

[0698] The server detects technical and business terms and converts them into more general terms. The converted text is then sent to the terminal for the user to review.

[0699] Step 8:

[0700] When a user enters their desired language into their device, the server passes it to a translation engine, which then translates the meeting content into the specified language. The translated text is displayed on the device in real time.

[0701] (Example 1)

[0702] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0703] In meetings and discussions, it is difficult to understand the specialized terminology and industry-specific language used by participants in real time and to ensure smooth communication across multiple languages. Furthermore, there is a need for a means to quickly gather and summarize information and for participants to easily follow the progress of the meeting. Traditional systems often fail to effectively address these challenges.

[0704] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0705] In this invention, the server includes a speech conversion means that acquires speech data in real time and converts that data into text information; a data analysis means that performs morphological analysis on the text information acquired by the speech conversion means to identify parts of speech; and an information provision means that acquires related information when a user selects an identified noun and provides summary information. This generalizes specialized and business terms used during meetings and enables accurate and rapid communication among participants who speak different languages.

[0706] "Voice conversion means" refers to technology for acquiring voice data in real time and converting that voice into text information.

[0707] "Data analysis means" refers to a technology for identifying the parts of speech used in text information obtained by speech conversion means through morphological analysis.

[0708] "Information provision means" refers to technology that retrieves relevant information based on a noun selected by the user, summarizes it, and provides it to the user.

[0709] "Data conversion means" refers to technologies that convert technical or business terms into general terms, making the content easier for users to understand.

[0710] A "translation processing method" is a technology that accurately translates text information between multiple languages, enabling mutual understanding between different languages.

[0711] A "language conversion method" is a technology for translating text information into a language specified by the user.

[0712] A "generative model" is an algorithm that uses artificial intelligence technology to summarize information and generate new text.

[0713] This embodiment of the invention is a system for supporting smooth communication in meetings and discussions. The configuration and operation of this system will be described in detail below.

[0714] The server functions as the central processing unit of this system. The server first receives audio data from the terminals. To convert the audio data into text in real time, the server utilizes a common speech recognition API. This API has the capability to rapidly convert audio data into text data.

[0715] The identified text information is broken down into parts of speech using a morphological analysis engine. This engine is used to identify key keywords and nouns, and the server uses this information to evaluate its relevance. The server also collects relevant information from external resources and databases, and generates summary information using an artificial intelligence model based on the information presentation method.

[0716] The terminal is a user-operated device that captures meeting audio and sends it to the server. Furthermore, the server displays the user the converted text and summary information in its response. This allows the user to directly obtain relevant information by selecting identified nouns. Additionally, this information is translated into multiple languages ​​using a multilingual translation API, enabling smooth communication across language barriers even in international meetings.

[0717] Users can understand important technical and business terms in a discussion in real time based on the information displayed on their device. This system can provide additional information using prompts based on a generative AI model for specific words. For example, a prompt such as, "Please provide a summary of the general definition and related techniques for the noun 'project management' used during the meeting," can be used.

[0718] Thus, this system, starting with speech recognition and continuing through a series of processes including multilingual translation, provides an environment where participants can more efficiently and accurately grasp the flow of conversation.

[0719] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0720] Step 1:

[0721] The terminal captures audio data via the microphone during the meeting. This input audio data is streamed to the server in real time via a secure communication protocol. The terminal appropriately compresses this audio data and packets it to improve data transfer efficiency.

[0722] Step 2:

[0723] The server inputs the audio data received from the terminal into a speech recognition API. This API converts the audio into text data. Specifically, it analyzes the language from the audio waveform and outputs it as text information. Through this process, the audio input is converted into clean text data.

[0724] Step 3:

[0725] The server passes the generated text data to the morphological analysis engine. The morphological analysis engine divides the input text into words and identifies the part of speech of each word. The output is a list of identified nouns in the text. This list is used for subsequent information extraction.

[0726] Step 4:

[0727] The user selects a noun displayed on the device. Based on the selected noun, the server retrieves relevant information from an external source or database. This information gathering is performed via an API, and the information is returned to the server. The output of this process is detailed information related to the selected noun.

[0728] Step 5:

[0729] The server summarizes the acquired detailed information using a generative AI model. This generative AI model quickly summarizes large amounts of text information, extracting key points and organizing them concisely. This summarized information is output and used to provide information to users.

[0730] Step 6:

[0731] The server applies a data transformation algorithm to replace technical terms with more general ones. This process converts complex words and phrases into simpler terms to aid user understanding. The output is revised text information.

[0732] Step 7:

[0733] The server passes text information to a multilingual translation API as needed, translating it into the specified language. The translated text is processed on the server and immediately sent to the terminal. The terminal displays this translated information to the user, facilitating communication in a multilingual environment. The user can then use prompts on the terminal to request additional information.

[0734] (Application Example 1)

[0735] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0736] In communication settings, there is a need for information exchange among team members who speak different languages ​​and for facilitating conversations in fields with a lot of specialized terminology. In particular, in on-site security operations, rapid and reliable information transmission is crucial, but language and jargon barriers often hinder communication. Furthermore, there is a lack of support for effectively integrating audio and visual information acquired on-site to correctly understand the situation. Technologies are needed to address these challenges.

[0737] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0738] In this invention, the server includes: a speech recognition means for acquiring voice input in real time and converting the voice into text data; a part-of-speech parsing means for performing morphological analysis on the text data acquired by the speech recognition means and identifying specific parts of speech; an information presentation means for acquiring related information and displaying a summary when a user selects an identified noun; a term conversion means for converting technical terms and business terms into general vocabulary; a translation means for performing translation between multiple languages; an information distribution means for providing translated text information to on-site staff via an information display device; and a situation analysis means for linking visual information acquired from a monitoring device with voice reports to identify abnormal events. This enables rapid information sharing and accurate understanding of situations among staff who speak different languages.

[0739] "Voice input" refers to acquiring acoustic signals in real time and using that information as data for processing.

[0740] "Text data" refers to text information obtained by converting speech, and it can be processed and displayed electronically.

[0741] "Speech recognition means" refers to a device or system for receiving speech information and converting it into text data.

[0742] A "part-of-speech analysis tool" is a device or system for performing morphological analysis on character data and identifying specific parts of speech.

[0743] An "information presentation means" is a device or system for acquiring knowledge related to a noun selected by the user, summarizing it, and displaying it.

[0744] A "terminology conversion means" is a device or system for replacing technical terms or business jargon with general vocabulary.

[0745] "Translation means" refers to a device or system for converting text information between languages.

[0746] "Information distribution means" refers to a device or system for transmitting acquired information to an information display device and presenting it to staff.

[0747] A "situation analysis means" is a device or system that integrates visual and auditory information to identify abnormal events.

[0748] This invention is designed as an information gathering and presentation system for use by on-site security staff. The server acquires voice input in real time and uses the microphones of smartphones or information presentation devices to process the acoustic signals. Google Speech-to-Text API or Amazon Transcribe is used for speech recognition, and the resulting text is converted into character data by the speech recognition means.

[0749] Next, the text data is preprocessed for morphological analysis. Here, parts of speech are analyzed using a morphological analysis engine such as MeCab, and particularly important nouns are extracted. When the user selects a specific noun, the server retrieves relevant information from a database or external resources and summarizes that information using a generative AI model. In this process, OpenAI's GPT model or similar might be used.

[0750] Subsequently, the extracted technical and business terms are converted into general vocabulary using a terminology conversion tool and displayed in a format that is easy for the user to understand. Text translation between multiple languages ​​is performed using translation tools such as the Google Translate API or the DeepL API.

[0751] The translated information is provided to on-site staff in real time via information display devices through information distribution means. Smart glasses or mobile devices may be used for this information display. Situation analysis means are also used to integrate visual information obtained from monitoring devices with audio reports. This makes it possible to quickly identify abnormal events and take appropriate countermeasures.

[0752] For example, if a security staff member verbally reports a suspicious vehicle, a voice recognition system converts the report into text data, extracting the key phrase "suspicious vehicle." The server automatically summarizes the information, translates it into multiple languages, and immediately shares it with the entire team.

[0753] An example of a prompt for a generative AI model might be: "Summarize the latest reports related to the keyword 'suspicious vehicle' and generate a multilingual description. Translate it into English and add any relevant information."

[0754] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0755] Step 1:

[0756] The terminal acquires voice input from on-site staff in real time. This input voice is collected through the microphone of the information display device.

[0757] Step 2:

[0758] The server converts the acquired speech input into text data using the Google Speech-to-Text API or Amazon Transcribe. This conversion process transforms the data from speech to text.

[0759] Step 3:

[0760] The server uses the MeCab library to analyze the parts of speech in order to perform morphological analysis on the obtained character data. Through this analysis, nouns and important phrases are extracted. The input is character data, and the output is a list of nouns.

[0761] Step 4:

[0762] The user selects a noun of interest from the displayed list. This selection action triggers the next information retrieval process.

[0763] Step 5:

[0764] The server retrieves information related to the selected noun from databases and external resources. This information is then input into a prompt sentence that is summarized using a generative AI model. The output of this step is the summarized information.

[0765] Step 6:

[0766] The server uses a terminology conversion mechanism to convert technical terms into general vocabulary and arranges them in a format that is easily understood by a wider range of users. The input is summary information, and the output is the converted information.

[0767] Step 7:

[0768] The server uses the DeepL API or Google Translate API to perform translations and generate multilingual information. This makes it possible to communicate information to users who speak different languages.

[0769] Step 8:

[0770] The server transmits the translated information to the terminal at the site via an information distribution system. The terminal then presents the information to the user using an information display device. This allows for immediate feedback.

[0771] Step 9:

[0772] The server integrates visual and audio data from monitoring devices and identifies abnormal events using situation analysis tools. Based on the integrated data, it creates example prompt statements generated by a generative AI model, providing specific action guidelines.

[0773] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0774] This invention is a system that combines speech recognition and emotion recognition to support smooth and effective communication. The system includes functions for speech recognition, morphological analysis, information presentation, terminology conversion, multilingual translation, and emotion recognition.

[0775] 1. Overall structure:

[0776] The server functions as the central hub of the system, processing voice data, analyzing sentiment data, and presenting and translating information. The terminal provides the interface with the user, collecting voice input from the user, sending it to the server, and displaying the processed information.

[0777] 2. Server processing:

[0778] The server converts the audio transmitted from the terminal into text data using a speech recognition engine. This text data is then broken down by morphological analysis to identify nouns and important words. Furthermore, the server can use an emotion engine to analyze the emotions contained in the audio data. The analysis results are used as data that provides a detailed indication of the user's emotional state.

[0779] The server adjusts information presentation based on the user's emotional state, collecting, summarizing, and displaying information related to the noun selected by the user. This information is presented in a way that aligns with the user's emotions, enabling a more personalized experience.

[0780] 3. Role of the terminal:

[0781] The terminal is a device used by the user and primarily manages voice input and output. The terminal acquires voice and sends it to the server, and displays the server's response to the user. It also has an interface for visualizing the user's emotional state, allowing changes to be monitored in real time.

[0782] 4. User interaction and experience:

[0783] Users input voice commands into their devices and receive information based on the analyzed results. For example, if a user says "project management," the server uses an emotion engine to analyze the emotions expressed in the statement and presents information in the most appropriate format. Furthermore, if the system determines that the user is feeling stressed, it can suggest information and methods to help them relax.

[0784] Specific example:

[0785] When a user mentions "the introduction of a new product" during a meeting, the server analyzes the emotions behind the statement and, if it determines that the user is feeling anxious, provides reassuring information or examples of past success stories.

[0786] This system provides an environment that enables users to make better decisions by delivering information with consideration for their emotions. Furthermore, the system continuously learns and improves the relevance and accuracy of the information presented.

[0787] The following describes the processing flow.

[0788] Step 1:

[0789] The device uses a microphone to capture the user's voice. The captured voice is converted into a digital format and sent to the server.

[0790] Step 2:

[0791] The server inputs the received audio data into the speech recognition engine and converts it into corresponding text data. This text data is then ready for further processing within the system.

[0792] Step 3:

[0793] The server passes the text data to a morphological analysis engine, which breaks down words into parts of speech. Nouns and keywords, in particular, are identified and marked so that their relevant information can be presented to the user later.

[0794] Step 4:

[0795] The server uses an emotion engine to analyze the user's emotions in the voice data. The emotion recognition results are used as data indicating the user's state.

[0796] Step 5:

[0797] When a user selects a noun or keyword on their device, the server collects relevant information from databases and external sources. The collected information can then be summarized according to the user's emotional state.

[0798] Step 6:

[0799] The server optimizes the summary information based on the results of sentiment analysis. For example, if the user is feeling anxious, it may emphasize information that provides reassurance.

[0800] Step 7:

[0801] The server sends summarized information to the terminal. The terminal displays this clearly to the user to support their understanding. However, the display method is adjusted as appropriate based on the sentiment analysis results.

[0802] Step 8:

[0803] The device visualizes the user's emotional state in real time and displays changes. This allows the user to understand their own emotional changes and respond as needed.

[0804] Step 9:

[0805] If a user requests multilingual translation, the device sends the request to the server. The server translates the text into the selected language, and the translation result is returned to the device and provided to the user.

[0806] (Example 2)

[0807] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0808] Conventional systems simply convert user speech into text data and present information, lacking the ability to provide flexible information based on the user's emotional state. Furthermore, they lack mechanisms to support decision-making through information presentation that considers the user's emotions, thus failing to contribute to smoother communication.

[0809] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0810] In this invention, the server includes speech recognition means, part-of-speech analysis means, emotion recognition means, information presentation means, and display means. This enables the recognition of emotions from the user's speech and the presentation of optimal information corresponding to those emotions.

[0811] "Voice recognition means" refers to a function that converts voice input acquired by a terminal into text data in real time.

[0812] A "part-of-speech analysis tool" is a function that performs morphological analysis on acquired character data to identify specific parts of speech.

[0813] "Emotion recognition means" refers to a function that analyzes identified terms and recognizes the emotional state of an utterance.

[0814] "Information presentation means" refers to a function that collects information in accordance with the user's emotions and uses artificial intelligence to generate and present a summary in the most optimal form.

[0815] "Display means" refers to a function that includes an interface for transmitting information to a terminal and providing that information to the user visually or audibly.

[0816] This invention is a system that combines speech recognition and emotion recognition to support smooth and effective communication with users. The system mainly consists of a server and terminals, with the server functioning as the central hub of the system.

[0817] The server receives audio data from the terminal and converts it into text data using speech recognition. For this purpose, the server can use general-purpose speech recognition software. Specifically, a speech recognition API can be used as the speech recognition engine. The converted text data is then parsed into parts of speech via a morphological analysis tool. This analysis allows the server to identify specific parts of speech and nouns.

[0818] Next, the server uses an emotion recognition engine to analyze the user's emotional state from the obtained data. For example, an emotion analysis API can be used for emotion recognition. This analysis allows the server to prepare information that reflects the user's emotional state.

[0819] Furthermore, the server uses information presentation devices to collect information tailored to the user's emotions and uses an artificial intelligence model to generate an optimal summary. This information is transmitted to the terminal via display devices and provided to the user visually or aurally.

[0820] The terminal functions as a user interface, handling voice input and output. It acquires voice from the user and transmits it to the server. It also presents the processed information to the user through a medium, allowing the user to make decisions while monitoring their emotional state in real time.

[0821] For example, if a user says, "I feel anxious about the next project meeting," the server analyzes the user's emotions from that statement and suggests ways to relax and provides helpful success stories. In this way, the system provides personalized information based on the user's emotions.

[0822] An example of a prompt to input into a generative AI model is, "Show how to provide effective information to a user who is feeling anxious while preparing a presentation." Using this prompt, the model can generate appropriate responses for a user's specific emotional state.

[0823] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0824] Step 1:

[0825] The terminal acquires voice input from the user. This voice data is collected in real time by the microphone and sent to the server as raw voice data.

[0826] Step 2:

[0827] The server inputs the received audio data into a speech recognition engine, which converts it into text data. This engine analyzes sound wave information and identifies words and phrases from combinations of phonemes. The output is text data of the user's speech.

[0828] Step 3:

[0829] The server inputs the converted character data into a morphological analysis tool and performs part-of-speech analysis. This analysis identifies grammatical structures and tags parts of speech such as nouns and verbs. The output is text data with part-of-speech information attached.

[0830] Step 4:

[0831] The server inputs the part-of-speech analysis results into an emotion recognition engine to identify the emotional state of the text. This engine evaluates the context of words based on specific emotion indicators and identifies the intensity and type of emotion. The output is structured data indicating the emotional state.

[0832] Step 5:

[0833] The server uses information presentation tools to collect relevant information based on the user's emotional state and identified keywords. Using artificial intelligence, it generates summaries tailored to the user's emotions and extracts information useful to the user. The output is summarized information that matches the user's emotions.

[0834] Step 6:

[0835] The server sends the generated information to the terminal, which then provides it to the user in visual or auditory form. Specific actions include displaying relevant information on the screen or playing audio messages through the speaker. The user can then use this information to make further decisions or take action.

[0836] (Application Example 2)

[0837] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0838] In elderly care settings, providing appropriate support in response to the emotions of both the care recipient and the caregiver is crucial. However, it is difficult for care staff to immediately understand the emotions of the care recipient and respond appropriately based on that understanding, highlighting the need for improved communication quality. Furthermore, language barriers pose a problem in communication between caregivers and care recipients who speak diverse languages. A system is needed to address these challenges.

[0839] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0840] In this invention, the server includes a speech recognition means, a morphological analysis means, an information presentation means, a terminology conversion means for converting technical terms and business terms into general terms, a translation means for performing translation between multiple languages, and an emotion analysis and information provision means for determining the emotional state of the user or person being cared for and providing appropriate information or messages based on those emotions. This enables timely and appropriate emotionally sensitive information presentation in care settings and facilitates smooth communication between users who speak diverse languages.

[0841] A "speech recognition means" is a device that has the function of acquiring speech input in real time and converting that speech into text data.

[0842] A "morphological analysis means" is a device that breaks down character data acquired by a speech recognition means and has the function of identifying specific parts of speech.

[0843] An "information presentation means" is a device that has the function of acquiring related information when an identified noun is selected, and summarizing and displaying that information.

[0844] A "terminology conversion tool" is a device that has the function of converting specialized or business-related terms into general terms.

[0845] A "translation tool" is a device that performs translation between multiple languages ​​and has the function of converting text into a specified language.

[0846] "Emotional analysis and information provision means" refers to a system that has the function of determining the emotional state of a user or person receiving care and providing appropriate information or messages based on those emotions.

[0847] "Machine learning" is a technology that uses data to automatically improve algorithms, and is applied to improve the accuracy of information.

[0848] To implement the present invention, a system centered on speech recognition and emotion analysis can be envisioned operating on a smart device. This system consists of a server and a terminal, and the user provides voice input via the terminal. Voice data is acquired in real time through the terminal's microphone and transmitted to the server. The server converts the voice into text data using speech recognition means.

[0849] The server uses morphological analysis to break down text data into parts of speech and identify specific important words and phrases. Then, using sentiment analysis and information provision tools, it determines the user's emotional state and collects, summarizes, and presents relevant information based on the analysis results. The information is adjusted according to the user's emotions and displayed on the device in a personalized format.

[0850] If multilingual support is required, the text will be translated into the user's specified language using translation tools. The information presented and the translated text are intended to provide support that is sensitive to the feelings of both the care recipient and the caregiver, and are provided on the device as screen display or audio output.

[0851] For example, if a care recipient inputs "I'm feeling a little anxious today...", the system uses emotion analysis to recognize this state and suggests calming music or relaxation techniques. It also advises care staff to "take a moment, hold the care recipient's hand, and encourage slow breathing."

[0852] An example of a prompt might be, "Please suggest effective communication methods to help caregivers alleviate the anxiety of those they care for."

[0853] This allows the server to personalize and deliver appropriate information in real time according to the user's needs, enabling effective, emotion-based support in caregiving settings.

[0854] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0855] Step 1:

[0856] The user inputs voice using the device's microphone. The device captures the voice in real time and generates digital audio data. This audio data is sent to a server for analysis.

[0857] Step 2:

[0858] The server receives the audio data and converts it to text using speech recognition technology. In this step, the input is the audio data, and the output is the converted text data. This process is performed using speech recognition technology.

[0859] Step 3:

[0860] The server analyzes character data using morphological analysis tools and extracts specific parts of speech, particularly nouns. The input in this step is character data, and the output is a list of analyzed nouns. The analysis is performed using a morphological analysis algorithm.

[0861] Step 4:

[0862] The server uses emotion analysis and information provision means to determine the user's emotions from the voice input. The input in this step is voice data and text data, and the output is emotion data indicating the emotional state. Emotion analysis is performed by an emotion recognition engine.

[0863] Step 5:

[0864] The server collects relevant information based on the emotional state and noun list, and summarizes the information using an information presentation tool. The input in this step is emotional data and a noun list, and the output is the summarized information.

[0865] Step 6:

[0866] The server uses translation tools to translate the text into the user's specified language as needed. The input in this step is summarized information, and the output is the translated information.

[0867] Step 7:

[0868] The server adjusts the information based on emotional data and sends it to the terminal. The terminal presents this information to the user as a screen display or audio output. In this step, the input is the adjusted information, and the output is the information delivered to the user.

[0869] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0870] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0871] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0872] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0873] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0874] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0875] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0876] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0877] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0878] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0879] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0880] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0881] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0882] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0883] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0884] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0885] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0886] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0887] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0888] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0889] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0890] The following is further disclosed regarding the embodiments described above.

[0891] (Claim 1)

[0892] A speech recognition means that acquires voice input in real time and converts that voice into text data,

[0893] A part-of-speech parsing means that performs morphological analysis on the character data acquired by the speech recognition means and identifies a specific part of speech,

[0894] An information presentation means that retrieves related information and displays a summary when a user selects an identified noun,

[0895] A terminology conversion method that converts technical terms and business terms into general terms,

[0896] Translation tools for translating between multiple languages,

[0897] A system that includes this.

[0898] (Claim 2)

[0899] The system according to claim 1, further comprising a translation means for translating text into a language specified by the user.

[0900] (Claim 3)

[0901] The system according to claim 1, further comprising an information presentation means for collecting information based on noun selection and generating a summary using artificial intelligence.

[0902] "Example 1"

[0903] (Claim 1)

[0904] A speech conversion means that acquires audio data in real time and converts that data into text information,

[0905] A data analysis means that performs morphological analysis on text information acquired by the speech conversion means to identify parts of speech,

[0906] An information provision means that retrieves relevant information and provides summary information when a user selects an identified noun,

[0907] A data conversion method that converts technical terms and business terms into general terms,

[0908] A translation processing means for performing translation between multiple languages,

[0909] A system that includes this.

[0910] (Claim 2)

[0911] The system according to claim 1, further comprising a language conversion means for translating text information into a language specified by the user.

[0912] (Claim 3)

[0913] The system according to claim 1, further comprising an information providing means for obtaining relevant information based on noun selection and creating summary information using a generative model.

[0914] "Application Example 1"

[0915] (Claim 1)

[0916] A speech recognition means that acquires voice input in real time and converts that voice into text data,

[0917] A part-of-speech parsing means that performs morphological analysis on the character data acquired by the speech recognition means and identifies a specific part of speech,

[0918] An information presentation means that retrieves related information and displays a summary when a user selects an identified noun,

[0919] A terminology conversion method that converts specialized and business terms into general vocabulary,

[0920] Translation tools for translating between multiple languages,

[0921] An information distribution means for providing translated text information to on-site staff via an information display device,

[0922] A situation analysis means for identifying abnormal events by linking visual information and audio reports acquired from monitoring devices,

[0923] A system that includes this.

[0924] (Claim 2)

[0925] The system according to claim 1, further comprising a translation means for translating text into a language specified by the user and displaying it using an information display device.

[0926] (Claim 3)

[0927] The system according to claim 1, further comprising information presentation means for collecting information from a knowledge base based on noun selection, generating a summary using a generative AI model, and outputting a prompt sentence.

[0928] "Example 2 of combining an emotion engine"

[0929] (Claim 1)

[0930] A speech recognition means that acquires voice input in real time on a terminal and converts that voice into text data,

[0931] A part-of-speech parsing means that performs morphological analysis on the character data acquired by the aforementioned speech recognition means and identifies a specific part of speech,

[0932] An emotion recognition means that analyzes identified terms and recognizes the emotional state of the utterance,

[0933] An information presentation means that collects information according to the user's emotions and generates a summary in the most optimal form using artificial intelligence,

[0934] A display means including an interface that transmits information to a terminal and displays the user's emotional state,

[0935] ...

[0936] A system that includes this.

[0937] (Claim 2)

[0938] The system according to claim 1, which adaptively provides information in response to the user's request based on the aforementioned emotion recognition.

[0939] (Claim 3)

[0940] The system according to claim 1, which provides information received by a terminal to a user visually or audibly.

[0941] "Application example 2 when combining with an emotional engine"

[0942] (Claim 1)

[0943] A speech recognition means that acquires voice input in real time and converts that voice into text data,

[0944] A part-of-speech parsing means that performs morphological analysis on the character data acquired by the speech recognition means and identifies a specific part of speech,

[0945] An information presentation means that retrieves related information and displays a summary when a user selects an identified noun,

[0946] A terminology conversion method that converts technical terms and business terms into general terms,

[0947] Translation tools for translating between multiple languages,

[0948] A means of emotion analysis and information provision that assesses the emotional state of a user or care recipient and provides appropriate information or messages based on those emotions,

[0949] A system that includes this.

[0950] (Claim 2)

[0951] The system according to claim 1, further comprising a translation means for translating text into a language specified by the user, thereby providing multilingual support to a person receiving care.

[0952] (Claim 3)

[0953] The system according to claim 1, further comprising an information presentation means for collecting information based on noun selection, generating a summary using machine learning, and adjusting and providing the information based on sentiment analysis results. [Explanation of Symbols]

[0954] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A speech recognition means that acquires voice input in real time and converts that voice into text data, A part-of-speech parsing means that performs morphological analysis on the character data acquired by the speech recognition means and identifies a specific part of speech, An information presentation means that retrieves related information and displays a summary when a user selects an identified noun, A terminology conversion method that converts specialized and business terms into general vocabulary, Translation tools for translating between multiple languages, An information distribution means for providing translated text information to on-site staff via an information display device, A situation analysis means for identifying abnormal events by linking visual information and audio reports acquired from monitoring devices, A system that includes this.

2. The system according to claim 1, further comprising a translation means for translating text into a language specified by the user and displaying it using an information display device.

3. The system according to claim 1, further comprising information presentation means for collecting information from a knowledge base based on noun selection, generating a summary using a generative AI model, and outputting a prompt sentence.