system

The system automates voice message transcription, translation, and schedule management, addressing inefficiencies in conventional systems by providing convenient and efficient call handling and schedule adjustment across languages.

JP2026100600APending Publication Date: 2026-06-19SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Conventional answering machine systems require manual checking of voice messages, which is time-consuming, and separate translation is needed for different languages, with inefficient schedule management when users are absent.

Method used

A system that automatically answers phone calls, transcribes voice messages to text, performs translation, and adjusts schedules using AI, reducing manual effort and improving convenience.

Benefits of technology

The system significantly reduces time and effort in managing phone calls and schedules, enabling efficient communication across languages and languages without manual intervention.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026100600000001_ABST
    Figure 2026100600000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A means of receiving voice data, An analysis means for converting received audio data into text data, A means of sending the converted text data to the user, An analytical means for extracting schedule information from text data, A means of coordinating appointments based on extracted schedule information, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0004] , , , ,

[0005] , , , , ,

[0003] , , ,

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In a conventional answering machine system, there is a problem that it is necessary to manually check voice messages, which takes time and effort to listen to the voice again. Also, when it is desired to communicate the message content in another language, separate translation work is required. In the case of messages related to schedule adjustment, the user himself / herself has to manually register the schedule in a calendar application or the like, so it cannot be said to be an efficient telephone response. Means for solving these problems and realizing a more efficient and convenient telephone response have been demanded.

Means for Solving the Problems

[0005] The present invention provides a system that automatically answers phone calls even when the user is absent, efficiently transcribes voice messages into text, and performs translation and schedule adjustments as needed. This system includes communication means for receiving voice data, analysis means for converting the received voice data into text data, transmission means for notifying the user of the converted text data, analysis means for extracting schedule information from the text data, and coordination means for scheduling appointments based on the extracted schedule information. This significantly reduces the time and effort associated with answering phone calls and improves convenience.

[0006] "Voice data" refers to data that represents voice information obtained through telephone calls, recordings, etc., in digital or analog format.

[0007] "Communication means" refers to equipment or protocols for sending and receiving voice data, and includes devices and systems that have the function of detecting incoming calls and connecting with the caller.

[0008] "Analysis means" refers to software or algorithms for converting audio data into text data, and means of analyzing audio using speech recognition technology.

[0009] "Text data" refers to string data converted and generated by speech recognition, in a format that users can visually confirm.

[0010] "Transmission means" refers to a device or software that has the function of notifying the user of the analyzed text data and transfers the data to the receiving terminal via a network.

[0011] "Schedule information" refers to information about the date, time, and activity communicated by telephone, and is basic data necessary for schedule management.

[0012] "Integration means" refers to a device or program that has the function of sharing information with external calendar applications or systems based on analyzed schedule information and automatically registering or adjusting schedules. [Brief explanation of the drawing]

[0013] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14]It is a sequence diagram showing the processing flow of a data processing system in Application Example 2 when a sentiment engine is combined.

Embodiments for Carrying Out the Invention

[0014] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, a numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, a numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, a numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0019] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] Embodiments of the present invention will be described in detail from the perspectives of the server, terminal, and user, detailing a specific system in which an AI personal assistant handles telephone calls.

[0035] This system aims to receive voice messages as text and notify the user without requiring manual confirmation of incoming calls. Upon receiving incoming call information from a terminal, the server initiates an AI-powered automated response. During the automated response, the server records the voice and converts the audio data into text data using an analysis tool. The analyzed text data is then sent to the user's terminal via the server, allowing the user to instantly review it visually.

[0036] Furthermore, the server identifies appointment information within the text data and registers the schedule in conjunction with the calendar. This allows users to automatically and quickly manage their schedules based on the content of their calls. If necessary, the server translates the text data to support multilingual communication.

[0037] As a concrete example, consider a situation where a user is in a meeting and unable to answer a call. The device notifies the server of the incoming call, and the server automatically responds. If the caller leaves a message asking, "Are you available for a meeting tomorrow at 10 AM?", the audio is immediately transcribed into text and notified to the user. The server then analyzes this message and reflects it in the calendar as an appointment. This allows the user to check and register the meeting appointment without having to listen to the audio message. Even if the language is different, the server automatically translates and notifies the user in the appropriate language.

[0038] Thus, the system of the present invention reduces the time and effort associated with answering phone calls, even when the user is away, and enables efficient schedule management.

[0039] The following describes the processing flow.

[0040] Step 1:

[0041] The terminal detects an incoming call and transmits incoming call data, including caller information and the time of the call, to the server.

[0042] Step 2:

[0043] The server activates the AI ​​personal assistant as soon as it receives incoming data and initiates the call with an automated response message.

[0044] Step 3:

[0045] The server records the audio during the call in real time and temporarily stores the data.

[0046] Step 4:

[0047] The server sends the stored audio data to the speech recognition engine, which then processes the audio into text.

[0048] Step 5:

[0049] The server verifies the converted text data and sends its contents as a push notification to the user's device.

[0050] Step 6:

[0051] The server automatically extracts date and time information from the text data and determines whether it relates to a schedule.

[0052] Step 7:

[0053] If the server contains schedule information, it will synchronize with the user's calendar application and input or update the necessary information.

[0054] Step 8:

[0055] If the text is in different languages, the server will use a translation engine to automatically translate it and notify the user in the most appropriate language.

[0056] Step 9:

[0057] The user receives a notification sent from the server, reviews the text, and modifies or adds calendar information if necessary.

[0058] This process allows users to effectively manage information through automated telephone support.

[0059] (Example 1)

[0060] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0061] Traditional voice response systems were inefficient because checking voice messages and managing schedules were done manually, and they had problems with providing appropriate support when users were absent. Furthermore, communication problems sometimes arose due to difficulties in transferring information between different languages.

[0062] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0063] In this invention, the server includes means for receiving incoming information and starting processing, means for converting the received voice data into text data, and means for notifying the information processing device of the converted text data. This enables users to efficiently manage their schedules and facilitates smooth information transmission between different languages ​​without having to directly check voice messages.

[0064] A "communication means" is a technical element that has the function of receiving incoming information and starting to process it.

[0065] "Analysis means" refers to a technical element for converting received audio data into text data and then extracting schedule information from that text data.

[0066] A "transmission means" is a technical element that has the function of notifying an information processing device of converted text data.

[0067] An "integration means" is a technical element that works in conjunction with other functions to register a schedule based on the schedule information extracted by the analysis means.

[0068] A "translation tool" is a technological element that has the function of translating text data between different languages.

[0069] A "response mechanism" is a technical element that has the function of automatically responding in the absence of an information processing device.

[0070] This invention uses a voice response system to enable users to easily receive information and manage their schedules. This system primarily functions through communication and data processing between a server, a terminal, and the user.

[0071] When the server receives incoming call information from the terminal, it begins recording the voice message. The recorded voice data is converted into text data using speech recognition technology. A common speech recognition API is used for this process. For example, existing technologies such as Google's Cloud Speech-to-Text API can be used. This conversion process uses asynchronous processing to ensure efficient data processing.

[0072] The converted text data is sent to the user's device. The server then uses natural language processing technology to analyze the text and automatically registers the schedule based on the extracted data. This schedule registration is facilitated by integrating with user-friendly platforms such as calendar APIs. For example, by utilizing the Google Calendar API, the schedule can be directly reflected in the user's calendar.

[0073] Furthermore, the server includes a translation function that translates text data into different languages ​​as needed. This allows users to handle messages in other languages. Translation technology could utilize a multilingual translation service such as Microsoft® Translator API.

[0074] For example, if a user is unable to answer a call during a meeting, their device notifies the server of the incoming call, and the server automatically records a voice message. If the recorded voice message is "Is a meeting possible tomorrow at 2pm?", the server transcribes it into text and immediately notifies the user's device. The analyzed schedule information is automatically added to the user's calendar, allowing the user to understand their schedule without having to listen to the audio. Furthermore, even if the message is in a different language, the server automatically translates it and notifies the user in the appropriate language, facilitating smooth communication between languages.

[0075] Example of a prompt

[0076] "Please explain the process by which the server converts received voice messages into text and automatically registers them as appointments."

[0077] This system allows users to respond quickly and efficiently to automated inquiries and manage their schedules even when they are away.

[0078] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0079] Step 1:

[0080] The terminal receives incoming call information. This information is metadata, including the caller's number and the date and time of the call. The terminal sends this data to the server, which recognizes the incoming call trigger and initiates an automated response.

[0081] Step 2:

[0082] The server plays an automated response message to the caller. Following this, the server records the caller's voice message. The recorded data is saved to the server as an audio file.

[0083] Step 3:

[0084] The server calls a speech recognition API to convert the audio data into text data. This process analyzes the recorded audio (input) and generates the corresponding text (output). Here, the natural language in the audio is electronically recorded.

[0085] Step 4:

[0086] The server analyzes the converted text data and extracts the schedule information. In this step, natural language processing techniques are used to identify the date, time, and event details (output) from the text (input).

[0087] Step 5:

[0088] The analyzed schedule information is registered in the user's schedule using the Calendar API. The server sends the schedule information (input) to the Calendar service, where it is reflected as a new event (output). This automatically updates the user's calendar.

[0089] Step 6:

[0090] The server translates text data into different languages ​​as needed. Using a translation API, it generates text (output) converted from the original text (input) into the desired language. This allows users to understand information without language barriers.

[0091] Step 7:

[0092] The server notifies the user's terminal of the converted text data. Ultimately, the user can visually review and understand the text message. In this process, the translated text (input) is displayed on the user's terminal screen (output).

[0093] (Application Example 1)

[0094] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0095] Traditional telephone answering systems have the potential to miss important incoming calls when users are away, which can reduce the efficiency of schedule management, especially for busy users. Furthermore, language barriers hinder smooth information transfer in communication involving different languages. There is a need to solve these problems and streamline information management within the home.

[0096] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0097] In this invention, the server includes communication means for receiving voice information, analysis means for converting the received voice information into text information, and transmission means for notifying the user of the converted text information. This enables the automatic conversion of important voice messages into text information even when the user is absent, allowing for quick and accurate schedule management.

[0098] "Audio information" refers to sound data, including human voices, acquired through telephones or recording devices.

[0099] "Communication methods" refer to infrastructure and equipment for receiving voice information from external sources, and are based on network technologies such as telephone lines and the internet.

[0100] "Analysis means" refers to technologies and devices for processing received audio information and converting it into text information, often using speech recognition technology.

[0101] "Textual information" refers to data obtained by converting audio information into text format, and is a visually displayed text.

[0102] "Transmission means" refers to the technology and equipment used to deliver converted character information to the user, and may include communication methods such as email and messaging applications.

[0103] "Schedule information" refers to data about dates and appointments extracted from audio or text information.

[0104] "Integration methods" refer to technologies and devices for linking extracted schedule information with schedule management systems and calendar applications.

[0105] "Home-use assistance devices" are devices that provide information management and work support in a home environment, and may include robots and smart home devices.

[0106] This invention provides a system for a home-use assistance device that efficiently manages voice information and automatically adjusts schedules even when the user is absent. The server receives voice information from an external source using communication means. The received voice information is converted into text information by analysis means. The text information is transmitted to the user's terminal by transmission means, and the user can visually confirm it through a messaging application or similar.

[0107] Furthermore, the server extracts schedule information from the text data and registers it in the schedule management system using a collaborative mechanism. This system consists of home assistance devices equipped with AI chipsets such as NVIDIA Jetson. The Google Cloud Speech-to-Text API is used for analyzing speech data, and the Google Translate API is used for translating text data.

[0108] For example, if a user receives a phone call while out saying, "The meeting has been rescheduled to 2 PM tomorrow," the system automatically transcribes the message into text and updates their schedule. The user can then easily check and manage their schedule upon returning home.

[0109] An example of a prompt message to input into a generative AI model is as follows:

[0110] "Scene: You are away from home. A robot automatically manages phone messages, sends notifications, and schedules appointments. Imagine the interactions that take place."

[0111] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0112] Step 1:

[0113] The server receives voice information from external sources using communication methods. At this time, it receives voice data acquired via telephone lines or the internet as input and retains it in its original format. The server temporarily stores the received voice data in a database or temporary storage in order to send it to the next processing step.

[0114] Step 2:

[0115] The server uses parsing tools to convert received audio information into text information. In this process, it uses audio data as input and calls the Google Cloud Speech-to-Text API to convert the audio to text. It generates text data as output and stores the converted text information. The server then prepares the text data for use in the next step.

[0116] Step 3:

[0117] The server sends the converted character information to the terminal using a transmission method. As input, it retrieves the text data generated in the previous step and delivers it to the user via email or messaging apps. As output, it provides the received character information to the terminal. The server maintains a transmission record and notifies the user for later reference and verification.

[0118] Step 4:

[0119] The server extracts schedule information from text data and registers it in the schedule using a linkage mechanism. In this step, text data is parsed as input, and schedule-related information such as date, time, and activity is extracted. The output provides schedule information to be registered in calendar applications and schedule management systems. The server uses an API to programmatically update the schedule data.

[0120] Step 5:

[0121] Users can check notifications on their devices and easily manage their schedules. This step uses text information sent to the device and registered schedule information as input, and outputs a visual calendar display and reminders. Users can edit, delete, or add new schedules through the device interface.

[0122] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0123] This invention provides a system for effectively processing voice data acquired through voice communication, and in particular, efficiently combines an emotion engine that analyzes user emotions from voice. A specific embodiment of the system, involving a server, terminal, and user, is described below.

[0124] The device detects an incoming call and transmits this information to the server. The server automatically activates the AI ​​personal assistant and begins recording the voice communication. Simultaneously, the server utilizes an emotion engine to analyze this voice data. The emotion engine analyzes the intonation, speed, tone, etc., of the voice to determine the emotion. The emotion information obtained from this analysis is used in subsequent processing.

[0125] The server converts the recorded audio data into text using a speech recognition engine. This text data is then supplemented with the results of sentiment analysis obtained from an emotion engine. As a result, the text data goes beyond mere message content and includes information indicating the speaker's emotional state. When the server sends the text data and sentiment information to the user's device, the user receives this information as a notification.

[0126] As a concrete example, consider a situation where a user is in a meeting and unable to answer a phone call. If the caller leaves a message saying, "I was very impressed with your proposal. I would love to work with you again in the future," the emotion engine extracts positive emotions from this message. The transcribed message is tagged with emotion tags, allowing the user to understand not only its meaning but also the caller's emotional state. This information is useful for users when scheduling or responding to messages.

[0127] In addition, the server automatically extracts schedule information from text data and synchronizes it with the calendar application. By providing a system that allows for the efficient and emotional use of voice messages even when the user is absent, the quality of the call experience can be improved.

[0128] The following describes the processing flow.

[0129] Step 1:

[0130] The terminal detects an incoming call and sends incoming call data, including caller information and the time of the call, to the server.

[0131] Step 2:

[0132] When the server receives incoming data, it activates the AI ​​personal assistant and sends an automated message to respond to the call.

[0133] Step 3:

[0134] The server records the call's audio data in real time and sends the recorded audio data to the emotion engine.

[0135] Step 4:

[0136] The emotion engine analyzes audio data to determine the speaker's emotions. It generates information about the intensity and type of emotion (e.g., joy, surprise, anger).

[0137] Step 5:

[0138] The server sends the recorded audio data to the speech recognition engine, which converts the audio into text data.

[0139] Step 6:

[0140] The generated text data is then supplemented with emotional information generated by the emotion engine.

[0141] Step 7:

[0142] The server sends text data containing emotional information to the user's device as a notification. This allows the user to communicate while understanding the sender's emotions.

[0143] Step 8:

[0144] The server analyzes date and time information from the text data and extracts information related to the schedule.

[0145] Step 9:

[0146] Based on the extracted schedule information, the server synchronizes with the user's calendar application. If necessary, it registers new schedules.

[0147] This series of processes allows users to understand the message content, including the sender's emotions, and take appropriate action.

[0148] (Example 2)

[0149] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0150] In modern voice communication, there are problems with efficiently managing voice messages when the recipient is absent, and with capturing not only the content but also the emotional nuances of the message. Furthermore, in situations where there is a need to extract specific schedule information from messages and support multiple languages, the lack of sufficient systems to automate these processes reduces user convenience.

[0151] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0152] In this invention, the server includes means for utilizing an information processing device for receiving voice data, means for utilizing an analysis device for analyzing the received voice data and determining emotional information, and means for utilizing a conversion device for converting the analyzed voice data into text data. This makes it possible to provide users with detailed information, including emotional and schedule information, when receiving voice messages, thereby improving user convenience.

[0153] "Audio data" refers to information that represents audio signals in digital or analog format.

[0154] An "information processing device" is a device that has a hardware or software configuration for collecting, recording, analyzing, and outputting data.

[0155] An "analytical device" is a device that extracts useful information from input data and makes decisions based on a specific purpose.

[0156] A "conversion device" is a device or function used to convert data in one format into another format.

[0157] An "additional device" refers to a function or device used to add additional information to the main data.

[0158] A "transmitting device" is a device used to send data along a designated communication channel to another device or system.

[0159] An "extraction device" is a device or software that has the function of selecting specific information from a large amount of data.

[0160] A "synchronization device" is a device that ensures consistency between multiple data sets or processes by matching timing and data content.

[0161] A "translation device" is a device or software that has the function of converting text written in one language into another language.

[0162] This invention provides a system that highly streamlines the communication and analysis of voice data. At its core, it includes a process that extracts emotional information from voice and manages schedules based on that information.

[0163] The terminal is responsible for receiving voice data and transmits incoming call information to the server. The hardware used here includes acoustic sensors and microphones that enable voice input and output. A dedicated software algorithm detects incoming calls and sends the information to the server in real time.

[0164] The server activates the recording function as soon as it receives the audio data sent from the terminal. The recorded audio data is immediately sent to an analysis device. This analysis device runs an advanced emotion recognition algorithm that analyzes parameters such as intonation, speed, and tone of the voice in order to determine emotional information. Once the analysis is complete, an emotion determination result is generated and managed within the server.

[0165] Subsequently, the server uses speech recognition technology to convert the audio data into text data. This process utilizes a system called a natural language processing engine to create the text. Sentimental information is also added to this text data, which is then finally transmitted to the user's device.

[0166] Users can receive notifications that include emotional tags along with the text data. This allows users to understand not only the message content but also the sender's emotional state. Furthermore, the server has the ability to extract schedule information from the text data and automatically synchronize it with the user's calendar application. This enables efficient schedule management using voice messages.

[0167] As a concrete example, consider a situation where a user is in a meeting and unable to answer a phone call. If the caller leaves a message saying, "I was very impressed with your proposal the other day. I would love to work with you on the next project," the server analyzes this message as a positive emotion. This information is then communicated to the user, enabling them to respond quickly and appropriately in their work.

[0168] An example of a prompt might be: "Translate the received voice message into text and extract sentiment information. Example: 'I was very impressed with your proposal the other day. I would love to work with you on the next project.'" This prompt allows the system to quickly and accurately understand the voice data and perform sentiment analysis.

[0169] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0170] Step 1:

[0171] The terminal detects incoming calls by utilizing its ability to receive voice signals. When an incoming call is received, it notifies the server of the corresponding incoming call data. This notification is sent via the terminal's network module. The input is a signal indicating whether or not an incoming call is received, and the output is the incoming call notification sent to the server.

[0172] Step 2:

[0173] When the server receives an incoming call notification from the terminal, it immediately activates the AI ​​personal assistant and records the voice communication. This process utilizes a recording device, and the voice data is stored digitally. The input is the incoming call notification sent from the terminal, and the output is the recorded voice file.

[0174] Step 3:

[0175] The server transmits the recorded audio data to an analysis device for sentiment analysis. The analysis device determines emotional information based on the intonation, speed, and tone of the voice. The input here is the recorded audio data, and the output is the determined emotional information. This quantifies the emotional state of the speaker.

[0176] Step 4:

[0177] The server runs a speech recognition engine to convert audio data into text data. Because the speech recognition engine utilizes a machine learning model, it can convert speech to text with high accuracy. The input is a pre-analyzed audio file, and the output is text data with added emotional information.

[0178] Step 5:

[0179] The server extracts schedule information from text data. Using natural language processing techniques, it identifies and extracts the schedule information in a specified format. The input is character data, and the output is the extracted schedule information. This information is used in subsequent processes.

[0180] Step 6:

[0181] The server sends text data and sentiment information to the user's terminal. The terminal receives this information as a notification and displays it through the user interface. The input is text data and sentiment information, and the output is visualized data notified to the user. This step allows the user to quickly grasp the message content and the sender's sentiment.

[0182] Step 7:

[0183] The server uses the extracted schedule information to synchronize with the user's calendar application. The synchronization is automated and processes the process by adding new appointments to the user's schedule. The input is the schedule information, and the output is the updated calendar data. This allows the user to enjoy efficient schedule management.

[0184] (Application Example 2)

[0185] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0186] In today's communication environment, receiving and managing voice messages is essential. However, there is a need for technology that not only converts voice to text but also recognizes emotional nuances and provides information in a user-friendly format. Furthermore, efficiently processing information while the user is absent, automatically taking appropriate action, and coordinating with individual schedules are key challenges.

[0187] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0188] In this invention, the server includes communication means for receiving voice data, analysis means for converting the received voice data into text data, and emotion analysis means for analyzing emotions from the voice data and adding emotion information. This allows the user to manage their schedule while understanding not only the specific content of the voice message but also the emotional state of the speaker.

[0189] A "communication means" is an interface for receiving voice data and a device that plays the role of incorporating voice information into the system.

[0190] The "analysis means" is a device that processes received audio data and converts it into text data, achieving accurate text conversion using speech recognition technology.

[0191] An "emotion analysis device" is a device that determines the speaker's emotions from audio data and provides that emotional information, analyzing the intonation and tone of the voice.

[0192] A "transmission means" is a device that notifies the user of converted text data and sentiment information, and is responsible for conveying information to the user.

[0193] A "coordination device" is a device that has the function of reserving or updating schedules based on schedule information extracted from text data, and synchronizes with calendars and other similar systems.

[0194] The system of the present invention mainly consists of a server, a terminal, and a user. The terminal is equipped with a communication interface for receiving voice communication data and transmits the received voice data to the server. When the server receives the voice data, it first uses an analysis means to convert the voice into text data. Next, it uses an emotion analysis means to extract emotion information from the voice data.

[0195] The server analyzes the intonation, speed, and tone of the speech to evaluate the speaker's emotional state. The emotional information obtained by the emotion analysis is added to the converted text data, resulting in the generation of composite data that includes not only text data but also emotional information.

[0196] This composite data is sent to the user's terminal via a transmission method. This allows the user to understand not only the content of the message but also the speaker's emotional state. Furthermore, the server has the functionality to extract schedule information from the text data using a linking mechanism and integrate it with a schedule reservation system or calendar.

[0197] As a concrete example of use, consider a situation where a user is in a meeting and unable to answer a phone call. In this case, the terminal receives a voice message, and the server automatically transcribes the message into text and analyzes its sentiment. If the server recognizes a positive sentiment in a message such as "I was very impressed with your proposal. I would love to work with you again in the future," and notifies the user of this information, the user can also understand the nuances of the message.

[0198] An example of a prompt to set in a generative AI model is, "Analyze the voice message, identify its emotional state, and then transcribe it into text to convey the relevant emotional information to the user." This makes it possible to extract emotions from voice and convert them to text more efficiently.

[0199] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0200] Step 1:

[0201] The terminal receives voice data. Specifically, it detects an incoming call via the communication interface and records a voice message. The input to this process is voice data, and the output is the same voice data retained as is.

[0202] Step 2:

[0203] The terminal sends the received audio data to the server. The terminal uploads the recorded audio file to the server via the network. The input to this process is the audio data recorded in step 1, and the output is the audio data sent to the server.

[0204] Step 3:

[0205] The server analyzes the audio data and converts it into text data. The server uses a speech recognition engine to convert the audio data into text. In this process, audio data is used as input and text data is generated as output.

[0206] Step 4:

[0207] The server analyzes the audio data using emotion analysis tools and extracts emotional information. It analyzes the intonation and speed of the speech to estimate the speaker's emotions. The input for this process is the original audio data, and the output is emotional information.

[0208] Step 5:

[0209] The server integrates text data and sentiment information and notifies the user. The server adds sentiment information as a tag to the text data and sends it to the user's terminal. The input to this process is text data and sentiment information, and the output is integrated notification data.

[0210] Step 6:

[0211] The server extracts schedule information from text data and links it to the schedule. Using natural language processing technology, it identifies schedule information within text messages and registers it in the calendar system. The input to this process is text data, and the output is updated schedule information.

[0212] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0213] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0214] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0215] [Second Embodiment]

[0216] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0217] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0218] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0219] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0220] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0221] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0222] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0223] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0224] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0225] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0226] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0227] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0228] Embodiments of the present invention will be described in detail from the perspectives of the server, terminal, and user, detailing a specific system in which an AI personal assistant handles telephone calls.

[0229] This system aims to receive voice messages as text and notify the user without requiring manual confirmation of incoming calls. Upon receiving incoming call information from a terminal, the server initiates an AI-powered automated response. During the automated response, the server records the voice and converts the audio data into text data using an analysis tool. The analyzed text data is then sent to the user's terminal via the server, allowing the user to instantly review it visually.

[0230] Furthermore, the server identifies appointment information within the text data and registers the schedule in conjunction with the calendar. This allows users to automatically and quickly manage their schedules based on the content of their calls. If necessary, the server translates the text data to support multilingual communication.

[0231] As a concrete example, consider a situation where a user is in a meeting and unable to answer a call. The device notifies the server of the incoming call, and the server automatically responds. If the caller leaves a message asking, "Are you available for a meeting tomorrow at 10 AM?", the audio is immediately transcribed into text and notified to the user. The server then analyzes this message and reflects it in the calendar as an appointment. This allows the user to check and register the meeting appointment without having to listen to the audio message. Even if the language is different, the server automatically translates and notifies the user in the appropriate language.

[0232] Thus, the system of the present invention reduces the time and effort associated with answering phone calls, even when the user is away, and enables efficient schedule management.

[0233] The following describes the processing flow.

[0234] Step 1:

[0235] The terminal detects an incoming call and transmits incoming call data, including caller information and the time of the call, to the server.

[0236] Step 2:

[0237] The server activates the AI ​​personal assistant as soon as it receives incoming data and initiates the call with an automated response message.

[0238] Step 3:

[0239] The server records the audio during the call in real time and temporarily stores the data.

[0240] Step 4:

[0241] The server sends the stored audio data to the speech recognition engine, which then processes the audio into text.

[0242] Step 5:

[0243] The server verifies the converted text data and sends its contents as a push notification to the user's device.

[0244] Step 6:

[0245] The server automatically extracts date and time information from the text data and determines whether it relates to a schedule.

[0246] Step 7:

[0247] If the server contains schedule information, it will synchronize with the user's calendar application and input or update the necessary information.

[0248] Step 8:

[0249] If the text is in different languages, the server will use a translation engine to automatically translate it and notify the user in the most appropriate language.

[0250] Step 9:

[0251] The user receives a notification sent from the server, reviews the text, and modifies or adds calendar information if necessary.

[0252] This process allows users to effectively manage information through automated telephone support.

[0253] (Example 1)

[0254] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0255] Traditional voice response systems were inefficient because checking voice messages and managing schedules were done manually, and they had problems with providing appropriate support when users were absent. Furthermore, communication problems sometimes arose due to difficulties in transferring information between different languages.

[0256] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0257] In this invention, the server includes means for receiving incoming information and starting processing, means for converting the received voice data into text data, and means for notifying the information processing device of the converted text data. This enables users to efficiently manage their schedules and facilitates smooth information transmission between different languages ​​without having to directly check voice messages.

[0258] A "communication means" is a technical element that has the function of receiving incoming information and starting to process it.

[0259] "Analysis means" refers to a technical element for converting received audio data into text data and then extracting schedule information from that text data.

[0260] A "transmission means" is a technical element that has the function of notifying an information processing device of converted text data.

[0261] An "integration means" is a technical element that works in conjunction with other functions to register a schedule based on the schedule information extracted by the analysis means.

[0262] A "translation tool" is a technological element that has the function of translating text data between different languages.

[0263] A "response mechanism" is a technical element that has the function of automatically responding in the absence of an information processing device.

[0264] This invention uses a voice response system to enable users to easily receive information and manage their schedules. This system primarily functions through communication and data processing between a server, a terminal, and the user.

[0265] When the server receives incoming call information from the terminal, it begins recording the voice message. The recorded voice data is converted into text data using speech recognition technology. A common speech recognition API is used for this process. For example, existing technologies such as the Google Cloud Speech-to-Text API can be used. This conversion process uses asynchronous processing to ensure efficient data processing.

[0266] The converted text data is sent to the user's device. The server then uses natural language processing technology to analyze the text and automatically registers the schedule based on the extracted data. This schedule registration is facilitated by integrating with user-friendly platforms such as calendar APIs. For example, by utilizing the Google Calendar API, the schedule can be directly reflected in the user's calendar.

[0267] Furthermore, the server includes a translation function that translates text data into different languages ​​as needed. This allows users to handle messages in other languages. Translation technology could utilize a multilingual translation service such as the Microsoft Translator API.

[0268] For example, if a user is unable to answer a call during a meeting, their device notifies the server of the incoming call, and the server automatically records a voice message. If the recorded voice message is "Is a meeting possible tomorrow at 2pm?", the server transcribes it into text and immediately notifies the user's device. The analyzed schedule information is automatically added to the user's calendar, allowing the user to understand their schedule without having to listen to the audio. Furthermore, even if the message is in a different language, the server automatically translates it and notifies the user in the appropriate language, facilitating smooth communication between languages.

[0269] Example of a prompt

[0270] "Please explain the process by which the server converts received voice messages into text and automatically registers them as appointments."

[0271] This system allows users to respond quickly and efficiently to automated inquiries and manage their schedules even when they are away.

[0272] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0273] Step 1:

[0274] The terminal receives incoming call information. This information is metadata, including the caller's number and the date and time of the call. The terminal sends this data to the server, which recognizes the incoming call trigger and initiates an automated response.

[0275] Step 2:

[0276] The server plays an automated response message to the caller. Following this, the server records the caller's voice message. The recorded data is saved to the server as an audio file.

[0277] Step 3:

[0278] The server calls the speech recognition API to convert speech data into text data. In this process, the recorded speech (input) is analyzed to generate the corresponding text (output). Here, the natural language in the speech is electronically recorded.

[0279] Step 4:

[0280] The server analyzes the converted text data and extracts schedule information. In this step, natural language processing technology is used to identify the date, time, and event details (output) from the text (input).

[0281] Step 5:

[0282] The analyzed schedule information is registered in the user's schedule using the calendar API. The server sends the schedule information (input) to the calendar service and it is reflected as a new schedule (output). Thereby, the user's calendar is automatically updated.

[0283] Step 6:

[0284] The server translates the text data into different languages as needed. Using the translation API, the original text (input) is converted into text (output) in the desired language. Thereby, the user can understand the information without language barriers.

[0285] Step 7:

[0286] The server notifies the user's terminal of the converted text data. Finally, the user can visually check the text message and understand the content. In this process, the translated text (input) is displayed on the screen of the user terminal (output).

[0287] (Application Example 1)

[0288] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".

[0289] Traditional telephone answering systems have the potential to miss important incoming calls when users are away, which can reduce the efficiency of schedule management, especially for busy users. Furthermore, language barriers hinder smooth information transfer in communication involving different languages. There is a need to solve these problems and streamline information management within the home.

[0290] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0291] In this invention, the server includes communication means for receiving voice information, analysis means for converting the received voice information into text information, and transmission means for notifying the user of the converted text information. This enables the automatic conversion of important voice messages into text information even when the user is absent, allowing for quick and accurate schedule management.

[0292] "Audio information" refers to sound data, including human voices, acquired through telephones or recording devices.

[0293] "Communication methods" refer to infrastructure and equipment for receiving voice information from external sources, and are based on network technologies such as telephone lines and the internet.

[0294] "Analysis means" refers to technologies and devices for processing received audio information and converting it into text information, often using speech recognition technology.

[0295] "Textual information" refers to data obtained by converting audio information into text format, and is a visually displayed text.

[0296] "Transmission means" refers to the technology and equipment used to deliver converted character information to the user, and may include communication methods such as email and messaging applications.

[0297] "Schedule information" refers to data about dates and appointments extracted from audio or text information.

[0298] "Integration methods" refer to technologies and devices for linking extracted schedule information with schedule management systems and calendar applications.

[0299] "Home-use assistance devices" are devices that provide information management and work support in a home environment, and may include robots and smart home devices.

[0300] This invention provides a system for a home-use assistance device that efficiently manages voice information and automatically adjusts schedules even when the user is absent. The server receives voice information from an external source using communication means. The received voice information is converted into text information by analysis means. The text information is transmitted to the user's terminal by transmission means, and the user can visually confirm it through a messaging application or similar.

[0301] Furthermore, the server extracts schedule information from the text data and registers it in the schedule management system using a collaborative mechanism. This system consists of home assistance devices equipped with AI chipsets such as NVIDIA Jetson. The Google Cloud Speech-to-Text API is used for analyzing speech data, and the Google Translate API is used for translating text data.

[0302] For example, if a user receives a phone call while out saying, "The meeting has been rescheduled to 2 PM tomorrow," the system automatically transcribes the message into text and updates their schedule. The user can then easily check and manage their schedule upon returning home.

[0303] An example of a prompt message to input into a generative AI model is as follows:

[0304] "Scenario: You are away from home. A robot automatically manages, notifies, and schedules phone messages. Please imagine what kind of conversations took place."

[0305] The flow of the specific process in Application Example 1 will be described using FIG. 12.

[0306] Step 1:

[0307] The server receives voice information from the outside using communication means. At this time, it receives voice data acquired through a telephone line or the Internet as input and retains it in the original format. The server temporarily stores the received voice data in a database or temporary storage in order to send it to the next processing step.

[0308] Step 2:

[0309] The server converts the received voice information into character information using analysis means. In this process, it uses voice data as input, calls the Google Cloud Speech-to-Text API to convert the voice into text, generates text data as output, and stores the converted character information. The server prepares the text data for use in the next step.

[0310] Step 3:

[0311] The server transmits the converted character information to the terminal using transmission means. As input, it acquires the text data generated in the previous step and delivers it to the user through email or a messaging app. As output, it provides the character information received by the terminal. The server retains the transmission record and notifies the user for later reference and confirmation.

[0312] Step 4:

[0313] The server extracts schedule information from text data and registers it in the schedule using a linkage mechanism. In this step, text data is parsed as input, and schedule-related information such as date, time, and activity is extracted. The output provides schedule information to be registered in calendar applications and schedule management systems. The server uses an API to programmatically update the schedule data.

[0314] Step 5:

[0315] Users can check notifications on their devices and easily manage their schedules. This step uses text information sent to the device and registered schedule information as input, and outputs a visual calendar display and reminders. Users can edit, delete, or add new schedules through the device interface.

[0316] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0317] This invention provides a system for effectively processing voice data acquired through voice communication, and in particular, efficiently combines an emotion engine that analyzes user emotions from voice. A specific embodiment of the system, involving a server, terminal, and user, is described below.

[0318] The device detects an incoming call and transmits this information to the server. The server automatically activates the AI ​​personal assistant and begins recording the voice communication. Simultaneously, the server utilizes an emotion engine to analyze this voice data. The emotion engine analyzes the intonation, speed, tone, etc., of the voice to determine the emotion. The emotion information obtained from this analysis is used in subsequent processing.

[0319] The server converts the recorded audio data into text using a speech recognition engine. This text data is then supplemented with the results of sentiment analysis obtained from an emotion engine. As a result, the text data goes beyond mere message content and includes information indicating the speaker's emotional state. When the server sends the text data and sentiment information to the user's device, the user receives this information as a notification.

[0320] As a concrete example, consider a situation where a user is in a meeting and unable to answer a phone call. If the caller leaves a message saying, "I was very impressed with your proposal. I would love to work with you again in the future," the emotion engine extracts positive emotions from this message. The transcribed message is tagged with emotion tags, allowing the user to understand not only its meaning but also the caller's emotional state. This information is useful for users when scheduling or responding to messages.

[0321] In addition, the server automatically extracts schedule information from text data and synchronizes it with the calendar application. By providing a system that allows for the efficient and emotional use of voice messages even when the user is absent, the quality of the call experience can be improved.

[0322] The following describes the processing flow.

[0323] Step 1:

[0324] The terminal detects an incoming call and sends incoming call data, including caller information and the time of the call, to the server.

[0325] Step 2:

[0326] When the server receives incoming data, it activates the AI ​​personal assistant and sends an automated message to respond to the call.

[0327] Step 3:

[0328] The server records the call's audio data in real time and sends the recorded audio data to the emotion engine.

[0329] Step 4:

[0330] The emotion engine analyzes audio data to determine the speaker's emotions. It generates information about the intensity and type of emotion (e.g., joy, surprise, anger).

[0331] Step 5:

[0332] The server sends the recorded audio data to the speech recognition engine, which converts the audio into text data.

[0333] Step 6:

[0334] The generated text data is then supplemented with emotional information generated by the emotion engine.

[0335] Step 7:

[0336] The server sends text data containing emotional information to the user's device as a notification. This allows the user to communicate while understanding the sender's emotions.

[0337] Step 8:

[0338] The server analyzes date and time information from the text data and extracts information related to the schedule.

[0339] Step 9:

[0340] Based on the extracted schedule information, the server synchronizes with the user's calendar application. If necessary, it registers new schedules.

[0341] This series of processes allows users to understand the message content, including the sender's emotions, and take appropriate action.

[0342] (Example 2)

[0343] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0344] In modern voice communication, there are problems with efficiently managing voice messages when the recipient is absent, and with capturing not only the content but also the emotional nuances of the message. Furthermore, in situations where there is a need to extract specific schedule information from messages and support multiple languages, the lack of sufficient systems to automate these processes reduces user convenience.

[0345] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0346] In this invention, the server includes means for utilizing an information processing device for receiving voice data, means for utilizing an analysis device for analyzing the received voice data and determining emotional information, and means for utilizing a conversion device for converting the analyzed voice data into text data. This makes it possible to provide users with detailed information, including emotional and schedule information, when receiving voice messages, thereby improving user convenience.

[0347] "Audio data" refers to information that represents audio signals in digital or analog format.

[0348] An "information processing device" is a device that has a hardware or software configuration for collecting, recording, analyzing, and outputting data.

[0349] An "analytical device" is a device that extracts useful information from input data and makes decisions based on a specific purpose.

[0350] A "conversion device" is a device or function used to convert data in one format into another format.

[0351] An "additional device" refers to a function or device used to add additional information to the main data.

[0352] A "transmitting device" is a device used to send data along a designated communication channel to another device or system.

[0353] An "extraction device" is a device or software that has the function of selecting specific information from a large amount of data.

[0354] A "synchronization device" is a device that ensures consistency between multiple data sets or processes by matching timing and data content.

[0355] A "translation device" is a device or software that has the function of converting text written in one language into another language.

[0356] This invention provides a system that highly streamlines the communication and analysis of voice data. At its core, it includes a process that extracts emotional information from voice and manages schedules based on that information.

[0357] The terminal is responsible for receiving voice data and transmits incoming call information to the server. The hardware used here includes acoustic sensors and microphones that enable voice input and output. A dedicated software algorithm detects incoming calls and sends the information to the server in real time.

[0358] The server activates the recording function as soon as it receives the audio data sent from the terminal. The recorded audio data is immediately sent to an analysis device. This analysis device runs an advanced emotion recognition algorithm that analyzes parameters such as intonation, speed, and tone of the voice in order to determine emotional information. Once the analysis is complete, an emotion determination result is generated and managed within the server.

[0359] Subsequently, the server uses speech recognition technology to convert the audio data into text data. This process utilizes a system called a natural language processing engine to create the text. Sentimental information is also added to this text data, which is then finally transmitted to the user's device.

[0360] Users can receive notifications that include emotional tags along with the text data. This allows users to understand not only the message content but also the sender's emotional state. Furthermore, the server has the ability to extract schedule information from the text data and automatically synchronize it with the user's calendar application. This enables efficient schedule management using voice messages.

[0361] As a concrete example, consider a situation where a user is in a meeting and unable to answer a phone call. If the caller leaves a message saying, "I was very impressed with your proposal the other day. I would love to work with you on the next project," the server analyzes this message as a positive emotion. This information is then communicated to the user, enabling them to respond quickly and appropriately in their work.

[0362] An example of a prompt might be: "Translate the received voice message into text and extract sentiment information. Example: 'I was very impressed with your proposal the other day. I would love to work with you on the next project.'" This prompt allows the system to quickly and accurately understand the voice data and perform sentiment analysis.

[0363] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0364] Step 1:

[0365] The terminal detects incoming calls by utilizing its ability to receive voice signals. When an incoming call is received, it notifies the server of the corresponding incoming call data. This notification is sent via the terminal's network module. The input is a signal indicating whether or not an incoming call is received, and the output is the incoming call notification sent to the server.

[0366] Step 2:

[0367] When the server receives an incoming call notification from the terminal, it immediately activates the AI ​​personal assistant and records the voice communication. This process utilizes a recording device, and the voice data is stored digitally. The input is the incoming call notification sent from the terminal, and the output is the recorded voice file.

[0368] Step 3:

[0369] The server transmits the recorded audio data to an analysis device for sentiment analysis. The analysis device determines emotional information based on the intonation, speed, and tone of the voice. The input here is the recorded audio data, and the output is the determined emotional information. This quantifies the emotional state of the speaker.

[0370] Step 4:

[0371] The server runs a speech recognition engine to convert audio data into text data. Because the speech recognition engine utilizes a machine learning model, it can convert speech to text with high accuracy. The input is a pre-analyzed audio file, and the output is text data with added emotional information.

[0372] Step 5:

[0373] The server extracts schedule information from text data. Using natural language processing techniques, it identifies and extracts the schedule information in a specified format. The input is character data, and the output is the extracted schedule information. This information is used in subsequent processes.

[0374] Step 6:

[0375] The server sends text data and sentiment information to the user's terminal. The terminal receives this information as a notification and displays it through the user interface. The input is text data and sentiment information, and the output is visualized data notified to the user. This step allows the user to quickly grasp the message content and the sender's sentiment.

[0376] Step 7:

[0377] The server uses the extracted schedule information to synchronize with the user's calendar application. The synchronization is automated and processes the process by adding new appointments to the user's schedule. The input is the schedule information, and the output is the updated calendar data. This allows the user to enjoy efficient schedule management.

[0378] (Application Example 2)

[0379] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0380] In today's communication environment, receiving and managing voice messages is essential. However, there is a need for technology that not only converts voice to text but also recognizes emotional nuances and provides information in a user-friendly format. Furthermore, efficiently processing information while the user is absent, automatically taking appropriate action, and coordinating with individual schedules are key challenges.

[0381] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0382] In this invention, the server includes communication means for receiving voice data, analysis means for converting the received voice data into text data, and emotion analysis means for analyzing emotions from the voice data and adding emotion information. This allows the user to manage their schedule while understanding not only the specific content of the voice message but also the emotional state of the speaker.

[0383] A "communication means" is an interface for receiving voice data and a device that plays the role of incorporating voice information into the system.

[0384] The "analysis means" is a device that processes received audio data and converts it into text data, achieving accurate text conversion using speech recognition technology.

[0385] An "emotion analysis device" is a device that determines the speaker's emotions from audio data and provides that emotional information, analyzing the intonation and tone of the voice.

[0386] A "transmission means" is a device that notifies the user of converted text data and sentiment information, and is responsible for conveying information to the user.

[0387] A "coordination device" is a device that has the function of reserving or updating schedules based on schedule information extracted from text data, and synchronizes with calendars and other similar systems.

[0388] The system of the present invention mainly consists of a server, a terminal, and a user. The terminal is equipped with a communication interface for receiving voice communication data and transmits the received voice data to the server. When the server receives the voice data, it first uses an analysis means to convert the voice into text data. Next, it uses an emotion analysis means to extract emotion information from the voice data.

[0389] The server analyzes the intonation, speed, and tone of the speech to evaluate the speaker's emotional state. The emotional information obtained by the emotion analysis is added to the converted text data, resulting in the generation of composite data that includes not only text data but also emotional information.

[0390] This composite data is sent to the user's terminal via a transmission method. This allows the user to understand not only the content of the message but also the speaker's emotional state. Furthermore, the server has the functionality to extract schedule information from the text data using a linking mechanism and integrate it with a schedule reservation system or calendar.

[0391] As a concrete example of use, consider a situation where a user is in a meeting and unable to answer a phone call. In this case, the terminal receives a voice message, and the server automatically transcribes the message into text and analyzes its sentiment. If the server recognizes a positive sentiment in a message such as "I was very impressed with your proposal. I would love to work with you again in the future," and notifies the user of this information, the user can also understand the nuances of the message.

[0392] An example of a prompt to set in a generative AI model is, "Analyze the voice message, identify its emotional state, and then transcribe it into text to convey the relevant emotional information to the user." This makes it possible to extract emotions from voice and convert them to text more efficiently.

[0393] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0394] Step 1:

[0395] The terminal receives voice data. Specifically, it detects an incoming call via the communication interface and records a voice message. The input to this process is voice data, and the output is the same voice data retained as is.

[0396] Step 2:

[0397] The terminal sends the received audio data to the server. The terminal uploads the recorded audio file to the server via the network. The input to this process is the audio data recorded in step 1, and the output is the audio data sent to the server.

[0398] Step 3:

[0399] The server analyzes the audio data and converts it into text data. The server uses a speech recognition engine to convert the audio data into text. In this process, audio data is used as input and text data is generated as output.

[0400] Step 4:

[0401] The server analyzes the audio data using emotion analysis tools and extracts emotional information. It analyzes the intonation and speed of the speech to estimate the speaker's emotions. The input for this process is the original audio data, and the output is emotional information.

[0402] Step 5:

[0403] The server integrates text data and sentiment information and notifies the user. The server adds sentiment information as a tag to the text data and sends it to the user's terminal. The input to this process is text data and sentiment information, and the output is integrated notification data.

[0404] Step 6:

[0405] The server extracts schedule information from text data and links it to the schedule. Using natural language processing technology, it identifies schedule information within text messages and registers it in the calendar system. The input to this process is text data, and the output is updated schedule information.

[0406] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0407] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0408] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0409] [Third Embodiment]

[0410] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0411] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0412] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0413] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0414] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0415] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0416] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0417] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0418] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0419] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0420] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0421] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0422] Embodiments of the present invention will be described in detail from the perspectives of the server, terminal, and user, detailing a specific system in which an AI personal assistant handles telephone calls.

[0423] This system aims to receive voice messages as text and notify the user without requiring manual confirmation of incoming calls. Upon receiving incoming call information from a terminal, the server initiates an AI-powered automated response. During the automated response, the server records the voice and converts the audio data into text data using an analysis tool. The analyzed text data is then sent to the user's terminal via the server, allowing the user to instantly review it visually.

[0424] Furthermore, the server identifies appointment information within the text data and registers the schedule in conjunction with the calendar. This allows users to automatically and quickly manage their schedules based on the content of their calls. If necessary, the server translates the text data to support multilingual communication.

[0425] As a concrete example, consider a situation where a user is in a meeting and unable to answer a call. The device notifies the server of the incoming call, and the server automatically responds. If the caller leaves a message asking, "Are you available for a meeting tomorrow at 10 AM?", the audio is immediately transcribed into text and notified to the user. The server then analyzes this message and reflects it in the calendar as an appointment. This allows the user to check and register the meeting appointment without having to listen to the audio message. Even if the language is different, the server automatically translates and notifies the user in the appropriate language.

[0426] Thus, the system of the present invention reduces the time and effort associated with answering phone calls, even when the user is away, and enables efficient schedule management.

[0427] The following describes the processing flow.

[0428] Step 1:

[0429] The terminal detects an incoming call and transmits incoming call data, including caller information and the time of the call, to the server.

[0430] Step 2:

[0431] The server activates the AI ​​personal assistant as soon as it receives incoming data and initiates the call with an automated response message.

[0432] Step 3:

[0433] The server records the audio during the call in real time and temporarily stores the data.

[0434] Step 4:

[0435] The server sends the stored audio data to the speech recognition engine, which then processes the audio into text.

[0436] Step 5:

[0437] The server verifies the converted text data and sends its contents as a push notification to the user's device.

[0438] Step 6:

[0439] The server automatically extracts date and time information from the text data and determines whether it relates to a schedule.

[0440] Step 7:

[0441] If the server contains schedule information, it will synchronize with the user's calendar application and input or update the necessary information.

[0442] Step 8:

[0443] If the text is in different languages, the server will use a translation engine to automatically translate it and notify the user in the most appropriate language.

[0444] Step 9:

[0445] The user receives a notification sent from the server, reviews the text, and modifies or adds calendar information if necessary.

[0446] This process allows users to effectively manage information through automated telephone support.

[0447] (Example 1)

[0448] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0449] Traditional voice response systems were inefficient because checking voice messages and managing schedules were done manually, and they had problems with providing appropriate support when users were absent. Furthermore, communication problems sometimes arose due to difficulties in transferring information between different languages.

[0450] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0451] In this invention, the server includes means for receiving incoming information and starting processing, means for converting the received voice data into text data, and means for notifying the information processing device of the converted text data. This enables users to efficiently manage their schedules and facilitates smooth information transmission between different languages ​​without having to directly check voice messages.

[0452] A "communication means" is a technical element that has the function of receiving incoming information and starting to process it.

[0453] "Analysis means" refers to a technical element for converting received audio data into text data and then extracting schedule information from that text data.

[0454] A "transmission means" is a technical element that has the function of notifying an information processing device of converted text data.

[0455] An "integration means" is a technical element that works in conjunction with other functions to register a schedule based on the schedule information extracted by the analysis means.

[0456] A "translation tool" is a technological element that has the function of translating text data between different languages.

[0457] A "response mechanism" is a technical element that has the function of automatically responding in the absence of an information processing device.

[0458] This invention uses a voice response system to enable users to easily receive information and manage their schedules. This system primarily functions through communication and data processing between a server, a terminal, and the user.

[0459] When the server receives incoming call information from the terminal, it begins recording the voice message. The recorded voice data is converted into text data using speech recognition technology. A common speech recognition API is used for this process. For example, existing technologies such as the Google Cloud Speech-to-Text API can be used. This conversion process uses asynchronous processing to ensure efficient data processing.

[0460] The converted text data is sent to the user's device. The server then uses natural language processing technology to analyze the text and automatically registers the schedule based on the extracted data. This schedule registration is facilitated by integrating with user-friendly platforms such as calendar APIs. For example, by utilizing the Google Calendar API, the schedule can be directly reflected in the user's calendar.

[0461] Furthermore, the server includes a translation function that translates text data into different languages ​​as needed. This allows users to handle messages in other languages. Translation technology could utilize a multilingual translation service such as the Microsoft Translator API.

[0462] For example, if a user is unable to answer a call during a meeting, their device notifies the server of the incoming call, and the server automatically records a voice message. If the recorded voice message is "Is a meeting possible tomorrow at 2pm?", the server transcribes it into text and immediately notifies the user's device. The analyzed schedule information is automatically added to the user's calendar, allowing the user to understand their schedule without having to listen to the audio. Furthermore, even if the message is in a different language, the server automatically translates it and notifies the user in the appropriate language, facilitating smooth communication between languages.

[0463] Example of a prompt

[0464] "Please explain the process by which the server converts received voice messages into text and automatically registers them as appointments."

[0465] This system allows users to respond quickly and efficiently to automated inquiries and manage their schedules even when they are away.

[0466] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0467] Step 1:

[0468] The terminal receives incoming call information. This information is metadata, including the caller's number and the date and time of the call. The terminal sends this data to the server, which recognizes the incoming call trigger and initiates an automated response.

[0469] Step 2:

[0470] The server plays an automated response message to the caller. Following this, the server records the caller's voice message. The recorded data is saved to the server as an audio file.

[0471] Step 3:

[0472] The server calls a speech recognition API to convert the audio data into text data. This process analyzes the recorded audio (input) and generates the corresponding text (output). Here, the natural language in the audio is electronically recorded.

[0473] Step 4:

[0474] The server analyzes the converted text data and extracts the schedule information. In this step, natural language processing techniques are used to identify the date, time, and event details (output) from the text (input).

[0475] Step 5:

[0476] The analyzed schedule information is registered in the user's schedule using the Calendar API. The server sends the schedule information (input) to the Calendar service, where it is reflected as a new event (output). This automatically updates the user's calendar.

[0477] Step 6:

[0478] The server translates text data into different languages ​​as needed. Using a translation API, it generates text (output) converted from the original text (input) into the desired language. This allows users to understand information without language barriers.

[0479] Step 7:

[0480] The server notifies the user's terminal of the converted text data. Ultimately, the user can visually review and understand the text message. In this process, the translated text (input) is displayed on the user's terminal screen (output).

[0481] (Application Example 1)

[0482] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0483] Traditional telephone answering systems have the potential to miss important incoming calls when users are away, which can reduce the efficiency of schedule management, especially for busy users. Furthermore, language barriers hinder smooth information transfer in communication involving different languages. There is a need to solve these problems and streamline information management within the home.

[0484] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0485] In this invention, the server includes communication means for receiving voice information, analysis means for converting the received voice information into text information, and transmission means for notifying the user of the converted text information. This enables the automatic conversion of important voice messages into text information even when the user is absent, allowing for quick and accurate schedule management.

[0486] "Audio information" refers to sound data, including human voices, acquired through telephones or recording devices.

[0487] "Communication methods" refer to infrastructure and equipment for receiving voice information from external sources, and are based on network technologies such as telephone lines and the internet.

[0488] "Analysis means" refers to technologies and devices for processing received audio information and converting it into text information, often using speech recognition technology.

[0489] "Textual information" refers to data obtained by converting audio information into text format, and is a visually displayed text.

[0490] "Transmission means" refers to the technology and equipment used to deliver converted character information to the user, and may include communication methods such as email and messaging applications.

[0491] "Schedule information" refers to data about dates and appointments extracted from audio or text information.

[0492] "Integration methods" refer to technologies and devices for linking extracted schedule information with schedule management systems and calendar applications.

[0493] "Home-use assistance devices" are devices that provide information management and work support in a home environment, and may include robots and smart home devices.

[0494] This invention provides a system for a home-use assistance device that efficiently manages voice information and automatically adjusts schedules even when the user is absent. The server receives voice information from an external source using communication means. The received voice information is converted into text information by analysis means. The text information is transmitted to the user's terminal by transmission means, and the user can visually confirm it through a messaging application or similar.

[0495] Furthermore, the server extracts schedule information from the text data and registers it in the schedule management system using a collaborative mechanism. This system consists of home assistance devices equipped with AI chipsets such as NVIDIA Jetson. The Google Cloud Speech-to-Text API is used for analyzing speech data, and the Google Translate API is used for translating text data.

[0496] For example, if a user receives a phone call while out saying, "The meeting has been rescheduled to 2 PM tomorrow," the system automatically transcribes the message into text and updates their schedule. The user can then easily check and manage their schedule upon returning home.

[0497] An example of a prompt message to input into a generative AI model is as follows:

[0498] "Scene: You are away from home. A robot automatically manages phone messages, sends notifications, and schedules appointments. Imagine the interactions that take place."

[0499] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0500] Step 1:

[0501] The server receives voice information from external sources using communication methods. At this time, it receives voice data acquired via telephone lines or the internet as input and retains it in its original format. The server temporarily stores the received voice data in a database or temporary storage in order to send it to the next processing step.

[0502] Step 2:

[0503] The server uses parsing tools to convert received audio information into text information. In this process, it uses audio data as input and calls the Google Cloud Speech-to-Text API to convert the audio to text. It generates text data as output and stores the converted text information. The server then prepares the text data for use in the next step.

[0504] Step 3:

[0505] The server sends the converted character information to the terminal using a transmission method. As input, it retrieves the text data generated in the previous step and delivers it to the user via email or messaging apps. As output, it provides the received character information to the terminal. The server maintains a transmission record and notifies the user for later reference and verification.

[0506] Step 4:

[0507] The server extracts schedule information from text data and registers it in the schedule using a linkage mechanism. In this step, text data is parsed as input, and schedule-related information such as date, time, and activity is extracted. The output provides schedule information to be registered in calendar applications and schedule management systems. The server uses an API to programmatically update the schedule data.

[0508] Step 5:

[0509] Users can check notifications on their devices and easily manage their schedules. This step uses text information sent to the device and registered schedule information as input, and outputs a visual calendar display and reminders. Users can edit, delete, or add new schedules through the device interface.

[0510] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0511] This invention provides a system for effectively processing voice data acquired through voice communication, and in particular, efficiently combines an emotion engine that analyzes user emotions from voice. A specific embodiment of the system, involving a server, terminal, and user, is described below.

[0512] The device detects an incoming call and transmits this information to the server. The server automatically activates the AI ​​personal assistant and begins recording the voice communication. Simultaneously, the server utilizes an emotion engine to analyze this voice data. The emotion engine analyzes the intonation, speed, tone, etc., of the voice to determine the emotion. The emotion information obtained from this analysis is used in subsequent processing.

[0513] The server converts the recorded audio data into text using a speech recognition engine. This text data is then supplemented with the results of sentiment analysis obtained from an emotion engine. As a result, the text data goes beyond mere message content and includes information indicating the speaker's emotional state. When the server sends the text data and sentiment information to the user's device, the user receives this information as a notification.

[0514] As a concrete example, consider a situation where a user is in a meeting and unable to answer a phone call. If the caller leaves a message saying, "I was very impressed with your proposal. I would love to work with you again in the future," the emotion engine extracts positive emotions from this message. The transcribed message is tagged with emotion tags, allowing the user to understand not only its meaning but also the caller's emotional state. This information is useful for users when scheduling or responding to messages.

[0515] In addition, the server automatically extracts schedule information from text data and synchronizes it with the calendar application. By providing a system that allows for the efficient and emotional use of voice messages even when the user is absent, the quality of the call experience can be improved.

[0516] The following describes the processing flow.

[0517] Step 1:

[0518] The terminal detects an incoming call and sends incoming call data, including caller information and the time of the call, to the server.

[0519] Step 2:

[0520] When the server receives incoming data, it activates the AI ​​personal assistant and sends an automated message to respond to the call.

[0521] Step 3:

[0522] The server records the call's audio data in real time and sends the recorded audio data to the emotion engine.

[0523] Step 4:

[0524] The emotion engine analyzes audio data to determine the speaker's emotions. It generates information about the intensity and type of emotion (e.g., joy, surprise, anger).

[0525] Step 5:

[0526] The server sends the recorded audio data to the speech recognition engine, which converts the audio into text data.

[0527] Step 6:

[0528] The generated text data is then supplemented with emotional information generated by the emotion engine.

[0529] Step 7:

[0530] The server sends text data containing emotional information to the user's device as a notification. This allows the user to communicate while understanding the sender's emotions.

[0531] Step 8:

[0532] The server analyzes date and time information from the text data and extracts information related to the schedule.

[0533] Step 9:

[0534] Based on the extracted schedule information, the server synchronizes with the user's calendar application. If necessary, it registers new schedules.

[0535] This series of processes allows users to understand the message content, including the sender's emotions, and take appropriate action.

[0536] (Example 2)

[0537] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0538] In modern voice communication, there are problems with efficiently managing voice messages when the recipient is absent, and with capturing not only the content but also the emotional nuances of the message. Furthermore, in situations where there is a need to extract specific schedule information from messages and support multiple languages, the lack of sufficient systems to automate these processes reduces user convenience.

[0539] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0540] In this invention, the server includes means for utilizing an information processing device for receiving voice data, means for utilizing an analysis device for analyzing the received voice data and determining emotional information, and means for utilizing a conversion device for converting the analyzed voice data into text data. This makes it possible to provide users with detailed information, including emotional and schedule information, when receiving voice messages, thereby improving user convenience.

[0541] "Audio data" refers to information that represents audio signals in digital or analog format.

[0542] An "information processing device" is a device that has a hardware or software configuration for collecting, recording, analyzing, and outputting data.

[0543] An "analytical device" is a device that extracts useful information from input data and makes decisions based on a specific purpose.

[0544] A "conversion device" is a device or function used to convert data in one format into another format.

[0545] An "additional device" refers to a function or device used to add additional information to the main data.

[0546] A "transmitting device" is a device used to send data along a designated communication channel to another device or system.

[0547] An "extraction device" is a device or software that has the function of selecting specific information from a large amount of data.

[0548] A "synchronization device" is a device that ensures consistency between multiple data sets or processes by matching timing and data content.

[0549] A "translation device" is a device or software that has the function of converting text written in one language into another language.

[0550] This invention provides a system that highly streamlines the communication and analysis of voice data. At its core, it includes a process that extracts emotional information from voice and manages schedules based on that information.

[0551] The terminal is responsible for receiving voice data and transmits incoming call information to the server. The hardware used here includes acoustic sensors and microphones that enable voice input and output. A dedicated software algorithm detects incoming calls and sends the information to the server in real time.

[0552] The server activates the recording function as soon as it receives the audio data sent from the terminal. The recorded audio data is immediately sent to an analysis device. This analysis device runs an advanced emotion recognition algorithm that analyzes parameters such as intonation, speed, and tone of the voice in order to determine emotional information. Once the analysis is complete, an emotion determination result is generated and managed within the server.

[0553] Subsequently, the server uses speech recognition technology to convert the audio data into text data. This process utilizes a system called a natural language processing engine to create the text. Sentimental information is also added to this text data, which is then finally transmitted to the user's device.

[0554] Users can receive notifications that include emotional tags along with the text data. This allows users to understand not only the message content but also the sender's emotional state. Furthermore, the server has the ability to extract schedule information from the text data and automatically synchronize it with the user's calendar application. This enables efficient schedule management using voice messages.

[0555] As a concrete example, consider a situation where a user is in a meeting and unable to answer a phone call. If the caller leaves a message saying, "I was very impressed with your proposal the other day. I would love to work with you on the next project," the server analyzes this message as a positive emotion. This information is then communicated to the user, enabling them to respond quickly and appropriately in their work.

[0556] An example of a prompt might be: "Translate the received voice message into text and extract sentiment information. Example: 'I was very impressed with your proposal the other day. I would love to work with you on the next project.'" This prompt allows the system to quickly and accurately understand the voice data and perform sentiment analysis.

[0557] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0558] Step 1:

[0559] The terminal detects incoming calls by utilizing its ability to receive voice signals. When an incoming call is received, it notifies the server of the corresponding incoming call data. This notification is sent via the terminal's network module. The input is a signal indicating whether or not an incoming call is received, and the output is the incoming call notification sent to the server.

[0560] Step 2:

[0561] When the server receives an incoming call notification from the terminal, it immediately activates the AI ​​personal assistant and records the voice communication. This process utilizes a recording device, and the voice data is stored digitally. The input is the incoming call notification sent from the terminal, and the output is the recorded voice file.

[0562] Step 3:

[0563] The server transmits the recorded audio data to an analysis device for sentiment analysis. The analysis device determines emotional information based on the intonation, speed, and tone of the voice. The input here is the recorded audio data, and the output is the determined emotional information. This quantifies the emotional state of the speaker.

[0564] Step 4:

[0565] The server runs a speech recognition engine to convert audio data into text data. Because the speech recognition engine utilizes a machine learning model, it can convert speech to text with high accuracy. The input is a pre-analyzed audio file, and the output is text data with added emotional information.

[0566] Step 5:

[0567] The server extracts schedule information from text data. Using natural language processing techniques, it identifies and extracts the schedule information in a specified format. The input is character data, and the output is the extracted schedule information. This information is used in subsequent processes.

[0568] Step 6:

[0569] The server sends text data and sentiment information to the user's terminal. The terminal receives this information as a notification and displays it through the user interface. The input is text data and sentiment information, and the output is visualized data notified to the user. This step allows the user to quickly grasp the message content and the sender's sentiment.

[0570] Step 7:

[0571] The server uses the extracted schedule information to synchronize with the user's calendar application. The synchronization is automated and processes the process by adding new appointments to the user's schedule. The input is the schedule information, and the output is the updated calendar data. This allows the user to enjoy efficient schedule management.

[0572] (Application Example 2)

[0573] Next, we will explain Application Example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0574] In today's communication environment, receiving and managing voice messages is essential. However, there is a need for technology that not only converts voice to text but also recognizes emotional nuances and provides information in a user-friendly format. Furthermore, efficiently processing information while the user is absent, automatically taking appropriate action, and coordinating with individual schedules are key challenges.

[0575] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0576] In this invention, the server includes communication means for receiving voice data, analysis means for converting the received voice data into text data, and emotion analysis means for analyzing emotions from the voice data and adding emotion information. This allows the user to manage their schedule while understanding not only the specific content of the voice message but also the emotional state of the speaker.

[0577] A "communication means" is an interface for receiving voice data and a device that plays the role of incorporating voice information into the system.

[0578] The "analysis means" is a device that processes received audio data and converts it into text data, achieving accurate text conversion using speech recognition technology.

[0579] An "emotion analysis device" is a device that determines the speaker's emotions from audio data and provides that emotional information, analyzing the intonation and tone of the voice.

[0580] A "transmission means" is a device that notifies the user of converted text data and sentiment information, and is responsible for conveying information to the user.

[0581] A "coordination device" is a device that has the function of reserving or updating schedules based on schedule information extracted from text data, and synchronizes with calendars and other similar systems.

[0582] The system of the present invention mainly consists of a server, a terminal, and a user. The terminal is equipped with a communication interface for receiving voice communication data and transmits the received voice data to the server. When the server receives the voice data, it first uses an analysis means to convert the voice into text data. Next, it uses an emotion analysis means to extract emotion information from the voice data.

[0583] The server analyzes the intonation, speed, and tone of the speech to evaluate the speaker's emotional state. The emotional information obtained by the emotion analysis is added to the converted text data, resulting in the generation of composite data that includes not only text data but also emotional information.

[0584] This composite data is sent to the user's terminal via a transmission method. This allows the user to understand not only the content of the message but also the speaker's emotional state. Furthermore, the server has the functionality to extract schedule information from the text data using a linking mechanism and integrate it with a schedule reservation system or calendar.

[0585] As a concrete example of use, consider a situation where a user is in a meeting and unable to answer a phone call. In this case, the terminal receives a voice message, and the server automatically transcribes the message into text and analyzes its sentiment. If the server recognizes a positive sentiment in a message such as "I was very impressed with your proposal. I would love to work with you again in the future," and notifies the user of this information, the user can also understand the nuances of the message.

[0586] An example of a prompt to set in a generative AI model is, "Analyze the voice message, identify its emotional state, and then transcribe it into text to convey the relevant emotional information to the user." This makes it possible to extract emotions from voice and convert them to text more efficiently.

[0587] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0588] Step 1:

[0589] The terminal receives voice data. Specifically, it detects an incoming call via the communication interface and records a voice message. The input to this process is voice data, and the output is the same voice data retained as is.

[0590] Step 2:

[0591] The terminal sends the received audio data to the server. The terminal uploads the recorded audio file to the server via the network. The input to this process is the audio data recorded in step 1, and the output is the audio data sent to the server.

[0592] Step 3:

[0593] The server analyzes the audio data and converts it into text data. The server uses a speech recognition engine to convert the audio data into text. In this process, audio data is used as input and text data is generated as output.

[0594] Step 4:

[0595] The server analyzes the audio data using emotion analysis tools and extracts emotional information. It analyzes the intonation and speed of the speech to estimate the speaker's emotions. The input for this process is the original audio data, and the output is emotional information.

[0596] Step 5:

[0597] The server integrates text data and sentiment information and notifies the user. The server adds sentiment information as a tag to the text data and sends it to the user's terminal. The input to this process is text data and sentiment information, and the output is integrated notification data.

[0598] Step 6:

[0599] The server extracts schedule information from text data and links it to the schedule. Using natural language processing technology, it identifies schedule information within text messages and registers it in the calendar system. The input to this process is text data, and the output is updated schedule information.

[0600] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0601] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0602] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0603] [Fourth Embodiment]

[0604] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0605] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0606] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0607] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0608] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0609] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0610] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0611] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0612] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0613] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0614] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0615] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0616] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0617] Embodiments of the present invention will be described in detail from the perspectives of the server, terminal, and user, detailing a specific system in which an AI personal assistant handles telephone calls.

[0618] This system aims to receive voice messages as text and notify the user without requiring manual confirmation of incoming calls. Upon receiving incoming call information from a terminal, the server initiates an AI-powered automated response. During the automated response, the server records the voice and converts the audio data into text data using an analysis tool. The analyzed text data is then sent to the user's terminal via the server, allowing the user to instantly review it visually.

[0619] Furthermore, the server identifies appointment information within the text data and registers the schedule in conjunction with the calendar. This allows users to automatically and quickly manage their schedules based on the content of their calls. If necessary, the server translates the text data to support multilingual communication.

[0620] As a concrete example, consider a situation where a user is in a meeting and unable to answer a call. The device notifies the server of the incoming call, and the server automatically responds. If the caller leaves a message asking, "Are you available for a meeting tomorrow at 10 AM?", the audio is immediately transcribed into text and notified to the user. The server then analyzes this message and reflects it in the calendar as an appointment. This allows the user to check and register the meeting appointment without having to listen to the audio message. Even if the language is different, the server automatically translates and notifies the user in the appropriate language.

[0621] Thus, the system of the present invention reduces the time and effort associated with answering phone calls, even when the user is away, and enables efficient schedule management.

[0622] The following describes the processing flow.

[0623] Step 1:

[0624] The terminal detects an incoming call and transmits incoming call data, including caller information and the time of the call, to the server.

[0625] Step 2:

[0626] The server activates the AI ​​personal assistant as soon as it receives incoming data and initiates the call with an automated response message.

[0627] Step 3:

[0628] The server records the audio during the call in real time and temporarily stores the data.

[0629] Step 4:

[0630] The server sends the stored audio data to the speech recognition engine, which then processes the audio into text.

[0631] Step 5:

[0632] The server verifies the converted text data and sends its contents as a push notification to the user's device.

[0633] Step 6:

[0634] The server automatically extracts date and time information from the text data and determines whether it relates to a schedule.

[0635] Step 7:

[0636] If the server contains schedule information, it will synchronize with the user's calendar application and input or update the necessary information.

[0637] Step 8:

[0638] If the text is in different languages, the server will use a translation engine to automatically translate it and notify the user in the most appropriate language.

[0639] Step 9:

[0640] The user receives a notification sent from the server, reviews the text, and modifies or adds calendar information if necessary.

[0641] This process allows users to effectively manage information through automated telephone support.

[0642] (Example 1)

[0643] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0644] Traditional voice response systems were inefficient because checking voice messages and managing schedules were done manually, and they had problems with providing appropriate support when users were absent. Furthermore, communication problems sometimes arose due to difficulties in transferring information between different languages.

[0645] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0646] In this invention, the server includes means for receiving incoming information and starting processing, means for converting the received voice data into text data, and means for notifying the information processing device of the converted text data. This enables users to efficiently manage their schedules and facilitates smooth information transmission between different languages ​​without having to directly check voice messages.

[0647] A "communication means" is a technical element that has the function of receiving incoming information and starting to process it.

[0648] "Analysis means" refers to a technical element for converting received audio data into text data and then extracting schedule information from that text data.

[0649] A "transmission means" is a technical element that has the function of notifying an information processing device of converted text data.

[0650] An "integration means" is a technical element that works in conjunction with other functions to register a schedule based on the schedule information extracted by the analysis means.

[0651] A "translation tool" is a technological element that has the function of translating text data between different languages.

[0652] A "response mechanism" is a technical element that has the function of automatically responding in the absence of an information processing device.

[0653] This invention uses a voice response system to enable users to easily receive information and manage their schedules. This system primarily functions through communication and data processing between a server, a terminal, and the user.

[0654] When the server receives incoming call information from the terminal, it begins recording the voice message. The recorded voice data is converted into text data using speech recognition technology. A common speech recognition API is used for this process. For example, existing technologies such as the Google Cloud Speech-to-Text API can be used. This conversion process uses asynchronous processing to ensure efficient data processing.

[0655] The converted text data is sent to the user's device. The server then uses natural language processing technology to analyze the text and automatically registers the schedule based on the extracted data. This schedule registration is facilitated by integrating with user-friendly platforms such as calendar APIs. For example, by utilizing the Google Calendar API, the schedule can be directly reflected in the user's calendar.

[0656] Furthermore, the server includes a translation function that translates text data into different languages ​​as needed. This allows users to handle messages in other languages. Translation technology could utilize a multilingual translation service such as the Microsoft Translator API.

[0657] For example, if a user is unable to answer a call during a meeting, their device notifies the server of the incoming call, and the server automatically records a voice message. If the recorded voice message is "Is a meeting possible tomorrow at 2pm?", the server transcribes it into text and immediately notifies the user's device. The analyzed schedule information is automatically added to the user's calendar, allowing the user to understand their schedule without having to listen to the audio. Furthermore, even if the message is in a different language, the server automatically translates it and notifies the user in the appropriate language, facilitating smooth communication between languages.

[0658] Example of a prompt

[0659] "Please explain the process by which the server converts received voice messages into text and automatically registers them as appointments."

[0660] This system allows users to respond quickly and efficiently to automated inquiries and manage their schedules even when they are away.

[0661] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0662] Step 1:

[0663] The terminal receives incoming call information. This information is metadata, including the caller's number and the date and time of the call. The terminal sends this data to the server, which recognizes the incoming call trigger and initiates an automated response.

[0664] Step 2:

[0665] The server plays an automated response message to the caller. Following this, the server records the caller's voice message. The recorded data is saved to the server as an audio file.

[0666] Step 3:

[0667] The server calls a speech recognition API to convert the audio data into text data. This process analyzes the recorded audio (input) and generates the corresponding text (output). Here, the natural language in the audio is electronically recorded.

[0668] Step 4:

[0669] The server analyzes the converted text data and extracts the schedule information. In this step, natural language processing techniques are used to identify the date, time, and event details (output) from the text (input).

[0670] Step 5:

[0671] The analyzed schedule information is registered in the user's schedule using the Calendar API. The server sends the schedule information (input) to the Calendar service, where it is reflected as a new event (output). This automatically updates the user's calendar.

[0672] Step 6:

[0673] The server translates text data into different languages ​​as needed. Using a translation API, it generates text (output) converted from the original text (input) into the desired language. This allows users to understand information without language barriers.

[0674] Step 7:

[0675] The server notifies the user's terminal of the converted text data. Ultimately, the user can visually review and understand the text message. In this process, the translated text (input) is displayed on the user's terminal screen (output).

[0676] (Application Example 1)

[0677] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0678] Traditional telephone answering systems have the potential to miss important incoming calls when users are away, which can reduce the efficiency of schedule management, especially for busy users. Furthermore, language barriers hinder smooth information transfer in communication involving different languages. There is a need to solve these problems and streamline information management within the home.

[0679] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0680] In this invention, the server includes communication means for receiving voice information, analysis means for converting the received voice information into text information, and transmission means for notifying the user of the converted text information. This enables the automatic conversion of important voice messages into text information even when the user is absent, allowing for quick and accurate schedule management.

[0681] "Audio information" refers to sound data, including human voices, acquired through telephones or recording devices.

[0682] "Communication methods" refer to infrastructure and equipment for receiving voice information from external sources, and are based on network technologies such as telephone lines and the internet.

[0683] "Analysis means" refers to technologies and devices for processing received audio information and converting it into text information, often using speech recognition technology.

[0684] "Textual information" refers to data obtained by converting audio information into text format, and is a visually displayed text.

[0685] "Transmission means" refers to the technology and equipment used to deliver converted character information to the user, and may include communication methods such as email and messaging applications.

[0686] "Schedule information" refers to data about dates and appointments extracted from audio or text information.

[0687] "Integration methods" refer to technologies and devices for linking extracted schedule information with schedule management systems and calendar applications.

[0688] "Home-use assistance devices" are devices that provide information management and work support in a home environment, and may include robots and smart home devices.

[0689] This invention provides a system for a home-use assistance device that efficiently manages voice information and automatically adjusts schedules even when the user is absent. The server receives voice information from an external source using communication means. The received voice information is converted into text information by analysis means. The text information is transmitted to the user's terminal by transmission means, and the user can visually confirm it through a messaging application or similar.

[0690] Furthermore, the server extracts schedule information from the text data and registers it in the schedule management system using a collaborative mechanism. This system consists of home assistance devices equipped with AI chipsets such as NVIDIA Jetson. The Google Cloud Speech-to-Text API is used for analyzing speech data, and the Google Translate API is used for translating text data.

[0691] For example, if a user receives a phone call while out saying, "The meeting has been rescheduled to 2 PM tomorrow," the system automatically transcribes the message into text and updates their schedule. The user can then easily check and manage their schedule upon returning home.

[0692] An example of a prompt message to input into a generative AI model is as follows:

[0693] "Scene: You are away from home. A robot automatically manages phone messages, sends notifications, and schedules appointments. Imagine the interactions that take place."

[0694] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0695] Step 1:

[0696] The server receives voice information from external sources using communication methods. At this time, it receives voice data acquired via telephone lines or the internet as input and retains it in its original format. The server temporarily stores the received voice data in a database or temporary storage in order to send it to the next processing step.

[0697] Step 2:

[0698] The server uses parsing tools to convert received audio information into text information. In this process, it uses audio data as input and calls the Google Cloud Speech-to-Text API to convert the audio to text. It generates text data as output and stores the converted text information. The server then prepares the text data for use in the next step.

[0699] Step 3:

[0700] The server sends the converted character information to the terminal using a transmission method. As input, it retrieves the text data generated in the previous step and delivers it to the user via email or messaging apps. As output, it provides the received character information to the terminal. The server maintains a transmission record and notifies the user for later reference and verification.

[0701] Step 4:

[0702] The server extracts schedule information from text data and registers it in the schedule using a linkage mechanism. In this step, text data is parsed as input, and schedule-related information such as date, time, and activity is extracted. The output provides schedule information to be registered in calendar applications and schedule management systems. The server uses an API to programmatically update the schedule data.

[0703] Step 5:

[0704] Users can check notifications on their devices and easily manage their schedules. This step uses text information sent to the device and registered schedule information as input, and outputs a visual calendar display and reminders. Users can edit, delete, or add new schedules through the device interface.

[0705] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0706] This invention provides a system for effectively processing voice data acquired through voice communication, and in particular, efficiently combines an emotion engine that analyzes user emotions from voice. A specific embodiment of the system, involving a server, terminal, and user, is described below.

[0707] The device detects an incoming call and transmits this information to the server. The server automatically activates the AI ​​personal assistant and begins recording the voice communication. Simultaneously, the server utilizes an emotion engine to analyze this voice data. The emotion engine analyzes the intonation, speed, tone, etc., of the voice to determine the emotion. The emotion information obtained from this analysis is used in subsequent processing.

[0708] The server converts the recorded audio data into text using a speech recognition engine. This text data is then supplemented with the results of sentiment analysis obtained from an emotion engine. As a result, the text data goes beyond mere message content and includes information indicating the speaker's emotional state. When the server sends the text data and sentiment information to the user's device, the user receives this information as a notification.

[0709] As a concrete example, consider a situation where a user is in a meeting and unable to answer a phone call. If the caller leaves a message saying, "I was very impressed with your proposal. I would love to work with you again in the future," the emotion engine extracts positive emotions from this message. The transcribed message is tagged with emotion tags, allowing the user to understand not only its meaning but also the caller's emotional state. This information is useful for users when scheduling or responding to messages.

[0710] In addition, the server automatically extracts schedule information from text data and synchronizes it with the calendar application. By providing a system that allows for the efficient and emotional use of voice messages even when the user is absent, the quality of the call experience can be improved.

[0711] The following describes the processing flow.

[0712] Step 1:

[0713] The terminal detects an incoming call and sends incoming call data, including caller information and the time of the call, to the server.

[0714] Step 2:

[0715] When the server receives incoming data, it activates the AI ​​personal assistant and sends an automated message to respond to the call.

[0716] Step 3:

[0717] The server records the call's audio data in real time and sends the recorded audio data to the emotion engine.

[0718] Step 4:

[0719] The emotion engine analyzes audio data to determine the speaker's emotions. It generates information about the intensity and type of emotion (e.g., joy, surprise, anger).

[0720] Step 5:

[0721] The server sends the recorded audio data to the speech recognition engine, which converts the audio into text data.

[0722] Step 6:

[0723] The generated text data is then supplemented with emotional information generated by the emotion engine.

[0724] Step 7:

[0725] The server sends text data containing emotional information to the user's device as a notification. This allows the user to communicate while understanding the sender's emotions.

[0726] Step 8:

[0727] The server analyzes date and time information from the text data and extracts information related to the schedule.

[0728] Step 9:

[0729] Based on the extracted schedule information, the server synchronizes with the user's calendar application. If necessary, it registers new schedules.

[0730] This series of processes allows users to understand the message content, including the sender's emotions, and take appropriate action.

[0731] (Example 2)

[0732] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0733] In modern voice communication, there are problems with efficiently managing voice messages when the recipient is absent, and with capturing not only the content but also the emotional nuances of the message. Furthermore, in situations where there is a need to extract specific schedule information from messages and support multiple languages, the lack of sufficient systems to automate these processes reduces user convenience.

[0734] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0735] In this invention, the server includes means for utilizing an information processing device for receiving voice data, means for utilizing an analysis device for analyzing the received voice data and determining emotional information, and means for utilizing a conversion device for converting the analyzed voice data into text data. This makes it possible to provide users with detailed information, including emotional and schedule information, when receiving voice messages, thereby improving user convenience.

[0736] "Audio data" refers to information that represents audio signals in digital or analog format.

[0737] An "information processing device" is a device that has a hardware or software configuration for collecting, recording, analyzing, and outputting data.

[0738] An "analytical device" is a device that extracts useful information from input data and makes decisions based on a specific purpose.

[0739] A "conversion device" is a device or function used to convert data in one format into another format.

[0740] An "additional device" refers to a function or device used to add additional information to the main data.

[0741] A "transmitting device" is a device used to send data along a designated communication channel to another device or system.

[0742] An "extraction device" is a device or software that has the function of selecting specific information from a large amount of data.

[0743] A "synchronization device" is a device that ensures consistency between multiple data sets or processes by matching timing and data content.

[0744] A "translation device" is a device or software that has the function of converting text written in one language into another language.

[0745] This invention provides a system that highly streamlines the communication and analysis of voice data. At its core, it includes a process that extracts emotional information from voice and manages schedules based on that information.

[0746] The terminal is responsible for receiving voice data and transmits incoming call information to the server. The hardware used here includes acoustic sensors and microphones that enable voice input and output. A dedicated software algorithm detects incoming calls and sends the information to the server in real time.

[0747] The server activates the recording function as soon as it receives the audio data sent from the terminal. The recorded audio data is immediately sent to an analysis device. This analysis device runs an advanced emotion recognition algorithm that analyzes parameters such as intonation, speed, and tone of the voice in order to determine emotional information. Once the analysis is complete, an emotion determination result is generated and managed within the server.

[0748] Subsequently, the server uses speech recognition technology to convert the audio data into text data. This process utilizes a system called a natural language processing engine to create the text. Sentimental information is also added to this text data, which is then finally transmitted to the user's device.

[0749] Users can receive notifications that include emotional tags along with the text data. This allows users to understand not only the message content but also the sender's emotional state. Furthermore, the server has the ability to extract schedule information from the text data and automatically synchronize it with the user's calendar application. This enables efficient schedule management using voice messages.

[0750] As a concrete example, consider a situation where a user is in a meeting and unable to answer a phone call. If the caller leaves a message saying, "I was very impressed with your proposal the other day. I would love to work with you on the next project," the server analyzes this message as a positive emotion. This information is then communicated to the user, enabling them to respond quickly and appropriately in their work.

[0751] An example of a prompt might be: "Translate the received voice message into text and extract sentiment information. Example: 'I was very impressed with your proposal the other day. I would love to work with you on the next project.'" This prompt allows the system to quickly and accurately understand the voice data and perform sentiment analysis.

[0752] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0753] Step 1:

[0754] The terminal detects incoming calls by utilizing its ability to receive voice signals. When an incoming call is received, it notifies the server of the corresponding incoming call data. This notification is sent via the terminal's network module. The input is a signal indicating whether or not an incoming call is received, and the output is the incoming call notification sent to the server.

[0755] Step 2:

[0756] When the server receives an incoming call notification from the terminal, it immediately activates the AI ​​personal assistant and records the voice communication. This process utilizes a recording device, and the voice data is stored digitally. The input is the incoming call notification sent from the terminal, and the output is the recorded voice file.

[0757] Step 3:

[0758] The server transmits the recorded audio data to an analysis device for sentiment analysis. The analysis device determines emotional information based on the intonation, speed, and tone of the voice. The input here is the recorded audio data, and the output is the determined emotional information. This quantifies the emotional state of the speaker.

[0759] Step 4:

[0760] The server runs a speech recognition engine to convert audio data into text data. Because the speech recognition engine utilizes a machine learning model, it can convert speech to text with high accuracy. The input is a pre-analyzed audio file, and the output is text data with added emotional information.

[0761] Step 5:

[0762] The server extracts schedule information from text data. Using natural language processing techniques, it identifies and extracts the schedule information in a specified format. The input is character data, and the output is the extracted schedule information. This information is used in subsequent processes.

[0763] Step 6:

[0764] The server sends text data and sentiment information to the user's terminal. The terminal receives this information as a notification and displays it through the user interface. The input is text data and sentiment information, and the output is visualized data notified to the user. This step allows the user to quickly grasp the message content and the sender's sentiment.

[0765] Step 7:

[0766] The server uses the extracted schedule information to synchronize with the user's calendar application. The synchronization is automated and processes the process by adding new appointments to the user's schedule. The input is the schedule information, and the output is the updated calendar data. This allows the user to enjoy efficient schedule management.

[0767] (Application Example 2)

[0768] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0769] In today's communication environment, receiving and managing voice messages is essential. However, there is a need for technology that not only converts voice to text but also recognizes emotional nuances and provides information in a user-friendly format. Furthermore, efficiently processing information while the user is absent, automatically taking appropriate action, and coordinating with individual schedules are key challenges.

[0770] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0771] In this invention, the server includes communication means for receiving voice data, analysis means for converting the received voice data into text data, and emotion analysis means for analyzing emotions from the voice data and adding emotion information. This allows the user to manage their schedule while understanding not only the specific content of the voice message but also the emotional state of the speaker.

[0772] A "communication means" is an interface for receiving voice data and a device that plays the role of incorporating voice information into the system.

[0773] The "analysis means" is a device that processes received audio data and converts it into text data, achieving accurate text conversion using speech recognition technology.

[0774] An "emotion analysis device" is a device that determines the speaker's emotions from audio data and provides that emotional information, analyzing the intonation and tone of the voice.

[0775] A "transmission means" is a device that notifies the user of converted text data and sentiment information, and is responsible for conveying information to the user.

[0776] A "coordination device" is a device that has the function of reserving or updating schedules based on schedule information extracted from text data, and synchronizes with calendars and other similar systems.

[0777] The system of the present invention mainly consists of a server, a terminal, and a user. The terminal is equipped with a communication interface for receiving voice communication data and transmits the received voice data to the server. When the server receives the voice data, it first uses an analysis means to convert the voice into text data. Next, it uses an emotion analysis means to extract emotion information from the voice data.

[0778] The server analyzes the intonation, speed, and tone of the speech to evaluate the speaker's emotional state. The emotional information obtained by the emotion analysis is added to the converted text data, resulting in the generation of composite data that includes not only text data but also emotional information.

[0779] This composite data is sent to the user's terminal via a transmission method. This allows the user to understand not only the content of the message but also the speaker's emotional state. Furthermore, the server has the functionality to extract schedule information from the text data using a linking mechanism and integrate it with a schedule reservation system or calendar.

[0780] As a concrete example of use, consider a situation where a user is in a meeting and unable to answer a phone call. In this case, the terminal receives a voice message, and the server automatically transcribes the message into text and analyzes its sentiment. If the server recognizes a positive sentiment in a message such as "I was very impressed with your proposal. I would love to work with you again in the future," and notifies the user of this information, the user can also understand the nuances of the message.

[0781] An example of a prompt to set in a generative AI model is, "Analyze the voice message, identify its emotional state, and then transcribe it into text to convey the relevant emotional information to the user." This makes it possible to extract emotions from voice and convert them to text more efficiently.

[0782] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0783] Step 1:

[0784] The terminal receives voice data. Specifically, it detects an incoming call via the communication interface and records a voice message. The input to this process is voice data, and the output is the same voice data retained as is.

[0785] Step 2:

[0786] The terminal sends the received audio data to the server. The terminal uploads the recorded audio file to the server via the network. The input to this process is the audio data recorded in step 1, and the output is the audio data sent to the server.

[0787] Step 3:

[0788] The server analyzes the audio data and converts it into text data. The server uses a speech recognition engine to convert the audio data into text. In this process, audio data is used as input and text data is generated as output.

[0789] Step 4:

[0790] The server analyzes the audio data using emotion analysis tools and extracts emotional information. It analyzes the intonation and speed of the speech to estimate the speaker's emotions. The input for this process is the original audio data, and the output is emotional information.

[0791] Step 5:

[0792] The server integrates text data and sentiment information and notifies the user. The server adds sentiment information as a tag to the text data and sends it to the user's terminal. The input to this process is text data and sentiment information, and the output is integrated notification data.

[0793] Step 6:

[0794] The server extracts schedule information from text data and links it to the schedule. Using natural language processing technology, it identifies schedule information within text messages and registers it in the calendar system. The input to this process is text data, and the output is updated schedule information.

[0795] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0796] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0797] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0798] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0799] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0800] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0801] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0802] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0803] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0804] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0805] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0806] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0807] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0808] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0809] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0810] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0811] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0812] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0813] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0814] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0815] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0816] The following is further disclosed regarding the embodiments described above.

[0817] (Claim 1)

[0818] A means of receiving voice data,

[0819] An analysis means for converting received audio data into text data,

[0820] A means of sending the converted text data to the user,

[0821] An analytical means for extracting schedule information from text data,

[0822] A means of coordinating appointments based on extracted schedule information,

[0823] A system that includes this.

[0824] (Claim 2)

[0825] The system according to claim 1, further comprising a translation means for translating text data converted by an analysis means into a different language.

[0826] (Claim 3)

[0827] The system according to claim 1, wherein the communication means has a function to automatically respond when the user is absent.

[0828] "Example 1"

[0829] (Claim 1)

[0830] A communication means that receives incoming information and starts processing it,

[0831] An analysis means for converting received audio data into text data,

[0832] A transmission means for notifying an information processing device of the converted text data,

[0833] An analytical method for extracting schedule information from text data,

[0834] An integrated means for registering schedules based on extracted schedule information,

[0835] Translation tools for translating text data between different languages,

[0836] A response means that automatically responds when the information processing device is absent,

[0837] A system that includes this.

[0838] (Claim 2)

[0839] The system according to claim 1, wherein the information processing device records a voice message and immediately notifies the converted text data.

[0840] (Claim 3)

[0841] The system according to claim 1, wherein the information processing device cooperates with a schedule management function to automatically register schedule information.

[0842] "Application Example 1"

[0843] (Claim 1)

[0844] A means of receiving voice information,

[0845] An analysis means for converting received audio information into text information,

[0846] A means of transmission that notifies the user of the converted character information,

[0847] An analysis method for extracting schedule information from textual information,

[0848] A means of coordinating schedules based on extracted schedule information,

[0849] A system including a home assistance device that has a function to automatically manage and notify the user of their schedule based on received voice information when the user is absent.

[0850] (Claim 2)

[0851] The system according to claim 1, further comprising a translation means for translating character information converted by an analysis means into a different natural language.

[0852] (Claim 3)

[0853] The system according to claim 1, wherein the communication means has a function to automatically respond when the user is absent, and further automatically records the schedule based on the received voice information and notifies the user from the support device.

[0854] "Example 2 of combining an emotion engine"

[0855] (Claim 1)

[0856] An information processing device that receives audio data,

[0857] An analysis device that analyzes received audio data and determines emotional information,

[0858] A conversion device that converts analyzed audio data into text data,

[0859] An additive device that adds emotional information to the converted text data,

[0860] A transmitting device that notifies the user of text data and emotional information,

[0861] An extraction device that automatically extracts schedule information from text data,

[0862] A synchronization device that synchronizes schedules based on extracted schedule information,

[0863] A system that includes this.

[0864] (Claim 2)

[0865] The system according to claim 1, further comprising a translation device for translating converted character data into a different language.

[0866] (Claim 3)

[0867] The system according to claim 1, wherein the communication means has a function to automatically respond when the user is absent.

[0868] "Application example 2 of combining emotional engines"

[0869] (Claim 1)

[0870] A means of receiving voice data,

[0871] An analysis means for converting received audio data into text data,

[0872] An emotion analysis method that analyzes emotions from audio data and adds emotional information,

[0873] A means of transmitting converted text data and sentiment information to the user,

[0874] A means of integrating to extract schedule information from text data and book appointments,

[0875] A system that includes this.

[0876] (Claim 2)

[0877] The system according to claim 1, further comprising a translation means for translating text data converted by an analysis means into a different language.

[0878] (Claim 3)

[0879] The system according to claim 1, wherein the communication means has a function to automatically respond when the user is absent. [Explanation of Symbols]

[0880] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of receiving voice data, An analysis means for converting received audio data into text data, A means of sending the converted text data to the user, An analytical means for extracting schedule information from text data, A means of coordinating appointments based on extracted schedule information, A system that includes this.

2. The system according to claim 1, further comprising a translation means for translating text data converted by an analysis means into a different language.

3. The system according to claim 1, wherein the communication means has a function to automatically respond when the user is absent.