system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
An AI system with facial recognition, emotion analysis, and multilingual support addresses labor shortages and language barriers in reception systems, providing personalized and efficient interactions.

JP2026096459APending Publication Date: 2026-06-15SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-03
Publication Date: 2026-06-15

Application Information

Patent Timeline

03 Dec 2024

Application

15 Jun 2026

Publication

JP2026096459A

IPC: G06Q50/10; G06T7/00; G10L15/22; G10L13/00

AI Tagging

Application Domain

Image analysis Data processing applications

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing reception systems in corporations and public institutions face challenges with labor shortages, inefficient communication, and the inability to handle multiple languages and cultural backgrounds, leading to potential customer dissatisfaction.

Method used

An AI-powered system that integrates facial recognition, emotion analysis, speech recognition, and multilingual support to provide personalized and efficient responses, utilizing generative models for natural interactions and data-driven improvements.

Benefits of technology

The system enhances customer satisfaction by offering tailored services in multiple languages, improving operational efficiency through continuous learning, and ensuring smooth communication.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026096459000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A facial recognition system for recognizing a user's face and obtaining corresponding personal information, An emotion recognition means that analyzes the user's emotions based on the recognized facial data, A speech recognition means for converting voice input from a user into text and analyzing the information request, A response generation means using a generation model that generates an appropriate response based on the analysis results, A multilingual support system that translates and provides responses in multiple languages, A data storage and analysis means for accumulating data obtained from each of the above means and performing learning, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] These days, as the labor shortage becomes more serious, there is a problem that reception work in corporations and public institutions has to be handled with limited staff. Also, while efficient and friendly responses are required, if communication with users is not smoothly carried out, there is a risk of lowering customer satisfaction. Furthermore, it may be difficult to appropriately handle visitors with a large number of languages and cultural backgrounds by conventional methods.

Means for Solving the Problems

[0005] This invention provides a system that uses AI technology to automate reception duties. Specifically, it achieves personalized communication by using means to recognize the user's face and analyze their emotional state. Furthermore, it accurately understands user inquiries using speech recognition technology and generates natural responses using generative models, thereby responding appropriately to diverse requests. In addition, by incorporating multilingual support capabilities and building a system that can respond appropriately to visitors who speak various languages, it is possible to effectively solve conventional problems.

[0006] "Facial recognition means" refers to technology that identifies individuals by capturing a user's face using a camera and comparing it with information in a database.

[0007] "Emotion recognition means" refers to technology that analyzes the emotional state from the user's facial expression data and adjusts the style and tone of responses based on that information.

[0008] "Speech recognition means" refers to technology that converts user speech into text, recognizes its content, and analyzes it.

[0009] A "generative model" refers to an algorithm that automatically generates responses in a natural format based on the user's inquiry.

[0010] "Response generation means" refers to technology that constructs an accurate response based on analyzed information and communicates it to the user.

[0011] "Multilingual support" refers to technology that can automatically translate and provide responses to inquiries in different languages.

[0012] "Data storage and analysis means" refers to technologies that store data generated and collected by a system on a daily basis over a long period of time and analyze it for learning and improvement. [Brief explanation of the drawing]

[0013] [Figure 1]This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0014] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0019] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), etc.

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] This invention is an AI concierge robot system designed to streamline reception operations in corporations and public institutions. At the heart of the system is a terminal that interacts with the user, integrating a wide range of functions including facial recognition, emotion analysis, speech recognition, response generation, and multilingual support. The specific implementation of the system is described below.

[0035] First, when a user stands in front of the device, the device captures the user's face with its built-in camera. This image data is analyzed by a facial recognition algorithm and then queried to a server to be matched with the user's personal information. For already registered users, their history information can be used to provide more personalized service. For new users, the registration process begins, and the acquired information is stored on the server.

[0036] Next, the device analyzes the user's facial expressions to recognize emotions. By recognizing the emotional state, it can determine whether the user is relaxed or stressed, and select a response tone appropriate to that state. For example, if the user is tense, a more friendly and reassuring tone will be used.

[0037] The device has a built-in microphone that accepts voice input from the user. Using speech recognition technology, the device converts the voice into text, and a generating AI model analyzes the content. This analysis helps the user understand their specific requests and questions. Then, the server starts a process of searching for the necessary information based on the analysis results and generating the optimal answer.

[0038] The generated responses are adapted to the user's language using multilingual support, ensuring that accurate information is provided to users who speak different languages. For example, an English-speaking user will receive an audio response in English.

[0039] All interactions are logged and stored on the server, and a machine learning engine periodically analyzes the data to improve future response accuracy. In this way, the system continuously learns and evolves through all interactions with the user. For example, if a visitor asks, "Where is the nearest ATM?", the terminal will determine the user's location and guide them to the nearest ATM on a map. Providing this information via voice makes the guidance quick and efficient.

[0040] The following describes the processing flow.

[0041] Step 1:

[0042] The device activates its camera and captures the user's face. The captured image is then analyzed using a built-in facial recognition algorithm.

[0043] Step 2:

[0044] The device sends the facial recognition results to the server, which compares them with past records to identify the user. The server then refers to an existing database to identify the user information and sends it back to the device. For new users, the information is recorded, and the initial registration process is performed.

[0045] Step 3:

[0046] The device captures facial expression data from the user's face and uses emotion recognition to determine the user's current emotional state. This prepares the device to appropriately adjust the tone of conversation.

[0047] Step 4:

[0048] The user communicates questions and requests by voice to the device. The voice input is transmitted to the device via the microphone.

[0049] Step 5:

[0050] The device uses a speech recognition algorithm to convert the user's voice into text data. The converted text is then parsed to understand the user's request.

[0051] Step 6:

[0052] The device then uses a generative AI model to initiate a process that generates an appropriate response based on the analyzed request.

[0053] Step 7:

[0054] The terminal sends a request to the server to obtain the necessary information. The server retrieves the appropriate information from the database and provides it to the terminal.

[0055] Step 8:

[0056] The device uses its multilingual support feature to translate responses generated according to the user's preferred language.

[0057] Step 9:

[0058] The terminal uses speech synthesis to output the system-generated response as audio and convey it to the user.

[0059] Step 10:

[0060] Once the interaction is complete, the server logs all interactions and analyzes the data to improve accuracy in the future.

[0061] (Example 1)

[0062] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0063] Traditional reception systems provided a uniform approach to users, making it difficult to offer services tailored to individual needs. Furthermore, their inability to support multiple languages meant they could not adequately assist users who spoke different languages. Additionally, learning and improvement based on collected data was difficult, resulting in a lack of an efficient feedback loop for improving service quality.

[0064] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0065] In this invention, the server includes biometric authentication means, emotion detection means, voice conversion means, response formation means, language conversion means, information storage and analysis means, and location identification means. This enables the provision of flexible services tailored to the individual needs of users and provides accurate information to users who speak different languages. Furthermore, it enables qualitative improvement of services through the analysis of collected data and allows the system to evolve through continuous learning.

[0066] "Biometric authentication" refers to technology that scans a user's face to obtain personal information and enables individual authentication.

[0067] "Emotion detection means" refers to technology that analyzes a user's facial expressions to determine their emotional state and selects a response based on that.

[0068] "Voice conversion means" refers to technology that converts voice input from a user into text data and analyzes the content of the request.

[0069] "Response formation means" refers to a technology that generates an appropriate response using a generative AI model based on the analyzed user request.

[0070] "Language conversion means" refers to technology that translates generated responses into multiple languages, providing accurate information to users who speak different languages.

[0071] "Information storage and analysis methods" refer to technologies that accumulate data obtained at each processing stage and improve system performance through machine learning.

[0072] "Location identification means" refers to technology that acquires a user's geographical location information and guides them to relevant facilities and services.

[0073] This system is an AI concierge robot system that streamlines reception operations in corporations and public institutions. Here, we will show a specific form for implementing the invention.

[0074] At the core of the system are terminals that interact with users. These terminals are equipped with input and output devices such as cameras, microphones, and speakers, which are used to facilitate interaction with users. Servers are located on the cloud or local network and are responsible for processing and storing data.

[0075] First, when a user stands in front of the device, the device's built-in camera captures the user's face. A facial recognition algorithm analyzes this image data, and the server uses biometric authentication to verify the user's personal information. If the user is already registered, the server retrieves their history information to enable individualized service. For new users, the registration process begins on the device, and the information is stored on the server.

[0076] Next, the terminal uses emotion detection means to analyze the user's facial expressions and analyze their emotional state. Based on this, the server determines whether the user is relaxed or stressed, and selects an appropriate response tone using response shaping means.

[0077] When a user makes a question or request by voice, the device captures the audio using its microphone and converts it to text using a speech-to-text converter. This text data is then analyzed using a generative AI model to generate an appropriate response. The server then translates the generated response into the user's language using a language conversion converter and outputs it as audio through its speaker.

[0078] Furthermore, all interactions are logged on the server using information storage and analysis tools. This data is periodically analyzed using machine learning techniques to improve the system's response accuracy.

[0079] As a concrete example, consider a scenario where a user asks, "Where is the nearest cafe?" The device obtains the user's location information and uses location-based searching to find nearby cafes. The server can output the generated information as voice in the user's language and also display map information on the device's screen. An example of a prompt message could be something like, "How should we respond when a visitor asks about nearby facilities?"

[0080] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0081] Step 1: User Recognition

[0082] The user stands in front of the terminal. The terminal captures the user's face with its built-in camera. The input is the captured facial image data, which is analyzed by a facial recognition algorithm. The server receives the analysis results, compares them with the database, and determines whether the user is already registered. The output is the user's personal information and past history.

[0083] Step 2: Sentiment Analysis

[0084] The system analyzes the user's facial expressions using facial images acquired by the terminal. The input is the user's facial image data, which is then analyzed by an emotion detection system. The server receives the analysis results and determines the user's emotional state (relaxed, tense, etc.). The output is the user's emotional state data. Specifically, if the user is tense, data is generated to change the tone of response.

[0085] Step 3: Speech Recognition

[0086] The user communicates questions and requests verbally to the device. The device captures the audio using its built-in microphone. The input is the user's voice data, which is converted into text data using a speech-to-text conversion device. The output is the text data converted from the audio. This text serves as the basis for the next analysis.

[0087] Step 4: Response Generation

[0088] The server receives text data obtained through speech recognition and analyzes it using a generative AI model. The input is text data, and the system accurately understands the user's request based on the prompt text. Based on the analysis results, a response formation mechanism generates the optimal answer. The output is the generated response data.

[0089] Step 5: Multilingual Translation

[0090] The system receives the response data generated by the server and uses a language translation mechanism. The input consists of the generated response data and the user's language information. This translates the response into the user's native language. The output is the multilingual response data, which the terminal then outputs as audio.

[0091] Step 6: Location Information Guidance

[0092] When a user makes a location-related inquiry (for example, "Where is the nearest cafe?"), the device uses location-determining means. The input is the user's current location information, and the device uses this location information to search for relevant facilities and services. The output is optimal guidance information, which the device displays on the screen as map information and also provides audio explanations.

[0093] Step 7: Data accumulation and analysis

[0094] After all interactions are complete, the server stores the collected data using data storage and analysis tools. The input is all the data generated at each step. This data is used by machine learning algorithms to analyze the data and improve the system's accuracy and response quality. The output is analysis reports and improvement suggestions, which allow the system to be continuously improved.

[0095] (Application Example 1)

[0096] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0097] In user reception operations, there is a need to provide multilingual support and personalized services quickly. Furthermore, it is necessary to improve convenience by utilizing user location information to guide users to details about relevant facilities and services. In addition, it is important to utilize data collected through the system to streamline reception operations.

[0098] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0099] In this invention, the server includes a facial recognition means for recognizing the user's face and acquiring personal information, an emotion recognition means for analyzing emotions, a speech recognition means for analyzing voice input, a multilingual support means for providing responses in multiple languages, a facility guidance means, and a usage status recording means. This enables multilingual support and personalized responses tailored to the user, and allows for rapid guidance to relevant facilities based on the user's location information. Furthermore, by accumulating and learning from data, continuous operational efficiency improvements can be achieved.

[0100] A "user" is an individual or group that utilizes a system and is the entity that receives information services.

[0101] "Facial recognition means" refers to technology for identifying an individual's face from image data acquired using a camera.

[0102] "Emotion recognition means" refers to technology that analyzes a user's emotional state from their facial expressions and behavior, and then derives an appropriate response based on that analysis.

[0103] "Speech recognition means" refers to a process that converts speech sounds into text data and analyzes information based on the results.

[0104] "Response generation means using a generative model" refers to a technology that utilizes a generative AI model to create an appropriate response based on voice input or other data.

[0105] "Multilingual support" refers to a function that enables the provision of information in different languages by utilizing translation technology.

[0106] "Data storage and analysis methods" refer to technologies used to improve operations and enhance services by storing and analyzing information collected through systems.

[0107] A "facility guidance system" is a process for guiding users to the location and services of relevant facilities based on their location information.

[0108] A "usage status recording means" is a function that records how users use the system and uses that information to optimize business processes and improve services.

[0109] The system for implementing this invention is configured as a concierge robot that integrates various technologies.

[0110] First, when a user stands in front of the system, the terminal's built-in camera captures the user's face, and facial recognition technology is used to identify the individual. The facial recognition algorithm then analyzes the data and compares it with the user's personal information. If the user is already registered, personalized responses can be provided based on their history.

[0111] Next, the device uses the user's facial expressions to activate its emotion recognition mechanism. This involves applying an emotion analysis algorithm to determine the user's mental state. Based on the emotion recognition results, the tone of response and conversation style are adjusted. For example, if the device determines that the user is nervous, it will respond in a friendly and reassuring tone.

[0112] Subsequently, voice input from the user is acquired via a microphone and converted to text by a speech recognition system. The text data is then analyzed using a generative AI model to understand the user's specific requests. Based on this analysis, the server uses the generative model to create the optimal response.

[0113] Furthermore, multilingual support means that the generated responses are translated into the user's native language and provided in either voice or text. This process ensures that users who speak different languages can receive efficient support.

[0114] For example, if a user uses voice input to say, "Please tell me the nearest public facility," the device can obtain location information using its facility guidance system and guide the user to nearby facilities. The response includes specific map information, which can be visually confirmed.

[0115] Examples of software and hardware used include "face recognition algorithms" and "emotion analysis algorithms" for face recognition and emotion recognition, a "speech recognition engine" for speech recognition, a "generative AI engine" for generative AI models, and a "language translation API" for multilingual support.

[0116] Examples of prompt statements to be input to a generative AI model include the following:

[0117] "Please explain how to guide visitors who ask about nearby public facilities."

[0118] "Please ask users if there are any events that require specific notifications."

[0119] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0120] Step 1:

[0121] The device captures the user's face with its camera and acquires image data. This image data is analyzed using facial recognition technology and compared with the user's personal information database. In this process, the image input by the camera is analyzed by a facial recognition algorithm and matched with existing user data. As a result, the user's personal information is output.

[0122] Step 2:

[0123] The device uses the user's facial expression data to activate its emotion recognition mechanism. The device obtains emotions from facial expressions and applies an emotion analysis algorithm to determine the user's mental state. The user's facial image is used as input, and the emotional state is output. This process selects a response tone appropriate to the user's emotions.

[0124] Step 3:

[0125] The user requests information by voice. The device's microphone acquires the voice input, and a speech recognition system converts this voice data into text. Here, the input is the user's voice information, and the output is generated as text data. The transcribed data forms the basis for the next analysis step.

[0126] Step 4:

[0127] The server uses a generative AI model to analyze text data. Based on the analysis results, it generates the optimal response to understand the user's request. The input data is a text-based request, and the generated text data is output as the response. Based on this result, the server provides information tailored to the user.

[0128] Step 5:

[0129] The generated text response is translated into the user's language using a multilingual support system. The server uses a language translation API to convert the generated text into the user's native language and outputs it as audio or text. This allows for accurate and easily understandable responses to multilingual users.

[0130] Step 6:

[0131] The terminal uses a facility guidance system to obtain the user's location information and provides detailed information about relevant facilities. In this step, the user's location data is input, and the output provides guidance information about facilities best suited to the user. This function enables more precise service guidance based on the user's current location.

[0132] Step 7:

[0133] The server collects and stores all interaction data using data storage and analysis tools. This data is analyzed for continuous operational efficiency improvements. Input includes details of all interactions with the user, and output provides analytical data to improve response accuracy.

[0134] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0135] This invention is a system that efficiently handles reception duties for corporations and public institutions using AI technology. Utilizing an emotion engine, this system achieves advanced customer service through user facial recognition, emotion analysis, speech recognition, response generation, multilingual support, and data storage and analysis. The specific implementation of the system is described below.

[0136] When a user appears in front of the device, the device uses facial recognition to capture the user's face. The facial data is sent to a server, where it is compared with an existing database to obtain the user's personal information. For new users, the information is recorded in the system and used to improve future services.

[0137] The emotion engine analyzes emotions in real time across multiple dimensions based on facial recognition data and the user's voice characteristics. It extracts changes in the user's facial expressions and voice tone as features, and the resulting emotional state directly participates in the subsequent response generation process. For example, if the user is feeling anxious, the device will generate a more reassuring response.

[0138] The user inputs questions and requests into the terminal via voice. The terminal uses speech recognition to convert the voice into text, which is then analyzed by a generative AI model. The analysis results are used to retrieve information from the server, providing the most appropriate information. The server searches its database for the necessary data and provides it to the terminal.

[0139] The generated responses are adjusted based on the user's emotional state. The emotion engine optimizes each response according to emotional data, creating natural and empathetic communication. Multilingual support automatically translates the response according to the user's language.

[0140] All interactions are saved as logs on the server and analyzed by a machine learning engine. This allows for further accuracy improvements by taking into account the user's individual emotional history. For example, if a user who was previously dissatisfied returns, the device can take their previous emotional data into consideration and strive to provide particularly helpful service.

[0141] In this way, by using AI that integrates multiple functions, it is possible to achieve both high-quality customer service and improved operational efficiency simultaneously.

[0142] The following describes the processing flow.

[0143] Step 1:

[0144] When a user approaches the device, the device captures the user's face with its built-in camera. It then performs facial recognition and uses the captured image data to identify the user.

[0145] Step 2:

[0146] The device sends facial recognition data to the server, which compares it with an existing user database. If the user is already registered, the server returns the relevant personal information to the device. For new users, the initial registration process begins.

[0147] Step 3:

[0148] The device uses an emotion engine to analyze facial expression data obtained from the user's face. It also analyzes voice tone and other emotional indicators to determine the user's emotional state.

[0149] Step 4:

[0150] The user verbally communicates questions or requests to the device. The device receives the audio through its microphone and converts the audio data into text using speech recognition technology.

[0151] Step 5:

[0152] The device uses a generative AI model to analyze the content of the transcribed questions and processes them to understand the user's intent.

[0153] Step 6:

[0154] Based on the analysis results, the terminal requests the necessary data from the server. The server retrieves the requested information from its database and returns it to the terminal.

[0155] Step 7:

[0156] The device generates responses based on information and emotional data. The emotion engine considers the user's current emotional state and appropriately adjusts the tone and content of the generated responses.

[0157] Step 8:

[0158] The device uses its multilingual support function to translate responses into the user's selected language and output them as audio. This ensures that information is provided in a way that is easy for the user to understand.

[0159] Step 9:

[0160] The server logs all interaction data and uses machine learning algorithms to analyze it for continuous system improvement, thereby enhancing the user experience.

[0161] (Example 2)

[0162] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0163] In reception services at corporations and public institutions, smooth communication with visitors is essential, but traditional methods have limitations in terms of quality and efficiency. In particular, current technology may not be able to provide sufficient satisfaction in situations requiring multilingual support and consideration for visitors' feelings. Furthermore, there is a lack of mechanisms to effectively utilize visitor information and continuously improve services.

[0164] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0165] In this invention, the server includes adaptive means for adjusting responses according to the user's emotional state, acquisition means for obtaining information from a database, and storage means for accumulating and learning from the information obtained from each of the means. This enables natural and friendly responses that take into account the visitor's emotions, smooth communication in multiple languages, and continuous service improvement based on accumulated data.

[0166] "Identification means for recognizing faces" refers to a device or technology for identifying a user's face and capturing its characteristics as data.

[0167] "Emotional evaluation analysis means" refers to a device or technology for determining a user's emotional state in real time based on their facial expressions and vocal characteristics.

[0168] "Recording means for converting voice input into text and analyzing information requests" refers to a device or technology that converts a user's voice into text data, analyzes its content, and understands the information request.

[0169] A "generative model-based response method" refers to a machine learning model or technique used to create an appropriate response based on an analyzed information request.

[0170] A "translation means that translates and provides responses in various languages" refers to a device or technology for translating the generated response into the user's native language or a specified language and providing it.

[0171] "Memory means for accumulating information and learning" refers to a device or technology for storing acquired data and using it to improve the performance of a system.

[0172] "An adaptive means that adjusts responses according to emotional state" refers to a device or technology that dynamically changes the tone and content of a response based on the user's emotional information.

[0173] "Means of acquiring information" refers to devices or technologies for retrieving necessary information from databases or other information sources based on user requests.

[0174] To implement this invention, the following hardware and software are used. The terminal is a device equipped with a camera and microphone that captures the user's face and voice. The server is a computer system with high-speed data processing capabilities that performs functions such as face recognition, emotion analysis, voice analysis, and database access.

[0175] When a user stands in front of the device, the terminal uses its camera to capture their face. The facial data is sent to a server and compared against a database. Based on this facial data, the server uses identification methods to retrieve personal information from an existing database.

[0176] In addition, the user's facial expressions and voice data are analyzed using an emotional evaluation tool to determine their emotional state. The server converts the user's voice input into text using speech recognition technology and interprets the information request using a generative AI model. Based on this analysis, the server retrieves the necessary information from the database and performs multilingual translation using a conversion tool.

[0177] A notable feature is that the server is equipped with adaptive mechanisms to adjust its response to the user's emotional state. This enables empathetic and natural interactions. Furthermore, it uses memory mechanisms to accumulate dialogue data, allowing for more refined service on subsequent visits.

[0178] For example, if a user visits the reception desk of a facility and speaks into a terminal saying, "I'd like to know the facility's event schedule," the voice is converted into text, analyzed by the server, and then the most appropriate response is generated. Furthermore, if the user appears confused, supplementary information can be provided, such as, "If you could tell us more specific dates and times, we can provide you with more detailed information."

[0179] An example of a prompt message would be, "Please tell me the date and time of the event. Also, please take an appropriate approach based on a specific sentiment analysis." In this way, a practical and interactive reception system can be realized.

[0180] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0181] Step 1:

[0182] When a user stands in front of the device, the terminal activates its camera and captures the user's face. The input at this time is image data of the user's face, and the output is face data that is sent to the server. The face data is transferred to the server, which then checks against a database to obtain the user's personal information.

[0183] Step 2:

[0184] The server analyzes the user's emotional state using an analysis method that evaluates emotions based on facial recognition data. The input is facial and voice feature data, and the output is the analyzed emotional state. The server analyzes changes in facial expressions and voice to determine what emotional state the user is in.

[0185] Step 3:

[0186] The user inputs questions or requests into the terminal using voice. The terminal converts the input voice data into text using a speech recognition engine. In this case, the input is the user's voice data, and the output is text data. This converted text data is sent to the server.

[0187] Step 4:

[0188] The server uses a generative AI model to analyze text data and understand the user's intent. The input is text data, and the output is the analyzed information request. The server then uses the generative AI model to apply prompts and determine the optimal response based on the user's request.

[0189] Step 5:

[0190] The server retrieves the necessary information from the database based on the analysis results. The input in this step is the analyzed information request, and the output is the retrieved information data. The server searches the database and extracts the exact information the user is looking for.

[0191] Step 6:

[0192] Based on the acquired information, the server generates a response using its multilingual translation function. The input in this process consists of informational data and the user's language settings, while the output is the translated response data. The server converts the information into the specified language and formats it as a response.

[0193] Step 7:

[0194] The terminal provides the user with translated responses sent from the server. The input is the translated response data, and the output is information provided to the user in either voice or text. The terminal uses speech synthesis technology to respond to the user in a natural-sounding manner.

[0195] (Application Example 2)

[0196] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0197] In today's retail environment, personalized service and multilingual support are required for visiting customers, but traditional methods make it difficult to efficiently achieve both simultaneously. Furthermore, there is a lack of natural, real-time responses based on customer emotions and language, hindering the delivery of a high-quality customer experience.

[0198] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0199] In this invention, the server includes biometric authentication means, emotion analysis means, and voice analysis means. This enables personalized responses tailored to each customer visiting the store, as well as responses in appropriate language on the spot. In particular, by adapting responses in real time based on the customer's emotional state, it becomes possible to provide a natural and empathetic customer experience.

[0200] "Biometric authentication methods" are technologies that use a user's face or other physical characteristics to identify an individual and obtain related personal information.

[0201] "Emotional analysis methods" are technologies that analyze a user's facial expressions and vocal characteristics to identify their emotional state.

[0202] "Voice analysis means" refers to technology that converts voice input from a user into text, analyzes its content, and understands the user's information request.

[0203] "Response formation means using generative artificial intelligence" refers to artificial intelligence technology for generating the optimal response based on analysis results.

[0204] A "multilingual translation method" is a technology for translating the generated response into the user's language and providing it appropriately.

[0205] "Information storage and analysis means" refers to technologies that store data acquired from various means, analyze it through machine learning, and use it to improve services.

[0206] "Language detection means" refers to technology that identifies the user's language and optimizes communication for that language.

[0207] "Response adaptation means" is a technology that adjusts responses in real time based on the user's emotional state, enabling more appropriate communication.

[0208] In one embodiment of this invention, the server first recognizes the user's face using biometric authentication means and identifies the individual. This allows the server to retrieve relevant personal information from a database. Using the recognized facial data, the server analyzes the user's emotional state using emotion analysis means. Technologies such as EmotionEngine can be used for emotion analysis.

[0209] Next, the terminal receives voice input from the user through the microphone. This voice is converted into text data by a voice analysis system, and the user's request is analyzed using natural language processing. At this stage, a response formation system using generative artificial intelligence generates an appropriate response to the user's request. A generative AI model is used for this generation.

[0210] The generated response is translated into the user's language by a multilingual translation system. Since the user's language is pre-identified by a language detection system, accurate translation is possible. Finally, the response is output to the user, but the response adaptation system adjusts the tone and content in real time according to the user's emotional state.

[0211] As a concrete example, if a customer visiting a store asks, "Where can I buy this product?", the terminal recognizes the language and emotion, generates product information, and provides guidance in the appropriate language. A prompt such as, "Generate an answer to the question 'Where can I buy this product?' into the generative AI model" is input, leading to a natural response. This entire process results in a high-quality and personalized customer experience.

[0212] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0213] Step 1:

[0214] The server uses biometric authentication to recognize the user's face using image data from the camera as input. Using the recognized facial data, personal information is retrieved from the database to identify individual users.

[0215] Step 2:

[0216] The server enters the emotion analysis system and receives facial feature data and voice characteristics as input. By analyzing this data, the user's emotional state is identified. Based on the user's facial expressions and voice tone, the emotion engine measures emotions such as anger, anxiety, and joy.

[0217] Step 3:

[0218] The device converts the voice input from the user via the microphone into text using voice analysis. The converted text data is then analyzed to understand the user's request using natural language processing. Based on the text analysis, the user's questions and requests are interpreted.

[0219] Step 4:

[0220] The server then enters a response formation mechanism using generative artificial intelligence. Based on the text data generated in step 3, the generative AI model generates the optimal response as a prompt. For example, in response to the question "Where can I buy this product?", a response based on the product's location information is generated.

[0221] Step 5:

[0222] The server translates the response using a multilingual translation tool. After identifying the user's language using a language detection tool, it translates the response into the appropriate language and prepares it.

[0223] Step 6:

[0224] The device uses response adaptation mechanisms to adjust its response in real time based on the emotional state detected in step 2. The final response to the user is modified to be more empathetic and natural according to the emotion and then provided to the user.

[0225] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0226] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0227] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0228] [Second Embodiment]

[0229] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0230] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0231] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0232] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0233] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0234] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0235] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0236] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0237] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0238] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0239] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0240] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0241] This invention is an AI concierge robot system designed to streamline reception operations in corporations and public institutions. At the heart of the system is a terminal that interacts with the user, integrating a wide range of functions including facial recognition, emotion analysis, speech recognition, response generation, and multilingual support. The specific implementation of the system is described below.

[0242] First, when a user stands in front of the device, the device captures the user's face with its built-in camera. This image data is analyzed by a facial recognition algorithm and then queried to a server to be matched with the user's personal information. For already registered users, their history information can be used to provide more personalized service. For new users, the registration process begins, and the acquired information is stored on the server.

[0243] Next, the device analyzes the user's facial expressions to recognize emotions. By recognizing the emotional state, it can determine whether the user is relaxed or stressed, and select a response tone appropriate to that state. For example, if the user is tense, a more friendly and reassuring tone will be used.

[0244] The device has a built-in microphone that accepts voice input from the user. Using speech recognition technology, the device converts the voice into text, and a generating AI model analyzes the content. This analysis helps the user understand their specific requests and questions. Then, the server starts a process of searching for the necessary information based on the analysis results and generating the optimal answer.

[0245] The generated responses are adapted to the user's language using multilingual support, ensuring that accurate information is provided to users who speak different languages. For example, an English-speaking user will receive an audio response in English.

[0246] All interactions are logged and stored on the server, and a machine learning engine periodically analyzes the data to improve future response accuracy. In this way, the system continuously learns and evolves through all interactions with the user. For example, if a visitor asks, "Where is the nearest ATM?", the terminal will determine the user's location and guide them to the nearest ATM on a map. Providing this information via voice makes the guidance quick and efficient.

[0247] The following describes the processing flow.

[0248] Step 1:

[0249] The device activates its camera and captures the user's face. The captured image is then analyzed using a built-in facial recognition algorithm.

[0250] Step 2:

[0251] The device sends the facial recognition results to the server, which compares them with past records to identify the user. The server then refers to an existing database to identify the user information and sends it back to the device. For new users, the information is recorded, and the initial registration process is performed.

[0252] Step 3:

[0253] The device captures facial expression data from the user's face and uses emotion recognition to determine the user's current emotional state. This prepares the device to appropriately adjust the tone of conversation.

[0254] Step 4:

[0255] The user communicates questions and requests by voice to the device. The voice input is transmitted to the device via the microphone.

[0256] Step 5:

[0257] The device uses a speech recognition algorithm to convert the user's voice into text data. The converted text is then parsed to understand the user's request.

[0258] Step 6:

[0259] The device then uses a generative AI model to initiate a process that generates an appropriate response based on the analyzed request.

[0260] Step 7:

[0261] The terminal sends a request to the server to obtain the necessary information. The server retrieves the appropriate information from the database and provides it to the terminal.

[0262] Step 8:

[0263] The device uses its multilingual support feature to translate responses generated according to the user's preferred language.

[0264] Step 9:

[0265] The terminal uses speech synthesis to output the system-generated response as audio and convey it to the user.

[0266] Step 10:

[0267] Once the interaction is complete, the server logs all interactions and analyzes the data to improve accuracy in the future.

[0268] (Example 1)

[0269] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0270] Traditional reception systems provided a uniform approach to users, making it difficult to offer services tailored to individual needs. Furthermore, their inability to support multiple languages meant they could not adequately assist users who spoke different languages. Additionally, learning and improvement based on collected data was difficult, resulting in a lack of an efficient feedback loop for improving service quality.

[0271] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0272] In this invention, the server includes biometric authentication means, emotion detection means, voice conversion means, response formation means, language conversion means, information storage and analysis means, and location identification means. This enables the provision of flexible services tailored to the individual needs of users and provides accurate information to users who speak different languages. Furthermore, it enables qualitative improvement of services through the analysis of collected data and allows the system to evolve through continuous learning.

[0273] "Biometric authentication" refers to technology that scans a user's face to obtain personal information and enables individual authentication.

[0274] "Emotion detection means" refers to technology that analyzes a user's facial expressions to determine their emotional state and selects a response based on that.

[0275] "Voice conversion means" refers to technology that converts voice input from a user into text data and analyzes the content of the request.

[0276] "Response formation means" refers to a technology that generates an appropriate response using a generative AI model based on the analyzed user request.

[0277] "Language conversion means" refers to technology that translates generated responses into multiple languages, providing accurate information to users who speak different languages.

[0278] "Information storage and analysis methods" refer to technologies that accumulate data obtained at each processing stage and improve system performance through machine learning.

[0279] "Location identification means" refers to technology that acquires a user's geographical location information and guides them to relevant facilities and services.

[0280] This system is an AI concierge robot system that streamlines reception operations in corporations and public institutions. Here, we will show a specific form for implementing the invention.

[0281] At the core of the system, a terminal for interacting with the user is placed. The terminal is equipped with input and output devices such as a camera, a microphone, and a speaker, and these are used to realize interaction with the user. The server is installed on the cloud or a local network and is responsible for data processing and storage.

[0282] First, when the user stands in front of the terminal, the built-in camera of the terminal captures the user's face. The face recognition algorithm analyzes this image data, and the server uses this data to verify the user's personal information through biometric authentication means. If the user is already registered, the server calls the history information to enable individual responses. In the case of a new user, the registration process is started on the terminal, and the information is saved on the server.

[0283] Next, the terminal uses emotion detection means to analyze the user's expression and analyze the emotional state. Thereby, the server determines whether the user is relaxed or stressed, and selects an appropriate response tone by the response formation means.

[0284] When the user makes a question or a request by voice, the terminal uses the microphone to capture the voice and converts it into text by voice conversion means. The text data is analyzed by utilizing the generative AI model, and an appropriate response is generated. The server translates the response generated using the language conversion means into the user's language and outputs it as voice through the speaker.

[0285] Also, all interactions are accumulated as logs on the server using the information storage and analysis means. This data is periodically analyzed by machine learning techniques to improve the response accuracy of the system.

[0286] As a specific example, consider the case where a user asks, "Where is the nearest café?" The terminal acquires the user's location information and searches for nearby cafés using location identification means. The server can output the generated information in the user's language as voice and further display map information on the screen of the terminal. As an example of a prompt sentence, content such as "When a visitor asks a question about nearby facilities, how should one respond?" can be considered.

[0287] The flow of the specific process in Example 1 will be described using FIG. 11.

[0288] Step 1: User recognition

[0289] The user stands in front of the terminal. The terminal captures the user's face with the built-in camera. The input is the captured face image data, and a face recognition algorithm analyzes this. The server receives the analysis result, collates it with the database, and determines whether the user is already registered. The output is the user's personal information and past history information.

[0290] Step 2: Sentiment analysis

[0291] The terminal analyzes the user's expression using the face image obtained. The input is the user's face image data, and based on this, sentiment detection means performs the analysis. The server receives the analysis result and determines the user's emotional state (relaxed, tense, etc.). The output is the user's emotional state data. Specifically, when the user is tense, data for changing the tone of the response is generated.

[0292] Step 3: Speech recognition

[0293] The user conveys a question or request to the terminal by voice. The terminal captures the voice with the built-in microphone. The input is the user's voice data, and this is converted into text data using voice conversion means. The output is the text data converted from the voice. This text serves as the basis for the next analysis.

[0294] Step 4: Response generation

[0295] The server receives text data obtained through speech recognition and analyzes it using a generative AI model. The input is text data, and the system accurately understands the user's request based on the prompt text. Based on the analysis results, a response formation mechanism generates the optimal answer. The output is the generated response data.

[0296] Step 5: Multilingual Translation

[0297] The system receives the response data generated by the server and uses a language translation mechanism. The input consists of the generated response data and the user's language information. This translates the response into the user's native language. The output is the multilingual response data, which the terminal then outputs as audio.

[0298] Step 6: Location Information Guidance

[0299] When a user makes a location-related inquiry (for example, "Where is the nearest cafe?"), the device uses location-determining means. The input is the user's current location information, and the device uses this location information to search for relevant facilities and services. The output is optimal guidance information, which the device displays on the screen as map information and also provides audio explanations.

[0300] Step 7: Data accumulation and analysis

[0301] After all interactions are complete, the server stores the collected data using data storage and analysis tools. The input is all the data generated at each step. This data is used by machine learning algorithms to analyze the data and improve the system's accuracy and response quality. The output is analysis reports and improvement suggestions, which allow the system to be continuously improved.

[0302] (Application Example 1)

[0303] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0304] In the user reception service, there is a demand for quickly providing multi-language support and personalized services. Also, it is necessary to improve convenience by guiding the details of related facilities and services using the user's location information. Furthermore, it is important to utilize the data collected through the system to improve the efficiency of the reception service.

[0305] The specific processing by the specific processing unit 290 of the data processing apparatus 12 in Application Example 1 is realized by the following respective means.

[0306] In this invention, the server includes a face recognition means for recognizing the user's face and acquiring personal information, an emotion recognition means for analyzing emotions, a voice recognition means for analyzing voice input, a multi-language support means for providing responses in multiple languages, a facility guidance means, and a usage status recording means. As a result, it becomes possible to support multiple languages and provide personalized responses according to the user, and it is possible to quickly guide related facilities based on the user's location information. Also, by accumulating and learning data, continuous improvement of business efficiency becomes possible.

[0307] A "user" is an individual or group that uses the system and is the entity that receives information services.

[0308] "Face recognition means" is a technology for identifying an individual's face from image data acquired using a camera.

[0309] "Emotion recognition means" is a technology for analyzing the emotional state from the user's expressions and behaviors and deriving an appropriate response based on that.

[0310] "Voice recognition means" is a process of converting voice audio into text data and analyzing information based on the result.

[0311] "Response generation means using a generative model" is a technology that utilizes a generative AI model to create an appropriate response based on voice input and other data.

[0312] "Multilingual support" refers to a function that enables the provision of information in different languages by utilizing translation technology.

[0313] "Data storage and analysis methods" refer to technologies used to improve operations and enhance services by storing and analyzing information collected through systems.

[0314] A "facility guidance system" is a process for guiding users to the location and services of relevant facilities based on their location information.

[0315] A "usage status recording means" is a function that records how users use the system and uses that information to optimize business processes and improve services.

[0316] The system for implementing this invention is configured as a concierge robot that integrates various technologies.

[0317] First, when a user stands in front of the system, the terminal's built-in camera captures the user's face, and facial recognition technology is used to identify the individual. The facial recognition algorithm then analyzes the data and compares it with the user's personal information. If the user is already registered, personalized responses can be provided based on their history.

[0318] Next, the device uses the user's facial expressions to activate its emotion recognition mechanism. This involves applying an emotion analysis algorithm to determine the user's mental state. Based on the emotion recognition results, the tone of response and conversation style are adjusted. For example, if the device determines that the user is nervous, it will respond in a friendly and reassuring tone.

[0319] Subsequently, voice input from the user is acquired via a microphone and converted to text by a speech recognition system. The text data is then analyzed using a generative AI model to understand the user's specific requests. Based on this analysis, the server uses the generative model to create the optimal response.

[0320] Furthermore, multilingual support means that the generated responses are translated into the user's native language and provided in either voice or text. This process ensures that users who speak different languages can receive efficient support.

[0321] For example, if a user uses voice input to say, "Please tell me the nearest public facility," the device can obtain location information using its facility guidance system and guide the user to nearby facilities. The response includes specific map information, which can be visually confirmed.

[0322] Examples of software and hardware used include "face recognition algorithms" and "emotion analysis algorithms" for face recognition and emotion recognition, a "speech recognition engine" for speech recognition, a "generative AI engine" for generative AI models, and a "language translation API" for multilingual support.

[0323] Examples of prompt statements to be input to a generative AI model include the following:

[0324] "Please explain how to guide visitors who ask about nearby public facilities."

[0325] "Please ask users if there are any events that require specific notifications."

[0326] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0327] Step 1:

[0328] The device captures the user's face with its camera and acquires image data. This image data is analyzed using facial recognition technology and compared with the user's personal information database. In this process, the image input by the camera is analyzed by a facial recognition algorithm and matched with existing user data. As a result, the user's personal information is output.

[0329] Step 2:

[0330] The device uses the user's facial expression data to activate its emotion recognition mechanism. The device obtains emotions from facial expressions and applies an emotion analysis algorithm to determine the user's mental state. The user's facial image is used as input, and the emotional state is output. This process selects a response tone appropriate to the user's emotions.

[0331] Step 3:

[0332] The user requests information by voice. The device's microphone acquires the voice input, and a speech recognition system converts this voice data into text. Here, the input is the user's voice information, and the output is generated as text data. The transcribed data forms the basis for the next analysis step.

[0333] Step 4:

[0334] The server uses a generative AI model to analyze text data. Based on the analysis results, it generates the optimal response to understand the user's request. The input data is a text-based request, and the generated text data is output as the response. Based on this result, the server provides information tailored to the user.

[0335] Step 5:

[0336] The generated text response is translated into the user's language using a multilingual support system. The server uses a language translation API to convert the generated text into the user's native language and outputs it as audio or text. This allows for accurate and easily understandable responses to multilingual users.

[0337] Step 6:

[0338] The terminal uses a facility guidance system to obtain the user's location information and provides detailed information about relevant facilities. In this step, the user's location data is input, and the output provides guidance information about facilities best suited to the user. This function enables more precise service guidance based on the user's current location.

[0339] Step 7:

[0340] The server collects and stores all interaction data using data storage and analysis tools. This data is analyzed for continuous operational efficiency improvements. Input includes details of all interactions with the user, and output provides analytical data to improve response accuracy.

[0341] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0342] This invention is a system that efficiently handles reception duties for corporations and public institutions using AI technology. Utilizing an emotion engine, this system achieves advanced customer service through user facial recognition, emotion analysis, speech recognition, response generation, multilingual support, and data storage and analysis. The specific implementation of the system is described below.

[0343] When a user appears in front of the device, the device uses facial recognition to capture the user's face. The facial data is sent to a server, where it is compared with an existing database to obtain the user's personal information. For new users, the information is recorded in the system and used to improve future services.

[0344] The emotion engine analyzes emotions in real time across multiple dimensions based on facial recognition data and the user's voice characteristics. It extracts changes in the user's facial expressions and voice tone as features, and the resulting emotional state directly participates in the subsequent response generation process. For example, if the user is feeling anxious, the device will generate a more reassuring response.

[0345] The user inputs questions and requests into the terminal via voice. The terminal uses speech recognition to convert the voice into text, which is then analyzed by a generative AI model. The analysis results are used to retrieve information from the server, providing the most appropriate information. The server searches its database for the necessary data and provides it to the terminal.

[0346] The generated responses are adjusted based on the user's emotional state. The emotion engine optimizes each response according to emotional data, creating natural and empathetic communication. Multilingual support automatically translates the response according to the user's language.

[0347] All interactions are saved as logs on the server and analyzed by a machine learning engine. This allows for further accuracy improvements by taking into account the user's individual emotional history. For example, if a user who was previously dissatisfied returns, the device can take their previous emotional data into consideration and strive to provide particularly helpful service.

[0348] In this way, by using AI that integrates multiple functions, it is possible to achieve both high-quality customer service and improved operational efficiency simultaneously.

[0349] The following describes the processing flow.

[0350] Step 1:

[0351] When a user approaches the device, the device captures the user's face with its built-in camera. It then performs facial recognition and uses the captured image data to identify the user.

[0352] Step 2:

[0353] The device sends facial recognition data to the server, which compares it with an existing user database. If the user is already registered, the server returns the relevant personal information to the device. For new users, the initial registration process begins.

[0354] Step 3:

[0355] The device uses an emotion engine to analyze facial expression data obtained from the user's face. It also analyzes voice tone and other emotional indicators to determine the user's emotional state.

[0356] Step 4:

[0357] The user verbally communicates questions or requests to the device. The device receives the audio through its microphone and converts the audio data into text using speech recognition technology.

[0358] Step 5:

[0359] The device uses a generative AI model to analyze the content of the transcribed questions and processes them to understand the user's intent.

[0360] Step 6:

[0361] Based on the analysis results, the terminal requests the necessary data from the server. The server retrieves the requested information from its database and returns it to the terminal.

[0362] Step 7:

[0363] The device generates responses based on information and emotional data. The emotion engine considers the user's current emotional state and appropriately adjusts the tone and content of the generated responses.

[0364] Step 8:

[0365] The device uses its multilingual support function to translate responses into the user's selected language and output them as audio. This ensures that information is provided in a way that is easy for the user to understand.

[0366] Step 9:

[0367] The server logs all interaction data and uses machine learning algorithms to analyze it for continuous system improvement, thereby enhancing the user experience.

[0368] (Example 2)

[0369] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0370] In reception services at corporations and public institutions, smooth communication with visitors is essential, but traditional methods have limitations in terms of quality and efficiency. In particular, current technology may not be able to provide sufficient satisfaction in situations requiring multilingual support and consideration for visitors' feelings. Furthermore, there is a lack of mechanisms to effectively utilize visitor information and continuously improve services.

[0371] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0372] In this invention, the server includes adaptive means for adjusting responses according to the user's emotional state, acquisition means for obtaining information from a database, and storage means for accumulating and learning from the information obtained from each of the means. This enables natural and friendly responses that take into account the visitor's emotions, smooth communication in multiple languages, and continuous service improvement based on accumulated data.

[0373] "Identification means for recognizing faces" refers to a device or technology for identifying a user's face and capturing its characteristics as data.

[0374] "Emotional evaluation analysis means" refers to a device or technology for determining a user's emotional state in real time based on their facial expressions and vocal characteristics.

[0375] "Recording means for converting voice input into text and analyzing information requests" refers to a device or technology that converts a user's voice into text data, analyzes its content, and understands the information request.

[0376] A "generative model-based response method" refers to a machine learning model or technique used to create an appropriate response based on an analyzed information request.

[0377] A "translation means that translates and provides responses in various languages" refers to a device or technology for translating the generated response into the user's native language or a specified language and providing it.

[0378] "Memory means for accumulating information and learning" refers to a device or technology for storing acquired data and using it to improve the performance of a system.

[0379] "An adaptive means that adjusts responses according to emotional state" refers to a device or technology that dynamically changes the tone and content of a response based on the user's emotional information.

[0380] "Means of acquiring information" refers to devices or technologies for retrieving necessary information from databases or other information sources based on user requests.

[0381] To implement this invention, the following hardware and software are used. The terminal is a device equipped with a camera and microphone that captures the user's face and voice. The server is a computer system with high-speed data processing capabilities that performs functions such as face recognition, emotion analysis, voice analysis, and database access.

[0382] When a user stands in front of the device, the terminal uses its camera to capture their face. The facial data is sent to a server and compared against a database. Based on this facial data, the server uses identification methods to retrieve personal information from an existing database.

[0383] In addition, the user's facial expressions and voice data are analyzed using an emotional evaluation tool to determine their emotional state. The server converts the user's voice input into text using speech recognition technology and interprets the information request using a generative AI model. Based on this analysis, the server retrieves the necessary information from the database and performs multilingual translation using a conversion tool.

[0384] A notable feature is that the server is equipped with adaptive mechanisms to adjust its response to the user's emotional state. This enables empathetic and natural interactions. Furthermore, it uses memory mechanisms to accumulate dialogue data, allowing for more refined service on subsequent visits.

[0385] For example, if a user visits the reception desk of a facility and speaks into a terminal saying, "I'd like to know the facility's event schedule," the voice is converted into text, analyzed by the server, and then the most appropriate response is generated. Furthermore, if the user appears confused, supplementary information can be provided, such as, "If you could tell us more specific dates and times, we can provide you with more detailed information."

[0386] An example of a prompt message would be, "Please tell me the date and time of the event. Also, please take an appropriate approach based on a specific sentiment analysis." In this way, a practical and interactive reception system can be realized.

[0387] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0388] Step 1:

[0389] When a user stands in front of the device, the terminal activates its camera and captures the user's face. The input at this time is image data of the user's face, and the output is face data that is sent to the server. The face data is transferred to the server, which then checks against a database to obtain the user's personal information.

[0390] Step 2:

[0391] The server analyzes the user's emotional state using an analysis method that evaluates emotions based on facial recognition data. The input is facial and voice feature data, and the output is the analyzed emotional state. The server analyzes changes in facial expressions and voice to determine what emotional state the user is in.

[0392] Step 3:

[0393] The user inputs questions or requests into the terminal using voice. The terminal converts the input voice data into text using a speech recognition engine. In this case, the input is the user's voice data, and the output is text data. This converted text data is sent to the server.

[0394] Step 4:

[0395] The server uses a generative AI model to analyze text data and understand the user's intent. The input is text data, and the output is the analyzed information request. The server then uses the generative AI model to apply prompts and determine the optimal response based on the user's request.

[0396] Step 5:

[0397] The server retrieves the necessary information from the database based on the analysis results. The input in this step is the analyzed information request, and the output is the retrieved information data. The server searches the database and extracts the exact information the user is looking for.

[0398] Step 6:

[0399] Based on the acquired information, the server generates a response using its multilingual translation function. The input in this process consists of informational data and the user's language settings, while the output is the translated response data. The server converts the information into the specified language and formats it as a response.

[0400] Step 7:

[0401] The terminal provides the user with translated responses sent from the server. The input is the translated response data, and the output is information provided to the user in either voice or text. The terminal uses speech synthesis technology to respond to the user in a natural-sounding manner.

[0402] (Application Example 2)

[0403] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0404] In today's retail environment, personalized service and multilingual support are required for visiting customers, but traditional methods make it difficult to efficiently achieve both simultaneously. Furthermore, there is a lack of natural, real-time responses based on customer emotions and language, hindering the delivery of a high-quality customer experience.

[0405] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0406] In this invention, the server includes biometric authentication means, emotion analysis means, and voice analysis means. This enables personalized responses tailored to each customer visiting the store, as well as responses in appropriate language on the spot. In particular, by adapting responses in real time based on the customer's emotional state, it becomes possible to provide a natural and empathetic customer experience.

[0407] "Biometric authentication methods" are technologies that use a user's face or other physical characteristics to identify an individual and obtain related personal information.

[0408] "Emotional analysis methods" are technologies that analyze a user's facial expressions and vocal characteristics to identify their emotional state.

[0409] "Voice analysis means" refers to technology that converts voice input from a user into text, analyzes its content, and understands the user's information request.

[0410] "Response formation means using generative artificial intelligence" refers to artificial intelligence technology for generating the optimal response based on analysis results.

[0411] A "multilingual translation method" is a technology for translating the generated response into the user's language and providing it appropriately.

[0412] "Information storage and analysis means" refers to technologies that store data acquired from various means, analyze it through machine learning, and use it to improve services.

[0413] "Language detection means" refers to technology that identifies the user's language and optimizes communication for that language.

[0414] "Response adaptation means" is a technology that adjusts responses in real time based on the user's emotional state, enabling more appropriate communication.

[0415] In one embodiment of this invention, the server first recognizes the user's face using biometric authentication means and identifies the individual. This allows the server to retrieve relevant personal information from a database. Using the recognized facial data, the server analyzes the user's emotional state using emotion analysis means. Technologies such as EmotionEngine can be used for emotion analysis.

[0416] Next, the terminal receives voice input from the user through the microphone. This voice is converted into text data by a voice analysis system, and the user's request is analyzed using natural language processing. At this stage, a response formation system using generative artificial intelligence generates an appropriate response to the user's request. A generative AI model is used for this generation.

[0417] The generated response is translated into the user's language by a multilingual translation system. Since the user's language is pre-identified by a language detection system, accurate translation is possible. Finally, the response is output to the user, but the response adaptation system adjusts the tone and content in real time according to the user's emotional state.

[0418] As a concrete example, if a customer visiting a store asks, "Where can I buy this product?", the terminal recognizes the language and emotion, generates product information, and provides guidance in the appropriate language. A prompt such as, "Generate an answer to the question 'Where can I buy this product?' into the generative AI model" is input, leading to a natural response. This entire process results in a high-quality and personalized customer experience.

[0419] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0420] Step 1:

[0421] The server uses biometric authentication to recognize the user's face using image data from the camera as input. Using the recognized facial data, personal information is retrieved from the database to identify individual users.

[0422] Step 2:

[0423] The server enters the emotion analysis system and receives facial feature data and voice characteristics as input. By analyzing this data, the user's emotional state is identified. Based on the user's facial expressions and voice tone, the emotion engine measures emotions such as anger, anxiety, and joy.

[0424] Step 3:

[0425] The device converts the voice input from the user via the microphone into text using voice analysis. The converted text data is then analyzed to understand the user's request using natural language processing. Based on the text analysis, the user's questions and requests are interpreted.

[0426] Step 4:

[0427] The server then enters a response formation mechanism using generative artificial intelligence. Based on the text data generated in step 3, the generative AI model generates the optimal response as a prompt. For example, in response to the question "Where can I buy this product?", a response based on the product's location information is generated.

[0428] Step 5:

[0429] The server translates the response using a multilingual translation tool. After identifying the user's language using a language detection tool, it translates the response into the appropriate language and prepares it.

[0430] Step 6:

[0431] The device uses response adaptation mechanisms to adjust its response in real time based on the emotional state detected in step 2. The final response to the user is modified to be more empathetic and natural according to the emotion and then provided to the user.

[0432] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0433] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0434] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0435] [Third Embodiment]

[0436] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0437] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0438] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0439] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0440] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0441] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0442] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0443] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0444] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0445] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0446] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0447] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0448] This invention is an AI concierge robot system designed to streamline reception operations in corporations and public institutions. At the heart of the system is a terminal that interacts with the user, integrating a wide range of functions including facial recognition, emotion analysis, speech recognition, response generation, and multilingual support. The specific implementation of the system is described below.

[0449] First, when a user stands in front of the device, the device captures the user's face with its built-in camera. This image data is analyzed by a facial recognition algorithm and then queried to a server to be matched with the user's personal information. For already registered users, their history information can be used to provide more personalized service. For new users, the registration process begins, and the acquired information is stored on the server.

[0450] Next, the device analyzes the user's facial expressions to recognize emotions. By recognizing the emotional state, it can determine whether the user is relaxed or stressed, and select a response tone appropriate to that state. For example, if the user is tense, a more friendly and reassuring tone will be used.

[0451] The device has a built-in microphone that accepts voice input from the user. Using speech recognition technology, the device converts the voice into text, and a generating AI model analyzes the content. This analysis helps the user understand their specific requests and questions. Then, the server starts a process of searching for the necessary information based on the analysis results and generating the optimal answer.

[0452] The generated responses are adapted to the user's language using multilingual support, ensuring that accurate information is provided to users who speak different languages. For example, an English-speaking user will receive an audio response in English.

[0453] All interactions are logged and stored on the server, and a machine learning engine periodically analyzes the data to improve future response accuracy. In this way, the system continuously learns and evolves through all interactions with the user. For example, if a visitor asks, "Where is the nearest ATM?", the terminal will determine the user's location and guide them to the nearest ATM on a map. Providing this information via voice makes the guidance quick and efficient.

[0454] The following describes the processing flow.

[0455] Step 1:

[0456] The device activates its camera and captures the user's face. The captured image is then analyzed using a built-in facial recognition algorithm.

[0457] Step 2:

[0458] The device sends the facial recognition results to the server, which compares them with past records to identify the user. The server then refers to an existing database to identify the user information and sends it back to the device. For new users, the information is recorded, and the initial registration process is performed.

[0459] Step 3:

[0460] The device captures facial expression data from the user's face and uses emotion recognition to determine the user's current emotional state. This prepares the device to appropriately adjust the tone of conversation.

[0461] Step 4:

[0462] The user communicates questions and requests by voice to the device. The voice input is transmitted to the device via the microphone.

[0463] Step 5:

[0464] The device uses a speech recognition algorithm to convert the user's voice into text data. The converted text is then parsed to understand the user's request.

[0465] Step 6:

[0466] The device then uses a generative AI model to initiate a process that generates an appropriate response based on the analyzed request.

[0467] Step 7:

[0468] The terminal sends a request to the server to obtain the necessary information. The server retrieves the appropriate information from the database and provides it to the terminal.

[0469] Step 8:

[0470] The device uses its multilingual support feature to translate responses generated according to the user's preferred language.

[0471] Step 9:

[0472] The terminal uses speech synthesis to output the system-generated response as audio and convey it to the user.

[0473] Step 10:

[0474] Once the interaction is complete, the server logs all interactions and analyzes the data to improve accuracy in the future.

[0475] (Example 1)

[0476] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0477] Traditional reception systems provided a uniform approach to users, making it difficult to offer services tailored to individual needs. Furthermore, their inability to support multiple languages meant they could not adequately assist users who spoke different languages. Additionally, learning and improvement based on collected data was difficult, resulting in a lack of an efficient feedback loop for improving service quality.

[0478] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0479] In this invention, the server includes biometric authentication means, emotion detection means, voice conversion means, response formation means, language conversion means, information storage and analysis means, and location identification means. This enables the provision of flexible services tailored to the individual needs of users and provides accurate information to users who speak different languages. Furthermore, it enables qualitative improvement of services through the analysis of collected data and allows the system to evolve through continuous learning.

[0480] "Biometric authentication" refers to technology that scans a user's face to obtain personal information and enables individual authentication.

[0481] "Emotion detection means" refers to technology that analyzes a user's facial expressions to determine their emotional state and selects a response based on that.

[0482] "Voice conversion means" refers to technology that converts voice input from a user into text data and analyzes the content of the request.

[0483] "Response formation means" refers to a technology that generates an appropriate response using a generative AI model based on the analyzed user request.

[0484] "Language conversion means" refers to technology that translates generated responses into multiple languages, providing accurate information to users who speak different languages.

[0485] "Information storage and analysis methods" refer to technologies that accumulate data obtained at each processing stage and improve system performance through machine learning.

[0486] "Location identification means" refers to technology that acquires a user's geographical location information and guides them to relevant facilities and services.

[0487] This system is an AI concierge robot system that streamlines reception operations in corporations and public institutions. Here, we will show a specific form for implementing the invention.

[0488] At the core of the system are terminals that interact with users. These terminals are equipped with input and output devices such as cameras, microphones, and speakers, which are used to facilitate interaction with users. Servers are located on the cloud or local network and are responsible for processing and storing data.

[0489] First, when a user stands in front of the device, the device's built-in camera captures the user's face. A facial recognition algorithm analyzes this image data, and the server uses biometric authentication to verify the user's personal information. If the user is already registered, the server retrieves their history information to enable individualized service. For new users, the registration process begins on the device, and the information is stored on the server.

[0490] Next, the terminal uses emotion detection means to analyze the user's facial expressions and analyze their emotional state. Based on this, the server determines whether the user is relaxed or stressed, and selects an appropriate response tone using response shaping means.

[0491] When a user makes a question or request by voice, the device captures the audio using its microphone and converts it to text using a speech-to-text converter. This text data is then analyzed using a generative AI model to generate an appropriate response. The server then translates the generated response into the user's language using a language conversion converter and outputs it as audio through its speaker.

[0492] Furthermore, all interactions are logged on the server using information storage and analysis tools. This data is periodically analyzed using machine learning techniques to improve the system's response accuracy.

[0493] As a concrete example, consider a scenario where a user asks, "Where is the nearest cafe?" The device obtains the user's location information and uses location-based searching to find nearby cafes. The server can output the generated information as voice in the user's language and also display map information on the device's screen. An example of a prompt message could be something like, "How should we respond when a visitor asks about nearby facilities?"

[0494] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0495] Step 1: User Recognition

[0496] The user stands in front of the terminal. The terminal captures the user's face with its built-in camera. The input is the captured facial image data, which is analyzed by a facial recognition algorithm. The server receives the analysis results, compares them with the database, and determines whether the user is already registered. The output is the user's personal information and past history.

[0497] Step 2: Sentiment Analysis

[0498] The system analyzes the user's facial expressions using facial images acquired by the terminal. The input is the user's facial image data, which is then analyzed by an emotion detection system. The server receives the analysis results and determines the user's emotional state (relaxed, tense, etc.). The output is the user's emotional state data. Specifically, if the user is tense, data is generated to change the tone of response.

[0499] Step 3: Speech Recognition

[0500] The user communicates questions and requests verbally to the device. The device captures the audio using its built-in microphone. The input is the user's voice data, which is converted into text data using a speech-to-text conversion device. The output is the text data converted from the audio. This text serves as the basis for the next analysis.

[0501] Step 4: Response Generation

[0502] The server receives text data obtained through speech recognition and analyzes it using a generative AI model. The input is text data, and the system accurately understands the user's request based on the prompt text. Based on the analysis results, a response formation mechanism generates the optimal answer. The output is the generated response data.

[0503] Step 5: Multilingual Translation

[0504] The system receives the response data generated by the server and uses a language translation mechanism. The input consists of the generated response data and the user's language information. This translates the response into the user's native language. The output is the multilingual response data, which the terminal then outputs as audio.

[0505] Step 6: Location Information Guidance

[0506] When a user makes a location-related inquiry (for example, "Where is the nearest cafe?"), the device uses location-determining means. The input is the user's current location information, and the device uses this location information to search for relevant facilities and services. The output is optimal guidance information, which the device displays on the screen as map information and also provides audio explanations.

[0507] Step 7: Data accumulation and analysis

[0508] After all interactions are complete, the server stores the collected data using data storage and analysis tools. The input is all the data generated at each step. This data is used by machine learning algorithms to analyze the data and improve the system's accuracy and response quality. The output is analysis reports and improvement suggestions, which allow the system to be continuously improved.

[0509] (Application Example 1)

[0510] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0511] In user reception operations, there is a need to provide multilingual support and personalized services quickly. Furthermore, it is necessary to improve convenience by utilizing user location information to guide users to details about relevant facilities and services. In addition, it is important to utilize data collected through the system to streamline reception operations.

[0512] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0513] In this invention, the server includes a facial recognition means for recognizing the user's face and acquiring personal information, an emotion recognition means for analyzing emotions, a speech recognition means for analyzing voice input, a multilingual support means for providing responses in multiple languages, a facility guidance means, and a usage status recording means. This enables multilingual support and personalized responses tailored to the user, and allows for rapid guidance to relevant facilities based on the user's location information. Furthermore, by accumulating and learning from data, continuous operational efficiency improvements can be achieved.

[0514] A "user" is an individual or group that utilizes a system and is the entity that receives information services.

[0515] "Facial recognition means" refers to technology for identifying an individual's face from image data acquired using a camera.

[0516] "Emotion recognition means" refers to technology that analyzes a user's emotional state from their facial expressions and behavior, and then derives an appropriate response based on that analysis.

[0517] "Speech recognition means" refers to a process that converts speech sounds into text data and analyzes information based on the results.

[0518] "Response generation means using a generative model" refers to a technology that utilizes a generative AI model to create an appropriate response based on voice input or other data.

[0519] "Multilingual support" refers to a function that enables the provision of information in different languages by utilizing translation technology.

[0520] "Data storage and analysis methods" refer to technologies used to improve operations and enhance services by storing and analyzing information collected through systems.

[0521] A "facility guidance system" is a process for guiding users to the location and services of relevant facilities based on their location information.

[0522] A "usage status recording means" is a function that records how users use the system and uses that information to optimize business processes and improve services.

[0523] The system for implementing this invention is configured as a concierge robot that integrates various technologies.

[0524] First, when a user stands in front of the system, the terminal's built-in camera captures the user's face, and facial recognition technology is used to identify the individual. The facial recognition algorithm then analyzes the data and compares it with the user's personal information. If the user is already registered, personalized responses can be provided based on their history.

[0525] Next, the device uses the user's facial expressions to activate its emotion recognition mechanism. This involves applying an emotion analysis algorithm to determine the user's mental state. Based on the emotion recognition results, the tone of response and conversation style are adjusted. For example, if the device determines that the user is nervous, it will respond in a friendly and reassuring tone.

[0526] Subsequently, voice input from the user is acquired via a microphone and converted to text by a speech recognition system. The text data is then analyzed using a generative AI model to understand the user's specific requests. Based on this analysis, the server uses the generative model to create the optimal response.

[0527] Furthermore, multilingual support means that the generated responses are translated into the user's native language and provided in either voice or text. This process ensures that users who speak different languages can receive efficient support.

[0528] For example, if a user uses voice input to say, "Please tell me the nearest public facility," the device can obtain location information using its facility guidance system and guide the user to nearby facilities. The response includes specific map information, which can be visually confirmed.

[0529] Examples of software and hardware used include "face recognition algorithms" and "emotion analysis algorithms" for face recognition and emotion recognition, a "speech recognition engine" for speech recognition, a "generative AI engine" for generative AI models, and a "language translation API" for multilingual support.

[0530] Examples of prompt statements to be input to a generative AI model include the following:

[0531] "Please explain how to guide visitors who ask about nearby public facilities."

[0532] "Please ask users if there are any events that require specific notifications."

[0533] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0534] Step 1:

[0535] The device captures the user's face with its camera and acquires image data. This image data is analyzed using facial recognition technology and compared with the user's personal information database. In this process, the image input by the camera is analyzed by a facial recognition algorithm and matched with existing user data. As a result, the user's personal information is output.

[0536] Step 2:

[0537] The device uses the user's facial expression data to activate its emotion recognition mechanism. The device obtains emotions from facial expressions and applies an emotion analysis algorithm to determine the user's mental state. The user's facial image is used as input, and the emotional state is output. This process selects a response tone appropriate to the user's emotions.

[0538] Step 3:

[0539] The user requests information by voice. The device's microphone acquires the voice input, and a speech recognition system converts this voice data into text. Here, the input is the user's voice information, and the output is generated as text data. The transcribed data forms the basis for the next analysis step.

[0540] Step 4:

[0541] The server uses a generative AI model to analyze text data. Based on the analysis results, it generates the optimal response to understand the user's request. The input data is a text-based request, and the generated text data is output as the response. Based on this result, the server provides information tailored to the user.

[0542] Step 5:

[0543] The generated text response is translated into the user's language using a multilingual support system. The server uses a language translation API to convert the generated text into the user's native language and outputs it as audio or text. This allows for accurate and easily understandable responses to multilingual users.

[0544] Step 6:

[0545] The terminal uses a facility guidance system to obtain the user's location information and provides detailed information about relevant facilities. In this step, the user's location data is input, and the output provides guidance information about facilities best suited to the user. This function enables more precise service guidance based on the user's current location.

[0546] Step 7:

[0547] The server collects and stores all interaction data using data storage and analysis tools. This data is analyzed for continuous operational efficiency improvements. Input includes details of all interactions with the user, and output provides analytical data to improve response accuracy.

[0548] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0549] This invention is a system that efficiently handles reception duties for corporations and public institutions using AI technology. Utilizing an emotion engine, this system achieves advanced customer service through user facial recognition, emotion analysis, speech recognition, response generation, multilingual support, and data storage and analysis. The specific implementation of the system is described below.

[0550] When a user appears in front of the device, the device uses facial recognition to capture the user's face. The facial data is sent to a server, where it is compared with an existing database to obtain the user's personal information. For new users, the information is recorded in the system and used to improve future services.

[0551] The emotion engine analyzes emotions in real time across multiple dimensions based on facial recognition data and the user's voice characteristics. It extracts changes in the user's facial expressions and voice tone as features, and the resulting emotional state directly participates in the subsequent response generation process. For example, if the user is feeling anxious, the device will generate a more reassuring response.

[0552] The user inputs questions and requests into the terminal via voice. The terminal uses speech recognition to convert the voice into text, which is then analyzed by a generative AI model. The analysis results are used to retrieve information from the server, providing the most appropriate information. The server searches its database for the necessary data and provides it to the terminal.

[0553] The generated responses are adjusted based on the user's emotional state. The emotion engine optimizes each response according to emotional data, creating natural and empathetic communication. Multilingual support automatically translates the response according to the user's language.

[0554] All interactions are saved as logs on the server and analyzed by a machine learning engine. This allows for further accuracy improvements by taking into account the user's individual emotional history. For example, if a user who was previously dissatisfied returns, the device can take their previous emotional data into consideration and strive to provide particularly helpful service.

[0555] In this way, by using AI that integrates multiple functions, it is possible to achieve both high-quality customer service and improved operational efficiency simultaneously.

[0556] The following describes the processing flow.

[0557] Step 1:

[0558] When a user approaches the device, the device captures the user's face with its built-in camera. It then performs facial recognition and uses the captured image data to identify the user.

[0559] Step 2:

[0560] The device sends facial recognition data to the server, which compares it with an existing user database. If the user is already registered, the server returns the relevant personal information to the device. For new users, the initial registration process begins.

[0561] Step 3:

[0562] The device uses an emotion engine to analyze facial expression data obtained from the user's face. It also analyzes voice tone and other emotional indicators to determine the user's emotional state.

[0563] Step 4:

[0564] The user verbally communicates questions or requests to the device. The device receives the audio through its microphone and converts the audio data into text using speech recognition technology.

[0565] Step 5:

[0566] The device uses a generative AI model to analyze the content of the transcribed questions and processes them to understand the user's intent.

[0567] Step 6:

[0568] Based on the analysis results, the terminal requests the necessary data from the server. The server retrieves the requested information from its database and returns it to the terminal.

[0569] Step 7:

[0570] The device generates responses based on information and emotional data. The emotion engine considers the user's current emotional state and appropriately adjusts the tone and content of the generated responses.

[0571] Step 8:

[0572] The device uses its multilingual support function to translate responses into the user's selected language and output them as audio. This ensures that information is provided in a way that is easy for the user to understand.

[0573] Step 9:

[0574] The server logs all interaction data and uses machine learning algorithms to analyze it for continuous system improvement, thereby enhancing the user experience.

[0575] (Example 2)

[0576] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0577] In reception services at corporations and public institutions, smooth communication with visitors is essential, but traditional methods have limitations in terms of quality and efficiency. In particular, current technology may not be able to provide sufficient satisfaction in situations requiring multilingual support and consideration for visitors' feelings. Furthermore, there is a lack of mechanisms to effectively utilize visitor information and continuously improve services.

[0578] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0579] In this invention, the server includes adaptive means for adjusting responses according to the user's emotional state, acquisition means for obtaining information from a database, and storage means for accumulating and learning from the information obtained from each of the means. This enables natural and friendly responses that take into account the visitor's emotions, smooth communication in multiple languages, and continuous service improvement based on accumulated data.

[0580] "Identification means for recognizing faces" refers to a device or technology for identifying a user's face and capturing its characteristics as data.

[0581] "Emotional evaluation analysis means" refers to a device or technology for determining a user's emotional state in real time based on their facial expressions and vocal characteristics.

[0582] "Recording means for converting voice input into text and analyzing information requests" refers to a device or technology that converts a user's voice into text data, analyzes its content, and understands the information request.

[0583] A "generative model-based response method" refers to a machine learning model or technique used to create an appropriate response based on an analyzed information request.

[0584] A "translation means that translates and provides responses in various languages" refers to a device or technology for translating the generated response into the user's native language or a specified language and providing it.

[0585] "Memory means for accumulating information and learning" refers to a device or technology for storing acquired data and using it to improve the performance of a system.

[0586] "An adaptive means that adjusts responses according to emotional state" refers to a device or technology that dynamically changes the tone and content of a response based on the user's emotional information.

[0587] "Means of acquiring information" refers to devices or technologies for retrieving necessary information from databases or other information sources based on user requests.

[0588] To implement this invention, the following hardware and software are used. The terminal is a device equipped with a camera and microphone that captures the user's face and voice. The server is a computer system with high-speed data processing capabilities that performs functions such as face recognition, emotion analysis, voice analysis, and database access.

[0589] When a user stands in front of the device, the terminal uses its camera to capture their face. The facial data is sent to a server and compared against a database. Based on this facial data, the server uses identification methods to retrieve personal information from an existing database.

[0590] In addition, the user's facial expressions and voice data are analyzed using an emotional evaluation tool to determine their emotional state. The server converts the user's voice input into text using speech recognition technology and interprets the information request using a generative AI model. Based on this analysis, the server retrieves the necessary information from the database and performs multilingual translation using a conversion tool.

[0591] A notable feature is that the server is equipped with adaptive mechanisms to adjust its response to the user's emotional state. This enables empathetic and natural interactions. Furthermore, it uses memory mechanisms to accumulate dialogue data, allowing for more refined service on subsequent visits.

[0592] For example, if a user visits the reception desk of a facility and speaks into a terminal saying, "I'd like to know the facility's event schedule," the voice is converted into text, analyzed by the server, and then the most appropriate response is generated. Furthermore, if the user appears confused, supplementary information can be provided, such as, "If you could tell us more specific dates and times, we can provide you with more detailed information."

[0593] An example of a prompt message would be, "Please tell me the date and time of the event. Also, please take an appropriate approach based on a specific sentiment analysis." In this way, a practical and interactive reception system can be realized.

[0594] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0595] Step 1:

[0596] When a user stands in front of the device, the terminal activates its camera and captures the user's face. The input at this time is image data of the user's face, and the output is face data that is sent to the server. The face data is transferred to the server, which then checks against a database to obtain the user's personal information.

[0597] Step 2:

[0598] The server analyzes the user's emotional state using an analysis method that evaluates emotions based on facial recognition data. The input is facial and voice feature data, and the output is the analyzed emotional state. The server analyzes changes in facial expressions and voice to determine what emotional state the user is in.

[0599] Step 3:

[0600] The user inputs questions or requests into the terminal using voice. The terminal converts the input voice data into text using a speech recognition engine. In this case, the input is the user's voice data, and the output is text data. This converted text data is sent to the server.

[0601] Step 4:

[0602] The server uses a generative AI model to analyze text data and understand the user's intent. The input is text data, and the output is the analyzed information request. The server then uses the generative AI model to apply prompts and determine the optimal response based on the user's request.

[0603] Step 5:

[0604] The server retrieves the necessary information from the database based on the analysis results. The input in this step is the analyzed information request, and the output is the retrieved information data. The server searches the database and extracts the exact information the user is looking for.

[0605] Step 6:

[0606] Based on the acquired information, the server generates a response using its multilingual translation function. The input in this process consists of informational data and the user's language settings, while the output is the translated response data. The server converts the information into the specified language and formats it as a response.

[0607] Step 7:

[0608] The terminal provides the user with translated responses sent from the server. The input is the translated response data, and the output is information provided to the user in either voice or text. The terminal uses speech synthesis technology to respond to the user in a natural-sounding manner.

[0609] (Application Example 2)

[0610] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0611] In today's retail environment, personalized service and multilingual support are required for visiting customers, but traditional methods make it difficult to efficiently achieve both simultaneously. Furthermore, there is a lack of natural, real-time responses based on customer emotions and language, hindering the delivery of a high-quality customer experience.

[0612] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0613] In this invention, the server includes biometric authentication means, emotion analysis means, and voice analysis means. This enables personalized responses tailored to each customer visiting the store, as well as responses in appropriate language on the spot. In particular, by adapting responses in real time based on the customer's emotional state, it becomes possible to provide a natural and empathetic customer experience.

[0614] "Biometric authentication methods" are technologies that use a user's face or other physical characteristics to identify an individual and obtain related personal information.

[0615] "Emotional analysis methods" are technologies that analyze a user's facial expressions and vocal characteristics to identify their emotional state.

[0616] "Voice analysis means" refers to technology that converts voice input from a user into text, analyzes its content, and understands the information request.

[0617] "Response formation means using generative artificial intelligence" refers to artificial intelligence technology for generating the optimal response based on analysis results.

[0618] A "multilingual translation method" is a technology for translating the generated response into the user's language and providing it appropriately.

[0619] "Information storage and analysis means" refers to technologies that store data acquired from various means, analyze it through machine learning, and use it to improve services.

[0620] "Language detection means" refers to technology that identifies the user's language and optimizes communication for that language.

[0621] "Response adaptation means" is a technology that adjusts responses in real time based on the user's emotional state, enabling more appropriate communication.

[0622] In one embodiment of this invention, the server first recognizes the user's face using biometric authentication means and identifies the individual. This allows the server to retrieve relevant personal information from a database. Using the recognized facial data, the server analyzes the user's emotional state using emotion analysis means. Technologies such as EmotionEngine can be used for emotion analysis.

[0623] Next, the terminal receives voice input from the user through the microphone. This voice is converted into text data by a voice analysis system, and the user's request is analyzed using natural language processing. At this stage, a response formation system using generative artificial intelligence generates an appropriate response to the user's request. A generative AI model is used for this generation.

[0624] The generated response is translated into the user's language by a multilingual translation system. Since the user's language is pre-identified by a language detection system, accurate translation is possible. Finally, the response is output to the user, but the response adaptation system adjusts the tone and content in real time according to the user's emotional state.

[0625] As a concrete example, if a customer visiting a store asks, "Where can I buy this product?", the terminal recognizes the language and emotion, generates product information, and provides guidance in the appropriate language. A prompt such as, "Generate an answer to the question 'Where can I buy this product?' into the generative AI model" is input, leading to a natural response. This entire process results in a high-quality and personalized customer experience.

[0626] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0627] Step 1:

[0628] The server uses biometric authentication to recognize the user's face using image data from the camera as input. Using the recognized facial data, personal information is retrieved from the database to identify individual users.

[0629] Step 2:

[0630] The server enters the emotion analysis system and receives facial feature data and voice characteristics as input. By analyzing this data, the user's emotional state is identified. Based on the user's facial expressions and voice tone, the emotion engine measures emotions such as anger, anxiety, and joy.

[0631] Step 3:

[0632] The device converts the voice input from the user via the microphone into text using voice analysis. The converted text data is then analyzed to understand the user's request using natural language processing. Based on the text analysis, the user's questions and requests are interpreted.

[0633] Step 4:

[0634] The server then enters a response formation mechanism using generative artificial intelligence. Based on the text data generated in step 3, the generative AI model generates the optimal response as a prompt. For example, in response to the question "Where can I buy this product?", a response based on the product's location information is generated.

[0635] Step 5:

[0636] The server translates the response using a multilingual translation tool. After identifying the user's language using a language detection tool, it translates the response into the appropriate language and prepares it.

[0637] Step 6:

[0638] The device uses response adaptation mechanisms to adjust its response in real time based on the emotional state detected in step 2. The final response to the user is modified to be more empathetic and natural according to the emotion and then provided to the user.

[0639] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0640] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0641] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0642] [Fourth Embodiment]

[0643] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0644] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0645] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0646] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0647] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0648] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0649] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0650] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0651] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0652] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0653] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0654] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0655] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0656] This invention is an AI concierge robot system designed to streamline reception operations in corporations and public institutions. At the heart of the system is a terminal that interacts with the user, integrating a wide range of functions including facial recognition, emotion analysis, speech recognition, response generation, and multilingual support. The specific implementation of the system is described below.

[0657] First, when a user stands in front of the device, the device captures the user's face with its built-in camera. This image data is analyzed by a facial recognition algorithm and then queried to a server to be matched with the user's personal information. For already registered users, their history information can be used to provide more personalized service. For new users, the registration process begins, and the acquired information is stored on the server.

[0658] Next, the device analyzes the user's facial expressions to recognize emotions. By recognizing the emotional state, it can determine whether the user is relaxed or stressed, and select a response tone appropriate to that state. For example, if the user is tense, a more friendly and reassuring tone will be used.

[0659] The device has a built-in microphone that accepts voice input from the user. Using speech recognition technology, the device converts the voice into text, and a generating AI model analyzes the content. This analysis helps the user understand their specific requests and questions. Then, the server starts a process of searching for the necessary information based on the analysis results and generating the optimal answer.

[0660] The generated responses are adapted to the user's language using multilingual support, ensuring that accurate information is provided to users who speak different languages. For example, an English-speaking user will receive an audio response in English.

[0661] All interactions are logged and stored on the server, and a machine learning engine periodically analyzes the data to improve future response accuracy. In this way, the system continuously learns and evolves through all interactions with the user. For example, if a visitor asks, "Where is the nearest ATM?", the terminal will determine the user's location and guide them to the nearest ATM on a map. Providing this information via voice makes the guidance quick and efficient.

[0662] The following describes the processing flow.

[0663] Step 1:

[0664] The device activates its camera and captures the user's face. The captured image is then analyzed using a built-in facial recognition algorithm.

[0665] Step 2:

[0666] The device sends the facial recognition results to the server, which compares them with past records to identify the user. The server then refers to an existing database to identify the user information and sends it back to the device. For new users, the information is recorded, and the initial registration process is performed.

[0667] Step 3:

[0668] The device captures facial expression data from the user's face and uses emotion recognition to determine the user's current emotional state. This prepares the device to appropriately adjust the tone of conversation.

[0669] Step 4:

[0670] The user communicates questions and requests by voice to the device. The voice input is transmitted to the device via the microphone.

[0671] Step 5:

[0672] The device uses a speech recognition algorithm to convert the user's voice into text data. The converted text is then parsed to understand the user's request.

[0673] Step 6:

[0674] The device then uses a generative AI model to initiate a process that generates an appropriate response based on the analyzed request.

[0675] Step 7:

[0676] The terminal sends a request to the server to obtain the necessary information. The server retrieves the appropriate information from the database and provides it to the terminal.

[0677] Step 8:

[0678] The device uses its multilingual support feature to translate responses generated according to the user's preferred language.

[0679] Step 9:

[0680] The terminal uses speech synthesis to output the system-generated response as audio and convey it to the user.

[0681] Step 10:

[0682] Once the interaction is complete, the server logs all interactions and analyzes the data to improve accuracy in the future.

[0683] (Example 1)

[0684] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0685] Traditional reception systems provided a uniform approach to users, making it difficult to offer services tailored to individual needs. Furthermore, their inability to support multiple languages meant they could not adequately assist users who spoke different languages. Additionally, learning and improvement based on collected data was difficult, resulting in a lack of an efficient feedback loop for improving service quality.

[0686] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0687] In this invention, the server includes biometric authentication means, emotion detection means, voice conversion means, response formation means, language conversion means, information storage and analysis means, and location identification means. This enables the provision of flexible services tailored to the individual needs of users and provides accurate information to users who speak different languages. Furthermore, it enables qualitative improvement of services through the analysis of collected data and allows the system to evolve through continuous learning.

[0688] "Biometric authentication" refers to technology that scans a user's face to obtain personal information and enables individual authentication.

[0689] "Emotion detection means" refers to technology that analyzes a user's facial expressions to determine their emotional state and selects a response based on that.

[0690] "Voice conversion means" refers to technology that converts voice input from a user into text data and analyzes the content of the request.

[0691] "Response formation means" refers to a technology that generates an appropriate response using a generative AI model based on the analyzed user request.

[0692] "Language conversion means" refers to technology that translates generated responses into multiple languages, providing accurate information to users who speak different languages.

[0693] "Information storage and analysis methods" refer to technologies that accumulate data obtained at each processing stage and improve system performance through machine learning.

[0694] "Location identification means" refers to technology that acquires a user's geographical location information and guides them to relevant facilities and services.

[0695] This system is an AI concierge robot system that streamlines reception operations in corporations and public institutions. Here, we will show a specific form for implementing the invention.

[0696] At the core of the system are terminals that interact with users. These terminals are equipped with input and output devices such as cameras, microphones, and speakers, which are used to facilitate interaction with users. Servers are located on the cloud or local network and are responsible for processing and storing data.

[0697] First, when a user stands in front of the device, the device's built-in camera captures the user's face. A facial recognition algorithm analyzes this image data, and the server uses biometric authentication to verify the user's personal information. If the user is already registered, the server retrieves their history information to enable individualized service. For new users, the registration process begins on the device, and the information is stored on the server.

[0698] Next, the terminal uses emotion detection means to analyze the user's facial expressions and analyze their emotional state. Based on this, the server determines whether the user is relaxed or stressed, and selects an appropriate response tone using response shaping means.

[0699] When a user makes a question or request by voice, the device captures the audio using its microphone and converts it to text using a speech-to-text converter. This text data is then analyzed using a generative AI model to generate an appropriate response. The server then translates the generated response into the user's language using a language conversion converter and outputs it as audio through its speaker.

[0700] Furthermore, all interactions are logged on the server using information storage and analysis tools. This data is periodically analyzed using machine learning techniques to improve the system's response accuracy.

[0701] As a concrete example, consider a scenario where a user asks, "Where is the nearest cafe?" The device obtains the user's location information and uses location-based searching to find nearby cafes. The server can output the generated information as voice in the user's language and also display map information on the device's screen. An example of a prompt message could be something like, "How should we respond when a visitor asks about nearby facilities?"

[0702] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0703] Step 1: User Recognition

[0704] The user stands in front of the terminal. The terminal captures the user's face with its built-in camera. The input is the captured facial image data, which is analyzed by a facial recognition algorithm. The server receives the analysis results, compares them with the database, and determines whether the user is already registered. The output is the user's personal information and past history.

[0705] Step 2: Sentiment Analysis

[0706] The system analyzes the user's facial expressions using facial images acquired by the terminal. The input is the user's facial image data, which is then analyzed by an emotion detection system. The server receives the analysis results and determines the user's emotional state (relaxed, tense, etc.). The output is the user's emotional state data. Specifically, if the user is tense, data is generated to change the tone of response.

[0707] Step 3: Speech Recognition

[0708] The user communicates questions and requests verbally to the device. The device captures the audio using its built-in microphone. The input is the user's voice data, which is converted into text data using a speech-to-text conversion device. The output is the text data converted from the audio. This text serves as the basis for the next analysis.

[0709] Step 4: Response Generation

[0710] The server receives text data obtained through speech recognition and analyzes it using a generative AI model. The input is text data, and the system accurately understands the user's request based on the prompt text. Based on the analysis results, a response formation mechanism generates the optimal answer. The output is the generated response data.

[0711] Step 5: Multilingual Translation

[0712] The system receives the response data generated by the server and uses a language translation mechanism. The input consists of the generated response data and the user's language information. This translates the response into the user's native language. The output is the multilingual response data, which the terminal then outputs as audio.

[0713] Step 6: Location Information Guidance

[0714] When a user makes a location-related inquiry (for example, "Where is the nearest cafe?"), the device uses location-determining means. The input is the user's current location information, and the device uses this location information to search for relevant facilities and services. The output is optimal guidance information, which the device displays on the screen as map information and also provides audio explanations.

[0715] Step 7: Data accumulation and analysis

[0716] After all interactions are complete, the server stores the collected data using data storage and analysis tools. The input is all the data generated at each step. This data is used by machine learning algorithms to analyze the data and improve the system's accuracy and response quality. The output is analysis reports and improvement suggestions, which allow the system to be continuously improved.

[0717] (Application Example 1)

[0718] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0719] In user reception operations, there is a need to provide multilingual support and personalized services quickly. Furthermore, it is necessary to improve convenience by utilizing user location information to guide users to details about relevant facilities and services. In addition, it is important to utilize data collected through the system to streamline reception operations.

[0720] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0721] In this invention, the server includes a facial recognition means for recognizing the user's face and acquiring personal information, an emotion recognition means for analyzing emotions, a speech recognition means for analyzing voice input, a multilingual support means for providing responses in multiple languages, a facility guidance means, and a usage status recording means. This enables multilingual support and personalized responses tailored to the user, and allows for rapid guidance to relevant facilities based on the user's location information. Furthermore, by accumulating and learning from data, continuous operational efficiency improvements can be achieved.

[0722] A "user" is an individual or group that utilizes a system and is the entity that receives information services.

[0723] "Facial recognition means" refers to technology for identifying an individual's face from image data acquired using a camera.

[0724] "Emotion recognition means" refers to technology that analyzes a user's emotional state from their facial expressions and behavior, and then derives an appropriate response based on that analysis.

[0725] "Speech recognition means" refers to a process that converts speech sounds into text data and analyzes information based on the results.

[0726] "Response generation means using a generative model" refers to a technology that utilizes a generative AI model to create an appropriate response based on voice input or other data.

[0727] "Multilingual support" refers to a function that enables the provision of information in different languages by utilizing translation technology.

[0728] "Data storage and analysis methods" refer to technologies used to improve operations and enhance services by storing and analyzing information collected through systems.

[0729] A "facility guidance system" is a process for guiding users to the location and services of relevant facilities based on their location information.

[0730] A "usage status recording means" is a function that records how users use the system and uses that information to optimize business processes and improve services.

[0731] The system for implementing this invention is configured as a concierge robot that integrates various technologies.

[0732] First, when a user stands in front of the system, the terminal's built-in camera captures the user's face, and facial recognition technology is used to identify the individual. The facial recognition algorithm then analyzes the data and compares it with the user's personal information. If the user is already registered, personalized responses can be provided based on their history.

[0733] Next, the device uses the user's facial expressions to activate its emotion recognition mechanism. This involves applying an emotion analysis algorithm to determine the user's mental state. Based on the emotion recognition results, the tone of response and conversation style are adjusted. For example, if the device determines that the user is nervous, it will respond in a friendly and reassuring tone.

[0734] Subsequently, voice input from the user is acquired via a microphone and converted to text by a speech recognition system. The text data is then analyzed using a generative AI model to understand the user's specific requests. Based on this analysis, the server uses the generative model to create the optimal response.

[0735] Furthermore, multilingual support means that the generated responses are translated into the user's native language and provided in either voice or text. This process ensures that users who speak different languages can receive efficient support.

[0736] For example, if a user uses voice input to say, "Please tell me the nearest public facility," the device can obtain location information using its facility guidance system and guide the user to nearby facilities. The response includes specific map information, which can be visually confirmed.

[0737] Examples of software and hardware used include "face recognition algorithms" and "emotion analysis algorithms" for face recognition and emotion recognition, a "speech recognition engine" for speech recognition, a "generative AI engine" for generative AI models, and a "language translation API" for multilingual support.

[0738] Examples of prompt statements to be input to a generative AI model include the following:

[0739] "Please explain how to guide visitors who ask about nearby public facilities."

[0740] "Please ask users if there are any events that require specific notifications."

[0741] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0742] Step 1:

[0743] The device captures the user's face with its camera and acquires image data. This image data is analyzed using facial recognition technology and compared with the user's personal information database. In this process, the image input by the camera is analyzed by a facial recognition algorithm and matched with existing user data. As a result, the user's personal information is output.

[0744] Step 2:

[0745] The device uses the user's facial expression data to activate its emotion recognition mechanism. The device obtains emotions from facial expressions and applies an emotion analysis algorithm to determine the user's mental state. The user's facial image is used as input, and the emotional state is output. This process selects a response tone appropriate to the user's emotions.

[0746] Step 3:

[0747] The user requests information by voice. The device's microphone acquires the voice input, and a speech recognition system converts this voice data into text. Here, the input is the user's voice information, and the output is generated as text data. The transcribed data forms the basis for the next analysis step.

[0748] Step 4:

[0749] The server uses a generative AI model to analyze text data. Based on the analysis results, it generates the optimal response to understand the user's request. The input data is a text-based request, and the generated text data is output as the response. Based on this result, the server provides information tailored to the user.

[0750] Step 5:

[0751] The generated text response is translated into the user's language using a multilingual support system. The server uses a language translation API to convert the generated text into the user's native language and outputs it as audio or text. This allows for accurate and easily understandable responses to multilingual users.

[0752] Step 6:

[0753] The terminal uses a facility guidance system to obtain the user's location information and provides detailed information about relevant facilities. In this step, the user's location data is input, and the output provides guidance information about facilities best suited to the user. This function enables more precise service guidance based on the user's current location.

[0754] Step 7:

[0755] The server collects and stores all interaction data using data storage and analysis tools. This data is analyzed for continuous operational efficiency improvements. Input includes details of all interactions with the user, and output provides analytical data to improve response accuracy.

[0756] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0757] This invention is a system that efficiently handles reception duties for corporations and public institutions using AI technology. Utilizing an emotion engine, this system achieves advanced customer service through user facial recognition, emotion analysis, speech recognition, response generation, multilingual support, and data storage and analysis. The specific implementation of the system is described below.

[0758] When a user appears in front of the device, the device uses facial recognition to capture the user's face. The facial data is sent to a server, where it is compared with an existing database to obtain the user's personal information. For new users, the information is recorded in the system and used to improve future services.

[0759] The emotion engine analyzes emotions in real time across multiple dimensions based on facial recognition data and the user's voice characteristics. It extracts changes in the user's facial expressions and voice tone as features, and the resulting emotional state directly participates in the subsequent response generation process. For example, if the user is feeling anxious, the device will generate a more reassuring response.

[0760] The user inputs questions and requests into the terminal via voice. The terminal uses speech recognition to convert the voice into text, which is then analyzed by a generative AI model. The analysis results are used to retrieve information from the server, providing the most appropriate information. The server searches its database for the necessary data and provides it to the terminal.

[0761] The generated responses are adjusted based on the user's emotional state. The emotion engine optimizes each response according to emotional data, creating natural and empathetic communication. Multilingual support automatically translates the response according to the user's language.

[0762] All interactions are saved as logs on the server and analyzed by a machine learning engine. This allows for further accuracy improvements by taking into account the user's individual emotional history. For example, if a user who was previously dissatisfied returns, the device can take their previous emotional data into consideration and strive to provide particularly helpful service.

[0763] In this way, by using AI that integrates multiple functions, it is possible to achieve both high-quality customer service and improved operational efficiency simultaneously.

[0764] The following describes the processing flow.

[0765] Step 1:

[0766] When a user arrives in front of the device, the device captures the user's face with its built-in camera. It then performs facial recognition and uses the captured image data to identify the user.

[0767] Step 2:

[0768] The device sends facial recognition data to the server, which compares it with an existing user database. If the user is already registered, the server returns the relevant personal information to the device. For new users, the initial registration process begins.

[0769] Step 3:

[0770] The device uses an emotion engine to analyze facial expression data obtained from the user's face. It also analyzes voice tone and other emotional indicators to determine the user's emotional state.

[0771] Step 4:

[0772] The user verbally communicates questions or requests to the device. The device receives the audio through its microphone and converts the audio data into text using speech recognition technology.

[0773] Step 5:

[0774] The device uses a generative AI model to analyze the content of the transcribed questions and processes them to understand the user's intent.

[0775] Step 6:

[0776] Based on the analysis results, the terminal requests the necessary data from the server. The server retrieves the requested information from its database and returns it to the terminal.

[0777] Step 7:

[0778] The device generates responses based on information and emotional data. The emotion engine considers the user's current emotional state and appropriately adjusts the tone and content of the generated responses.

[0779] Step 8:

[0780] The device uses its multilingual support function to translate responses into the user's selected language and output them as audio. This ensures that information is provided in a way that is easy for the user to understand.

[0781] Step 9:

[0782] The server logs all interaction data and uses machine learning algorithms to analyze it for continuous system improvement, thereby enhancing the user experience.

[0783] (Example 2)

[0784] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0785] In reception services at corporations and public institutions, smooth communication with visitors is essential, but traditional methods have limitations in terms of quality and efficiency. In particular, current technology may not be able to provide sufficient satisfaction in situations requiring multilingual support and consideration for visitors' feelings. Furthermore, there is a lack of mechanisms to effectively utilize visitor information and continuously improve services.

[0786] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0787] In this invention, the server includes adaptive means for adjusting responses according to the user's emotional state, acquisition means for obtaining information from a database, and storage means for accumulating and learning from the information obtained from each of the means. This enables natural and friendly responses that take into account the visitor's emotions, smooth communication in multiple languages, and continuous service improvement based on accumulated data.

[0788] "Identification means for recognizing faces" refers to a device or technology for identifying a user's face and capturing its characteristics as data.

[0789] "Emotional evaluation analysis means" refers to a device or technology for determining a user's emotional state in real time based on their facial expressions and vocal characteristics.

[0790] "Recording means for converting voice input into text and analyzing information requests" refers to a device or technology that converts a user's voice into text data, analyzes its content, and understands the information request.

[0791] A "generative model-based response method" refers to a machine learning model or technique used to create an appropriate response based on an analyzed information request.

[0792] A "translation means that translates and provides responses in various languages" refers to a device or technology for translating the generated response into the user's native language or a specified language and providing it.

[0793] "Memory means for accumulating information and learning" refers to a device or technology for storing acquired data and using it to improve the performance of a system.

[0794] "An adaptive means that adjusts responses according to emotional state" refers to a device or technology that dynamically changes the tone and content of a response based on the user's emotional information.

[0795] "Means of acquiring information" refers to devices or technologies for retrieving necessary information from databases or other information sources based on user requests.

[0796] To implement this invention, the following hardware and software are used. The terminal is a device equipped with a camera and microphone that captures the user's face and voice. The server is a computer system with high-speed data processing capabilities that performs functions such as face recognition, emotion analysis, voice analysis, and database access.

[0797] When a user stands in front of the device, the terminal uses its camera to capture their face. The facial data is sent to a server and compared against a database. Based on this facial data, the server uses identification methods to retrieve personal information from an existing database.

[0798] In addition, the user's facial expressions and voice data are analyzed using an emotional evaluation tool to determine their emotional state. The server converts the user's voice input into text using speech recognition technology and interprets the information request using a generative AI model. Based on this analysis, the server retrieves the necessary information from the database and performs multilingual translation using a conversion tool.

[0799] A notable feature is that the server is equipped with adaptive mechanisms to adjust its response to the user's emotional state. This enables empathetic and natural interactions. Furthermore, it uses memory mechanisms to accumulate dialogue data, allowing for more refined service on subsequent visits.

[0800] For example, if a user visits the reception desk of a facility and speaks into a terminal saying, "I'd like to know the facility's event schedule," the voice is converted into text, analyzed by the server, and then the most appropriate response is generated. Furthermore, if the user appears confused, supplementary information can be provided, such as, "If you could tell us more specific dates and times, we can provide you with more detailed information."

[0801] An example of a prompt message would be, "Please tell me the date and time of the event. Also, please take an appropriate approach based on a specific sentiment analysis." In this way, a practical and interactive reception system can be realized.

[0802] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0803] Step 1:

[0804] When a user stands in front of the device, the terminal activates its camera and captures the user's face. The input at this time is image data of the user's face, and the output is face data that is sent to the server. The face data is transferred to the server, which then checks against a database to obtain the user's personal information.

[0805] Step 2:

[0806] The server analyzes the user's emotional state using an analysis method that evaluates emotions based on facial recognition data. The input is facial and voice feature data, and the output is the analyzed emotional state. The server analyzes changes in facial expressions and voice to determine what emotional state the user is in.

[0807] Step 3:

[0808] The user inputs questions or requests into the terminal using voice. The terminal converts the input voice data into text using a speech recognition engine. In this case, the input is the user's voice data, and the output is text data. This converted text data is sent to the server.

[0809] Step 4:

[0810] The server uses a generative AI model to analyze text data and understand the user's intent. The input is text data, and the output is the analyzed information request. The server then uses the generative AI model to apply prompts and determine the optimal response based on the user's request.

[0811] Step 5:

[0812] The server retrieves the necessary information from the database based on the analysis results. The input in this step is the analyzed information request, and the output is the retrieved information data. The server searches the database and extracts the exact information the user is looking for.

[0813] Step 6:

[0814] Based on the acquired information, the server generates a response using its multilingual translation function. The input in this process consists of informational data and the user's language settings, while the output is the translated response data. The server converts the information into the specified language and formats it as a response.

[0815] Step 7:

[0816] The terminal provides the user with translated responses sent from the server. The input is the translated response data, and the output is information provided to the user in either voice or text. The terminal uses speech synthesis technology to respond to the user in a natural-sounding manner.

[0817] (Application Example 2)

[0818] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0819] In today's retail environment, personalized service and multilingual support are required for visiting customers, but traditional methods make it difficult to efficiently achieve both simultaneously. Furthermore, there is a lack of natural, real-time responses based on customer emotions and language, hindering the delivery of a high-quality customer experience.

[0820] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0821] In this invention, the server includes biometric authentication means, emotion analysis means, and voice analysis means. This enables personalized responses tailored to each customer visiting the store, as well as responses in appropriate language on the spot. In particular, by adapting responses in real time based on the customer's emotional state, it becomes possible to provide a natural and empathetic customer experience.

[0822] "Biometric authentication methods" are technologies that use a user's face or other physical characteristics to identify an individual and obtain related personal information.

[0823] "Emotional analysis methods" are technologies that analyze a user's facial expressions and vocal characteristics to identify their emotional state.

[0824] "Voice analysis means" refers to technology that converts voice input from a user into text, analyzes its content, and understands the information request.

[0825] "Response formation means using generative artificial intelligence" refers to artificial intelligence technology for generating the optimal response based on analysis results.

[0826] A "multilingual translation method" is a technology for translating the generated response into the user's language and providing it appropriately.

[0827] "Information storage and analysis means" refers to technologies that store data acquired from various means, analyze it through machine learning, and use it to improve services.

[0828] "Language detection means" refers to technology that identifies the user's language and optimizes communication for that language.

[0829] "Response adaptation means" is a technology that adjusts responses in real time based on the user's emotional state, enabling more appropriate communication.

[0830] In one embodiment of this invention, the server first recognizes the user's face using biometric authentication means and identifies the individual. This allows the server to retrieve relevant personal information from a database. Using the recognized facial data, the server analyzes the user's emotional state using emotion analysis means. Technologies such as EmotionEngine can be used for emotion analysis.

[0831] Next, the terminal receives voice input from the user through the microphone. This voice is converted into text data by a voice analysis system, and the user's request is analyzed using natural language processing. At this stage, a response formation system using generative artificial intelligence generates an appropriate response to the user's request. A generative AI model is used for this generation.

[0832] The generated response is translated into the user's language by a multilingual translation system. Since the user's language is pre-identified by a language detection system, accurate translation is possible. Finally, the response is output to the user, but the response adaptation system adjusts the tone and content in real time according to the user's emotional state.

[0833] As a concrete example, if a customer visiting a store asks, "Where can I buy this product?", the terminal recognizes the language and emotion, generates product information, and provides guidance in the appropriate language. A prompt such as, "Generate an answer to the question 'Where can I buy this product?' into the generative AI model" is input, leading to a natural response. This entire process results in a high-quality and personalized customer experience.

[0834] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0835] Step 1:

[0836] The server uses biometric authentication to recognize the user's face using image data from the camera as input. Using the recognized facial data, personal information is retrieved from the database to identify individual users.

[0837] Step 2:

[0838] The server enters the emotion analysis system and receives facial feature data and voice characteristics as input. By analyzing this data, the user's emotional state is identified. Based on the user's facial expressions and voice tone, the emotion engine measures emotions such as anger, anxiety, and joy.

[0839] Step 3:

[0840] The device converts the voice input from the user via the microphone into text using voice analysis. The converted text data is then analyzed to understand the user's request using natural language processing. Based on the text analysis, the user's questions and requests are interpreted.

[0841] Step 4:

[0842] The server then enters a response formation mechanism using generative artificial intelligence. Based on the text data generated in step 3, the generative AI model generates the optimal response as a prompt. For example, in response to the question "Where can I buy this product?", a response based on the product's location information is generated.

[0843] Step 5:

[0844] The server translates the response using a multilingual translation tool. After identifying the user's language using a language detection tool, it translates the response into the appropriate language and prepares it.

[0845] Step 6:

[0846] The device uses response adaptation mechanisms to adjust its response in real time based on the emotional state detected in step 2. The final response to the user is modified to be more empathetic and natural according to the emotion and then provided to the user.

[0847] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0848] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0849] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0850] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0851] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0852] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0853] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0854] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0855] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0856] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0857] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0858] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0859] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0860] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0861] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0862] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0863] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0864] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0865] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0866] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0867] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted as being incorporated by reference.

[0868] The following is further disclosed regarding the embodiments described above.

[0869] (Claim 1)

[0870] A facial recognition system for recognizing a user's face and obtaining corresponding personal information,

[0871] An emotion recognition means that analyzes the user's emotions based on the recognized facial data,

[0872] A speech recognition means for converting voice input from a user into text and analyzing the information request,

[0873] A response generation means using a generation model that generates an appropriate response based on the analysis results,

[0874] A multilingual support system that translates and provides responses in multiple languages,

[0875] A data storage and analysis means for accumulating data obtained from each of the above means and performing learning,

[0876] A system that includes this.

[0877] (Claim 2)

[0878] The system according to claim 1, further comprising a process for accumulating user information through an initial registration process if the user is new.

[0879] (Claim 3)

[0880] The system according to claim 1, further comprising means for automatically adjusting response tone and conversation style based on user emotion data.

[0881] "Example 1"

[0882] (Claim 1)

[0883] A biometric authentication method for recognizing the user's face and obtaining corresponding personal information,

[0884] An emotion detection means that analyzes the user's emotions based on the recognized facial data,

[0885] A speech-to-text conversion method for converting user voice input into text and analyzing information requests,

[0886] A response formation means using a generative AI model that generates an appropriate response based on the analysis results,

[0887] A language conversion means that translates into multiple languages and provides a response,

[0888] A means for storing and analyzing data obtained from each of the above means and performing machine learning,

[0889] A location-determining means that acquires the user's geographical location information and guides them to relevant facilities and services,

[0890] A system that includes this.

[0891] (Claim 2)

[0892] The system according to claim 1, further comprising a process for saving user information via an initial registration process if the user is new.

[0893] (Claim 3)

[0894] The system according to claim 1, further comprising means for automatically adjusting response tone and conversational style based on user emotion data.

[0895] "Application Example 1"

[0896] (Claim 1)

[0897] A facial recognition system for recognizing a user's face and obtaining corresponding personal information,

[0898] An emotion recognition means that analyzes the user's emotions based on the recognized facial data,

[0899] A speech recognition means for converting voice input from a user into text and analyzing the information request,

[0900] A response generation means using a generation model that generates an appropriate response based on the analysis results,

[0901] A multilingual support system that translates and provides responses in multiple languages,

[0902] A data storage and analysis means for accumulating data obtained from each of the above means and performing learning,

[0903] A facility guidance means that acquires the user's location information and enables guidance to related facilities,

[0904] A means for recording user usage to optimize reception operations,

[0905] A system that includes this.

[0906] (Claim 2)

[0907] The system according to claim 1, further comprising a process for accumulating user information through an initial registration process if the user is new.

[0908] (Claim 3)

[0909] The system according to claim 1, further comprising means for automatically adjusting response tone and conversation style based on user emotion data, and providing optimal service to the user in conjunction with acquired location information.

[0910] "Example 2 of combining an emotion engine"

[0911] (Claim 1)

[0912] An identification method for recognizing the user's face and obtaining corresponding personal information,

[0913] An analysis means for evaluating the user's emotions based on the recognized feature data,

[0914] A recording means for converting voice input from users into text and analyzing information requests,

[0915] A response means using a generative model that generates an appropriate response based on the aforementioned analysis results,

[0916] A translation means that translates and provides responses in various languages,

[0917] A memory means for storing and learning information obtained from each of the above means,

[0918] Adaptive means that adjust the response according to the user's emotional state,

[0919] A means of obtaining information from a database,

[0920] A system that includes this.

[0921] (Claim 2)

[0922] The system according to claim 1, which has a process for accumulating information via an initial registration process if the user is new.

[0923] (Claim 3)

[0924] The system according to claim 1, further comprising a device for automatically adjusting response tone and conversation style based on user emotional information.

[0925] "Application example 2 when combining with an emotional engine"

[0926] (Claim 1)

[0927] A biometric authentication method for recognizing the user's face and obtaining corresponding personal information,

[0928] An emotion analysis means for analyzing the user's emotions based on the recognized feature data,

[0929] A voice analysis means for converting voice input from users into text and analyzing information requests,

[0930] A response formation means using generative artificial intelligence to generate an appropriate response based on the analysis results,

[0931] A multilingual translation means that translates and provides responses in multiple languages,

[0932] Information storage and analysis means for storing and learning data obtained from each of the above means,

[0933] Language detection means for detecting the customer's language and optimizing the response,

[0934] A response adaptation mechanism for adjusting responses in real time based on the customer's emotional state,

[0935] A system that includes this.

[0936] (Claim 2)

[0937] The system according to claim 1, further comprising a process for accumulating user information through an initial registration process if the user is new, and a function for customizing responses by referring to the customer's past sentiment information.

[0938] (Claim 3)

[0939] The system according to claim 1, comprising means for automatically adjusting response tone and conversation style based on the customer's emotional state, and further providing translation according to the customer's language. [Explanation of Symbols]

[0940] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A facial recognition system for recognizing a user's face and obtaining corresponding personal information, An emotion recognition means that analyzes the user's emotions based on the recognized facial data, A speech recognition means for converting voice input from a user into text and analyzing the information request, A response generation means using a generation model that generates an appropriate response based on the analysis results, A multilingual support system that translates and provides responses in multiple languages, A data storage and analysis means for accumulating data obtained from each of the above means and performing learning, A system that includes this.

2. The system according to claim 1, further comprising a process for accumulating user information via an initial registration process if the user is new.

3. The system according to claim 1, further comprising means for automatically adjusting response tone and conversation style based on user emotion data.