system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
An interactive system with speech recognition and natural language processing facilitates quick and effective troubleshooting of information terminal malfunctions through visual and audio guidance, eliminating the need for specialized knowledge.

JP2026096598APending Publication Date: 2026-06-15SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-03
Publication Date: 2026-06-15

Application Information

Patent Timeline

03 Dec 2024

Application

15 Jun 2026

Publication

JP2026096598A

IPC: G06Q50/10; G06Q10/10

AI Tagging

Application Domain

Office automation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Users face difficulties in quickly and efficiently resolving malfunctions in information terminals like smartphones or tablets due to insufficient understanding of technical terms, leading to lengthy interactions with call centers or stores.

Method used

An interactive system utilizing speech recognition, natural language processing, and visual/audio guidance to identify and resolve issues, allowing users to follow clear instructions without specialized knowledge.

Benefits of technology

Enables users to rapidly and effectively troubleshoot problems using intuitive visual and audio guidance, reducing the burden of specialized knowledge requirements.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026096598000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A means of converting user input into text data using speech recognition technology, A means of analyzing information necessary to identify defects from text data using natural language processing technology, A means of searching for relevant information from multiple sources and selecting the optimal solution, A means of presenting the selected solution to the user visually and audibly, A means of receiving user feedback and continuing or ending the dialogue depending on the resolution status, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] Conventionally, when a malfunction occurs in an information terminal such as a smartphone or a tablet, the user has been forced to inquire at a call center or visit a store. The waiting time and stress due to insufficient understanding of technical terms associated with this have also become problems. There is a need to provide a means to reduce such inconveniences and enable users to solve problems quickly and efficiently.

Means for Solving the Problems

[0005] This invention provides an interactive system for identifying and resolving malfunctions in information terminals experienced by users. Specifically, it uses speech recognition technology to convert user input into text and analyzes it using natural language processing technology to identify problems. Furthermore, it searches for the optimal solution from multiple information sources and presents it to the user through visual and audio guidance. This allows users to easily resolve problems even without specialized knowledge. The ability to choose to continue or end the dialogue through feedback also reduces the burden on the user.

[0006] "Speech recognition technology" is a technology that converts input speech data into text data.

[0007] "Natural language processing technology" is a technology that analyzes meaning and intent from text data to understand human language.

[0008] "Information sources" refer to external or internal databases or historical data that are referenced to collect the data necessary for solving a problem.

[0009] A "solution" refers to the specific steps or methods for resolving an identified problem.

[0010] "Feedback" refers to data received from users, such as responses and results of actions, which the system uses to advance its interaction.

[0011] "Visual guidance" is a method of conveying information to users through text and images displayed on a screen.

[0012] "Voice guidance" is a method of conveying information to users using synthesized speech. [Brief explanation of the drawing]

[0013] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2]It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

MODE FOR CARRYING OUT THE INVENTION

[0014] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, a numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, a numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, a numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0019] In the following embodiments, a numbered communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), and the like.

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] This invention is an interactive system that enables users to self-resolve problems with information terminals via a server that interacts with a dedicated terminal installed in a store. This system combines speech recognition technology, natural language processing technology, visual guidance, and voice guidance. The program's processing is described below in natural language.

[0035] First, the user approaches a terminal installed in the store and accesses the system using voice or touch controls. The terminal converts the user's input into text data using speech recognition technology. This text data is then transmitted to a server via the internet.

[0036] The server uses natural language processing technology to analyze the received text data and extract keywords necessary to identify the problem. Based on this, the server consults internal databases and external information sources to search for relevant information. Once the optimal solution is determined, the server sends the steps to the terminal.

[0037] The terminal presents the received solution to the user through visual and audio guidance. This allows the user to understand and implement the solution even without specialized knowledge. When the user reports feedback on the results of their actions to the terminal, it is sent to the server, and the interaction continues. If the problem is reported to be resolved, the server terminates the session and records the history.

[0038] As a concrete example, consider a case where a user enters "My smartphone can't connect to Wi-Fi." The server analyzes Wi-Fi-related keywords and searches for the best solution from past history and the internet. For example, it might select a solution such as "Recheck your Wi-Fi settings and try reconnecting." The device then guides the user through this solution using character animations and voice prompts, providing specific instructions.

[0039] In this way, this system supports users in resolving problems with their information terminals themselves, enabling a rapid response.

[0040] The following describes the processing flow.

[0041] Step 1:

[0042] The user approaches a terminal installed in the store and activates the agent via touch or voice. The terminal uses sensors to detect the user's presence and initiates an initial screen or voice guidance.

[0043] Step 2:

[0044] The user describes a problem with their smartphone using voice. The device activates its voice recognition system and converts the user's voice into text data. This text data is then sent to a server via the internet.

[0045] Step 3:

[0046] The server analyzes the received text data using natural language processing techniques, extracts keywords related to the problem, and understands its content. Based on the analysis results, the server searches its internal database and external information sources for possible causes and potential solutions.

[0047] Step 4:

[0048] The server selects the most suitable solution from multiple options. The selected solution is then sent to the user's terminal in the form of visual and audio guidance to ensure user understanding.

[0049] Step 5:

[0050] The device visually displays the received solution on the screen and provides voice guidance to the user. The user understands the specific solution and attempts to perform the operation on their smartphone.

[0051] Step 6:

[0052] Users provide feedback on their device regarding the results of trying the solution. This feedback may include, for example, "The problem is solved" or "It's not solved yet."

[0053] Step 7:

[0054] The server receives feedback from the user and ends the session if the problem is resolved. If it remains unresolved, it explores alternative solutions and sends further instructions to the terminal. This allows the interaction to continue.

[0055] Step 8:

[0056] Once the problem is resolved or support has ended, the server logs the session and saves the support history to the database. The terminal then prompts the user to confirm the support details and guides them through the termination process.

[0057] (Example 1)

[0058] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0059] When information terminals malfunction, there is a need for a method that allows users to quickly and effectively resolve problems themselves, even without specialized knowledge. Current support systems often make it difficult for users to understand the appropriate operating procedures, resulting in lengthy troubleshooting processes. Therefore, there is a need for a system that intuitively understands the user's problem and guides them to the appropriate solution.

[0060] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0061] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary to identify a problem from the text data using natural language processing technology, and means for visually displaying the user's progress and clearly indicating specific operating procedures when presenting a solution. As a result, the user can intuitively understand the problem with the information terminal and implement the solution themselves.

[0062] "Speech recognition technology" is a technology that converts voice input from a user into digital text data.

[0063] "Natural language processing technology" is a technique that analyzes text data, understands the structure of the information, and extracts necessary keywords and context.

[0064] "A means of exploring relevant information and selecting the optimal solution" refers to a function that collects necessary data from multiple sources and determines the most suitable solution for the user's problem.

[0065] "Means of presenting to the user visually and aurally" refers to methods of effectively communicating solutions to the user through on-screen visual elements and audio guides.

[0066] "Means of receiving user feedback and continuing or ending the dialogue depending on the resolution status" refers to a mechanism in which the system determines the next step in the dialogue based on responses and reports from the user and provides supplementary information as needed.

[0067] "Means of visually displaying the user's progress and clearly indicating specific operating procedures" refers to a method of supporting the user by visualizing on the screen which step the user is currently in and clearly indicating the flow of operations.

[0068] This invention is a system that allows users to self-resolve problems with information terminals using dedicated terminals installed in stores or other locations. Specific embodiments for carrying out the invention are shown below.

[0069] The user approaches a terminal in the store and accesses the system using voice or touch controls. The terminal is equipped with voice recognition technology, which converts the user's voice input into text data. Voice recognition software is used for this process. The generated text data is sent to a server via the internet.

[0070] The server analyzes the received text data using natural language processing techniques. This process employs a natural language processing algorithm, which is a generative AI model. Based on the analysis, the server extracts the information necessary to identify the problem and consults internal databases and external information sources to find a solution.

[0071] Next, the server selects the optimal solution based on the relevant information and sends it to the terminal. The terminal then presents the received solution to the user using a combination of visual and audio guidance. This allows users to understand the specific operating procedures and solve problems even without specialized knowledge.

[0072] The user follows the given instructions, performs the operation, and reports the results as feedback to the terminal. Based on this feedback, the terminal continues its interaction with the server and provides additional solutions as needed. If the problem is resolved based on the user's report, the server terminates the session and records the history.

[0073] As a concrete example, consider a scenario where a user enters "My smartphone can't connect to Wi-Fi" into their device. The server analyzes the Wi-Fi-related information and selects a solution such as "Recheck Wi-Fi settings and try reconnecting." The device then visually displays this solution on the screen, providing the user with specific steps.

[0074] This system addresses problems by using prompt messages such as, "Please tell me effective steps to improve my smartphone's Wi-Fi connection." This allows for the quick and effective resolution of problems with the user's information device.

[0075] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0076] Step 1: The user approaches a terminal installed in the store and provides voice or touch input. This input includes details about the problem with the user's device. The terminal uses speech recognition technology to convert the voice input into text data. Speech recognition software is used for this conversion, and the output is in text format.

[0077] Step 2: The terminal sends the generated text data to the server via the internet. The server receives this text data and analyzes it using natural language processing technology. Here, a generative AI model is used to extract keywords and contextual information necessary to identify problems from the data. The output is the analyzed information.

[0078] Step 3: The server identifies the type of malfunction based on the extracted information and searches for the best solution by accessing internal databases and external information sources. This process may utilize historical data and statistical data. The output is a specific solution presented to the user.

[0079] Step 4: The server sends the selected solution to the terminal. The terminal receives the solution and presents specific steps to the user using visual and auditory means. Visually, animations and illustrations are displayed on the screen, and audio guidance is output using synthesized speech technology. The output is clear and easy-to-understand instructions for the user.

[0080] Step 5: The user follows the instructions on the device and performs the troubleshooting steps. They then input the results as feedback on the device. The input reports whether the problem has been resolved and may include additional comments as needed.

[0081] Step 6: The terminal sends user feedback to the server, which uses this to decide whether to continue or end the interaction. The server analyzes the feedback, ends the session if the problem is resolved, and records the history. The output is either the end of the interaction or further instructions.

[0082] (Application Example 1)

[0083] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0084] When users encounter problems with their information terminals, it is difficult for them to resolve the issue themselves without specialized knowledge. Furthermore, traditional support systems often struggle to provide users with the information they need quickly and accurately, leading to prolonged interactions. Additionally, the solutions presented may be unclear, increasing the risk of users making incorrect decisions.

[0085] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0086] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary to identify a problem from the text data using natural language processing technology, and means for displaying virtual elements using augmented reality technology and guiding the user through specific operating procedures. As a result, the user receives clear visual and auditory guidance, enabling them to quickly and accurately resolve their problems.

[0087] "Speech recognition technology" is a technology that converts voice input into text data, processing what the user says as text information.

[0088] "Natural language processing technology" is a technology that analyzes text data to understand its meaning and intent, and extracts information necessary to identify problems from language data.

[0089] "Information sources" refer to internal databases or external knowledge bases that a server accesses when searching for relevant information.

[0090] "Presenting visually and aurally" means providing the selected solution to the user through visual displays and audio explanations.

[0091] Augmented reality technology is a technology that overlays virtual elements onto the real world, visually guiding users through specific operating procedures.

[0092] "Communication services" refer to system functions that record user feedback and history, and exchange information between servers.

[0093] A system for carrying out this invention includes a user, a terminal, and a server as its main components.

[0094] First, the user accesses the terminal to resolve an issue with it. The terminal accepts input from the user via voice or touch, and converts this input into text data using speech recognition technology. This process utilizes speech recognition software such as Google® Cloud Speech-to-Text.

[0095] Next, the text data is sent to a server. This server uses natural language processing technologies such as OpenAI® GPT to analyze the text and extract keywords necessary to identify the problem. Based on the analyzed text, the server searches for relevant information by referring to various sources such as internal databases and the internet, and selects the optimal solution.

[0096] The selected solution is returned to the device, which uses ARKit (iOS) or ARCore (Android®) to present virtual elements to the user using augmented reality technology. Additionally, Amazon Polly is used to provide voice guidance, visually and audibly guiding the user through specific operating procedures.

[0097] For example, if a user reports a problem such as "my smartphone's Wi-Fi won't connect," the server analyzes Wi-Fi-related keywords and selects a solution such as "check your Wi-Fi settings and try reconnecting." The device then presents these instructions using augmented reality guidelines and voice prompts, allowing the user to resolve the issue by following the instructions.

[0098] By using a generative AI model, it is possible to generate appropriate solutions to user questions. An example of a prompt to be input to the generative AI model is: "The user has reported a problem with their smartphone's Wi-Fi. Please briefly explain the steps to the best solution. In particular, please provide information that can be supported both visually and audibly in a way that even a beginner can understand." In this way, the system of the present invention helps to quickly and effectively resolve technical problems faced by users.

[0099] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0100] Step 1:

[0101] The user inputs a question into the device via voice or touch. The device accepts the input and uses speech recognition technology to convert voice input into text data. During this process, the user's voice data is converted to text using Google Cloud Speech-to-Text. The output is then sent to the server as natural language text.

[0102] Step 2:

[0103] The server analyzes the received text data using natural language processing techniques. Specifically, it uses OpenAI's natural language processing model to extract keywords from the text data to identify the problem. The input to this process is the text data from step 1, and the output is the extracted keyword set.

[0104] Step 3:

[0105] The server references internal databases and external information sources to search for and select the optimal solution based on extracted keywords. For example, it collects relevant technical information from the internet and determines the optimal solution using its own algorithm. The input is the keyword set from step 2, and the output is the selected solution.

[0106] Step 4:

[0107] The server sends the selected solution to the terminal. The terminal receives this and prepares a display to provide visual guidance to the user using augmented reality technology. Using ARKit or ARCore, virtual guidelines are overlaid on the user's real field of view. The input is the solution from step 3, and the output is the guidance in the visual display.

[0108] Step 5:

[0109] The device simultaneously uses Amazon Polly to generate voice guidance and provide auditory feedback to the user. Specifically, the device plays an audio explanation of the solution through its speaker, guiding the user through the specific actions they should take next. The input is the solution from step 3, and the output is the audio format explanation.

[0110] Step 6:

[0111] The user attempts to resolve the problem by following the instructions and provides feedback to the terminal. The terminal collects this feedback and reports it to the server. The server processes the received feedback and determines whether the problem has been resolved. If the problem is resolved, the session ends and the user's history is recorded. The input is the feedback from the user, and the output is the update of the history database.

[0112] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0113] This invention combines emotion recognition technology with an interactive system that allows users to self-resolve problems with information terminals. By incorporating an emotion engine, this system can analyze the user's emotional state from their voice and facial expressions and communicate appropriately based on that analysis.

[0114] First, the user accesses the system via a terminal installed in the store and describes the problem verbally. The terminal uses speech recognition technology to transcribe what is said into text and sends it to the server. Simultaneously, an emotion engine analyzes the user's tone of voice and facial expressions to identify their emotional state.

[0115] The server analyzes the received text data using natural language processing technology to extract information necessary to identify the problem. Furthermore, it considers the user's emotional state based on information from the emotion engine, and provides more careful guidance if the user is feeling anxious or frustrated.

[0116] The server selects the optimal solution from multiple sources based on the identified problem and the user's emotional state, and sends it to the terminal. The terminal provides the solution through visual and audio guidance, adjusting the tone of communication according to the user's emotional state.

[0117] For example, if a user complains that their smartphone battery dies too quickly, the emotion engine might detect frustration. In this case, the server will quickly and concisely suggest battery-saving methods that can be tried with simple steps, and the device will guide the user through these steps in a gentle and friendly tone.

[0118] Thus, the system of the present invention can easily provide guidance that reflects the user's emotional state, thereby improving the user experience.

[0119] The following describes the processing flow.

[0120] Step 1:

[0121] The user approaches a terminal installed in the store and begins to operate it. The terminal activates its emotion engine and starts collecting emotion data from the user's voice and facial expressions. At the same time, the terminal also activates its voice recognition system to convert the voice input into text.

[0122] Step 2:

[0123] The device converts voice input into text data while simultaneously sending emotional data analyzed by the emotion engine to the server. This is a crucial process for understanding the user's emotional state.

[0124] Step 3:

[0125] The server analyzes the received text data using a natural language processing engine to extract keywords necessary for identifying the problem. Simultaneously, it analyzes sentiment data to understand the user's emotional state.

[0126] Step 4:

[0127] The server selects the optimal solution from multiple sources based on the analysis results. This selection process takes into account the user's emotional state; for example, if the user is frustrated, an intuitive and concise solution is more likely to be chosen.

[0128] Step 5:

[0129] After the server selects the optimal solution, it sends that information to the terminal. The terminal visually displays the received solution on its screen and explains it clearly to the user through voice guidance.

[0130] Step 6:

[0131] The device adjusts the tone and expression of its guidance based on the user's emotional state. For example, if the user is feeling anxious, the guidance will be delivered in a gentle tone.

[0132] Step 7:

[0133] The user tries the suggested solution and reports the results as feedback to the device. The device then sends this feedback to the server.

[0134] Step 8:

[0135] The server receives user feedback and terminates the session if the problem is resolved. If the problem remains unresolved, the server explores alternative solutions and continues the interaction by sending instructions to the terminal again.

[0136] (Example 2)

[0137] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0138] When users attempt to resolve issues with their information terminals themselves, typical interactive systems often fail to consider the user's emotional state, resulting in a degraded user experience. In particular, if users are feeling anxious or frustrated, an inadequate response can lead to further stress. The challenge lies in providing a system that addresses these issues while efficiently and effectively resolving problems.

[0139] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0140] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary for identifying problems from the text data using natural language processing technology, and means for identifying the user's emotional state using sentiment analysis technology and adjusting the tone of communication accordingly. This enables flexible and appropriate responses in accordance with the user's emotional state, improving the user experience and enabling efficient problem solving.

[0141] "Speech recognition technology" is a technology that converts speech data into text data, and it is a process that analyzes a user's voice and turns it into text.

[0142] "Natural language processing technology" is a technique for extracting and analyzing specific information from text data, and is a method for computers to understand and process human language.

[0143] "Emotional analysis technology" is a technology that analyzes a user's voice tone and facial expressions to identify their current emotional state.

[0144] "Communication tone" refers to the language and atmosphere used when presenting information or solutions to users, and it should be adjusted to match the user's emotions.

[0145] "Identifying a problem" means diagnosing and identifying the root cause of a problem or malfunction that a user is experiencing.

[0146] "Information sources" refer to various databases and knowledge bases that provide relevant information useful for solving problems.

[0147] "Feedback" refers to information collected from users, such as their reactions and opinions, which is used to adjust the system's response.

[0148] This invention is an interactive system centered on a user-operated information terminal, realized by integrating speech recognition technology, natural language processing technology, and sentiment analysis technology. The user begins by verbally describing the problem to the information terminal. The hardware used in this process is a general-purpose information terminal including a microphone and camera. The software then uses speech recognition technology (e.g., speech recognition software as a general term) to convert the speech into text.

[0149] The text data generated by speech recognition is sent to the server, where natural language processing software (e.g., a natural language processing engine) analyzes the information necessary to identify the problem. In parallel, the server uses sentiment analysis technology (e.g., a sentiment analysis module) to identify the user's emotional state from their voice tone and facial expressions.

[0150] Based on information received from the server, the device presents solutions visually and audibly, in a tone appropriate to the user's emotional state. This reduces stress and facilitates self-resolution of problems. For example, if a user says, "My smartphone screen isn't working," and the emotion analysis technology detects frustration, the server will quickly select a specific solution, and the device will gently and calmly instruct the user to "press and hold the power button for 10 seconds to restart."

[0151] An example of a prompt might be, "Tell me how to recognize a specific emotional state and design a user-friendly solution based on it." This prompt allows the generative AI model to learn and provide methods for adjusting responses based on the user's emotions.

[0152] In this way, it is possible to improve the user experience while efficiently solving problems.

[0153] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0154] Step 1:

[0155] The user inputs a description of the problem by voice into the information terminal. The terminal's microphone captures this voice data and uses speech recognition technology to convert the voice into text data. This input of voice data results in the output of text data.

[0156] Step 2:

[0157] The terminal sends the generated text data to the server. The server analyzes this text data using natural language processing technology and extracts the information necessary to identify the problem. It receives text data as input and can obtain the cause of the problem and related information as output.

[0158] Step 3:

[0159] The device processes the user's voice tone and facial expressions using emotion analysis technology to identify the user's emotional state. The input is voice tone and facial expression data, and the output identifies the emotional state (e.g., anxiety or frustration).

[0160] Step 4:

[0161] The server selects the optimal solution from multiple information sources based on malfunction information obtained from natural language processing and emotional state information from sentiment analysis technology. The input includes malfunction information and emotional state information, and the output is the selection of the best solution to propose to the user.

[0162] Step 5:

[0163] The terminal provides the user with solutions received from the server. Solutions are presented via a visual display and audio output, with guidance delivered in a tone appropriate to the user's emotional state. The input to this process is the solution data, and the output is the form of guidance for the user.

[0164] (Application Example 2)

[0165] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0166] Users facing malfunctions in information terminals or home electronic devices often spend considerable time and effort identifying the problem and selecting a solution. Furthermore, responses that disregard the user's emotional state can detract from the user experience. This invention aims to provide a system for quickly and accurately resolving malfunctions while considering the user's emotional state.

[0167] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0168] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary for identifying problems from the text data using natural language processing technology, and means for recognizing the user's emotional state from their voice and facial expressions using emotion analysis technology. This enables appropriate communication according to the user's emotional state, improving the user experience while allowing for rapid problem resolution.

[0169] "Speech recognition technology" is a technology that records a user's speech as digital data and converts it into text data.

[0170] "Natural language processing technology" refers to techniques for analyzing and extracting meaningful information from text data.

[0171] "Emotion analysis technology" is a technology used to identify a user's emotional state from their voice and facial expressions.

[0172] An "information source" is a collection of multiple data sets that provide reference data or a knowledge base for resolving problems.

[0173] "Communication style" refers to the guidelines for selecting appropriate expressions and tone of voice when interacting with users.

[0174] "Perceptual devices" are devices such as sensors and cameras that robots use to acquire information about the user and their environment.

[0175] "Dialogue history" refers to a record of information exchanges that have taken place between the user and the system in the past.

[0176] "User experience" is a comprehensive evaluation of the satisfaction and convenience that users feel when using a system.

[0177] The system for implementing the present invention is comprised of a combination of speech recognition, natural language processing, and sentiment analysis technologies. The server receives user voice data and visual data from a terminal equipped with a microphone and camera. Speech recognition software (e.g., Google Speech-to-Text) converts this voice data into text data. Next, a natural language processing library (e.g., NLTK) analyzes this text data to obtain information necessary for identifying defects.

[0178] Simultaneously, emotion analysis technology (e.g., Microsoft® Azure® Emotion API) is used to analyze the user's voice and facial expression data and recognize their emotional state. The server combines these analysis results, searches databases and knowledge bases for information to resolve problems, and selects the optimal solution according to the user's emotional state.

[0179] The selected solution is presented to the user visually and audibly through the device. The tone of communication is adjusted according to the user's emotional state. Furthermore, it is possible to acquire additional environmental information about the user's surroundings using perceptual devices such as robots.

[0180] For example, if a user states that they are experiencing a problem where their smart speaker is not playing music, the system analyzes the statement and detects dissatisfaction through sentiment analysis. The system then provides a quick solution, improving the user experience by gently suggesting, "First, let's check the power and connection status of your smart speaker."

[0181] An example of a prompt message could be text that reads, "Recognize the emotions of users experiencing problems with their smart home devices and design guidelines to provide solutions based on those emotions."

[0182] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0183] Step 1:

[0184] The user describes the problem through a voice input device. The voice data is input into the terminal and converted into text data by speech recognition software. This process converts the user's spoken content into an analyzable text format.

[0185] Step 2:

[0186] The server receives text data and analyzes its content using natural language processing technology. It receives a text description of a problem as input and processes the data to extract relevant keywords and important information. As a result, the information necessary to identify the problem is output.

[0187] Step 3:

[0188] The user's voice and visual data are input into the emotion analysis system. The server uses emotion analysis technology to analyze the data and identify the user's emotional state. This process calculates data obtained from voice tone and facial expressions, and the identified emotional state is output.

[0189] Step 4:

[0190] Based on the identified problem and the user's emotional state, the server selects several potential solutions. In this step, the database is referenced, and data retrieval and calculations are performed to provide the user with the most relevant information.

[0191] Step 5:

[0192] The server sends the selected solution to the terminal. The solution is presented to the user visually and audibly on the terminal, with the communication style adjusted according to the user's emotional state. This ensures that the information is presented in a way that is more easily accepted by the user.

[0193] Step 6:

[0194] The user provides feedback on the presented solution. The terminal sends this feedback to the server, which then analyzes the results to determine the next step in the interaction. During this process, the output is determined as either continuing or ending the interaction.

[0195] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0196] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0197] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0198] [Second Embodiment]

[0199] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0200] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0201] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0202] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0203] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0204] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0205] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0206] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0207] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0208] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0209] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0210] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0211] This invention is an interactive system that enables users to self-resolve problems with information terminals via a server that interacts with a dedicated terminal installed in a store. This system combines speech recognition technology, natural language processing technology, visual guidance, and voice guidance. The program's processing is described below in natural language.

[0212] First, the user approaches a terminal installed in the store and accesses the system using voice or touch controls. The terminal converts the user's input into text data using speech recognition technology. This text data is then transmitted to a server via the internet.

[0213] The server uses natural language processing technology to analyze the received text data and extract keywords necessary to identify the problem. Based on this, the server consults internal databases and external information sources to search for relevant information. Once the optimal solution is determined, the server sends the steps to the terminal.

[0214] The terminal presents the received solution to the user through visual and audio guidance. This allows the user to understand and implement the solution even without specialized knowledge. When the user reports feedback on the results of their actions to the terminal, it is sent to the server, and the interaction continues. If the problem is reported to be resolved, the server terminates the session and records the history.

[0215] As a concrete example, consider a case where a user enters "My smartphone can't connect to Wi-Fi." The server analyzes Wi-Fi-related keywords and searches for the best solution from past history and the internet. For example, it might select a solution such as "Recheck your Wi-Fi settings and try reconnecting." The device then guides the user through this solution using character animations and voice prompts, providing specific instructions.

[0216] In this way, this system supports users in resolving problems with their information terminals themselves, enabling a rapid response.

[0217] The following describes the processing flow.

[0218] Step 1:

[0219] The user approaches a terminal installed in the store and activates the agent via touch or voice. The terminal uses sensors to detect the user's presence and initiates an initial screen or voice guidance.

[0220] Step 2:

[0221] The user describes a problem with their smartphone using voice. The device activates its voice recognition system and converts the user's voice into text data. This text data is then sent to a server via the internet.

[0222] Step 3:

[0223] The server analyzes the received text data using natural language processing techniques, extracts keywords related to the problem, and understands its content. Based on the analysis results, the server searches its internal database and external information sources for possible causes and potential solutions.

[0224] Step 4:

[0225] The server selects the most suitable solution from multiple options. The selected solution is then sent to the user's terminal in the form of visual and audio guidance to ensure user understanding.

[0226] Step 5:

[0227] The device visually displays the received solution on the screen and provides voice guidance to the user. The user understands the specific solution and attempts to perform the operation on their smartphone.

[0228] Step 6:

[0229] Users provide feedback on their device regarding the results of trying the solution. This feedback may include, for example, "The problem is solved" or "It's not solved yet."

[0230] Step 7:

[0231] The server receives feedback from the user and ends the session if the problem is resolved. If it remains unresolved, it explores alternative solutions and sends further instructions to the terminal. This allows the interaction to continue.

[0232] Step 8:

[0233] Once the problem is resolved or support has ended, the server logs the session and saves the support history to the database. The terminal then prompts the user to confirm the support details and guides them through the termination process.

[0234] (Example 1)

[0235] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0236] When information terminals malfunction, there is a need for a method that allows users to quickly and effectively resolve problems themselves, even without specialized knowledge. Current support systems often make it difficult for users to understand the appropriate operating procedures, resulting in lengthy troubleshooting processes. Therefore, there is a need for a system that intuitively understands the user's problem and guides them to the appropriate solution.

[0237] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0238] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary to identify a problem from the text data using natural language processing technology, and means for visually displaying the user's progress and clearly indicating specific operating procedures when presenting a solution. As a result, the user can intuitively understand the problem with the information terminal and implement the solution themselves.

[0239] "Speech recognition technology" is a technology that converts voice input from a user into digital text data.

[0240] "Natural language processing technology" is a technique that analyzes text data, understands the structure of the information, and extracts necessary keywords and context.

[0241] "A means of exploring relevant information and selecting the optimal solution" refers to a function that collects necessary data from multiple sources and determines the most suitable solution for the user's problem.

[0242] "Means of presenting to the user visually and aurally" refers to methods of effectively communicating solutions to the user through on-screen visual elements and audio guides.

[0243] "Means of receiving user feedback and continuing or ending the dialogue depending on the resolution status" refers to a mechanism in which the system determines the next step in the dialogue based on responses and reports from the user and provides supplementary information as needed.

[0244] "Means of visually displaying the user's progress and clearly indicating specific operating procedures" refers to a method of supporting the user by visualizing on the screen which step the user is currently in and clearly indicating the flow of operations.

[0245] This invention is a system that allows users to self-resolve problems with information terminals using dedicated terminals installed in stores or other locations. Specific embodiments for carrying out the invention are shown below.

[0246] The user approaches a terminal in the store and accesses the system using voice or touch controls. The terminal is equipped with voice recognition technology, which converts the user's voice input into text data. Voice recognition software is used for this process. The generated text data is sent to a server via the internet.

[0247] The server analyzes the received text data using natural language processing techniques. This process employs a natural language processing algorithm, which is a generative AI model. Based on the analysis, the server extracts the information necessary to identify the problem and consults internal databases and external information sources to find a solution.

[0248] Next, the server selects the optimal solution based on the relevant information and sends it to the terminal. The terminal then presents the received solution to the user using a combination of visual and audio guidance. This allows users to understand the specific operating procedures and solve problems even without specialized knowledge.

[0249] The user follows the given instructions, performs the operation, and reports the results as feedback to the terminal. Based on this feedback, the terminal continues its interaction with the server and provides additional solutions as needed. If the problem is resolved based on the user's report, the server terminates the session and records the history.

[0250] As a concrete example, consider a scenario where a user enters "My smartphone can't connect to Wi-Fi" into their device. The server analyzes the Wi-Fi-related information and selects a solution such as "Recheck Wi-Fi settings and try reconnecting." The device then visually displays this solution on the screen, providing the user with specific steps.

[0251] This system addresses problems by using prompt messages such as, "Please tell me effective steps to improve my smartphone's Wi-Fi connection." This allows for the quick and effective resolution of problems with the user's information device.

[0252] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0253] Step 1: The user approaches a terminal installed in the store and provides voice or touch input. This input includes details about the problem with the user's device. The terminal uses speech recognition technology to convert the voice input into text data. Speech recognition software is used for this conversion, and the output is in text format.

[0254] Step 2: The terminal sends the generated text data to the server via the internet. The server receives this text data and analyzes it using natural language processing technology. Here, a generative AI model is used to extract keywords and contextual information necessary to identify problems from the data. The output is the analyzed information.

[0255] Step 3: The server identifies the type of malfunction based on the extracted information and searches for the best solution by accessing internal databases and external information sources. This process may utilize historical data and statistical data. The output is a specific solution presented to the user.

[0256] Step 4: The server sends the selected solution to the terminal. The terminal receives the solution and presents specific steps to the user using visual and auditory means. Visually, animations and illustrations are displayed on the screen, and audio guidance is output using synthesized speech technology. The output is clear and easy-to-understand instructions for the user.

[0257] Step 5: The user follows the instructions on the device and performs the troubleshooting steps. They then input the results as feedback on the device. The input reports whether the problem has been resolved and may include additional comments as needed.

[0258] Step 6: The terminal sends user feedback to the server, which uses this to decide whether to continue or end the interaction. The server analyzes the feedback, ends the session if the problem is resolved, and records the history. The output is either the end of the interaction or further instructions.

[0259] (Application Example 1)

[0260] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0261] When users encounter problems with their information terminals, it is difficult for them to resolve the issue themselves without specialized knowledge. Furthermore, traditional support systems often struggle to provide users with the information they need quickly and accurately, leading to prolonged interactions. Additionally, the solutions presented may be unclear, increasing the risk of users making incorrect decisions.

[0262] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0263] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary to identify a problem from the text data using natural language processing technology, and means for displaying virtual elements using augmented reality technology and guiding the user through specific operating procedures. As a result, the user receives clear visual and auditory guidance, enabling them to quickly and accurately resolve their problems.

[0264] "Speech recognition technology" is a technology that converts voice input into text data, processing what the user says as text information.

[0265] "Natural language processing technology" is a technology that analyzes text data to understand its meaning and intent, and extracts information necessary to identify problems from language data.

[0266] "Information sources" refer to internal databases or external knowledge bases that a server accesses when searching for relevant information.

[0267] "Presenting visually and aurally" means providing the selected solution to the user through visual displays and audio explanations.

[0268] Augmented reality technology is a technology that overlays virtual elements onto the real world, visually guiding users through specific operating procedures.

[0269] "Communication services" refer to system functions that record user feedback and history, and exchange information between servers.

[0270] A system for carrying out this invention includes a user, a terminal, and a server as its main components.

[0271] First, the user accesses the terminal to resolve an issue with it. The terminal accepts input from the user via voice or touch, and converts this input into text data using speech recognition technology. This process utilizes speech recognition software such as Google Cloud Speech-to-Text.

[0272] Next, the text data is sent to a server. This server uses natural language processing technologies such as OpenAI GPT to analyze the text and extract keywords necessary to identify the problem. Based on the analyzed text, the server searches for relevant information by referencing various sources such as internal databases and the internet, and selects the optimal solution.

[0273] The selected solution is returned to the device, which uses ARKit (iOS) or ARCore (Android) to present virtual elements to the user using augmented reality technology. Additionally, Amazon Polly is used to provide voice guidance, visually and audibly guiding the user through specific operating procedures.

[0274] For example, if a user reports a problem such as "my smartphone's Wi-Fi won't connect," the server analyzes Wi-Fi-related keywords and selects a solution such as "check your Wi-Fi settings and try reconnecting." The device then presents these instructions using augmented reality guidelines and voice prompts, allowing the user to resolve the issue by following the instructions.

[0275] By using a generative AI model, it is possible to generate appropriate solutions to user questions. An example of a prompt to be input to the generative AI model is: "The user has reported a problem with their smartphone's Wi-Fi. Please briefly explain the steps to the best solution. In particular, please provide information that can be supported both visually and audibly in a way that even a beginner can understand." In this way, the system of the present invention helps to quickly and effectively resolve technical problems faced by users.

[0276] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0277] Step 1:

[0278] The user inputs a question into the device via voice or touch. The device accepts the input and uses speech recognition technology to convert voice input into text data. During this process, the user's voice data is converted to text using Google Cloud Speech-to-Text. The output is then sent to the server as natural language text.

[0279] Step 2:

[0280] The server analyzes the received text data using natural language processing techniques. Specifically, it uses OpenAI's natural language processing model to extract keywords from the text data to identify the problem. The input to this process is the text data from step 1, and the output is the extracted keyword set.

[0281] Step 3:

[0282] The server references internal databases and external information sources to search for and select the optimal solution based on extracted keywords. For example, it collects relevant technical information from the internet and determines the optimal solution using its own algorithm. The input is the keyword set from step 2, and the output is the selected solution.

[0283] Step 4:

[0284] The server sends the selected solution to the terminal. The terminal receives this and prepares a display to visually guide the user using augmented reality technology. Using ARKit or ARCore, virtual guidelines are overlaid on the user's real field of vision. The input is the solution from Step 3, and the output is the guidance in a visual display.

[0285] Step 5:

[0286] The terminal simultaneously uses Amazon Polly to generate voice guidance and provides auditory feedback to the user. Specifically, the terminal plays a voice explanation of the solution through the speaker to guide the user on the specific actions to take next. The input is the solution from Step 3, and the output is an explanation in audio format.

[0287] Step 6:

[0288] The user attempts to solve the problem according to the guidance and provides feedback on the result to the terminal. The terminal collects this feedback and reports it to the server. The server processes the received feedback and determines whether the problem has been solved. If the problem is solved, the session is terminated and the user's history is recorded. The input is the feedback from the user, and the output is an update to the history database.

[0289] Furthermore, an emotion engine for estimating the user's emotion may be combined. That is, the specific processing unit 290 may estimate the user's emotion using the emotion identification model 59 and perform specific processing using the user's emotion.

[0290] This invention combines emotion recognition technology with an interactive system that allows users to self-resolve problems with information terminals. By incorporating an emotion engine, this system can analyze the user's emotional state from their voice and facial expressions and communicate appropriately based on that analysis.

[0291] First, the user accesses the system via a terminal installed in the store and describes the problem verbally. The terminal uses speech recognition technology to transcribe what is said into text and sends it to the server. Simultaneously, an emotion engine analyzes the user's tone of voice and facial expressions to identify their emotional state.

[0292] The server analyzes the received text data using natural language processing technology to extract information necessary to identify the problem. Furthermore, it considers the user's emotional state based on information from the emotion engine, and provides more careful guidance if the user is feeling anxious or frustrated.

[0293] The server selects the optimal solution from multiple sources based on the identified problem and the user's emotional state, and sends it to the terminal. The terminal provides the solution through visual and audio guidance, adjusting the tone of communication according to the user's emotional state.

[0294] For example, if a user complains that their smartphone battery dies too quickly, the emotion engine might detect frustration. In this case, the server will quickly and concisely suggest battery-saving methods that can be tried with simple steps, and the device will guide the user through these steps in a gentle and friendly tone.

[0295] Thus, the system of the present invention can easily provide guidance that reflects the user's emotional state, thereby improving the user experience.

[0296] The following describes the processing flow.

[0297] Step 1:

[0298] The user approaches the terminal installed in the store and starts operating it. The terminal activates the emotion engine and begins collecting emotion data from the user's voice and expression. At this time, the terminal also operates the voice recognition system to convert the voice input into text.

[0299] Step 2:

[0300] While converting the voice input into text data, the terminal sends the emotion data analyzed by the emotion engine to the server. This is an important process for grasping the user's emotional state.

[0301] Step 3:

[0302] The server analyzes the received text data with the natural language processing engine and extracts the keywords necessary for problem identification. At the same time, it analyzes the emotion data to grasp the user's emotional state. <00

[0310] The user tries the suggested solution and reports the results as feedback to the device. The device then sends this feedback to the server.

[0311] Step 8:

[0312] The server receives user feedback and terminates the session if the problem is resolved. If the problem remains unresolved, the server explores alternative solutions and continues the interaction by sending instructions to the terminal again.

[0313] (Example 2)

[0314] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0315] When users attempt to resolve issues with their information terminals themselves, typical interactive systems often fail to consider the user's emotional state, resulting in a degraded user experience. In particular, if users are feeling anxious or frustrated, an inadequate response can lead to further stress. The challenge lies in providing a system that addresses these issues while efficiently and effectively resolving problems.

[0316] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0317] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary for identifying problems from the text data using natural language processing technology, and means for identifying the user's emotional state using sentiment analysis technology and adjusting the tone of communication accordingly. This enables flexible and appropriate responses in accordance with the user's emotional state, improving the user experience and enabling efficient problem solving.

[0318] "Speech recognition technology" is a technology that converts speech data into text data, and it is a process that analyzes a user's voice and turns it into text.

[0319] "Natural language processing technology" is a technique for extracting and analyzing specific information from text data, and is a method for computers to understand and process human language.

[0320] "Emotional analysis technology" is a technology that analyzes a user's voice tone and facial expressions to identify their current emotional state.

[0321] "Communication tone" refers to the language and atmosphere used when presenting information or solutions to users, and it should be adjusted to match the user's emotions.

[0322] "Identifying a problem" means diagnosing and identifying the root cause of a problem or malfunction that a user is experiencing.

[0323] "Information sources" refer to various databases and knowledge bases that provide relevant information useful for solving problems.

[0324] "Feedback" refers to information collected from users, such as their reactions and opinions, which is used to adjust the system's response.

[0325] This invention is an interactive system centered on a user-operated information terminal, realized by integrating speech recognition technology, natural language processing technology, and sentiment analysis technology. The user begins by verbally describing the problem to the information terminal. The hardware used in this process is a general-purpose information terminal including a microphone and camera. The software then uses speech recognition technology (e.g., speech recognition software as a general term) to convert the speech into text.

[0326] The text data generated by speech recognition is sent to the server, where natural language processing software (e.g., a natural language processing engine) analyzes the information necessary to identify the problem. In parallel, the server uses sentiment analysis technology (e.g., a sentiment analysis module) to identify the user's emotional state from their voice tone and facial expressions.

[0327] Based on information received from the server, the device presents solutions visually and audibly, in a tone appropriate to the user's emotional state. This reduces stress and facilitates self-resolution of problems. For example, if a user says, "My smartphone screen isn't working," and the emotion analysis technology detects frustration, the server will quickly select a specific solution, and the device will gently and calmly instruct the user to "press and hold the power button for 10 seconds to restart."

[0328] An example of a prompt might be, "Tell me how to recognize a specific emotional state and design a user-friendly solution based on it." This prompt allows the generative AI model to learn and provide methods for adjusting responses based on the user's emotions.

[0329] In this way, it is possible to improve the user experience while efficiently solving problems.

[0330] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0331] Step 1:

[0332] The user inputs a description of the problem by voice into the information terminal. The terminal's microphone captures this voice data and uses speech recognition technology to convert the voice into text data. This input of voice data results in the output of text data.

[0333] Step 2:

[0334] The terminal sends the generated text data to the server. The server analyzes this text data using natural language processing technology and extracts the information necessary to identify the problem. It receives text data as input and can obtain the cause of the problem and related information as output.

[0335] Step 3:

[0336] The device processes the user's voice tone and facial expressions using emotion analysis technology to identify the user's emotional state. The input is voice tone and facial expression data, and the output identifies the emotional state (e.g., anxiety or frustration).

[0337] Step 4:

[0338] The server selects the optimal solution from multiple information sources based on malfunction information obtained from natural language processing and emotional state information from sentiment analysis technology. The input includes malfunction information and emotional state information, and the output is the selection of the best solution to propose to the user.

[0339] Step 5:

[0340] The terminal provides the user with solutions received from the server. Solutions are presented via a visual display and audio output, with guidance delivered in a tone appropriate to the user's emotional state. The input to this process is the solution data, and the output is the form of guidance for the user.

[0341] (Application Example 2)

[0342] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the smart glasses 214 as the "terminal".

[0343] Users facing malfunctions in information terminals or home electronic devices often spend considerable time and effort identifying the problem and selecting a solution. Furthermore, responses that disregard the user's emotional state can detract from the user experience. This invention aims to provide a system for quickly and accurately resolving malfunctions while considering the user's emotional state.

[0344] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0345] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary for identifying problems from the text data using natural language processing technology, and means for recognizing the user's emotional state from their voice and facial expressions using emotion analysis technology. This enables appropriate communication according to the user's emotional state, improving the user experience while allowing for rapid problem resolution.

[0346] "Speech recognition technology" is a technology that records a user's speech as digital data and converts it into text data.

[0347] "Natural language processing technology" refers to techniques for analyzing and extracting meaningful information from text data.

[0348] "Emotion analysis technology" is a technology used to identify a user's emotional state from their voice and facial expressions.

[0349] An "information source" is a collection of multiple data sets that provide reference data or a knowledge base for resolving problems.

[0350] "Communication style" refers to the guidelines for selecting appropriate expressions and tone of voice when interacting with users.

[0351] "Perceptual devices" are devices such as sensors and cameras that robots use to acquire information about the user and their environment.

[0352] "Dialogue history" refers to a record of information exchanges that have taken place between the user and the system in the past.

[0353] "User experience" is a comprehensive evaluation of the satisfaction and convenience that users feel when using a system.

[0354] The system for implementing the present invention is comprised of a combination of speech recognition, natural language processing, and sentiment analysis technologies. The server receives user voice data and visual data from a terminal equipped with a microphone and camera. Speech recognition software (e.g., Google Speech-to-Text) converts this voice data into text data. Next, a natural language processing library (e.g., NLTK) analyzes this text data to obtain information necessary for identifying defects.

[0355] Simultaneously, emotion analysis technology (e.g., Microsoft Azure's Emotion API) is used to analyze the user's voice and facial expression data and recognize their emotional state. The server combines these analysis results, searches databases and knowledge bases for information to resolve problems, and selects the optimal solution based on the user's emotional state.

[0356] The selected solution is presented to the user visually and audibly through the device. The tone of communication is adjusted according to the user's emotional state. Furthermore, it is possible to acquire additional environmental information about the user's surroundings using perceptual devices such as robots.

[0357] For example, if a user states that they are experiencing a problem where their smart speaker is not playing music, the system analyzes the statement and detects dissatisfaction through sentiment analysis. The system then provides a quick solution, improving the user experience by gently suggesting, "First, let's check the power and connection status of your smart speaker."

[0358] An example of a prompt message could be text that reads, "Recognize the emotions of users experiencing problems with their smart home devices and design guidelines to provide solutions based on those emotions."

[0359] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0360] Step 1:

[0361] The user describes the problem through a voice input device. The voice data is input into the terminal and converted into text data by speech recognition software. This process converts the user's spoken content into an analyzable text format.

[0362] Step 2:

[0363] The server receives text data and analyzes its content using natural language processing technology. It receives a text description of a problem as input and processes the data to extract relevant keywords and important information. As a result, the information necessary to identify the problem is output.

[0364] Step 3:

[0365] The user's voice and visual data are input into the emotion analysis system. The server uses emotion analysis technology to analyze the data and identify the user's emotional state. This process calculates data obtained from voice tone and facial expressions, and the identified emotional state is output.

[0366] Step 4:

[0367] Based on the identified problem and the user's emotional state, the server selects several potential solutions. In this step, the database is referenced, and data retrieval and calculations are performed to provide the user with the most relevant information.

[0368] Step 5:

[0369] The server sends the selected solution to the terminal. The solution is presented to the user visually and audibly on the terminal, with the communication style adjusted according to the user's emotional state. This ensures that the information is presented in a way that is more easily accepted by the user.

[0370] Step 6:

[0371] The user provides feedback on the presented solution. The terminal sends this feedback to the server, which then analyzes the results to determine the next step in the interaction. During this process, the output is determined as either continuing or ending the interaction.

[0372] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0373] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0374] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0375] [Third Embodiment]

[0376] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0377] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0378] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0379] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0380] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0381] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0382] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0383] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0384] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0385] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0386] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0387] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0388] This invention is an interactive system that enables users to self-resolve problems with information terminals via a server that interacts with a dedicated terminal installed in a store. This system combines speech recognition technology, natural language processing technology, visual guidance, and voice guidance. The program's processing is described below in natural language.

[0389] First, the user approaches a terminal installed in the store and accesses the system using voice or touch controls. The terminal converts the user's input into text data using speech recognition technology. This text data is then transmitted to a server via the internet.

[0390] The server uses natural language processing technology to analyze the received text data and extract keywords necessary to identify the problem. Based on this, the server consults internal databases and external information sources to search for relevant information. Once the optimal solution is determined, the server sends the steps to the terminal.

[0391] The terminal presents the received solution to the user through visual and audio guidance. This allows the user to understand and implement the solution even without specialized knowledge. When the user reports feedback on the results of their actions to the terminal, it is sent to the server, and the interaction continues. If the problem is reported to be resolved, the server terminates the session and records the history.

[0392] As a concrete example, consider a case where a user enters "My smartphone can't connect to Wi-Fi." The server analyzes Wi-Fi-related keywords and searches for the best solution from past history and the internet. For example, it might select a solution such as "Recheck your Wi-Fi settings and try reconnecting." The device then guides the user through this solution using character animations and voice prompts, providing specific instructions.

[0393] In this way, this system supports users in resolving problems with their information terminals themselves, enabling a rapid response.

[0394] The following describes the processing flow.

[0395] Step 1:

[0396] The user approaches a terminal installed in the store and activates the agent via touch or voice. The terminal uses sensors to detect the user's presence and initiates an initial screen or voice guidance.

[0397] Step 2:

[0398] The user describes a problem with their smartphone using voice. The device activates its voice recognition system and converts the user's voice into text data. This text data is then sent to a server via the internet.

[0399] Step 3:

[0400] The server analyzes the received text data using natural language processing techniques, extracts keywords related to the problem, and understands its content. Based on the analysis results, the server searches its internal database and external information sources for possible causes and potential solutions.

[0401] Step 4:

[0402] The server selects the most suitable solution from multiple options. The selected solution is then sent to the user's terminal in the form of visual and audio guidance to ensure user understanding.

[0403] Step 5:

[0404] The device visually displays the received solution on the screen and provides voice guidance to the user. The user understands the specific solution and attempts to perform the operation on their smartphone.

[0405] Step 6:

[0406] Users provide feedback on their device regarding the results of trying the solution. This feedback may include, for example, "The problem is solved" or "It's not solved yet."

[0407] Step 7:

[0408] The server receives feedback from the user and ends the session if the problem is resolved. If it remains unresolved, it explores alternative solutions and sends further instructions to the terminal. This allows the interaction to continue.

[0409] Step 8:

[0410] Once the problem is resolved or support has ended, the server logs the session and saves the support history to the database. The terminal then prompts the user to confirm the support details and guides them through the termination process.

[0411] (Example 1)

[0412] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0413] When information terminals malfunction, there is a need for a method that allows users to quickly and effectively resolve problems themselves, even without specialized knowledge. Current support systems often make it difficult for users to understand the appropriate operating procedures, resulting in lengthy troubleshooting processes. Therefore, there is a need for a system that intuitively understands the user's problem and guides them to the appropriate solution.

[0414] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0415] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary to identify a problem from the text data using natural language processing technology, and means for visually displaying the user's progress and clearly indicating specific operating procedures when presenting a solution. As a result, the user can intuitively understand the problem with the information terminal and implement the solution themselves.

[0416] "Speech recognition technology" is a technology that converts voice input from a user into digital text data.

[0417] "Natural language processing technology" is a technique that analyzes text data, understands the structure of the information, and extracts necessary keywords and context.

[0418] "A means of exploring relevant information and selecting the optimal solution" refers to a function that collects necessary data from multiple sources and determines the most suitable solution for the user's problem.

[0419] "Means of presenting to the user visually and aurally" refers to methods of effectively communicating solutions to the user through on-screen visual elements and audio guides.

[0420] "Means of receiving user feedback and continuing or ending the dialogue depending on the resolution status" refers to a mechanism in which the system determines the next step in the dialogue based on responses and reports from the user and provides supplementary information as needed.

[0421] "Means of visually displaying the user's progress and clearly indicating specific operating procedures" refers to a method of supporting the user by visualizing on the screen which step the user is currently in and clearly indicating the flow of operations.

[0422] This invention is a system that allows users to self-resolve problems with information terminals using dedicated terminals installed in stores or other locations. Specific embodiments for carrying out the invention are shown below.

[0423] The user approaches a terminal in the store and accesses the system using voice or touch controls. The terminal is equipped with voice recognition technology, which converts the user's voice input into text data. Voice recognition software is used for this process. The generated text data is sent to a server via the internet.

[0424] The server analyzes the received text data using natural language processing techniques. This process employs a natural language processing algorithm, which is a generative AI model. Based on the analysis, the server extracts the information necessary to identify the problem and consults internal databases and external information sources to find a solution.

[0425] Next, the server selects the optimal solution based on the relevant information and sends it to the terminal. The terminal then presents the received solution to the user using a combination of visual and audio guidance. This allows users to understand the specific operating procedures and solve problems even without specialized knowledge.

[0426] The user follows the given instructions, performs the operation, and reports the results as feedback to the terminal. Based on this feedback, the terminal continues its interaction with the server and provides additional solutions as needed. If the problem is resolved based on the user's report, the server terminates the session and records the history.

[0427] As a concrete example, consider a scenario where a user enters "My smartphone can't connect to Wi-Fi" into their device. The server analyzes the Wi-Fi-related information and selects a solution such as "Recheck Wi-Fi settings and try reconnecting." The device then visually displays this solution on the screen, providing the user with specific steps.

[0428] This system addresses problems by using prompt messages such as, "Please tell me effective steps to improve my smartphone's Wi-Fi connection." This allows for the quick and effective resolution of problems with the user's information device.

[0429] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0430] Step 1: The user approaches a terminal installed in the store and provides voice or touch input. This input includes details about the problem with the user's device. The terminal uses speech recognition technology to convert the voice input into text data. Speech recognition software is used for this conversion, and the output is in text format.

[0431] Step 2: The terminal sends the generated text data to the server via the internet. The server receives this text data and analyzes it using natural language processing technology. Here, a generative AI model is used to extract keywords and contextual information necessary to identify problems from the data. The output is the analyzed information.

[0432] Step 3: The server identifies the type of malfunction based on the extracted information and searches for the best solution by accessing internal databases and external information sources. This process may utilize historical data and statistical data. The output is a specific solution presented to the user.

[0433] Step 4: The server sends the selected solution to the terminal. The terminal receives the solution and presents specific steps to the user using visual and auditory means. Visually, animations and illustrations are displayed on the screen, and audio guidance is output using synthesized speech technology. The output is clear and easy-to-understand instructions for the user.

[0434] Step 5: The user follows the instructions on the device and performs the troubleshooting steps. They then input the results as feedback on the device. The input reports whether the problem has been resolved and may include additional comments as needed.

[0435] Step 6: The terminal sends user feedback to the server, which uses this to decide whether to continue or end the interaction. The server analyzes the feedback, ends the session if the problem is resolved, and records the history. The output is either the end of the interaction or further instructions.

[0436] (Application Example 1)

[0437] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0438] When users encounter problems with their information terminals, it is difficult for them to resolve the issue themselves without specialized knowledge. Furthermore, traditional support systems often struggle to provide users with the information they need quickly and accurately, leading to prolonged interactions. Additionally, the solutions presented may be unclear, increasing the risk of users making incorrect decisions.

[0439] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0440] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary to identify a problem from the text data using natural language processing technology, and means for displaying virtual elements using augmented reality technology and guiding the user through specific operating procedures. As a result, the user receives clear visual and auditory guidance, enabling them to quickly and accurately resolve their problems.

[0441] "Speech recognition technology" is a technology that converts voice input into text data, processing what the user says as text information.

[0442] "Natural language processing technology" is a technology that analyzes text data to understand its meaning and intent, and extracts information necessary to identify problems from language data.

[0443] "Information sources" refer to internal databases or external knowledge bases that a server accesses when searching for relevant information.

[0444] "Presenting visually and aurally" means providing the selected solution to the user through visual displays and audio explanations.

[0445] Augmented reality technology is a technology that overlays virtual elements onto the real world, visually guiding users through specific operating procedures.

[0446] "Communication services" refer to system functions that record user feedback and history, and exchange information between servers.

[0447] A system for carrying out this invention includes a user, a terminal, and a server as its main components.

[0448] First, the user accesses the terminal to resolve an issue with it. The terminal accepts input from the user via voice or touch, and converts this input into text data using speech recognition technology. This process utilizes speech recognition software such as Google Cloud Speech-to-Text.

[0449] Next, the text data is sent to a server. This server uses natural language processing technologies such as OpenAI GPT to analyze the text and extract keywords necessary to identify the problem. Based on the analyzed text, the server searches for relevant information by referencing various sources such as internal databases and the internet, and selects the optimal solution.

[0450] The selected solution is returned to the device, which uses ARKit (iOS) or ARCore (Android) to present virtual elements to the user using augmented reality technology. Additionally, Amazon Polly is used to provide voice guidance, visually and audibly guiding the user through specific operating procedures.

[0451] For example, if a user reports a problem such as "my smartphone's Wi-Fi won't connect," the server analyzes Wi-Fi-related keywords and selects a solution such as "check your Wi-Fi settings and try reconnecting." The device then presents these instructions using augmented reality guidelines and voice prompts, allowing the user to resolve the issue by following the instructions.

[0452] By using a generative AI model, it is possible to generate appropriate solutions to user questions. An example of a prompt to be input to the generative AI model is: "The user has reported a problem with their smartphone's Wi-Fi. Please briefly explain the steps to the best solution. In particular, please provide information that can be supported both visually and audibly in a way that even a beginner can understand." In this way, the system of the present invention helps to quickly and effectively resolve technical problems faced by users.

[0453] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0454] Step 1:

[0455] The user inputs a question into the device via voice or touch. The device accepts the input and uses speech recognition technology to convert voice input into text data. During this process, the user's voice data is converted to text using Google Cloud Speech-to-Text. The output is then sent to the server as natural language text.

[0456] Step 2:

[0457] The server analyzes the received text data using natural language processing techniques. Specifically, it uses OpenAI's natural language processing model to extract keywords from the text data to identify the problem. The input to this process is the text data from step 1, and the output is the extracted keyword set.

[0458] Step 3:

[0459] The server references internal databases and external information sources to search for and select the optimal solution based on extracted keywords. For example, it collects relevant technical information from the internet and determines the optimal solution using its own algorithm. The input is the keyword set from step 2, and the output is the selected solution.

[0460] Step 4:

[0461] The server sends the selected solution to the terminal. The terminal receives this and prepares a display to provide visual guidance to the user using augmented reality technology. Using ARKit or ARCore, virtual guidelines are overlaid on the user's real field of view. The input is the solution from step 3, and the output is the guidance in the visual display.

[0462] Step 5:

[0463] The device simultaneously uses Amazon Polly to generate voice guidance and provide auditory feedback to the user. Specifically, the device plays an audio explanation of the solution through its speaker, guiding the user through the specific actions they should take next. The input is the solution from step 3, and the output is the audio format explanation.

[0464] Step 6:

[0465] The user attempts to resolve the problem by following the instructions and provides feedback to the terminal. The terminal collects this feedback and reports it to the server. The server processes the received feedback and determines whether the problem has been resolved. If the problem is resolved, the session ends and the user's history is recorded. The input is the feedback from the user, and the output is the update of the history database.

[0466] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0467] This invention combines emotion recognition technology with an interactive system that allows users to self-resolve problems with information terminals. By incorporating an emotion engine, this system can analyze the user's emotional state from their voice and facial expressions and communicate appropriately based on that analysis.

[0468] First, the user accesses the system via a terminal installed in the store and describes the problem verbally. The terminal uses speech recognition technology to transcribe what is said into text and sends it to the server. Simultaneously, an emotion engine analyzes the user's tone of voice and facial expressions to identify their emotional state.

[0469] The server analyzes the received text data using natural language processing technology to extract information necessary to identify the problem. Furthermore, it considers the user's emotional state based on information from the emotion engine, and provides more careful guidance if the user is feeling anxious or frustrated.

[0470] The server selects the optimal solution from multiple sources based on the identified problem and the user's emotional state, and sends it to the terminal. The terminal provides the solution through visual and audio guidance, adjusting the tone of communication according to the user's emotional state.

[0471] For example, if a user complains that their smartphone battery dies too quickly, the emotion engine might detect frustration. In this case, the server will quickly and concisely suggest battery-saving methods that can be tried with simple steps, and the device will guide the user through these steps in a gentle and friendly tone.

[0472] Thus, the system of the present invention can easily provide guidance that reflects the user's emotional state, thereby improving the user experience.

[0473] The following describes the processing flow.

[0474] Step 1:

[0475] The user approaches a terminal installed in the store and begins to operate it. The terminal activates its emotion engine and starts collecting emotion data from the user's voice and facial expressions. At the same time, the terminal also activates its voice recognition system to convert the voice input into text.

[0476] Step 2:

[0477] The device converts voice input into text data while simultaneously sending emotional data analyzed by the emotion engine to the server. This is a crucial process for understanding the user's emotional state.

[0478] Step 3:

[0479] The server analyzes the received text data using a natural language processing engine to extract keywords necessary for identifying the problem. Simultaneously, it analyzes sentiment data to understand the user's emotional state.

[0480] Step 4:

[0481] The server selects the optimal solution from multiple sources based on the analysis results. This selection process takes into account the user's emotional state; for example, if the user is frustrated, an intuitive and concise solution is more likely to be chosen.

[0482] Step 5:

[0483] After the server selects the optimal solution, it sends that information to the terminal. The terminal visually displays the received solution on its screen and explains it clearly to the user through voice guidance.

[0484] Step 6:

[0485] The device adjusts the tone and expression of its guidance based on the user's emotional state. For example, if the user is feeling anxious, the guidance will be delivered in a gentle tone.

[0486] Step 7:

[0487] The user tries the suggested solution and reports the results as feedback to the device. The device then sends this feedback to the server.

[0488] Step 8:

[0489] The server receives user feedback and terminates the session if the problem is resolved. If the problem remains unresolved, the server explores alternative solutions and continues the interaction by sending instructions to the terminal again.

[0490] (Example 2)

[0491] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0492] When users attempt to resolve issues with their information terminals themselves, typical interactive systems often fail to consider the user's emotional state, resulting in a degraded user experience. In particular, if users are feeling anxious or frustrated, an inadequate response can lead to further stress. The challenge lies in providing a system that addresses these issues while efficiently and effectively resolving problems.

[0493] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0494] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary for identifying problems from the text data using natural language processing technology, and means for identifying the user's emotional state using sentiment analysis technology and adjusting the tone of communication accordingly. This enables flexible and appropriate responses in accordance with the user's emotional state, improving the user experience and enabling efficient problem solving.

[0495] "Speech recognition technology" is a technology that converts speech data into text data, and it is a process that analyzes a user's voice and turns it into text.

[0496] "Natural language processing technology" is a technique for extracting and analyzing specific information from text data, and is a method for computers to understand and process human language.

[0497] "Emotional analysis technology" is a technology that analyzes a user's voice tone and facial expressions to identify their current emotional state.

[0498] "Communication tone" refers to the language and atmosphere used when presenting information or solutions to users, and it should be adjusted to match the user's emotions.

[0499] "Identifying a problem" means diagnosing and identifying the root cause of a problem or malfunction that a user is experiencing.

[0500] "Information sources" refer to various databases and knowledge bases that provide relevant information useful for solving problems.

[0501] "Feedback" refers to information collected from users, such as their reactions and opinions, which is used to adjust the system's response.

[0502] This invention is an interactive system centered on a user-operated information terminal, realized by integrating speech recognition technology, natural language processing technology, and sentiment analysis technology. The user begins by verbally describing the problem to the information terminal. The hardware used in this process is a general-purpose information terminal including a microphone and camera. The software then uses speech recognition technology (e.g., speech recognition software as a general term) to convert the speech into text.

[0503] The text data generated by speech recognition is sent to the server, where natural language processing software (e.g., a natural language processing engine) analyzes the information necessary to identify the problem. In parallel, the server uses sentiment analysis technology (e.g., a sentiment analysis module) to identify the user's emotional state from their voice tone and facial expressions.

[0504] Based on information received from the server, the device presents solutions visually and audibly, in a tone appropriate to the user's emotional state. This reduces stress and facilitates self-resolution of problems. For example, if a user says, "My smartphone screen isn't working," and the emotion analysis technology detects frustration, the server will quickly select a specific solution, and the device will gently and calmly instruct the user to "press and hold the power button for 10 seconds to restart."

[0505] An example of a prompt might be, "Tell me how to recognize a specific emotional state and design a user-friendly solution based on it." This prompt allows the generative AI model to learn and provide methods for adjusting responses based on the user's emotions.

[0506] In this way, it is possible to improve the user experience while efficiently solving problems.

[0507] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0508] Step 1:

[0509] The user inputs a description of the problem by voice into the information terminal. The terminal's microphone captures this voice data and uses speech recognition technology to convert the voice into text data. This input of voice data results in the output of text data.

[0510] Step 2:

[0511] The terminal sends the generated text data to the server. The server analyzes this text data using natural language processing technology and extracts the information necessary to identify the problem. It receives text data as input and can obtain the cause of the problem and related information as output.

[0512] Step 3:

[0513] The device processes the user's voice tone and facial expressions using emotion analysis technology to identify the user's emotional state. The input is voice tone and facial expression data, and the output identifies the emotional state (e.g., anxiety or frustration).

[0514] Step 4:

[0515] The server selects the optimal solution from multiple information sources based on malfunction information obtained from natural language processing and emotional state information from sentiment analysis technology. The input includes malfunction information and emotional state information, and the output is the selection of the best solution to propose to the user.

[0516] Step 5:

[0517] The terminal provides the user with solutions received from the server. Solutions are presented via a visual display and audio output, with guidance delivered in a tone appropriate to the user's emotional state. The input to this process is the solution data, and the output is the form of guidance for the user.

[0518] (Application Example 2)

[0519] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0520] Users facing malfunctions in information terminals or home electronic devices often spend considerable time and effort identifying the problem and selecting a solution. Furthermore, responses that disregard the user's emotional state can detract from the user experience. This invention aims to provide a system for quickly and accurately resolving malfunctions while considering the user's emotional state.

[0521] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0522] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary for identifying problems from the text data using natural language processing technology, and means for recognizing the user's emotional state from their voice and facial expressions using emotion analysis technology. This enables appropriate communication according to the user's emotional state, improving the user experience while allowing for rapid problem resolution.

[0523] "Speech recognition technology" is a technology that records a user's speech as digital data and converts it into text data.

[0524] "Natural language processing technology" refers to techniques for analyzing and extracting meaningful information from text data.

[0525] "Emotion analysis technology" is a technology used to identify a user's emotional state from their voice and facial expressions.

[0526] An "information source" is a collection of multiple data sets that provide reference data or a knowledge base for resolving problems.

[0527] "Communication style" refers to the guidelines for selecting appropriate expressions and tone of voice when interacting with users.

[0528] "Perceptual devices" are devices such as sensors and cameras that robots use to acquire information about the user and their environment.

[0529] "Dialogue history" refers to a record of information exchanges that have taken place between the user and the system in the past.

[0530] "User experience" is a comprehensive evaluation of the satisfaction and convenience that users feel when using a system.

[0531] The system for implementing the present invention is comprised of a combination of speech recognition, natural language processing, and sentiment analysis technologies. The server receives user voice data and visual data from a terminal equipped with a microphone and camera. Speech recognition software (e.g., Google Speech-to-Text) converts this voice data into text data. Next, a natural language processing library (e.g., NLTK) analyzes this text data to obtain information necessary for identifying defects.

[0532] Simultaneously, emotion analysis technology (e.g., Microsoft Azure's Emotion API) is used to analyze the user's voice and facial expression data and recognize their emotional state. The server combines these analysis results, searches databases and knowledge bases for information to resolve problems, and selects the optimal solution based on the user's emotional state.

[0533] The selected solution is presented to the user visually and audibly through the device. The tone of communication is adjusted according to the user's emotional state. Furthermore, it is possible to acquire additional environmental information about the user's surroundings using perceptual devices such as robots.

[0534] For example, if a user states that they are experiencing a problem where their smart speaker is not playing music, the system analyzes the statement and detects dissatisfaction through sentiment analysis. The system then provides a quick solution, improving the user experience by gently suggesting, "First, let's check the power and connection status of your smart speaker."

[0535] An example of a prompt message could be text that reads, "Recognize the emotions of users experiencing problems with their smart home devices and design guidelines to provide solutions based on those emotions."

[0536] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0537] Step 1:

[0538] The user describes the problem through a voice input device. The voice data is input into the terminal and converted into text data by speech recognition software. This process converts the user's spoken content into an analyzable text format.

[0539] Step 2:

[0540] The server receives text data and analyzes its content using natural language processing technology. It receives a text description of a problem as input and processes the data to extract relevant keywords and important information. As a result, the information necessary to identify the problem is output.

[0541] Step 3:

[0542] The user's voice and visual data are input into the emotion analysis system. The server uses emotion analysis technology to analyze the data and identify the user's emotional state. This process calculates data obtained from voice tone and facial expressions, and the identified emotional state is output.

[0543] Step 4:

[0544] Based on the identified problem and the user's emotional state, the server selects several potential solutions. In this step, the database is referenced, and data retrieval and calculations are performed to provide the user with the most relevant information.

[0545] Step 5:

[0546] The server sends the selected solution to the terminal. The solution is presented to the user visually and audibly on the terminal, with the communication style adjusted according to the user's emotional state. This ensures that the information is presented in a way that is more easily accepted by the user.

[0547] Step 6:

[0548] The user provides feedback on the presented solution. The terminal sends this feedback to the server, which then analyzes the results to determine the next step in the interaction. During this process, the output is determined as either continuing or ending the interaction.

[0549] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0550] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0551] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0552] [Fourth Embodiment]

[0553] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0554] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0555] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0556] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0557] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0558] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0559] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0560] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0561] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0562] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0563] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0564] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0565] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0566] This invention is an interactive system that enables users to self-resolve problems with information terminals via a server that interacts with a dedicated terminal installed in a store. This system combines speech recognition technology, natural language processing technology, visual guidance, and voice guidance. The program's processing is described below in natural language.

[0567] First, the user approaches a terminal installed in the store and accesses the system using voice or touch controls. The terminal converts the user's input into text data using speech recognition technology. This text data is then transmitted to a server via the internet.

[0568] The server uses natural language processing technology to analyze the received text data and extract keywords necessary to identify the problem. Based on this, the server consults internal databases and external information sources to search for relevant information. Once the optimal solution is determined, the server sends the steps to the terminal.

[0569] The terminal presents the received solution to the user through visual and audio guidance. This allows the user to understand and implement the solution even without specialized knowledge. When the user reports feedback on the results of their actions to the terminal, it is sent to the server, and the interaction continues. If the problem is reported to be resolved, the server terminates the session and records the history.

[0570] As a concrete example, consider a case where a user enters "My smartphone can't connect to Wi-Fi." The server analyzes Wi-Fi-related keywords and searches for the best solution from past history and the internet. For example, it might select a solution such as "Recheck your Wi-Fi settings and try reconnecting." The device then guides the user through this solution using character animations and voice prompts, providing specific instructions.

[0571] In this way, this system supports users in resolving problems with their information terminals themselves, enabling a rapid response.

[0572] The following describes the processing flow.

[0573] Step 1:

[0574] The user approaches a terminal installed in the store and activates the agent via touch or voice. The terminal uses sensors to detect the user's presence and initiates an initial screen or voice guidance.

[0575] Step 2:

[0576] The user describes a problem with their smartphone using voice. The device activates its voice recognition system and converts the user's voice into text data. This text data is then sent to a server via the internet.

[0577] Step 3:

[0578] The server analyzes the received text data using natural language processing techniques, extracts keywords related to the problem, and understands its content. Based on the analysis results, the server searches its internal database and external information sources for possible causes and potential solutions.

[0579] Step 4:

[0580] The server selects the most suitable solution from multiple options. The selected solution is then sent to the user's terminal in the form of visual and audio guidance to ensure user understanding.

[0581] Step 5:

[0582] The device visually displays the received solution on the screen and provides voice guidance to the user. The user understands the specific solution and attempts to perform the operation on their smartphone.

[0583] Step 6:

[0584] Users provide feedback on their device regarding the results of trying the solution. This feedback may include, for example, "The problem is solved" or "It's not solved yet."

[0585] Step 7:

[0586] The server receives feedback from the user and ends the session if the problem is resolved. If it remains unresolved, it explores alternative solutions and sends further instructions to the terminal. This allows the interaction to continue.

[0587] Step 8:

[0588] Once the problem is resolved or support has ended, the server logs the session and saves the support history to the database. The terminal then prompts the user to confirm the support details and guides them through the termination process.

[0589] (Example 1)

[0590] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0591] When information terminals malfunction, there is a need for a method that allows users to quickly and effectively resolve problems themselves, even without specialized knowledge. Current support systems often make it difficult for users to understand the appropriate operating procedures, resulting in lengthy troubleshooting processes. Therefore, there is a need for a system that intuitively understands the user's problem and guides them to the appropriate solution.

[0592] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0593] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary to identify a problem from the text data using natural language processing technology, and means for visually displaying the user's progress and clearly indicating specific operating procedures when presenting a solution. As a result, the user can intuitively understand the problem with the information terminal and implement the solution themselves.

[0594] "Speech recognition technology" is a technology that converts voice input from a user into digital text data.

[0595] "Natural language processing technology" is a technique that analyzes text data, understands the structure of the information, and extracts necessary keywords and context.

[0596] "A means of exploring relevant information and selecting the optimal solution" refers to a function that collects necessary data from multiple sources and determines the most suitable solution for the user's problem.

[0597] "Means of presenting to the user visually and aurally" refers to methods of effectively communicating solutions to the user through on-screen visual elements and audio guides.

[0598] "Means of receiving user feedback and continuing or ending the dialogue depending on the resolution status" refers to a mechanism in which the system determines the next step in the dialogue based on responses and reports from the user and provides supplementary information as needed.

[0599] "Means of visually displaying the user's progress and clearly indicating specific operating procedures" refers to a method of supporting the user by visualizing on the screen which step the user is currently in and clearly indicating the flow of operations.

[0600] This invention is a system that allows users to self-resolve problems with information terminals using dedicated terminals installed in stores or other locations. Specific embodiments for carrying out the invention are shown below.

[0601] The user approaches a terminal in the store and accesses the system using voice or touch controls. The terminal is equipped with voice recognition technology, which converts the user's voice input into text data. Voice recognition software is used for this process. The generated text data is sent to a server via the internet.

[0602] The server analyzes the received text data using natural language processing techniques. This process employs a natural language processing algorithm, which is a generative AI model. Based on the analysis, the server extracts the information necessary to identify the problem and consults internal databases and external information sources to find a solution.

[0603] Next, the server selects the optimal solution based on the relevant information and sends it to the terminal. The terminal then presents the received solution to the user using a combination of visual and audio guidance. This allows users to understand the specific operating procedures and solve problems even without specialized knowledge.

[0604] The user follows the given instructions, performs the operation, and reports the results as feedback to the terminal. Based on this feedback, the terminal continues its interaction with the server and provides additional solutions as needed. If the problem is resolved based on the user's report, the server terminates the session and records the history.

[0605] As a concrete example, consider a scenario where a user enters "My smartphone can't connect to Wi-Fi" into their device. The server analyzes the Wi-Fi-related information and selects a solution such as "Recheck Wi-Fi settings and try reconnecting." The device then visually displays this solution on the screen, providing the user with specific steps.

[0606] This system addresses problems by using prompt messages such as, "Please tell me effective steps to improve my smartphone's Wi-Fi connection." This allows for the quick and effective resolution of problems with the user's information device.

[0607] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0608] Step 1: The user approaches a terminal installed in the store and provides voice or touch input. This input includes details about the problem with the user's device. The terminal uses speech recognition technology to convert the voice input into text data. Speech recognition software is used for this conversion, and the output is in text format.

[0609] Step 2: The terminal sends the generated text data to the server via the internet. The server receives this text data and analyzes it using natural language processing technology. Here, a generative AI model is used to extract keywords and contextual information necessary to identify problems from the data. The output is the analyzed information.

[0610] Step 3: The server identifies the type of malfunction based on the extracted information and searches for the best solution by accessing internal databases and external information sources. This process may utilize historical data and statistical data. The output is a specific solution presented to the user.

[0611] Step 4: The server sends the selected solution to the terminal. The terminal receives the solution and presents specific steps to the user using visual and auditory means. Visually, animations and illustrations are displayed on the screen, and audio guidance is output using synthesized speech technology. The output is clear and easy-to-understand instructions for the user.

[0612] Step 5: The user follows the instructions on the device and performs the troubleshooting steps. They then input the results as feedback on the device. The input reports whether the problem has been resolved and may include additional comments as needed.

[0613] Step 6: The terminal sends user feedback to the server, which uses this to decide whether to continue or end the interaction. The server analyzes the feedback, ends the session if the problem is resolved, and records the history. The output is either the end of the interaction or further instructions.

[0614] (Application Example 1)

[0615] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0616] When users encounter problems with their information terminals, it is difficult for them to resolve the issue themselves without specialized knowledge. Furthermore, traditional support systems often struggle to provide users with the information they need quickly and accurately, leading to prolonged interactions. Additionally, the solutions presented may be unclear, increasing the risk of users making incorrect decisions.

[0617] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0618] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary to identify a problem from the text data using natural language processing technology, and means for displaying virtual elements using augmented reality technology and guiding the user through specific operating procedures. As a result, the user receives clear visual and auditory guidance, enabling them to quickly and accurately resolve their problems.

[0619] "Speech recognition technology" is a technology that converts voice input into text data, processing what the user says as text information.

[0620] "Natural language processing technology" is a technology that analyzes text data to understand its meaning and intent, and extracts information necessary to identify problems from language data.

[0621] "Information sources" refer to internal databases or external knowledge bases that a server accesses when searching for relevant information.

[0622] "Presenting visually and aurally" means providing the selected solution to the user through visual displays and audio explanations.

[0623] Augmented reality technology is a technology that overlays virtual elements onto the real world, visually guiding users through specific operating procedures.

[0624] "Communication services" refer to system functions that record user feedback and history, and exchange information between servers.

[0625] A system for carrying out this invention includes a user, a terminal, and a server as its main components.

[0626] First, the user accesses the terminal to resolve an issue with it. The terminal accepts input from the user via voice or touch, and converts this input into text data using speech recognition technology. This process utilizes speech recognition software such as Google Cloud Speech-to-Text.

[0627] Next, the text data is sent to a server. This server uses natural language processing technologies such as OpenAI GPT to analyze the text and extract keywords necessary to identify the problem. Based on the analyzed text, the server searches for relevant information by referencing various sources such as internal databases and the internet, and selects the optimal solution.

[0628] The selected solution is returned to the device, which uses ARKit (iOS) or ARCore (Android) to present virtual elements to the user using augmented reality technology. Additionally, Amazon Polly is used to provide voice guidance, visually and audibly guiding the user through specific operating procedures.

[0629] For example, if a user reports a problem such as "my smartphone's Wi-Fi won't connect," the server analyzes Wi-Fi-related keywords and selects a solution such as "check your Wi-Fi settings and try reconnecting." The device then presents these instructions using augmented reality guidelines and voice prompts, allowing the user to resolve the issue by following the instructions.

[0630] By using a generative AI model, it is possible to generate appropriate solutions to user questions. An example of a prompt to be input to the generative AI model is: "The user has reported a problem with their smartphone's Wi-Fi. Please briefly explain the steps to the best solution. In particular, please provide information that can be supported both visually and audibly in a way that even a beginner can understand." In this way, the system of the present invention helps to quickly and effectively resolve technical problems faced by users.

[0631] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0632] Step 1:

[0633] The user inputs a question into the device via voice or touch. The device accepts the input and uses speech recognition technology to convert voice input into text data. During this process, the user's voice data is converted to text using Google Cloud Speech-to-Text. The output is then sent to the server as natural language text.

[0634] Step 2:

[0635] The server analyzes the received text data using natural language processing techniques. Specifically, it uses OpenAI's natural language processing model to extract keywords from the text data to identify the problem. The input to this process is the text data from step 1, and the output is the extracted keyword set.

[0636] Step 3:

[0637] The server references internal databases and external information sources to search for and select the optimal solution based on extracted keywords. For example, it collects relevant technical information from the internet and determines the optimal solution using its own algorithm. The input is the keyword set from step 2, and the output is the selected solution.

[0638] Step 4:

[0639] The server sends the selected solution to the terminal. The terminal receives this and prepares a display to provide visual guidance to the user using augmented reality technology. Using ARKit or ARCore, virtual guidelines are overlaid on the user's real field of view. The input is the solution from step 3, and the output is the guidance in the visual display.

[0640] Step 5:

[0641] The device simultaneously uses Amazon Polly to generate voice guidance and provide auditory feedback to the user. Specifically, the device plays an audio explanation of the solution through its speaker, guiding the user through the specific actions they should take next. The input is the solution from step 3, and the output is the audio format explanation.

[0642] Step 6:

[0643] The user attempts to resolve the problem by following the instructions and provides feedback to the terminal. The terminal collects this feedback and reports it to the server. The server processes the received feedback and determines whether the problem has been resolved. If the problem is resolved, the session ends and the user's history is recorded. The input is the feedback from the user, and the output is the update of the history database.

[0644] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0645] This invention combines emotion recognition technology with an interactive system that allows users to self-resolve problems with information terminals. By incorporating an emotion engine, this system can analyze the user's emotional state from their voice and facial expressions and communicate appropriately based on that analysis.

[0646] First, the user accesses the system via a terminal installed in the store and describes the problem verbally. The terminal uses speech recognition technology to transcribe what is said into text and sends it to the server. Simultaneously, an emotion engine analyzes the user's tone of voice and facial expressions to identify their emotional state.

[0647] The server analyzes the received text data using natural language processing technology to extract information necessary to identify the problem. Furthermore, it considers the user's emotional state based on information from the emotion engine, and provides more careful guidance if the user is feeling anxious or frustrated.

[0648] The server selects the optimal solution from multiple sources based on the identified problem and the user's emotional state, and sends it to the terminal. The terminal provides the solution through visual and audio guidance, adjusting the tone of communication according to the user's emotional state.

[0649] For example, if a user complains that their smartphone battery dies too quickly, the emotion engine might detect frustration. In this case, the server will quickly and concisely suggest battery-saving methods that can be tried with simple steps, and the device will guide the user through these steps in a gentle and friendly tone.

[0650] Thus, the system of the present invention can easily provide guidance that reflects the user's emotional state, thereby improving the user experience.

[0651] The following describes the processing flow.

[0652] Step 1:

[0653] The user approaches a terminal installed in the store and begins to operate it. The terminal activates its emotion engine and starts collecting emotion data from the user's voice and facial expressions. At the same time, the terminal also activates its voice recognition system to convert the voice input into text.

[0654] Step 2:

[0655] The device converts voice input into text data while simultaneously sending emotional data analyzed by the emotion engine to the server. This is a crucial process for understanding the user's emotional state.

[0656] Step 3:

[0657] The server analyzes the received text data using a natural language processing engine to extract keywords necessary for identifying the problem. Simultaneously, it analyzes sentiment data to understand the user's emotional state.

[0658] Step 4:

[0659] The server selects the optimal solution from multiple sources based on the analysis results. This selection process takes into account the user's emotional state; for example, if the user is frustrated, an intuitive and concise solution is more likely to be chosen.

[0660] Step 5:

[0661] After the server selects the optimal solution, it sends that information to the terminal. The terminal visually displays the received solution on its screen and explains it clearly to the user through voice guidance.

[0662] Step 6:

[0663] The device adjusts the tone and expression of its guidance based on the user's emotional state. For example, if the user is feeling anxious, the guidance will be delivered in a gentle tone.

[0664] Step 7:

[0665] The user tries the suggested solution and reports the results as feedback to the device. The device then sends this feedback to the server.

[0666] Step 8:

[0667] The server receives user feedback and terminates the session if the problem is resolved. If the problem remains unresolved, the server explores alternative solutions and continues the interaction by sending instructions to the terminal again.

[0668] (Example 2)

[0669] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0670] When users attempt to resolve issues with their information terminals themselves, typical interactive systems often fail to consider the user's emotional state, resulting in a degraded user experience. In particular, if users are feeling anxious or frustrated, an inadequate response can lead to further stress. The challenge lies in providing a system that addresses these issues while efficiently and effectively resolving problems.

[0671] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0672] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary for identifying problems from the text data using natural language processing technology, and means for identifying the user's emotional state using sentiment analysis technology and adjusting the tone of communication accordingly. This enables flexible and appropriate responses in accordance with the user's emotional state, improving the user experience and enabling efficient problem solving.

[0673] "Speech recognition technology" is a technology that converts speech data into text data, and it is a process that analyzes a user's voice and turns it into text.

[0674] "Natural language processing technology" is a technique for extracting and analyzing specific information from text data, and is a method for computers to understand and process human language.

[0675] "Emotional analysis technology" is a technology that analyzes a user's voice tone and facial expressions to identify their current emotional state.

[0676] "Communication tone" refers to the language and atmosphere used when presenting information or solutions to users, and it should be adjusted to match the user's emotions.

[0677] "Identifying a problem" means diagnosing and identifying the root cause of a problem or malfunction that a user is experiencing.

[0678] "Information sources" refer to various databases and knowledge bases that provide relevant information useful for solving problems.

[0679] "Feedback" refers to information collected from users, such as their reactions and opinions, which is used to adjust the system's response.

[0680] This invention is an interactive system centered on a user-operated information terminal, realized by integrating speech recognition technology, natural language processing technology, and sentiment analysis technology. The user begins by verbally describing the problem to the information terminal. The hardware used in this process is a general-purpose information terminal including a microphone and camera. The software then uses speech recognition technology (e.g., speech recognition software as a general term) to convert the speech into text.

[0681] The text data generated by speech recognition is sent to the server, where natural language processing software (e.g., a natural language processing engine) analyzes the information necessary to identify the problem. In parallel, the server uses sentiment analysis technology (e.g., a sentiment analysis module) to identify the user's emotional state from their voice tone and facial expressions.

[0682] Based on information received from the server, the device presents solutions visually and audibly, in a tone appropriate to the user's emotional state. This reduces stress and facilitates self-resolution of problems. For example, if a user says, "My smartphone screen isn't working," and the emotion analysis technology detects frustration, the server will quickly select a specific solution, and the device will gently and calmly instruct the user to "press and hold the power button for 10 seconds to restart."

[0683] An example of a prompt might be, "Tell me how to recognize a specific emotional state and design a user-friendly solution based on it." This prompt allows the generative AI model to learn and provide methods for adjusting responses based on the user's emotions.

[0684] In this way, it is possible to improve the user experience while efficiently solving problems.

[0685] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0686] Step 1:

[0687] The user inputs a description of the problem by voice into the information terminal. The terminal's microphone captures this voice data and uses speech recognition technology to convert the voice into text data. This input of voice data results in the output of text data.

[0688] Step 2:

[0689] The terminal sends the generated text data to the server. The server analyzes this text data using natural language processing technology and extracts the information necessary to identify the problem. It receives text data as input and can obtain the cause of the problem and related information as output.

[0690] Step 3:

[0691] The device processes the user's voice tone and facial expressions using emotion analysis technology to identify the user's emotional state. The input is voice tone and facial expression data, and the output identifies the emotional state (e.g., anxiety or frustration).

[0692] Step 4:

[0693] The server selects the optimal solution from multiple information sources based on malfunction information obtained from natural language processing and emotional state information from sentiment analysis technology. The input includes malfunction information and emotional state information, and the output is the selection of the best solution to propose to the user.

[0694] Step 5:

[0695] The terminal provides the user with solutions received from the server. Solutions are presented via a visual display and audio output, with guidance delivered in a tone appropriate to the user's emotional state. The input to this process is the solution data, and the output is the form of guidance for the user.

[0696] (Application Example 2)

[0697] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0698] Users facing malfunctions in information terminals or home electronic devices often spend considerable time and effort identifying the problem and selecting a solution. Furthermore, responses that disregard the user's emotional state can detract from the user experience. This invention aims to provide a system for quickly and accurately resolving malfunctions while considering the user's emotional state.

[0699] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0700] In this invention, the server includes means for converting user input into text data using speech recognition technology, means for analyzing information necessary for identifying problems from the text data using natural language processing technology, and means for recognizing the user's emotional state from their voice and facial expressions using emotion analysis technology. This enables appropriate communication according to the user's emotional state, improving the user experience while allowing for rapid problem resolution.

[0701] "Speech recognition technology" is a technology that records a user's speech as digital data and converts it into text data.

[0702] "Natural language processing technology" refers to techniques for analyzing and extracting meaningful information from text data.

[0703] "Emotion analysis technology" is a technology used to identify a user's emotional state from their voice and facial expressions.

[0704] An "information source" is a collection of multiple data sets that provide reference data or a knowledge base for resolving problems.

[0705] "Communication style" refers to the guidelines for selecting appropriate expressions and tone of voice when interacting with users.

[0706] "Perceptual devices" are devices such as sensors and cameras that robots use to acquire information about the user and their environment.

[0707] "Dialogue history" refers to a record of information exchanges that have taken place between the user and the system in the past.

[0708] "User experience" is a comprehensive evaluation of the satisfaction and convenience that users feel when using a system.

[0709] The system for implementing the present invention is comprised of a combination of speech recognition, natural language processing, and sentiment analysis technologies. The server receives user voice data and visual data from a terminal equipped with a microphone and camera. Speech recognition software (e.g., Google Speech-to-Text) converts this voice data into text data. Next, a natural language processing library (e.g., NLTK) analyzes this text data to obtain information necessary for identifying defects.

[0710] Simultaneously, emotion analysis technology (e.g., Microsoft Azure's Emotion API) is used to analyze the user's voice and facial expression data and recognize their emotional state. The server combines these analysis results, searches databases and knowledge bases for information to resolve problems, and selects the optimal solution based on the user's emotional state.

[0711] The selected solution is presented to the user visually and audibly through the device. The tone of communication is adjusted according to the user's emotional state. Furthermore, it is possible to acquire additional environmental information about the user's surroundings using perceptual devices such as robots.

[0712] For example, if a user states that they are experiencing a problem where their smart speaker is not playing music, the system analyzes the statement and detects dissatisfaction through sentiment analysis. The system then provides a quick solution, improving the user experience by gently suggesting, "First, let's check the power and connection status of your smart speaker."

[0713] An example of a prompt message could be text that reads, "Recognize the emotions of users experiencing problems with their smart home devices and design guidelines to provide solutions based on those emotions."

[0714] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0715] Step 1:

[0716] The user describes the problem through a voice input device. The voice data is input into the terminal and converted into text data by speech recognition software. This process converts the user's spoken content into an analyzable text format.

[0717] Step 2:

[0718] The server receives text data and analyzes its content using natural language processing technology. It receives a text description of a problem as input and processes the data to extract relevant keywords and important information. As a result, the information necessary to identify the problem is output.

[0719] Step 3:

[0720] The user's voice and visual data are input into the emotion analysis system. The server uses emotion analysis technology to analyze the data and identify the user's emotional state. This process calculates data obtained from voice tone and facial expressions, and the identified emotional state is output.

[0721] Step 4:

[0722] Based on the identified problem and the user's emotional state, the server selects several potential solutions. In this step, the database is referenced, and data retrieval and calculations are performed to provide the user with the most relevant information.

[0723] Step 5:

[0724] The server sends the selected solution to the terminal. The solution is presented to the user visually and audibly on the terminal, with the communication style adjusted according to the user's emotional state. This ensures that the information is presented in a way that is more easily accepted by the user.

[0725] Step 6:

[0726] The user provides feedback on the presented solution. The terminal sends this feedback to the server, which then analyzes the results to determine the next step in the interaction. During this process, the output is determined as either continuing or ending the interaction.

[0727] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0728] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0729] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0730] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0731] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0732] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0733] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0734] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0735] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0736] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0737] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0738] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0739] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0740] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0741] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0742] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using this memory.

[0743] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0744] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0745] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0746] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0747] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0748] The following is further disclosed regarding the embodiments described above.

[0749] (Claim 1)

[0750] A means of converting user input into text data using speech recognition technology,

[0751] A means of analyzing information necessary to identify defects from text data using natural language processing technology,

[0752] A means of searching for relevant information from multiple sources and selecting the optimal solution,

[0753] A means of presenting the selected solution to the user visually and audibly,

[0754] A means of receiving user feedback and continuing or ending the dialogue depending on the resolution status,

[0755] A system that includes this.

[0756] (Claim 2)

[0757] The system according to claim 1, which recognizes the user's terminal screen and provides additional information.

[0758] (Claim 3)

[0759] The system according to claim 1, which refers to the call center's interaction history and records information for use in future interactions.

[0760] "Example 1"

[0761] (Claim 1)

[0762] A means of converting user input into text data using speech recognition technology,

[0763] A means of analyzing information necessary to identify defects from text data using natural language processing technology,

[0764] A means of searching for relevant information from multiple sources and selecting the optimal solution,

[0765] A means of presenting the selected solution to the user visually and audibly,

[0766] A means of receiving user feedback and continuing or ending the dialogue depending on the resolution status,

[0767] When presenting a solution, a means of visually displaying the user's progress and clearly indicating specific operating procedures is necessary.

[0768] A system that includes this.

[0769] (Claim 2)

[0770] The system according to claim 1, which recognizes the user's terminal screen and provides additional information.

[0771] (Claim 3)

[0772] The system according to claim 1, which refers to the call center's interaction history and records information for use in future interactions.

[0773] "Application Example 1"

[0774] (Claim 1)

[0775] A means of converting user input into text data using speech recognition technology,

[0776] A means of analyzing information necessary to identify defects from text data using natural language processing technology,

[0777] A means of searching for relevant information from multiple sources and selecting the optimal solution,

[0778] A means of presenting the selected solution to the user visually and audibly,

[0779] A means of displaying virtual elements using augmented reality technology and guiding the user through specific operating procedures,

[0780] A means of receiving user feedback and continuing or ending the dialogue depending on the resolution status,

[0781] A system that includes this.

[0782] (Claim 2)

[0783] The system according to claim 1, which recognizes the user's display device and provides additional information.

[0784] (Claim 3)

[0785] The system according to claim 1, which refers to the history of communication service interactions and records information for use in future conversations.

[0786] "Example 2 of combining an emotion engine"

[0787] (Claim 1)

[0788] A means of converting user input into text data using speech recognition technology,

[0789] A means of analyzing information necessary to identify defects from text data using natural language processing technology,

[0790] A means of identifying a user's emotional state using emotion analysis technology,

[0791] A means of presenting solutions while adjusting the tone of communication according to the user's emotional state,

[0792] A means of searching for relevant information from multiple sources and selecting the optimal solution,

[0793] A means of presenting the selected solution to the user visually and audibly,

[0794] A means of receiving user feedback and continuing or ending the dialogue depending on the resolution status,

[0795] A system that includes this.

[0796] (Claim 2)

[0797] The system according to claim 1, which recognizes the user's terminal screen and provides additional information.

[0798] (Claim 3)

[0799] The system according to claim 1, which refers to the call center's interaction history and records information for use in future interactions.

[0800] "Application example 2 when combining with an emotional engine"

[0801] (Claim 1)

[0802] A means of converting user input into text data using speech recognition technology,

[0803] A means of analyzing information necessary to identify defects from text data using natural language processing technology,

[0804] A means of recognizing a user's emotional state from their voice and facial expressions using emotion analysis technology,

[0805] A means of exploring relevant information from multiple sources, selecting the optimal solution, and presenting it in a communication style that matches the user's emotional state,

[0806] A means of presenting selected solutions to the user visually and audibly, receiving user feedback, adjusting the dialogue according to the user's emotional state, and continuing or ending the conversation.

[0807] A system that includes this.

[0808] (Claim 2)

[0809] The system according to claim 1, which uses a robot's sensory device to recognize the user's environment and provide additional information.

[0810] (Claim 3)

[0811] The system according to claim 1, which analyzes recorded dialogue history and emotional states and uses them to improve the user experience in future dialogues. [Explanation of symbols]

[0812] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of converting user input into text data using speech recognition technology, A means of analyzing information necessary to identify defects from text data using natural language processing technology, A means of searching for relevant information from multiple sources and selecting the optimal solution, A means of presenting the selected solution to the user visually and audibly, A means of receiving user feedback and continuing or ending the dialogue depending on the resolution status, A system that includes this.

2. The system according to claim 1, which recognizes the user's terminal screen and provides additional information.

3. The system according to claim 1, which refers to the call center's interaction history and records information for use in future conversations.