system

A system using speech recognition and natural language processing provides audio-visual instructions for self-resolving device problems, addressing the challenge of user unfamiliarity and enhancing support efficiency.

JP2026101344APending Publication Date: 2026-06-22SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-10
Publication Date
2026-06-22

Smart Images

  • Figure 2026101344000001_ABST
    Figure 2026101344000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A speech recognition means that converts speech information obtained from the user's operating platform into text information, A natural language processing means for analyzing the aforementioned textual information and identifying the type of failure, A means for generating a solution based on the type of failure and presenting the solution audibly and visually, A means of providing self-support for customers to resolve problems with their mobile devices on the spot, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In modern times, many users use smartphones and other information terminals. However, when problems occur in these devices, users who are not familiar with technology often do not know how to handle them and often rush to stores or support centers. At this time, long waiting times and in-store responses may not necessarily solve the problems, so there is a need for means to solve problems quickly and effectively. The object of the present invention is to provide a system that supports users to easily solve problems.

Means for Solving the Problems

[0005] To solve this problem, the present invention provides a system that includes speech recognition means for converting audio data acquired from a user's terminal into text data, and natural language processing means for analyzing the text data and identifying the type of problem. Furthermore, it includes means for generating solutions based on the type of problem and presenting the solutions both audibly and visually, and these means work together to enable the user to quickly and efficiently solve the problem they are facing.

[0006] "Speech recognition means" refers to a technology that receives a user's speech, converts it into a digital signal, and then converts that into analyzable text data.

[0007] "Natural language processing means" refers to technologies that analyze acquired text data to identify relevant information and types of problems in a specific field, and are technologies that perform language understanding and generation.

[0008] A "solution generation method" is a technology that, based on previously accumulated data and algorithms, formulates appropriate countermeasures to present to users for identified problems.

[0009] "Voice presentation means" refers to functions and technologies for guiding users through generated solutions in audio format.

[0010] "Visual presentation means" refers to display methods and technologies used to visually show users solutions or operating procedures. [Brief explanation of the drawing]

[0011] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4]This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0012] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.

[0013] First, let's explain the terminology used in the following explanation.

[0014] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0015] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0016] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0017] In the following embodiments, the labeled communication I / F (Interface) is an interface that includes a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), and the like.

[0018] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0019] [First Embodiment]

[0020] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0021] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0022] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0023] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0024] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0025] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0026] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0027] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0028] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0029] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0030] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0031] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0032] This invention is a system that helps users resolve problems with their smartphones and other information terminals themselves using a tablet terminal at a store. This system operates by integrating speech recognition, natural language processing, and solution suggestion technologies.

[0033] First, the user activates the tablet device and begins describing the current problem using voice or text. The device records the user's voice and sends the data to the server. The server then uses speech recognition technology to convert the voice data into text.

[0034] Next, the server applies natural language processing techniques to analyze the type of problem from the text data. In this process, referencing past call center interaction history and device handling databases enables more accurate problem identification and advice.

[0035] Once a problem is identified, the server generates a solution to resolve it. This solution includes appropriate troubleshooting steps and configuration change instructions, and is prepared as voice instructions using speech synthesis technology. Furthermore, the operating procedures are displayed on the screen as a visual aid.

[0036] Finally, the device provides these instructions to the user via voice and visual means to support problem-solving. For example, if a user reports a Wi-Fi connection problem, the system will provide a visual example of the settings screen along with a voice instruction such as, "Open the Wi-Fi settings screen and check the network status." This makes it easy for even tech-inexperienced users to resolve the problem.

[0037] This system is designed to allow users to quickly resolve problems with their own devices, resulting in reduced waiting times and improved support quality.

[0038] The following describes the processing flow.

[0039] Step 1:

[0040] The user operates a tablet terminal in the store to activate the AI ​​agent. Once the agent is activated, the terminal sets the voice input function to standby mode.

[0041] Step 2:

[0042] The user enters the details of the problem via voice. The terminal records this voice data and prepares to send it to the server.

[0043] Step 3:

[0044] The server converts the received audio data into text data using ASR (Automatic Speech Recognition) technology. This makes the audio information available in a text format that can be parsed.

[0045] Step 4:

[0046] The server uses NLP (Natural Language Processing) technology to analyze text data. The purpose of the analysis is to identify the type of reported problem and its possible causes.

[0047] Step 5:

[0048] The server selects the necessary solution to resolve the problem based on the results of natural language processing. This selection process involves referencing call center response history and instruction manual databases to perform the most optimal troubleshooting.

[0049] Step 6:

[0050] The server prepares audio and visual information to present the selected solution to the user in an easy-to-understand manner. The prepared audio is generated using TTS (Text-to-Speech) technology.

[0051] Step 7:

[0052] The terminal receives voice instructions and visual information from the server and presents them to the user. Along with the voice instructions, character animations are displayed on the screen to visually support the instructions.

[0053] Step 8:

[0054] The user operates their smartphone following the provided audio and visual instructions. They can also report the results of their operations back to the tablet device.

[0055] This process allows users to resolve problems themselves, resulting in faster problem resolution.

[0056] (Example 1)

[0057] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0058] In modern information devices, user problems are becoming increasingly complex, making self-resolution difficult for users without specialized knowledge. Furthermore, traditional support systems often take a long time to resolve issues, leading to increased user stress and support costs. Therefore, there is a need for an efficient, intuitive, and user-friendly self-resolution support system.

[0059] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0060] In this invention, the server includes speech recognition means for converting voice data acquired from a user's information terminal into text data, language analysis means for analyzing the text data and identifying the type of problem, and means for generating a solution based on the type of problem, while referring to past support history and device handling information, and presenting the solution via voice output and visual display. This enables users to quickly and effectively resolve problems with their information terminals themselves, even without specialized knowledge.

[0061] "User information terminal" refers to a portable or stationary device with computing capabilities used by the user.

[0062] "Audio data" refers to sound information input by the user through a device, represented as a digital signal.

[0063] "Text data" refers to information expressed in string format obtained by analyzing audio data.

[0064] "Speech recognition means" refers to technology that has a mechanism for receiving speech data and converting it into corresponding text data.

[0065] "Linguistic analysis methods" refer to technical techniques for analyzing text data and identifying the type of problem.

[0066] "Past interaction history" refers to data on user interactions recorded to date, which is used to assist in resolving problems.

[0067] "Device handling information" refers to a collection of data related to the usage and troubleshooting of the information terminal being used.

[0068] "Means for generating solutions" refers to a system that automatically designs appropriate response methods based on the type of problem.

[0069] "Means of presentation via audio output and visual display" refers to a system that includes technology for providing users with guidance on solutions via audio or visual means.

[0070] "Animation" is a method of expression that uses dynamic changes to display visual information to aid user comprehension.

[0071] This invention is a system that helps users resolve problems with information devices themselves using information terminals in stores. This system operates by integrating speech recognition technology, natural language processing technology, and solution presentation technology.

[0072] The user first operates an information terminal and begins describing the problem using voice or text. The terminal records the user's voice and sends it to the server as audio data. At this stage, the server utilizes speech recognition technology as a "speech recognition means" to convert the "audio data" into "text data." A commercial speech recognition system may be used for this process.

[0073] Next, the server processes the text data using a "language analysis tool" to identify the type of problem. During this identification process, the server references databases such as past user interaction history and device handling information to improve the accuracy of the information. Once the problem is identified, the server generates a solution based on the identified problem. In this process, the server uses a generative AI model to design an appropriate solution to the problem.

[0074] The solutions generated by the server are again presented to the user in both audio and visual form. The terminal receives information from the server, provides guidance to the user via audio output, and simultaneously displays the operating procedure on the screen. If the user reports a problem such as "My smartphone cannot connect to Wi-Fi," it will be presented with specific instructions, such as "Open the Wi-Fi settings screen and check the network status," along with visual information.

[0075] This system allows users with limited technical knowledge to quickly understand and resolve device malfunctions. An example of a specific prompt message is, "Please provide specific troubleshooting steps for the Wi-Fi connection problem reported by the user." This reduces waiting times and improves the quality of support.

[0076] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0077] Step 1:

[0078] The user activates an information terminal installed in the store and inputs a description of the problem by voice. The input voice data is collected by the terminal and transmitted to the server in digital format. At this point, the input is voice data, and the output is digital voice data for transmission.

[0079] Step 2:

[0080] The server converts the received audio data into text data using speech recognition technology. Here, the speech recognition engine outputs the audio data as text data. In this conversion, text based on a language model is generated through the analysis of the audio waveform. Therefore, the input is audio data, and the output is text data.

[0081] Step 3:

[0082] The server passes text data to a natural language processing engine to analyze the content of the problem. This analysis identifies the problem from the query based on the user's utterance. The server then refers to the system's database and determines the type of problem by comparing it with relevant data. The input is text data, and the output is the content of the identified problem.

[0083] Step 4:

[0084] Based on the identified problem, the server generates a solution using a generative AI model, drawing on past response history and handling information. The server extracts necessary information from existing databases and designs the optimal solution according to the nature of the problem. The input at this stage is the content of the identified problem, and the output is instructions for the solution.

[0085] Step 5:

[0086] The terminal receives a solution from the server and presents the solution to the user verbally using speech synthesis technology. Simultaneously, visual instructions related to the solution are displayed on the terminal's screen. This allows the user to confirm the specific steps through both sight and sound. The input is the solution instructions, and the output is the presentation of the solution through voice and visual means.

[0087] Step 6:

[0088] The user resolves issues with their information device by following instructions presented on the terminal. The user uses the on-screen interface and follows voice guidance to proceed with the problem-solving process. This step involves the user performing physical actions to resolve the problem. The input is the suggested solution, and the output is the user's response.

[0089] This series of steps allows users to quickly resolve issues without the need for expert support, thereby improving usability.

[0090] (Application Example 1)

[0091] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0092] Consumers often need to visit a store when they encounter problems with mobile devices such as smartphones and tablets. However, this requires waiting for a specialist, which is time-consuming and inconvenient, and consumers unfamiliar with technology often find it difficult to resolve the issue on their own. Furthermore, traditional support systems lack sufficient integration of voice and visual information, and provide inadequate guidance for users to resolve problems themselves. Therefore, there is a need for a comprehensive support system that facilitates rapid self-resolution on-site.

[0093] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0094] In this invention, the server includes: speech recognition means for converting voice information acquired from the user's operating platform into text information; natural language processing means for analyzing the text information and identifying the type of malfunction; means for generating solutions based on the type of malfunction and presenting the solutions audibly and visually; and means for providing self-support to enable customers to resolve problems with their mobile devices on the spot. This enables customers to quickly and effectively resolve problems with their devices within the store.

[0095] An "operating platform" is a device used by users to input information and receive information from a system, and is usually an electronic device such as a tablet.

[0096] "Audio information" refers to data conveyed through voice input from users.

[0097] "Textual information" refers to data in text format that has been converted from audio information.

[0098] "Speech recognition means" refers to the technical processes and functions for converting speech information into text information.

[0099] "Natural language processing means" refers to processing techniques that analyze textual information and identify the type of problem based on its content.

[0100] "Type of malfunction" refers to the specific nature and classification of problems in mobile devices.

[0101] A "solution" is a set of procedures or guidelines for resolving a problem, generated according to the type of failure identified.

[0102] A "customer" is a consumer who visits a store to receive a service.

[0103] "Self-support" refers to the assistance and means provided to users to solve their own problems on their own.

[0104] The system for implementing this invention uses a tablet terminal as the customer's operating platform and integrates speech recognition, natural language processing, and solution presentation technologies. The system uses the Google® Speech-to-Text API as its speech recognition engine to convert speech information into text information. The server performs natural language processing on the received text information using the Google Cloud Natural Language API to identify the type of malfunction. Subsequently, the server generates a solution based on the identified malfunction, referencing information from the Firebase Realtime Database. This solution is presented to the terminal both audibly and visually through a web interface built with React.js.

[0105] The solutions generated by the server guide customers on how to efficiently resolve problems with their mobile devices on the spot. Specifically, for example, if a user says, "My smartphone battery drains quickly," the system identifies it as a "battery consumption problem" and presents specific suggestions such as "stopping unnecessary applications" and "setting battery saver mode." Customers can easily perform these steps by following the visual guide.

[0106] Examples of using generative AI models include prompts such as, "Identify battery problems from voice input and suggest the best solution," or "Provide optimal Wi-Fi connection troubleshooting steps based on user input." In this way, a self-support environment is provided where even customers who are not technically savvy can confidently solve problems.

[0107] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0108] Step 1:

[0109] The user activates the tablet device and provides voice input regarding the problem. This voice data is entered into the device. The entered voice data is then sent to the server in streaming format.

[0110] Step 2:

[0111] The server receives audio data and converts it into text data using the Google Speech-to-Text API. This process involves data processing that converts the audio signal into a string of characters, and the output is text data as character information.

[0112] Step 3:

[0113] The server receives text data and parses it using the Google Cloud Natural Language API. Here, natural language processing techniques are used to identify the type of failure from the input text. As a result, the identified failure type is output.

[0114] Step 4:

[0115] Based on the type of failure identified by the server, the system retrieves suitable solution data from the Firebase Realtime Database. The failure type is used as input, and data processing is performed to generate solutions, resulting in the output of specific solutions.

[0116] Step 5:

[0117] Prepare to present the server-generated solutions both audibly and visually. This involves using speech synthesis technology to format the solutions as voice instructions, and using React.js to create visual instructions. The audio file and visual display information are then output to the device.

[0118] Step 6:

[0119] The terminal presents the user with solutions received from the server. The user receives voice guidance and visual step-by-step instructions, allowing them to attempt to resolve the problem themselves. The solution is ultimately executed as an action by the user.

[0120] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0121] This invention aims to improve the user experience by adding emotion recognition functionality to a system that assists users in self-solving problems with smartphones and information terminals via in-store tablet devices. This will enable appropriate support even for users who are unfamiliar with technology or who are easily stressed.

[0122] This system operates by combining speech recognition, natural language processing, and solution suggestion with an emotion engine that recognizes emotions. The user operates a tablet device to activate the AI ​​agent. The device has the ability to record the user's voice and capture their facial expressions via a camera. The recorded voice and captured facial expression data are sent to a server.

[0123] The server uses an emotion recognition engine to analyze audio and video data to recognize the user's current emotional state. This analysis is then correlated with the nature of the problem using natural language processing technology, forming the basis for guiding users to appropriate solutions.

[0124] For example, if a user reports a problem such as "Wi-Fi not connecting," the server can determine from the user's tone of voice and facial expression that the user is experiencing significant stress. In this case, in addition to offering the usual solutions, adjustments are made to provide more empathetic voice guidance and simplified instructions. In particular, character animations are emphasized to make the user feel more approachable and reassured.

[0125] Ultimately, the device adjusts and presents the voice instructions and visual information received from the server based on the user's emotional state. This allows users to receive guidance tailored to their individual circumstances and solve problems more smoothly. This form, which combines emotion recognition, improves the user experience and enables a higher level of satisfaction.

[0126] The following describes the processing flow.

[0127] Step 1:

[0128] The user activates the AI ​​agent by operating a tablet device in the store. The device starts its camera and microphone and prepares to collect audio and video data.

[0129] Step 2:

[0130] The user describes the problem verbally. The device records the audio using its microphone and simultaneously captures the user's facial expressions in real time using its camera.

[0131] Step 3:

[0132] The device sends recorded audio data and captured facial expression data to the server. The transmitted data is encrypted to protect the user's privacy.

[0133] Step 4:

[0134] The server converts the audio data into text data using ASR (Automatic Speech Recognition) technology. Then, it uses NLP (Natural Language Processing) technology to analyze the details of the problem from the text.

[0135] Step 5:

[0136] The server uses an emotion recognition engine to analyze the user's emotional state from voice tone and facial expression data. This identifies the emotions the user is experiencing (e.g., stress, anxiety, etc.).

[0137] Step 6:

[0138] The server generates solutions to problems based on the analysis results. It adjusts how solutions are presented based on emotional data, using empathetic voice guidance and character animations as needed.

[0139] Step 7:

[0140] The server sends generated voice instructions and visual information to the terminal. The terminal receives them and presents them to the user in a tone that matches their emotional state. Along with the voice guidance, a friendly character is displayed on the screen.

[0141] Step 8:

[0142] Users follow instructions and operate their smartphones or information terminals to attempt to resolve the problem. After the operation, they can also report the situation to the terminal again and receive additional support if necessary.

[0143] This approach allows systems that incorporate emotion recognition to provide flexible support tailored to each user's emotional state, facilitating efficient problem-solving.

[0144] (Example 2)

[0145] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0146] Traditional information processing systems have standardized support for user-reported problems, making it difficult to provide appropriate assistance, especially to users unfamiliar with technology or those prone to stress. Furthermore, offering uniform solutions without considering user feelings can decrease user satisfaction. There is a need to solve these problems and improve the user experience.

[0147] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0148] In this invention, the server includes speech recognition means for converting speech information acquired from the user's information processing device into text information, natural language processing means for analyzing the text information and identifying the type of problem, means for generating a solution based on the type of problem and the user's emotional state and presenting the solution audibly and visually, and emotion recognition means for analyzing speech information and facial expression information to identify the user's emotional state. This makes it possible to provide personalized solutions that respond to the user's emotions, thereby improving the user experience and satisfaction.

[0149] "Information processing device" is a general term for electronic devices that have the function of inputting, processing, and outputting data, and includes terminals that are directly operated by the user.

[0150] "Voice information" refers to data collected as the user's voice, and is the original information that is converted into text data by speech recognition.

[0151] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is subject to natural language processing.

[0152] "Speech recognition means" refers to a process or device that converts speech information into text information using technical methods, and is implemented using various algorithms.

[0153] "Natural language processing" refers to technologies that analyze human language and understand information, and is particularly used to classify problems and formulate solutions.

[0154] "Emotion recognition means" refers to technology or devices that identify a user's emotional state by analyzing voice and facial expression data.

[0155] "Means for generating and presenting solutions" refers to a process or mechanism for creating the optimal problem-solving method for the user and providing that information to the user in an easily understandable format.

[0156] This invention realizes a process for effectively solving user problems using an information processing system. The user first accesses the system using a tablet terminal, which is an information processing device, and reports the problem via voice input. The terminal has a built-in microphone for collecting voice and a camera for capturing facial expressions. As a result, the user's voice and facial information are recorded in real time.

[0157] The terminal transmits the recorded audio information to the server. Standard data communication protocols are used for this transmission, and the data may be encrypted as a security measure. The server converts the audio information into text using speech recognition technology and then identifies the type of problem using natural language processing technology. Cloud-based speech recognition APIs and natural language processing APIs may be used for this technical implementation.

[0158] Furthermore, the server uses emotion recognition to analyze voice and facial expression information to identify the user's emotional state. For example, if a user reports a problem such as "Wi-Fi is not connecting," the server may determine from their tone of voice and facial expression that they are experiencing stress. In such situations, the solutions offered are adjusted based on the user's emotions.

[0159] The solution presentation method includes the use of visually easy-to-understand character animations, which enhances user engagement. The provided solutions are generated by a generative AI model based on the user's specific state and past interaction history. The generated solutions are presented to the user through voice guidance and visual step-by-step instructions.

[0160] As an example of a prompt, the system can generate optimal responses by asking a question such as, "How should support be customized if the user shows signs of stress?" This allows users to receive support tailored to their individual circumstances, resulting in a system that facilitates smooth problem-solving.

[0161] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0162] Step 1:

[0163] The user activates a tablet device, which is an information processing device, and reports a problem through a voice input system. The user's voice is used as input, and this voice is recorded in real time by the device's microphone.

[0164] Step 2:

[0165] The terminal processes the recorded audio information as digital data and sends it to the server. The input audio data is converted to an appropriate format and securely transferred to the server according to the network protocol. Data compression and encryption may also be performed at this stage.

[0166] Step 3:

[0167] The server receives the transmitted audio data and converts it into text data using speech recognition technology. The input data is in audio format, and the output is in text format. This conversion process may utilize a cloud-based speech recognition API.

[0168] Step 4:

[0169] The server analyzes the textual information obtained through speech recognition and identifies the problem using natural language processing. At this stage, the input is textual data, and the output is a specific problem category. A generative AI model is used for this analysis, and data calculations are performed based on the prompt text.

[0170] Step 5:

[0171] The server simultaneously performs analysis using emotion recognition means with voice and facial expression data. The input here is voice tone and facial expression data, and the output is the user's emotional state. The emotional state is classified into states such as "calm" or "stressed."

[0172] Step 6:

[0173] The server generates and presents the optimal solution based on the identified problem and perceived emotional state. This process involves inputs including the problem category and emotional state, and outputting a tailored solution. It is designed to include visual character animations and voice guidance.

[0174] Step 7:

[0175] The terminal displays customized solutions received from the server to the user. The input is solution data from the server, and the output is visual and audio guidance to the user. The user can then use this information to proceed with problem solving.

[0176] (Application Example 2)

[0177] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0178] Users often experience stress and difficulty resolving problems in stores due to malfunctions with technical equipment or unfamiliarity with its operation. Therefore, there is a need for a system that can understand users' feelings and provide appropriate and user-friendly support accordingly.

[0179] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0180] In this invention, the server includes speech recognition means for converting voice data acquired from the user's terminal into text data, natural language processing means for analyzing the text data and identifying the type of problem, and emotion recognition means for recognizing the user's emotional state and adjusting the presentation of solutions based on that emotional state. This makes it possible to quickly provide appropriate support according to the user's emotional state and to smoothly resolve problems.

[0181] "Voice recognition means" refers to a device or system that has the function of acquiring voice data from a user's terminal and converting it into text data.

[0182] "Natural language processing means" refers to a technology or system that analyzes text data and performs processing to identify the type of problem.

[0183] "Means of presenting solutions" refers to means of presenting solutions generated based on the type of problem, either audibly or visually, and having the function of conveying information to the user.

[0184] "Emotion recognition means" refers to a device or system that has the function of recognizing the emotional state of a user and adjusting the content and method of the solution provided based on that emotional state.

[0185] "Character animation" refers to visual animation used to enhance user engagement and to convey instructions and information.

[0186] The system for implementing this invention is designed to improve in-store user service using emotion recognition. A server converts audio data acquired from a user's terminal via speech recognition into text data. This text data is analyzed using natural language processing to identify the type of problem. The analyzed data is sent to emotion recognition, where the user's emotional state is recognized from the audio and video data. Using this data, the server adjusts the provided solutions according to the user's emotional state.

[0187] The hardware includes a microphone for acquiring audio data, a camera for acquiring video data, and a server for data processing. The software includes speech recognition using the Google Speech-to-Text API, natural language processing using Dialogflow, and emotion recognition using the Microsoft® Emotion API.

[0188] As a concrete example, when a customer is confused while searching for products in a store, the server can detect this confusion through emotion recognition, emphasize character animations, and provide product guidance in a gentle voice. In this scenario, the AI ​​model is instructed using a prompt message such as, "Analyze the customer's voice and facial expression data with the emotion recognition engine, and generate and provide gentle guidance text to the customer based on the results."

[0189] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0190] Step 1:

[0191] The device acquires the user's voice and facial expressions using a microphone and camera, and inputs them as audio and video data. This data is primary data for analyzing the user's state. The device then transmits this data to a server.

[0192] Step 2:

[0193] The server uses speech recognition to convert the incoming audio data into text data. The speech recognition results are output in text format and become data for natural language processing. The Google Speech-to-Text API handles this process.

[0194] Step 3:

[0195] The server uses natural language processing to analyze text data and identify the type of problem. The input is text data, and the output is information about the analyzed problem type. Dialogflow is used to analyze the context and problem, and appropriate tagging is performed.

[0196] Step 4:

[0197] The server analyzes video data sent via emotion recognition to recognize the user's emotional state. The input is video data, and the output is information about the recognized emotional state. The Microsoft Emotion API is used to analyze emotional patterns.

[0198] Step 5:

[0199] The server refines and generates solutions based on the results of natural language processing and sentiment recognition. The input is the type of problem and its sentiment state, and the output is the refined solution. The server sends prompts to the generating AI model, which uses prompts to generate the necessary solutions.

[0200] Step 6:

[0201] The server sends the adjusted solution to the terminal. The terminal presents the solution to the user both audibly and visually. This includes actions such as using character animations to explain things in a friendly manner. In this process, the terminal adjusts the tone of voice and animation movements to increase the user's sense of security.

[0202] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0203] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0204] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0205] [Second Embodiment]

[0206] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0207] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0208] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0209] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0210] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0211] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0212] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0213] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0214] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0215] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0216] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0217] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0218] This invention is a system that helps users resolve problems with their smartphones and other information terminals themselves using a tablet terminal at a store. This system operates by integrating speech recognition, natural language processing, and solution suggestion technologies.

[0219] First, the user activates the tablet device and begins describing the current problem using voice or text. The device records the user's voice and sends the data to the server. The server then uses speech recognition technology to convert the voice data into text.

[0220] Next, the server applies natural language processing techniques to analyze the type of problem from the text data. In this process, referencing past call center interaction history and device handling databases enables more accurate problem identification and advice.

[0221] Once a problem is identified, the server generates a solution to resolve it. This solution includes appropriate troubleshooting steps and configuration change instructions, and is prepared as voice instructions using speech synthesis technology. Furthermore, the operating procedures are displayed on the screen as a visual aid.

[0222] Finally, the device provides these instructions to the user via voice and visual means to support problem-solving. For example, if a user reports a Wi-Fi connection problem, the system will provide a visual example of the settings screen along with a voice instruction such as, "Open the Wi-Fi settings screen and check the network status." This makes it easy for even tech-inexperienced users to resolve the problem.

[0223] This system is designed to allow users to quickly resolve problems with their own devices, resulting in reduced waiting times and improved support quality.

[0224] The following describes the processing flow.

[0225] Step 1:

[0226] The user operates a tablet terminal in the store to activate the AI ​​agent. Once the agent is activated, the terminal sets the voice input function to standby mode.

[0227] Step 2:

[0228] The user enters the details of the problem via voice. The terminal records this voice data and prepares to send it to the server.

[0229] Step 3:

[0230] The server converts the received audio data into text data using ASR (Automatic Speech Recognition) technology. This makes the audio information available in a text format that can be parsed.

[0231] Step 4:

[0232] The server uses NLP (Natural Language Processing) technology to analyze text data. The purpose of the analysis is to identify the type of reported problem and its possible causes.

[0233] Step 5:

[0234] The server selects the necessary solution to resolve the problem based on the results of natural language processing. This selection process involves referencing call center response history and instruction manual databases to perform the most optimal troubleshooting.

[0235] Step 6:

[0236] The server prepares audio and visual information to present the selected solution to the user in an easy-to-understand manner. The prepared audio is generated using TTS (Text-to-Speech) technology.

[0237] Step 7:

[0238] The terminal receives voice instructions and visual information from the server and presents them to the user. Along with the voice instructions, character animations are displayed on the screen to visually support the instructions.

[0239] Step 8:

[0240] The user operates their smartphone following the provided audio and visual instructions. They can also report the results of their operations back to the tablet device.

[0241] This process allows users to resolve problems themselves, resulting in faster problem resolution.

[0242] (Example 1)

[0243] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0244] In modern information devices, user problems are becoming increasingly complex, making self-resolution difficult for users without specialized knowledge. Furthermore, traditional support systems often take a long time to resolve issues, leading to increased user stress and support costs. Therefore, there is a need for an efficient, intuitive, and user-friendly self-resolution support system.

[0245] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0246] In this invention, the server includes speech recognition means for converting voice data acquired from a user's information terminal into text data, language analysis means for analyzing the text data and identifying the type of problem, and means for generating a solution based on the type of problem, while referring to past support history and device handling information, and presenting the solution via voice output and visual display. This enables users to quickly and effectively resolve problems with their information terminals themselves, even without specialized knowledge.

[0247] "User information terminal" refers to a portable or stationary device with computing capabilities used by the user.

[0248] "Audio data" refers to sound information input by the user through a device, represented as a digital signal.

[0249] "Text data" refers to information expressed in string format obtained by analyzing audio data.

[0250] "Speech recognition means" refers to technology that has a mechanism for receiving speech data and converting it into corresponding text data.

[0251] "Linguistic analysis methods" refer to technical techniques for analyzing text data and identifying the type of problem.

[0252] "Past interaction history" refers to data on user interactions recorded to date, which is used to assist in resolving problems.

[0253] "Device handling information" refers to a collection of data related to the usage and troubleshooting of the information terminal being used.

[0254] "Means for generating solutions" refers to a system that automatically designs appropriate response methods based on the type of problem.

[0255] "Means of presentation via audio output and visual display" refers to a system that includes technology for providing users with guidance on solutions via audio or visual means.

[0256] "Animation" is a method of expression that uses dynamic changes to display visual information to aid user comprehension.

[0257] This invention is a system that helps users resolve problems with information devices themselves using information terminals in stores. This system operates by integrating speech recognition technology, natural language processing technology, and solution presentation technology.

[0258] The user first operates an information terminal and begins describing the problem using voice or text. The terminal records the user's voice and sends it to the server as audio data. At this stage, the server utilizes speech recognition technology as a "speech recognition means" to convert the "audio data" into "text data." A commercial speech recognition system may be used for this process.

[0259] Next, the server processes the text data using a "language analysis tool" to identify the type of problem. During this identification process, the server references databases such as past user interaction history and device handling information to improve the accuracy of the information. Once the problem is identified, the server generates a solution based on the identified problem. In this process, the server uses a generative AI model to design an appropriate solution to the problem.

[0260] The solutions generated by the server are again presented to the user in both audio and visual form. The terminal receives information from the server, provides guidance to the user via audio output, and simultaneously displays the operating procedure on the screen. If the user reports a problem such as "My smartphone cannot connect to Wi-Fi," it will be presented with specific instructions, such as "Open the Wi-Fi settings screen and check the network status," along with visual information.

[0261] This system allows users with limited technical knowledge to quickly understand and resolve device malfunctions. An example of a specific prompt message is, "Please provide specific troubleshooting steps for the Wi-Fi connection problem reported by the user." This reduces waiting times and improves the quality of support.

[0262] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0263] Step 1:

[0264] The user activates an information terminal installed in the store and inputs a description of the problem by voice. The input voice data is collected by the terminal and transmitted to the server in digital format. At this point, the input is voice data, and the output is digital voice data for transmission.

[0265] Step 2:

[0266] The server converts the received audio data into text data using speech recognition technology. Here, the speech recognition engine outputs the audio data as text data. In this conversion, text based on a language model is generated through the analysis of the audio waveform. Therefore, the input is audio data, and the output is text data.

[0267] Step 3:

[0268] The server passes text data to a natural language processing engine to analyze the content of the problem. This analysis identifies the problem from the query based on the user's utterance. The server then refers to the system's database and determines the type of problem by comparing it with relevant data. The input is text data, and the output is the content of the identified problem.

[0269] Step 4:

[0270] Based on the identified problem, the server generates a solution using a generative AI model, drawing on past response history and handling information. The server extracts necessary information from existing databases and designs the optimal solution according to the nature of the problem. The input at this stage is the content of the identified problem, and the output is instructions for the solution.

[0271] Step 5:

[0272] The terminal receives a solution from the server and presents the solution to the user verbally using speech synthesis technology. Simultaneously, visual instructions related to the solution are displayed on the terminal's screen. This allows the user to confirm the specific steps through both sight and sound. The input is the solution instructions, and the output is the presentation of the solution through voice and visual means.

[0273] Step 6:

[0274] The user resolves issues with their information device by following instructions presented on the terminal. The user uses the on-screen interface and follows voice guidance to proceed with the problem-solving process. This step involves the user performing physical actions to resolve the problem. The input is the suggested solution, and the output is the user's response.

[0275] This series of steps allows users to quickly resolve issues without the need for expert support, thereby improving usability.

[0276] (Application Example 1)

[0277] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0278] Consumers often need to visit a store when they encounter problems with mobile devices such as smartphones and tablets. However, this requires waiting for a specialist, which is time-consuming and inconvenient, and consumers unfamiliar with technology often find it difficult to resolve the issue on their own. Furthermore, traditional support systems lack sufficient integration of voice and visual information, and provide inadequate guidance for users to resolve problems themselves. Therefore, there is a need for a comprehensive support system that facilitates rapid self-resolution on-site.

[0279] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0280] In this invention, the server includes: speech recognition means for converting voice information acquired from the user's operating platform into text information; natural language processing means for analyzing the text information and identifying the type of malfunction; means for generating solutions based on the type of malfunction and presenting the solutions audibly and visually; and means for providing self-support to enable customers to resolve problems with their mobile devices on the spot. This enables customers to quickly and effectively resolve problems with their devices within the store.

[0281] The "operation platform" is a device for users to input information and receive information from the system, usually an electronic device such as a tablet terminal.

[0282] "Voice information" refers to data transmitted by voice obtained from users.

[0283] "Character information" is data in text format converted based on voice information.

[0284] "Voice recognition means" refers to the technical process or function for converting voice information into character information.

[0285] "Natural language processing means" is a processing technology for analyzing character information and identifying the type of problem from its content.

[0286] "Type of failure" refers to the specific content or classification of defects in mobile devices.

[0287] "Solution" is a procedure or guideline for solving problems generated according to the identified type of failure.

[0288] "Customers" refer to consumers who visit stores to receive services.

[0289] "Self-support" refers to the support or means provided for users to solve their own problems by themselves.

[0290] The system for implementing this invention uses a tablet terminal as the customer's operating platform and integrates speech recognition, natural language processing, and solution presentation technologies. The system uses the Google Speech-to-Text API as the speech recognition engine to convert speech information into text information. The server performs natural language processing on the received text information using the Google Cloud Natural Language API to identify the type of fault. Subsequently, the server generates a solution based on the identified fault, referencing information from the Firebase Realtime Database. This solution is presented to the terminal both audibly and visually through a web interface built with React.js.

[0291] The solutions generated by the server guide customers on how to efficiently resolve problems with their mobile devices on the spot. Specifically, for example, if a user says, "My smartphone battery drains quickly," the system identifies it as a "battery consumption problem" and presents specific suggestions such as "stopping unnecessary applications" and "setting battery saver mode." Customers can easily perform these steps by following the visual guide.

[0292] Examples of using generative AI models include prompts such as, "Identify battery problems from voice input and suggest the best solution," or "Provide optimal Wi-Fi connection troubleshooting steps based on user input." In this way, a self-support environment is provided where even customers who are not technically savvy can confidently solve problems.

[0293] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0294] Step 1:

[0295] The user activates the tablet device and provides voice input regarding the problem. This voice data is entered into the device. The entered voice data is then sent to the server in streaming format.

[0296] Step 2:

[0297] The server receives audio data and converts it into text data using the Google Speech-to-Text API. This process involves data processing that converts the audio signal into a string of characters, and the output is text data as character information.

[0298] Step 3:

[0299] The server receives text data and parses it using the Google Cloud Natural Language API. Here, natural language processing techniques are used to identify the type of failure from the input text. As a result, the identified failure type is output.

[0300] Step 4:

[0301] Based on the type of failure identified by the server, the system retrieves suitable solution data from the Firebase Realtime Database. The failure type is used as input, and data processing is performed to generate solutions, resulting in the output of specific solutions.

[0302] Step 5:

[0303] Prepare to present the server-generated solutions both audibly and visually. This involves using speech synthesis technology to format the solutions as voice instructions, and using React.js to create visual instructions. The audio file and visual display information are then output to the device.

[0304] Step 6:

[0305] The terminal presents the user with solutions received from the server. The user receives voice guidance and visual step-by-step instructions, allowing them to attempt to resolve the problem themselves. The solution is ultimately executed as an action by the user.

[0306] Furthermore, an emotion engine for estimating the user's emotion may be combined. That is, the specific processing unit 290 may estimate the user's emotion using the emotion recognition model 59 and perform specific processing using the user's emotion.

[0307] An object of the present invention is to improve the user experience by providing an emotion recognition function to a system that supports a user in self-solving problems of a smartphone or an information terminal through a tablet terminal at a storefront. This enables appropriate support for users who are unfamiliar with technology or who are likely to feel stressed.

[0308] This system operates in combination with an emotion engine that recognizes emotions, in addition to voice recognition, natural language processing, and solution presentation. The user operates the tablet terminal to activate the AI agent. The terminal has a function of recording the user's voice and further capturing the expression through the camera. The recorded voice and the captured expression data are transmitted to the server.

[0309] The server uses the emotion recognition engine to analyze the voice data and video data and recognize the user's current emotional state. This analysis result is used as a basis for guiding an appropriate solution according to the user by associating it with the content of the problem using natural language processing technology.

[0310] For example, when the user reports a problem such as "Wi-Fi cannot be connected", the server determines from the tone of the user's voice and expression that the user is feeling strong stress. In this case, in addition to presenting a normal solution, adjustments are made to provide a more personal voice guidance and a simplified procedure explanation. In particular, by emphasizing the character animation, it is possible to make it friendly to the user and provide a sense of security.

[0311] Ultimately, the device adjusts and presents the voice instructions and visual information received from the server based on the user's emotional state. This allows users to receive guidance tailored to their individual circumstances and solve problems more smoothly. This form, which combines emotion recognition, improves the user experience and enables a higher level of satisfaction.

[0312] The following describes the processing flow.

[0313] Step 1:

[0314] The user activates the AI ​​agent by operating a tablet device in the store. The device starts its camera and microphone and prepares to collect audio and video data.

[0315] Step 2:

[0316] The user describes the problem verbally. The device records the audio using its microphone and simultaneously captures the user's facial expressions in real time using its camera.

[0317] Step 3:

[0318] The device sends recorded audio data and captured facial expression data to the server. The transmitted data is encrypted to protect the user's privacy.

[0319] Step 4:

[0320] The server converts the audio data into text data using ASR (Automatic Speech Recognition) technology. Then, it uses NLP (Natural Language Processing) technology to analyze the details of the problem from the text.

[0321] Step 5:

[0322] The server uses an emotion recognition engine to analyze the user's emotional state from voice tone and facial expression data. This identifies the emotions the user is experiencing (e.g., stress, anxiety, etc.).

[0323] Step 6:

[0324] The server generates solutions to problems based on the analysis results. It adjusts how solutions are presented based on emotional data, using empathetic voice guidance and character animations as needed.

[0325] Step 7:

[0326] The server sends generated voice instructions and visual information to the terminal. The terminal receives them and presents them to the user in a tone that matches their emotional state. Along with the voice guidance, a friendly character is displayed on the screen.

[0327] Step 8:

[0328] Users follow instructions and operate their smartphones or information terminals to attempt to resolve the problem. After the operation, they can also report the situation to the terminal again and receive additional support if necessary.

[0329] This approach allows systems that incorporate emotion recognition to provide flexible support tailored to each user's emotional state, facilitating efficient problem-solving.

[0330] (Example 2)

[0331] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0332] Traditional information processing systems have standardized support for user-reported problems, making it difficult to provide appropriate assistance, especially to users unfamiliar with technology or those prone to stress. Furthermore, offering uniform solutions without considering user feelings can decrease user satisfaction. There is a need to solve these problems and improve the user experience.

[0333] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0334] In this invention, the server includes speech recognition means for converting speech information acquired from the user's information processing device into text information, natural language processing means for analyzing the text information and identifying the type of problem, means for generating a solution based on the type of problem and the user's emotional state and presenting the solution audibly and visually, and emotion recognition means for analyzing speech information and facial expression information to identify the user's emotional state. This makes it possible to provide personalized solutions that respond to the user's emotions, thereby improving the user experience and satisfaction.

[0335] "Information processing device" is a general term for electronic devices that have the function of inputting, processing, and outputting data, and includes terminals that are directly operated by the user.

[0336] "Voice information" refers to data collected as the user's voice, and is the original information that is converted into text data by speech recognition.

[0337] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is subject to natural language processing.

[0338] "Speech recognition means" refers to a process or device that converts speech information into text information using technical methods, and is implemented using various algorithms.

[0339] "Natural language processing" refers to technologies that analyze human language and understand information, and is particularly used to classify problems and formulate solutions.

[0340] "Emotion recognition means" refers to technology or devices that identify a user's emotional state by analyzing voice and facial expression data.

[0341] "Means for generating and presenting solutions" refers to a process or mechanism for creating the optimal problem-solving method for the user and providing that information to the user in an easily understandable format.

[0342] This invention realizes a process for effectively solving user problems using an information processing system. The user first accesses the system using a tablet terminal, which is an information processing device, and reports the problem via voice input. The terminal has a built-in microphone for collecting voice and a camera for capturing facial expressions. As a result, the user's voice and facial information are recorded in real time.

[0343] The terminal transmits the recorded audio information to the server. Standard data communication protocols are used for this transmission, and the data may be encrypted as a security measure. The server converts the audio information into text using speech recognition technology and then identifies the type of problem using natural language processing technology. Cloud-based speech recognition APIs and natural language processing APIs may be used for this technical implementation.

[0344] Furthermore, the server uses emotion recognition to analyze voice and facial expression information to identify the user's emotional state. For example, if a user reports a problem such as "Wi-Fi is not connecting," the server may determine from their tone of voice and facial expression that they are experiencing stress. In such situations, the solutions offered are adjusted based on the user's emotions.

[0345] The solution presentation method includes the use of visually easy-to-understand character animations, which enhances user engagement. The provided solutions are generated by a generative AI model based on the user's specific state and past interaction history. The generated solutions are presented to the user through voice guidance and visual step-by-step instructions.

[0346] As an example of a prompt, the system can generate optimal responses by asking a question such as, "How should support be customized if the user shows signs of stress?" This allows users to receive support tailored to their individual circumstances, resulting in a system that facilitates smooth problem-solving.

[0347] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0348] Step 1:

[0349] The user activates a tablet device, which is an information processing device, and reports a problem through a voice input system. The user's voice is used as input, and this voice is recorded in real time by the device's microphone.

[0350] Step 2:

[0351] The terminal processes the recorded audio information as digital data and sends it to the server. The input audio data is converted to an appropriate format and securely transferred to the server according to the network protocol. Data compression and encryption may also be performed at this stage.

[0352] Step 3:

[0353] The server receives the transmitted audio data and converts it into text data using speech recognition technology. The input data is in audio format, and the output is in text format. This conversion process may utilize a cloud-based speech recognition API.

[0354] Step 4:

[0355] The server analyzes the textual information obtained through speech recognition and identifies the problem using natural language processing. At this stage, the input is textual data, and the output is a specific problem category. A generative AI model is used for this analysis, and data calculations are performed based on the prompt text.

[0356] Step 5:

[0357] The server simultaneously performs analysis using emotion recognition means with voice and facial expression data. The input here is voice tone and facial expression data, and the output is the user's emotional state. The emotional state is classified into states such as "calm" or "stressed."

[0358] Step 6:

[0359] The server generates and presents the optimal solution based on the identified problem and perceived emotional state. This process involves inputs including the problem category and emotional state, and outputting a tailored solution. It is designed to include visual character animations and voice guidance.

[0360] Step 7:

[0361] The terminal displays customized solutions received from the server to the user. The input is solution data from the server, and the output is visual and audio guidance to the user. The user can then use this information to proceed with problem solving.

[0362] (Application Example 2)

[0363] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0364] Users often experience stress and difficulty resolving problems in stores due to malfunctions with technical equipment or unfamiliarity with its operation. Therefore, there is a need for a system that can understand users' feelings and provide appropriate and user-friendly support accordingly.

[0365] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0366] In this invention, the server includes speech recognition means for converting voice data acquired from the user's terminal into text data, natural language processing means for analyzing the text data and identifying the type of problem, and emotion recognition means for recognizing the user's emotional state and adjusting the presentation of solutions based on that emotional state. This makes it possible to quickly provide appropriate support according to the user's emotional state and to smoothly resolve problems.

[0367] "Voice recognition means" refers to a device or system that has the function of acquiring voice data from a user's terminal and converting it into text data.

[0368] "Natural language processing means" refers to a technology or system that analyzes text data and performs processing to identify the type of problem.

[0369] "Means of presenting solutions" refers to means of presenting solutions generated based on the type of problem, either audibly or visually, and having the function of conveying information to the user.

[0370] "Emotion recognition means" refers to a device or system that has the function of recognizing the emotional state of a user and adjusting the content and method of the solution provided based on that emotional state.

[0371] "Character animation" refers to visual animation used to enhance user engagement and to convey instructions and information.

[0372] The system for implementing this invention is designed to improve in-store user service using emotion recognition. A server converts audio data acquired from a user's terminal via speech recognition into text data. This text data is analyzed using natural language processing to identify the type of problem. The analyzed data is sent to emotion recognition, where the user's emotional state is recognized from the audio and video data. Using this data, the server adjusts the provided solutions according to the user's emotional state.

[0373] The hardware includes a microphone for acquiring audio data, a camera for acquiring video data, and a server for data processing. The software includes speech recognition using the Google Speech-to-Text API, natural language processing using Dialogflow, and emotion recognition using the Microsoft Emotion API.

[0374] As a concrete example, when a customer is confused while searching for products in a store, the server can detect this confusion through emotion recognition, emphasize character animations, and provide product guidance in a gentle voice. In this scenario, the AI ​​model is instructed using a prompt message such as, "Analyze the customer's voice and facial expression data with the emotion recognition engine, and generate and provide gentle guidance text to the customer based on the results."

[0375] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0376] Step 1:

[0377] The device acquires the user's voice and facial expressions using a microphone and camera, and inputs them as audio and video data. This data is primary data for analyzing the user's state. The device then transmits this data to a server.

[0378] Step 2:

[0379] The server uses speech recognition to convert the incoming audio data into text data. The speech recognition results are output in text format and become data for natural language processing. The Google Speech-to-Text API handles this process.

[0380] Step 3:

[0381] The server uses natural language processing to analyze text data and identify the type of problem. The input is text data, and the output is information about the analyzed problem type. Dialogflow is used to analyze the context and problem, and appropriate tagging is performed.

[0382] Step 4:

[0383] The server analyzes video data sent via emotion recognition to recognize the user's emotional state. The input is video data, and the output is information about the recognized emotional state. The Microsoft Emotion API is used to analyze emotional patterns.

[0384] Step 5:

[0385] The server refines and generates solutions based on the results of natural language processing and sentiment recognition. The input is the type of problem and its sentiment state, and the output is the refined solution. The server sends prompts to the generating AI model, which uses prompts to generate the necessary solutions.

[0386] Step 6:

[0387] The server sends the adjusted solution to the terminal. The terminal presents the solution to the user both audibly and visually. This includes actions such as using character animations to explain things in a friendly manner. In this process, the terminal adjusts the tone of voice and animation movements to increase the user's sense of security.

[0388] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0389] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0390] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0391] [Third Embodiment]

[0392] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0393] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0394] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0395] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0396] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0397] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0398] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0399] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0400] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0401] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0402] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0403] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0404] This invention is a system that helps users resolve problems with their smartphones and other information terminals themselves using a tablet terminal at a store. This system operates by integrating speech recognition, natural language processing, and solution suggestion technologies.

[0405] First, the user activates the tablet device and begins describing the current problem using voice or text. The device records the user's voice and sends the data to the server. The server then uses speech recognition technology to convert the voice data into text.

[0406] Next, the server applies natural language processing techniques to analyze the type of problem from the text data. In this process, referencing past call center interaction history and device handling databases enables more accurate problem identification and advice.

[0407] Once a problem is identified, the server generates a solution to resolve it. This solution includes appropriate troubleshooting steps and configuration change instructions, and is prepared as voice instructions using speech synthesis technology. Furthermore, the operating procedures are displayed on the screen as a visual aid.

[0408] Finally, the device provides these instructions to the user via voice and visual means to support problem-solving. For example, if a user reports a Wi-Fi connection problem, the system will provide a visual example of the settings screen along with a voice instruction such as, "Open the Wi-Fi settings screen and check the network status." This makes it easy for even tech-inexperienced users to resolve the problem.

[0409] This system is designed to allow users to quickly resolve problems with their own devices, resulting in reduced waiting times and improved support quality.

[0410] The following describes the processing flow.

[0411] Step 1:

[0412] The user operates a tablet terminal in the store to activate the AI ​​agent. Once the agent is activated, the terminal sets the voice input function to standby mode.

[0413] Step 2:

[0414] The user enters the details of the problem via voice. The terminal records this voice data and prepares to send it to the server.

[0415] Step 3:

[0416] The server converts the received audio data into text data using ASR (Automatic Speech Recognition) technology. This makes the audio information available in a text format that can be parsed.

[0417] Step 4:

[0418] The server uses NLP (Natural Language Processing) technology to analyze text data. The purpose of the analysis is to identify the type of reported problem and its possible causes.

[0419] Step 5:

[0420] The server selects the necessary solution to resolve the problem based on the results of natural language processing. This selection process involves referencing call center response history and instruction manual databases to perform the most optimal troubleshooting.

[0421] Step 6:

[0422] The server prepares audio and visual information to present the selected solution to the user in an easy-to-understand manner. The prepared audio is generated using TTS (Text-to-Speech) technology.

[0423] Step 7:

[0424] The terminal receives voice instructions and visual information from the server and presents them to the user. Along with the voice instructions, character animations are displayed on the screen to visually support the instructions.

[0425] Step 8:

[0426] The user operates their smartphone following the provided audio and visual instructions. They can also report the results of their operations back to the tablet device.

[0427] This process allows users to resolve problems themselves, resulting in faster problem resolution.

[0428] (Example 1)

[0429] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0430] In modern information devices, user problems are becoming increasingly complex, making self-resolution difficult for users without specialized knowledge. Furthermore, traditional support systems often take a long time to resolve issues, leading to increased user stress and support costs. Therefore, there is a need for an efficient, intuitive, and user-friendly self-resolution support system.

[0431] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0432] In this invention, the server includes speech recognition means for converting voice data acquired from a user's information terminal into text data, language analysis means for analyzing the text data and identifying the type of problem, and means for generating a solution based on the type of problem, while referring to past support history and device handling information, and presenting the solution via voice output and visual display. This enables users to quickly and effectively resolve problems with their information terminals themselves, even without specialized knowledge.

[0433] "User information terminal" refers to a portable or stationary device with computing capabilities used by the user.

[0434] "Audio data" refers to sound information input by the user through a device, represented as a digital signal.

[0435] "Text data" refers to information expressed in string format obtained by analyzing audio data.

[0436] "Speech recognition means" refers to technology that has a mechanism for receiving speech data and converting it into corresponding text data.

[0437] "Linguistic analysis methods" refer to technical techniques for analyzing text data and identifying the type of problem.

[0438] "Past interaction history" refers to data on user interactions recorded to date, which is used to assist in resolving problems.

[0439] "Device handling information" refers to a collection of data related to the usage and troubleshooting of the information terminal being used.

[0440] "Means for generating solutions" refers to a system that automatically designs appropriate response methods based on the type of problem.

[0441] "Means of presentation via audio output and visual display" refers to a system that includes technology for providing users with guidance on solutions via audio or visual means.

[0442] "Animation" is a method of expression that uses dynamic changes to display visual information to aid user comprehension.

[0443] This invention is a system that helps users resolve problems with information devices themselves using information terminals in stores. This system operates by integrating speech recognition technology, natural language processing technology, and solution presentation technology.

[0444] The user first operates an information terminal and begins describing the problem using voice or text. The terminal records the user's voice and sends it to the server as audio data. At this stage, the server utilizes speech recognition technology as a "speech recognition means" to convert the "audio data" into "text data." A commercial speech recognition system may be used for this process.

[0445] Next, the server processes the text data using a "language analysis tool" to identify the type of problem. During this identification process, the server references databases such as past user interaction history and device handling information to improve the accuracy of the information. Once the problem is identified, the server generates a solution based on the identified problem. In this process, the server uses a generative AI model to design an appropriate solution to the problem.

[0446] The solutions generated by the server are again presented to the user in both audio and visual form. The terminal receives information from the server, provides guidance to the user via audio output, and simultaneously displays the operating procedure on the screen. If the user reports a problem such as "My smartphone cannot connect to Wi-Fi," it will be presented with specific instructions, such as "Open the Wi-Fi settings screen and check the network status," along with visual information.

[0447] This system allows users with limited technical knowledge to quickly understand and resolve device malfunctions. An example of a specific prompt message is, "Please provide specific troubleshooting steps for the Wi-Fi connection problem reported by the user." This reduces waiting times and improves the quality of support.

[0448] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0449] Step 1:

[0450] The user activates an information terminal installed in the store and inputs a description of the problem by voice. The input voice data is collected by the terminal and transmitted to the server in digital format. At this point, the input is voice data, and the output is digital voice data for transmission.

[0451] Step 2:

[0452] The server converts the received audio data into text data using speech recognition technology. Here, the speech recognition engine outputs the audio data as text data. In this conversion, text based on a language model is generated through the analysis of the audio waveform. Therefore, the input is audio data, and the output is text data.

[0453] Step 3:

[0454] The server passes text data to a natural language processing engine to analyze the content of the problem. This analysis identifies the problem from the query based on the user's utterance. The server then refers to the system's database and determines the type of problem by comparing it with relevant data. The input is text data, and the output is the content of the identified problem.

[0455] Step 4:

[0456] Based on the identified problem, the server generates a solution using a generative AI model, drawing on past response history and handling information. The server extracts necessary information from existing databases and designs the optimal solution according to the nature of the problem. The input at this stage is the content of the identified problem, and the output is instructions for the solution.

[0457] Step 5:

[0458] The terminal receives a solution from the server and presents the solution to the user verbally using speech synthesis technology. Simultaneously, visual instructions related to the solution are displayed on the terminal's screen. This allows the user to confirm the specific steps through both sight and sound. The input is the solution instructions, and the output is the presentation of the solution through voice and visual means.

[0459] Step 6:

[0460] The user resolves issues with their information device by following instructions presented on the terminal. The user uses the on-screen interface and follows voice guidance to proceed with the problem-solving process. This step involves the user performing physical actions to resolve the problem. The input is the suggested solution, and the output is the user's response.

[0461] This series of steps allows users to quickly resolve issues without the need for expert support, thereby improving usability.

[0462] (Application Example 1)

[0463] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0464] Consumers often need to visit a store when they encounter problems with mobile devices such as smartphones and tablets. However, this requires waiting for a specialist, which is time-consuming and inconvenient, and consumers unfamiliar with technology often find it difficult to resolve the issue on their own. Furthermore, traditional support systems lack sufficient integration of voice and visual information, and provide inadequate guidance for users to resolve problems themselves. Therefore, there is a need for a comprehensive support system that facilitates rapid self-resolution on-site.

[0465] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0466] In this invention, the server includes: speech recognition means for converting voice information acquired from the user's operating platform into text information; natural language processing means for analyzing the text information and identifying the type of malfunction; means for generating solutions based on the type of malfunction and presenting the solutions audibly and visually; and means for providing self-support to enable customers to resolve problems with their mobile devices on the spot. This enables customers to quickly and effectively resolve problems with their devices within the store.

[0467] An "operating platform" is a device used by users to input information and receive information from a system, and is usually an electronic device such as a tablet.

[0468] "Audio information" refers to data conveyed through voice input from users.

[0469] "Textual information" refers to data in text format that has been converted from audio information.

[0470] "Speech recognition means" refers to the technical processes and functions for converting speech information into text information.

[0471] "Natural language processing means" refers to processing techniques that analyze textual information and identify the type of problem based on its content.

[0472] "Type of malfunction" refers to the specific nature and classification of problems in mobile devices.

[0473] A "solution" is a set of procedures or guidelines for resolving a problem, generated according to the type of failure identified.

[0474] A "customer" is a consumer who visits a store to receive a service.

[0475] "Self-support" refers to the assistance and means provided to users to solve their own problems on their own.

[0476] The system for implementing this invention uses a tablet terminal as the customer's operating platform and integrates speech recognition, natural language processing, and solution presentation technologies. The system uses the Google Speech-to-Text API as the speech recognition engine to convert speech information into text information. The server performs natural language processing on the received text information using the Google Cloud Natural Language API to identify the type of fault. Subsequently, the server generates a solution based on the identified fault, referencing information from the Firebase Realtime Database. This solution is presented to the terminal both audibly and visually through a web interface built with React.js.

[0477] The solutions generated by the server guide customers on how to efficiently resolve problems with their mobile devices on the spot. Specifically, for example, if a user says, "My smartphone battery drains quickly," the system identifies it as a "battery consumption problem" and presents specific suggestions such as "stopping unnecessary applications" and "setting battery saver mode." Customers can easily perform these steps by following the visual guide.

[0478] Examples of using generative AI models include prompts such as, "Identify battery problems from voice input and suggest the best solution," or "Provide optimal Wi-Fi connection troubleshooting steps based on user input." In this way, a self-support environment is provided where even customers who are not technically savvy can confidently solve problems.

[0479] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0480] Step 1:

[0481] The user activates the tablet device and provides voice input regarding the problem. This voice data is entered into the device. The entered voice data is then sent to the server in streaming format.

[0482] Step 2:

[0483] The server receives audio data and converts it into text data using the Google Speech-to-Text API. This process involves data processing that converts the audio signal into a string of characters, and the output is text data as character information.

[0484] Step 3:

[0485] The server receives text data and parses it using the Google Cloud Natural Language API. Here, natural language processing techniques are used to identify the type of failure from the input text. As a result, the identified failure type is output.

[0486] Step 4:

[0487] Based on the type of failure identified by the server, the system retrieves suitable solution data from the Firebase Realtime Database. The failure type is used as input, and data processing is performed to generate solutions, resulting in the output of specific solutions.

[0488] Step 5:

[0489] Prepare to present the server-generated solutions both audibly and visually. This involves using speech synthesis technology to format the solutions as voice instructions, and using React.js to create visual instructions. The audio file and visual display information are then output to the device.

[0490] Step 6:

[0491] The terminal presents the user with solutions received from the server. The user receives voice guidance and visual step-by-step instructions, allowing them to attempt to resolve the problem themselves. The solution is ultimately executed as an action by the user.

[0492] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0493] This invention aims to improve the user experience by adding emotion recognition functionality to a system that assists users in self-solving problems with smartphones and information terminals via in-store tablet devices. This will enable appropriate support even for users who are unfamiliar with technology or who are easily stressed.

[0494] This system operates by combining speech recognition, natural language processing, and solution suggestion with an emotion engine that recognizes emotions. The user operates a tablet device to activate the AI ​​agent. The device has the ability to record the user's voice and capture their facial expressions via a camera. The recorded voice and captured facial expression data are sent to a server.

[0495] The server uses an emotion recognition engine to analyze audio and video data to recognize the user's current emotional state. This analysis is then correlated with the nature of the problem using natural language processing technology, forming the basis for guiding users to appropriate solutions.

[0496] For example, if a user reports a problem such as "Wi-Fi not connecting," the server can determine from the user's tone of voice and facial expression that the user is experiencing significant stress. In this case, in addition to offering the usual solutions, adjustments are made to provide more empathetic voice guidance and simplified instructions. In particular, character animations are emphasized to make the user feel more approachable and reassured.

[0497] Ultimately, the device adjusts and presents the voice instructions and visual information received from the server based on the user's emotional state. This allows users to receive guidance tailored to their individual circumstances and solve problems more smoothly. This form, which combines emotion recognition, improves the user experience and enables a higher level of satisfaction.

[0498] The following describes the processing flow.

[0499] Step 1:

[0500] The user activates the AI ​​agent by operating a tablet device in the store. The device starts its camera and microphone and prepares to collect audio and video data.

[0501] Step 2:

[0502] The user describes the problem verbally. The device records the audio using its microphone and simultaneously captures the user's facial expressions in real time using its camera.

[0503] Step 3:

[0504] The device sends recorded audio data and captured facial expression data to the server. The transmitted data is encrypted to protect the user's privacy.

[0505] Step 4:

[0506] The server converts the audio data into text data using ASR (Automatic Speech Recognition) technology. Then, it uses NLP (Natural Language Processing) technology to analyze the details of the problem from the text.

[0507] Step 5:

[0508] The server uses an emotion recognition engine to analyze the user's emotional state from voice tone and facial expression data. This identifies the emotions the user is experiencing (e.g., stress, anxiety, etc.).

[0509] Step 6:

[0510] The server generates solutions to problems based on the analysis results. It adjusts how solutions are presented based on emotional data, using empathetic voice guidance and character animations as needed.

[0511] Step 7:

[0512] The server sends generated voice instructions and visual information to the terminal. The terminal receives them and presents them to the user in a tone that matches their emotional state. Along with the voice guidance, a friendly character is displayed on the screen.

[0513] Step 8:

[0514] Users follow instructions and operate their smartphones or information terminals to attempt to resolve the problem. After the operation, they can also report the situation to the terminal again and receive additional support if necessary.

[0515] This approach allows systems that incorporate emotion recognition to provide flexible support tailored to each user's emotional state, facilitating efficient problem-solving.

[0516] (Example 2)

[0517] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0518] Traditional information processing systems have standardized support for user-reported problems, making it difficult to provide appropriate assistance, especially to users unfamiliar with technology or those prone to stress. Furthermore, offering uniform solutions without considering user feelings can decrease user satisfaction. There is a need to solve these problems and improve the user experience.

[0519] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0520] In this invention, the server includes speech recognition means for converting speech information acquired from the user's information processing device into text information, natural language processing means for analyzing the text information and identifying the type of problem, means for generating a solution based on the type of problem and the user's emotional state and presenting the solution audibly and visually, and emotion recognition means for analyzing speech information and facial expression information to identify the user's emotional state. This makes it possible to provide personalized solutions that respond to the user's emotions, thereby improving the user experience and satisfaction.

[0521] "Information processing device" is a general term for electronic devices that have the function of inputting, processing, and outputting data, and includes terminals that are directly operated by the user.

[0522] "Voice information" refers to data collected as the user's voice, and is the original information that is converted into text data by speech recognition.

[0523] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is subject to natural language processing.

[0524] "Speech recognition means" refers to a process or device that converts speech information into text information using technical methods, and is implemented using various algorithms.

[0525] "Natural language processing" refers to technologies that analyze human language and understand information, and is particularly used to classify problems and formulate solutions.

[0526] "Emotion recognition means" refers to technology or devices that identify a user's emotional state by analyzing voice and facial expression data.

[0527] "Means for generating and presenting solutions" refers to a process or mechanism for creating the optimal problem-solving method for the user and providing that information to the user in an easily understandable format.

[0528] This invention realizes a process for effectively solving user problems using an information processing system. The user first accesses the system using a tablet terminal, which is an information processing device, and reports the problem via voice input. The terminal has a built-in microphone for collecting voice and a camera for capturing facial expressions. As a result, the user's voice and facial information are recorded in real time.

[0529] The terminal transmits the recorded audio information to the server. Standard data communication protocols are used for this transmission, and the data may be encrypted as a security measure. The server converts the audio information into text using speech recognition technology and then identifies the type of problem using natural language processing technology. Cloud-based speech recognition APIs and natural language processing APIs may be used for this technical implementation.

[0530] Furthermore, the server uses emotion recognition to analyze voice and facial expression information to identify the user's emotional state. For example, if a user reports a problem such as "Wi-Fi is not connecting," the server may determine from their tone of voice and facial expression that they are experiencing stress. In such situations, the solutions offered are adjusted based on the user's emotions.

[0531] The solution presentation method includes the use of visually easy-to-understand character animations, which enhances user engagement. The provided solutions are generated by a generative AI model based on the user's specific state and past interaction history. The generated solutions are presented to the user through voice guidance and visual step-by-step instructions.

[0532] As an example of a prompt, the system can generate optimal responses by asking a question such as, "How should support be customized if the user shows signs of stress?" This allows users to receive support tailored to their individual circumstances, resulting in a system that facilitates smooth problem-solving.

[0533] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0534] Step 1:

[0535] The user activates a tablet device, which is an information processing device, and reports a problem through a voice input system. The user's voice is used as input, and this voice is recorded in real time by the device's microphone.

[0536] Step 2:

[0537] The terminal processes the recorded audio information as digital data and sends it to the server. The input audio data is converted to an appropriate format and securely transferred to the server according to the network protocol. Data compression and encryption may also be performed at this stage.

[0538] Step 3:

[0539] The server receives the transmitted audio data and converts it into text data using speech recognition technology. The input data is in audio format, and the output is in text format. This conversion process may utilize a cloud-based speech recognition API.

[0540] Step 4:

[0541] The server analyzes the textual information obtained through speech recognition and identifies the problem using natural language processing. At this stage, the input is textual data, and the output is a specific problem category. A generative AI model is used for this analysis, and data calculations are performed based on the prompt text.

[0542] Step 5:

[0543] The server simultaneously performs analysis using emotion recognition means with voice and facial expression data. The input here is voice tone and facial expression data, and the output is the user's emotional state. The emotional state is classified into states such as "calm" or "stressed."

[0544] Step 6:

[0545] The server generates and presents the optimal solution based on the identified problem and perceived emotional state. This process involves inputs including the problem category and emotional state, and outputting a tailored solution. It is designed to include visual character animations and voice guidance.

[0546] Step 7:

[0547] The terminal displays customized solutions received from the server to the user. The input is solution data from the server, and the output is visual and audio guidance to the user. The user can then use this information to proceed with problem solving.

[0548] (Application Example 2)

[0549] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0550] Users often experience stress and difficulty resolving problems in stores due to malfunctions with technical equipment or unfamiliarity with its operation. Therefore, there is a need for a system that can understand users' feelings and provide appropriate and user-friendly support accordingly.

[0551] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0552] In this invention, the server includes speech recognition means for converting voice data acquired from the user's terminal into text data, natural language processing means for analyzing the text data and identifying the type of problem, and emotion recognition means for recognizing the user's emotional state and adjusting the presentation of solutions based on that emotional state. This makes it possible to quickly provide appropriate support according to the user's emotional state and to smoothly resolve problems.

[0553] "Voice recognition means" refers to a device or system that has the function of acquiring voice data from a user's terminal and converting it into text data.

[0554] "Natural language processing means" refers to a technology or system that analyzes text data and performs processing to identify the type of problem.

[0555] "Means of presenting solutions" refers to means of presenting solutions generated based on the type of problem, either audibly or visually, and having the function of conveying information to the user.

[0556] "Emotion recognition means" refers to a device or system that has the function of recognizing the emotional state of a user and adjusting the content and method of the solution provided based on that emotional state.

[0557] "Character animation" refers to visual animation used to enhance user engagement and to convey instructions and information.

[0558] The system for implementing this invention is designed to improve in-store user service using emotion recognition. A server converts audio data acquired from a user's terminal via speech recognition into text data. This text data is analyzed using natural language processing to identify the type of problem. The analyzed data is sent to emotion recognition, where the user's emotional state is recognized from the audio and video data. Using this data, the server adjusts the provided solutions according to the user's emotional state.

[0559] The hardware includes a microphone for acquiring audio data, a camera for acquiring video data, and a server for data processing. The software includes speech recognition using the Google Speech-to-Text API, natural language processing using Dialogflow, and emotion recognition using the Microsoft Emotion API.

[0560] As a concrete example, when a customer is confused while searching for products in a store, the server can detect this confusion through emotion recognition, emphasize character animations, and provide product guidance in a gentle voice. In this scenario, the AI ​​model is instructed using a prompt message such as, "Analyze the customer's voice and facial expression data with the emotion recognition engine, and generate and provide gentle guidance text to the customer based on the results."

[0561] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0562] Step 1:

[0563] The device acquires the user's voice and facial expressions using a microphone and camera, and inputs them as audio and video data. This data is primary data for analyzing the user's state. The device then transmits this data to a server.

[0564] Step 2:

[0565] The server uses speech recognition to convert the incoming audio data into text data. The speech recognition results are output in text format and become data for natural language processing. The Google Speech-to-Text API handles this process.

[0566] Step 3:

[0567] The server uses natural language processing to analyze text data and identify the type of problem. The input is text data, and the output is information about the analyzed problem type. Dialogflow is used to analyze the context and problem, and appropriate tagging is performed.

[0568] Step 4:

[0569] The server analyzes video data sent via emotion recognition to recognize the user's emotional state. The input is video data, and the output is information about the recognized emotional state. The Microsoft Emotion API is used to analyze emotional patterns.

[0570] Step 5:

[0571] The server refines and generates solutions based on the results of natural language processing and sentiment recognition. The input is the type of problem and its sentiment state, and the output is the refined solution. The server sends prompts to the generating AI model, which uses prompts to generate the necessary solutions.

[0572] Step 6:

[0573] The server sends the adjusted solution to the terminal. The terminal presents the solution to the user both audibly and visually. This includes actions such as using character animations to explain things in a friendly manner. In this process, the terminal adjusts the tone of voice and animation movements to increase the user's sense of security.

[0574] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0575] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0576] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0577] [Fourth Embodiment]

[0578] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0579] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0580] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0581] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0582] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0583] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0584] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0585] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0586] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0587] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0588] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0589] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0590] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0591] This invention is a system that helps users resolve problems with their smartphones and other information terminals themselves using a tablet terminal at a store. This system operates by integrating speech recognition, natural language processing, and solution suggestion technologies.

[0592] First, the user activates the tablet device and begins describing the current problem using voice or text. The device records the user's voice and sends the data to the server. The server then uses speech recognition technology to convert the voice data into text.

[0593] Next, the server applies natural language processing techniques to analyze the type of problem from the text data. In this process, referencing past call center interaction history and device handling databases enables more accurate problem identification and advice.

[0594] Once a problem is identified, the server generates a solution to resolve it. This solution includes appropriate troubleshooting steps and configuration change instructions, and is prepared as voice instructions using speech synthesis technology. Furthermore, the operating procedures are displayed on the screen as a visual aid.

[0595] Finally, the device provides these instructions to the user via voice and visual means to support problem-solving. For example, if a user reports a Wi-Fi connection problem, the system will provide a visual example of the settings screen along with a voice instruction such as, "Open the Wi-Fi settings screen and check the network status." This makes it easy for even tech-inexperienced users to resolve the problem.

[0596] This system is designed to allow users to quickly resolve problems with their own devices, resulting in reduced waiting times and improved support quality.

[0597] The following describes the processing flow.

[0598] Step 1:

[0599] The user operates a tablet terminal in the store to activate the AI ​​agent. Once the agent is activated, the terminal sets the voice input function to standby mode.

[0600] Step 2:

[0601] The user enters the details of the problem via voice. The terminal records this voice data and prepares to send it to the server.

[0602] Step 3:

[0603] The server converts the received audio data into text data using ASR (Automatic Speech Recognition) technology. This makes the audio information available in a text format that can be parsed.

[0604] Step 4:

[0605] The server uses NLP (Natural Language Processing) technology to analyze text data. The purpose of the analysis is to identify the type of reported problem and its possible causes.

[0606] Step 5:

[0607] The server selects the necessary solution to resolve the problem based on the results of natural language processing. This selection process involves referencing call center response history and instruction manual databases to perform the most optimal troubleshooting.

[0608] Step 6:

[0609] The server prepares audio and visual information to present the selected solution to the user in an easy-to-understand manner. The prepared audio is generated using TTS (Text-to-Speech) technology.

[0610] Step 7:

[0611] The terminal receives voice instructions and visual information from the server and presents them to the user. Along with the voice instructions, character animations are displayed on the screen to visually support the instructions.

[0612] Step 8:

[0613] The user operates their smartphone following the provided audio and visual instructions. They can also report the results of their operations back to the tablet device.

[0614] This process allows users to resolve problems themselves, resulting in faster problem resolution.

[0615] (Example 1)

[0616] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0617] In modern information devices, user problems are becoming increasingly complex, making self-resolution difficult for users without specialized knowledge. Furthermore, traditional support systems often take a long time to resolve issues, leading to increased user stress and support costs. Therefore, there is a need for an efficient, intuitive, and user-friendly self-resolution support system.

[0618] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0619] In this invention, the server includes speech recognition means for converting voice data acquired from a user's information terminal into text data, language analysis means for analyzing the text data and identifying the type of problem, and means for generating a solution based on the type of problem, while referring to past support history and device handling information, and presenting the solution via voice output and visual display. This enables users to quickly and effectively resolve problems with their information terminals themselves, even without specialized knowledge.

[0620] "User information terminal" refers to a portable or stationary device with computing capabilities used by the user.

[0621] "Audio data" refers to sound information input by the user through a device, represented as a digital signal.

[0622] "Text data" refers to information expressed in string format obtained by analyzing audio data.

[0623] "Speech recognition means" refers to technology that has a mechanism for receiving speech data and converting it into corresponding text data.

[0624] "Linguistic analysis methods" refer to technical techniques for analyzing text data and identifying the type of problem.

[0625] "Past interaction history" refers to data on user interactions recorded to date, which is used to assist in resolving problems.

[0626] "Device handling information" refers to a collection of data related to the usage and troubleshooting of the information terminal being used.

[0627] "Means for generating solutions" refers to a system that automatically designs appropriate response methods based on the type of problem.

[0628] "Means of presentation via audio output and visual display" refers to a system that includes technology for providing users with guidance on solutions via audio or visual means.

[0629] "Animation" is a method of expression that uses dynamic changes to display visual information to aid user comprehension.

[0630] This invention is a system that helps users resolve problems with information devices themselves using information terminals in stores. This system operates by integrating speech recognition technology, natural language processing technology, and solution presentation technology.

[0631] The user first operates an information terminal and begins describing the problem using voice or text. The terminal records the user's voice and sends it to the server as audio data. At this stage, the server utilizes speech recognition technology as a "speech recognition means" to convert the "audio data" into "text data." A commercial speech recognition system may be used for this process.

[0632] Next, the server processes the text data using a "language analysis tool" to identify the type of problem. During this identification process, the server references databases such as past user interaction history and device handling information to improve the accuracy of the information. Once the problem is identified, the server generates a solution based on the identified problem. In this process, the server uses a generative AI model to design an appropriate solution to the problem.

[0633] The solutions generated by the server are again presented to the user in both audio and visual form. The terminal receives information from the server, provides guidance to the user via audio output, and simultaneously displays the operating procedure on the screen. If the user reports a problem such as "My smartphone cannot connect to Wi-Fi," it will be presented with specific instructions, such as "Open the Wi-Fi settings screen and check the network status," along with visual information.

[0634] This system allows users with limited technical knowledge to quickly understand and resolve device malfunctions. An example of a specific prompt message is, "Please provide specific troubleshooting steps for the Wi-Fi connection problem reported by the user." This reduces waiting times and improves the quality of support.

[0635] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0636] Step 1:

[0637] The user activates an information terminal installed in the store and inputs a description of the problem by voice. The input voice data is collected by the terminal and transmitted to the server in digital format. At this point, the input is voice data, and the output is digital voice data for transmission.

[0638] Step 2:

[0639] The server converts the received audio data into text data using speech recognition technology. Here, the speech recognition engine outputs the audio data as text data. In this conversion, text based on a language model is generated through the analysis of the audio waveform. Therefore, the input is audio data, and the output is text data.

[0640] Step 3:

[0641] The server passes text data to a natural language processing engine to analyze the content of the problem. This analysis identifies the problem from the query based on the user's utterance. The server then refers to the system's database and determines the type of problem by comparing it with relevant data. The input is text data, and the output is the content of the identified problem.

[0642] Step 4:

[0643] Based on the identified problem, the server generates a solution using a generative AI model, drawing on past response history and handling information. The server extracts necessary information from existing databases and designs the optimal solution according to the nature of the problem. The input at this stage is the content of the identified problem, and the output is instructions for the solution.

[0644] Step 5:

[0645] The terminal receives a solution from the server and presents the solution to the user verbally using speech synthesis technology. Simultaneously, visual instructions related to the solution are displayed on the terminal's screen. This allows the user to confirm the specific steps through both sight and sound. The input is the solution instructions, and the output is the presentation of the solution through voice and visual means.

[0646] Step 6:

[0647] The user resolves issues with their information device by following instructions presented on the terminal. The user uses the on-screen interface and follows voice guidance to proceed with the problem-solving process. This step involves the user performing physical actions to resolve the problem. The input is the suggested solution, and the output is the user's response.

[0648] This series of steps allows users to quickly resolve issues without the need for expert support, thereby improving usability.

[0649] (Application Example 1)

[0650] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0651] Consumers often need to visit a store when they encounter problems with mobile devices such as smartphones and tablets. However, this requires waiting for a specialist, which is time-consuming and inconvenient, and consumers unfamiliar with technology often find it difficult to resolve the issue on their own. Furthermore, traditional support systems lack sufficient integration of voice and visual information, and provide inadequate guidance for users to resolve problems themselves. Therefore, there is a need for a comprehensive support system that facilitates rapid self-resolution on-site.

[0652] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0653] In this invention, the server includes: speech recognition means for converting voice information acquired from the user's operating platform into text information; natural language processing means for analyzing the text information and identifying the type of malfunction; means for generating solutions based on the type of malfunction and presenting the solutions audibly and visually; and means for providing self-support to enable customers to resolve problems with their mobile devices on the spot. This enables customers to quickly and effectively resolve problems with their devices within the store.

[0654] An "operating platform" is a device used by users to input information and receive information from a system, and is usually an electronic device such as a tablet.

[0655] "Audio information" refers to data conveyed through voice input from users.

[0656] "Textual information" refers to data in text format that has been converted from audio information.

[0657] "Speech recognition means" refers to the technical processes and functions for converting speech information into text information.

[0658] "Natural language processing means" refers to processing techniques that analyze textual information and identify the type of problem based on its content.

[0659] "Type of malfunction" refers to the specific nature and classification of problems in mobile devices.

[0660] A "solution" is a set of procedures or guidelines for resolving a problem, generated according to the type of failure identified.

[0661] A "customer" is a consumer who visits a store to receive a service.

[0662] "Self-support" refers to the assistance and means provided to users to solve their own problems on their own.

[0663] The system for implementing this invention uses a tablet terminal as the customer's operating platform and integrates speech recognition, natural language processing, and solution presentation technologies. The system uses the Google Speech-to-Text API as the speech recognition engine to convert speech information into text information. The server performs natural language processing on the received text information using the Google Cloud Natural Language API to identify the type of fault. Subsequently, the server generates a solution based on the identified fault, referencing information from the Firebase Realtime Database. This solution is presented to the terminal both audibly and visually through a web interface built with React.js.

[0664] The solutions generated by the server guide customers on how to efficiently resolve problems with their mobile devices on the spot. Specifically, for example, if a user says, "My smartphone battery drains quickly," the system identifies it as a "battery consumption problem" and presents specific suggestions such as "stopping unnecessary applications" and "setting battery saver mode." Customers can easily perform these steps by following the visual guide.

[0665] Examples of using generative AI models include prompts such as, "Identify battery problems from voice input and suggest the best solution," or "Provide optimal Wi-Fi connection troubleshooting steps based on user input." In this way, a self-support environment is provided where even customers who are not technically savvy can confidently solve problems.

[0666] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0667] Step 1:

[0668] The user activates the tablet device and provides voice input regarding the problem. This voice data is entered into the device. The entered voice data is then sent to the server in streaming format.

[0669] Step 2:

[0670] The server receives audio data and converts it into text data using the Google Speech-to-Text API. This process involves data processing that converts the audio signal into a string of characters, and the output is text data as character information.

[0671] Step 3:

[0672] The server receives text data and parses it using the Google Cloud Natural Language API. Here, natural language processing techniques are used to identify the type of failure from the input text. As a result, the identified failure type is output.

[0673] Step 4:

[0674] Based on the type of failure identified by the server, the system retrieves suitable solution data from the Firebase Realtime Database. The failure type is used as input, and data processing is performed to generate solutions, resulting in the output of specific solutions.

[0675] Step 5:

[0676] Prepare to present the server-generated solutions both audibly and visually. This involves using speech synthesis technology to format the solutions as voice instructions, and using React.js to create visual instructions. The audio file and visual display information are then output to the device.

[0677] Step 6:

[0678] The terminal presents the user with solutions received from the server. The user receives voice guidance and visual step-by-step instructions, allowing them to attempt to resolve the problem themselves. The solution is ultimately executed as an action by the user.

[0679] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0680] This invention aims to improve the user experience by adding emotion recognition functionality to a system that assists users in self-solving problems with smartphones and information terminals via in-store tablet devices. This will enable appropriate support even for users who are unfamiliar with technology or who are easily stressed.

[0681] This system operates by combining speech recognition, natural language processing, and solution suggestion with an emotion engine that recognizes emotions. The user operates a tablet device to activate the AI ​​agent. The device has the ability to record the user's voice and capture their facial expressions via a camera. The recorded voice and captured facial expression data are sent to a server.

[0682] The server uses an emotion recognition engine to analyze audio and video data to recognize the user's current emotional state. This analysis is then correlated with the nature of the problem using natural language processing technology, forming the basis for guiding users to appropriate solutions.

[0683] For example, if a user reports a problem such as "Wi-Fi not connecting," the server can determine from the user's tone of voice and facial expression that the user is experiencing significant stress. In this case, in addition to offering the usual solutions, adjustments are made to provide more empathetic voice guidance and simplified instructions. In particular, character animations are emphasized to make the user feel more approachable and reassured.

[0684] Ultimately, the device adjusts and presents the voice instructions and visual information received from the server based on the user's emotional state. This allows users to receive guidance tailored to their individual circumstances and solve problems more smoothly. This form, which combines emotion recognition, improves the user experience and enables a higher level of satisfaction.

[0685] The following describes the processing flow.

[0686] Step 1:

[0687] The user activates the AI ​​agent by operating a tablet device in the store. The device starts its camera and microphone and prepares to collect audio and video data.

[0688] Step 2:

[0689] The user describes the problem verbally. The device records the audio using its microphone and simultaneously captures the user's facial expressions in real time using its camera.

[0690] Step 3:

[0691] The device sends recorded audio data and captured facial expression data to the server. The transmitted data is encrypted to protect the user's privacy.

[0692] Step 4:

[0693] The server converts the audio data into text data using ASR (Automatic Speech Recognition) technology. Then, it uses NLP (Natural Language Processing) technology to analyze the details of the problem from the text.

[0694] Step 5:

[0695] The server uses an emotion recognition engine to analyze the user's emotional state from voice tone and facial expression data. This identifies the emotions the user is experiencing (e.g., stress, anxiety, etc.).

[0696] Step 6:

[0697] The server generates solutions to problems based on the analysis results. It adjusts how solutions are presented based on emotional data, using empathetic voice guidance and character animations as needed.

[0698] Step 7:

[0699] The server sends generated voice instructions and visual information to the terminal. The terminal receives them and presents them to the user in a tone that matches their emotional state. Along with the voice guidance, a friendly character is displayed on the screen.

[0700] Step 8:

[0701] Users follow instructions and operate their smartphones or information terminals to attempt to resolve the problem. After the operation, they can also report the situation to the terminal again and receive additional support if necessary.

[0702] This approach allows systems that incorporate emotion recognition to provide flexible support tailored to each user's emotional state, facilitating efficient problem-solving.

[0703] (Example 2)

[0704] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0705] Traditional information processing systems have standardized support for user-reported problems, making it difficult to provide appropriate assistance, especially to users unfamiliar with technology or those prone to stress. Furthermore, offering uniform solutions without considering user feelings can decrease user satisfaction. There is a need to solve these problems and improve the user experience.

[0706] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0707] In this invention, the server includes speech recognition means for converting speech information acquired from the user's information processing device into text information, natural language processing means for analyzing the text information and identifying the type of problem, means for generating a solution based on the type of problem and the user's emotional state and presenting the solution audibly and visually, and emotion recognition means for analyzing speech information and facial expression information to identify the user's emotional state. This makes it possible to provide personalized solutions that respond to the user's emotions, thereby improving the user experience and satisfaction.

[0708] "Information processing device" is a general term for electronic devices that have the function of inputting, processing, and outputting data, and includes terminals that are directly operated by the user.

[0709] "Voice information" refers to data collected as the user's voice, and is the original information that is converted into text data by speech recognition.

[0710] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is subject to natural language processing.

[0711] "Speech recognition means" refers to a process or device that converts speech information into text information using technical methods, and is implemented using various algorithms.

[0712] "Natural language processing" refers to technologies that analyze human language and understand information, and is particularly used to classify problems and formulate solutions.

[0713] "Emotion recognition means" refers to technology or devices that identify a user's emotional state by analyzing voice and facial expression data.

[0714] "Means for generating and presenting solutions" refers to a process or mechanism for creating the optimal problem-solving method for the user and providing that information to the user in an easily understandable format.

[0715] This invention realizes a process for effectively solving user problems using an information processing system. The user first accesses the system using a tablet terminal, which is an information processing device, and reports the problem via voice input. The terminal has a built-in microphone for collecting voice and a camera for capturing facial expressions. As a result, the user's voice and facial information are recorded in real time.

[0716] The terminal transmits the recorded audio information to the server. Standard data communication protocols are used for this transmission, and the data may be encrypted as a security measure. The server converts the audio information into text using speech recognition technology and then identifies the type of problem using natural language processing technology. Cloud-based speech recognition APIs and natural language processing APIs may be used for this technical implementation.

[0717] Furthermore, the server uses emotion recognition to analyze voice and facial expression information to identify the user's emotional state. For example, if a user reports a problem such as "Wi-Fi is not connecting," the server may determine from their tone of voice and facial expression that they are experiencing stress. In such situations, the solutions offered are adjusted based on the user's emotions.

[0718] The solution presentation method includes the use of visually easy-to-understand character animations, which enhances user engagement. The provided solutions are generated by a generative AI model based on the user's specific state and past interaction history. The generated solutions are presented to the user through voice guidance and visual step-by-step instructions.

[0719] As an example of a prompt, the system can generate optimal responses by asking a question such as, "How should support be customized if the user shows signs of stress?" This allows users to receive support tailored to their individual circumstances, resulting in a system that facilitates smooth problem-solving.

[0720] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0721] Step 1:

[0722] The user activates a tablet device, which is an information processing device, and reports a problem through a voice input system. The user's voice is used as input, and this voice is recorded in real time by the device's microphone.

[0723] Step 2:

[0724] The terminal processes the recorded audio information as digital data and sends it to the server. The input audio data is converted to an appropriate format and securely transferred to the server according to the network protocol. Data compression and encryption may also be performed at this stage.

[0725] Step 3:

[0726] The server receives the transmitted audio data and converts it into text data using speech recognition technology. The input data is in audio format, and the output is in text format. This conversion process may utilize a cloud-based speech recognition API.

[0727] Step 4:

[0728] The server analyzes the textual information obtained through speech recognition and identifies the problem using natural language processing. At this stage, the input is textual data, and the output is a specific problem category. A generative AI model is used for this analysis, and data calculations are performed based on the prompt text.

[0729] Step 5:

[0730] The server simultaneously performs analysis using emotion recognition means with voice and facial expression data. The input here is voice tone and facial expression data, and the output is the user's emotional state. The emotional state is classified into states such as "calm" or "stressed."

[0731] Step 6:

[0732] The server generates and presents the optimal solution based on the identified problem and perceived emotional state. This process involves inputs including the problem category and emotional state, and outputting a tailored solution. It is designed to include visual character animations and voice guidance.

[0733] Step 7:

[0734] The terminal displays customized solutions received from the server to the user. The input is solution data from the server, and the output is visual and audio guidance to the user. The user can then use this information to proceed with problem solving.

[0735] (Application Example 2)

[0736] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0737] Users often experience stress and difficulty resolving problems in stores due to malfunctions with technical equipment or unfamiliarity with its operation. Therefore, there is a need for a system that can understand users' feelings and provide appropriate and user-friendly support accordingly.

[0738] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0739] In this invention, the server includes speech recognition means for converting voice data acquired from the user's terminal into text data, natural language processing means for analyzing the text data and identifying the type of problem, and emotion recognition means for recognizing the user's emotional state and adjusting the presentation of solutions based on that emotional state. This makes it possible to quickly provide appropriate support according to the user's emotional state and to smoothly resolve problems.

[0740] "Voice recognition means" refers to a device or system that has the function of acquiring voice data from a user's terminal and converting it into text data.

[0741] "Natural language processing means" refers to a technology or system that analyzes text data and performs processing to identify the type of problem.

[0742] "Means of presenting solutions" refers to means of presenting solutions generated based on the type of problem, either audibly or visually, and having the function of conveying information to the user.

[0743] "Emotion recognition means" refers to a device or system that has the function of recognizing the emotional state of a user and adjusting the content and method of the solution provided based on that emotional state.

[0744] "Character animation" refers to visual animation used to enhance user engagement and to convey instructions and information.

[0745] The system for implementing this invention is designed to improve in-store user service using emotion recognition. A server converts audio data acquired from a user's terminal via speech recognition into text data. This text data is analyzed using natural language processing to identify the type of problem. The analyzed data is sent to emotion recognition, where the user's emotional state is recognized from the audio and video data. Using this data, the server adjusts the provided solutions according to the user's emotional state.

[0746] The hardware includes a microphone for acquiring audio data, a camera for acquiring video data, and a server for data processing. The software includes speech recognition using the Google Speech-to-Text API, natural language processing using Dialogflow, and emotion recognition using the Microsoft Emotion API.

[0747] As a concrete example, when a customer is confused while searching for products in a store, the server can detect this confusion through emotion recognition, emphasize character animations, and provide product guidance in a gentle voice. In this scenario, the AI ​​model is instructed using a prompt message such as, "Analyze the customer's voice and facial expression data with the emotion recognition engine, and generate and provide gentle guidance text to the customer based on the results."

[0748] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0749] Step 1:

[0750] The device acquires the user's voice and facial expressions using a microphone and camera, and inputs them as audio and video data. This data is primary data for analyzing the user's state. The device then transmits this data to a server.

[0751] Step 2:

[0752] The server uses speech recognition to convert the incoming audio data into text data. The speech recognition results are output in text format and become data for natural language processing. The Google Speech-to-Text API handles this process.

[0753] Step 3:

[0754] The server uses natural language processing to analyze text data and identify the type of problem. The input is text data, and the output is information about the analyzed problem type. Dialogflow is used to analyze the context and problem, and appropriate tagging is performed.

[0755] Step 4:

[0756] The server analyzes video data sent via emotion recognition to recognize the user's emotional state. The input is video data, and the output is information about the recognized emotional state. The Microsoft Emotion API is used to analyze emotional patterns.

[0757] Step 5:

[0758] The server refines and generates solutions based on the results of natural language processing and sentiment recognition. The input is the type of problem and its sentiment state, and the output is the refined solution. The server sends prompts to the generating AI model, which uses prompts to generate the necessary solutions.

[0759] Step 6:

[0760] The server sends the adjusted solution to the terminal. The terminal presents the solution to the user both audibly and visually. This includes actions such as using character animations to explain things in a friendly manner. In this process, the terminal adjusts the tone of voice and animation movements to increase the user's sense of security.

[0761] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0762] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0763] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0764] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0765] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0766] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0767] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0768] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0769] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0770] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0771] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0772] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0773] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0774] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0775] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0776] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0777] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0778] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0779] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0780] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0781] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0782] The following is further disclosed regarding the embodiments described above.

[0783] (Claim 1)

[0784] A speech recognition means that converts voice data acquired from the user's device into text data,

[0785] A natural language processing means for analyzing the aforementioned text data and identifying the type of problem,

[0786] A means for generating a solution based on the type of problem and presenting the solution audibly and visually,

[0787] A system that includes this.

[0788] (Claim 2)

[0789] The system according to claim 1, wherein the natural language processing means performs analysis based on past user interaction history and standard operation instruction data.

[0790] (Claim 3)

[0791] The system according to claim 1, wherein the means for presenting the aforementioned solution is to guide the user through the operation method while increasing user familiarity using character animation.

[0792] "Example 1"

[0793] (Claim 1)

[0794] A speech recognition means that converts voice data acquired from a user's information terminal into text data,

[0795] A language analysis means for analyzing the aforementioned text data and identifying the type of problem,

[0796] Based on the type of problem, a means for generating a solution while referring to past response history and device handling information, and presenting the solution via audio output and visual display,

[0797] A system that includes this.

[0798] (Claim 2)

[0799] The system according to claim 1, wherein the language analysis means performs analysis based on past user interaction history and standard operation information.

[0800] (Claim 3)

[0801] The system according to claim 1, wherein the means for presenting the aforementioned solution is to guide the user through the procedure using animation to enhance user familiarity.

[0802] "Application Example 1"

[0803] (Claim 1)

[0804] A speech recognition means that converts speech information obtained from the user's operating platform into text information,

[0805] A natural language processing means for analyzing the aforementioned textual information and identifying the type of failure,

[0806] A means for generating a solution based on the type of failure and presenting the solution audibly and visually,

[0807] A means of providing self-support for customers to resolve problems with their mobile devices on the spot,

[0808] A system that includes this.

[0809] (Claim 2)

[0810] The system according to claim 1, wherein the natural language processing means performs analysis based on historical customer service data and standard operation guidance information.

[0811] (Claim 3)

[0812] The system according to claim 1, wherein the means for presenting the aforementioned solution is to guide the user through the operation method while increasing user familiarity with the system using visual aids.

[0813] "Example 2 of combining an emotion engine"

[0814] (Claim 1)

[0815] A speech recognition means that converts speech information acquired from a user's information processing device into text information,

[0816] A natural language processing means for analyzing the aforementioned textual information and identifying the type of problem,

[0817] A means for generating a solution based on the type of problem and the user's emotional state, and for presenting the solution audibly and visually,

[0818] An emotion recognition means that analyzes the aforementioned voice information and facial expression information to identify the user's emotional state,

[0819] A system that includes this.

[0820] (Claim 2)

[0821] The system according to claim 1, wherein the natural language processing means performs analysis based on past user interaction history and standard operation instruction information.

[0822] (Claim 3)

[0823] The system according to claim 1, wherein the means for presenting the aforementioned solution is to guide the user through the operation method while increasing familiarity with the user by using character images based on visual display technology.

[0824] "Application example 2 when combining with an emotional engine"

[0825] (Claim 1)

[0826] A speech recognition means that converts voice data acquired from the user's device into text data,

[0827] A natural language processing means for analyzing the aforementioned text data and identifying the type of problem,

[0828] A means for generating a solution based on the type of problem and presenting the solution audibly and visually,

[0829] An emotion recognition means that recognizes the user's emotional state and adjusts the presentation of solutions based on that emotional state,

[0830] A system that includes this.

[0831] (Claim 2)

[0832] The system according to claim 1, wherein the natural language processing means performs analysis based on past user interaction history and standard operation instruction data.

[0833] (Claim 3)

[0834] The system according to claim 1, wherein the means for presenting the aforementioned solution is to guide the user through the operation method while increasing familiarity with the user using character animation, and further adjust the voice guidance to be more user-friendly based on the user's emotional state. [Explanation of symbols]

[0835] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A speech recognition means that converts speech information obtained from the user's operating platform into text information, A natural language processing means for analyzing the aforementioned textual information and identifying the type of failure, A means for generating a solution based on the type of failure and presenting the solution audibly and visually, A means of providing self-support for customers to resolve problems with their mobile devices on the spot, A system that includes this.

2. The system according to claim 1, wherein the natural language processing means performs analysis based on historical customer service data and standard operation guidance information.

3. The system according to claim 1, wherein the means for presenting the aforementioned solution is to guide the user through the operation method while increasing familiarity with the user by using visual aids.