system

The system addresses inefficiencies in customer inquiry handling by using natural language processing and audio-visual content generation to deliver rapid, accurate, and personalized responses, enhancing user understanding and satisfaction.

JP2026101279APending Publication Date: 2026-06-22SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-10
Publication Date
2026-06-22

AI Technical Summary

Technical Problem

Modern enterprises face challenges in efficiently handling customer inquiries, with conventional methods leading to longer response times and variations in answer quality, and self-service systems failing to provide comprehensive understanding due to reliance on text and diagrams.

Method used

A system that utilizes natural language processing to analyze user inquiries, generates optimal answers, and converts them into integrated audio-visual content, including speech synthesis and video generation to enhance user comprehension.

Benefits of technology

The system provides quick and accurate responses through both sight and hearing, reducing the number of inquiries and improving user experience by ensuring clarity and personalization.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026101279000001_ABST
    Figure 2026101279000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A data receiving means for receiving user inquiries, A data analysis method that analyzes received inquiries to identify frequently occurring questions, Information generation means that generates the optimal answer based on the analyzed results, A multimedia generation means that integrates visual and audio information based on the generated answer, A data distribution means for providing generated multimedia content to users, An information processing system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, the method including the steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a character of the chatbot, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] Modern enterprises are required to efficiently handle inquiries from customers. Conventionally, the response to inquiries largely depends on customer support staff, which may lead to longer response times and variations in the quality of answers. In addition, in self-service with only text and diagrams, it is difficult for users to understand the information, and inquiries may occur again. There is a need to solve such problems.

Means for Solving the Problems

[0005] This invention provides a system that receives inquiries from users, analyzes their content to identify frequently asked questions, and generates optimal answers. Specifically, it uses natural language processing technology to analyze the inquiry content, identifies frequently asked questions, and generates optimal answers using a generation AI. Based on these answers, it uses speech synthesis technology to convert them into speech and automatically generates a video that integrates with visual information. By providing the generated video to the user, it is possible to provide efficient and easy-to-understand answers using both sight and hearing, thereby reducing the number of inquiries.

[0006] An "information receiving means" is a component that has the function of acquiring inquiry data from users and transferring it to the system.

[0007] "Analysis means" refers to a component that uses natural language processing technology to analyze received inquiry data and has the function of identifying frequently occurring questions and user intentions from its content.

[0008] "Answer generation means" refers to components that utilize algorithms or AI technologies to generate the optimal answer based on the analyzed results.

[0009] The "video generation means" is a component that has the function of converting the generated response into visual and audio information and automatically generating video content to present to the user.

[0010] "Information provision means" refers to a component that has the function of delivering generated video images to the user in an appropriate format and conveying information through sight and hearing. [Brief explanation of the drawing]

[0011] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3]This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0012] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.

[0013] First, let's explain the terminology used in the following explanation.

[0014] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0015] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0016] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0017] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0018] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0019] [First Embodiment]

[0020] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0021] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0022] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0023] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0024] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0025] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0026] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0027] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0028] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0029] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0030] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0031] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0032] In embodiments of the present invention, a system is provided that efficiently and effectively responds to user inquiries. This system is mainly server-based and improves the user experience through the following processes.

[0033] Inquiry reception and analysis

[0034] First, the user enters an inquiry via an information terminal. The terminal receives this inquiry and sends the data to the server. The server feeds the received data into its analysis engine, which analyzes the inquiry using natural language processing technology. This allows the server to accurately understand the user's intent and topic, and determine if it is a frequently asked question.

[0035] Answer generation

[0036] For questions identified by the analysis engine, the server activates an answer generation engine. This engine refers to FAQs and knowledge bases recorded in an internal database to create the most appropriate answer. The generated answer is output as natural-sounding text.

[0037] Video generation

[0038] The server uses the generated text-based responses to activate a video generation module. This module uses speech synthesis technology to convert the responses into audio and then generates images and animations to visually represent the content. This results in a video that integrates the responses as audiovisual information.

[0039] Video provision and viewing

[0040] The server uploads the generated video to the hosting server and creates a link that the user can access. The device receives the video through this link and provides it to the user. By watching the video, the user receives explanations through both sight and sound, allowing for a deeper understanding of the content.

[0041] Specific example

[0042] For example, if a user asks, "Please tell me how to return a product," the server will identify the FAQ related to "return procedures" and generate a detailed answer outlining those procedures. Then, based on this answer, it will generate a video visualizing the return process and provide it to the user. By watching the video, users can more easily understand the details of the return procedure.

[0043] Thus, the present invention aims to provide users with quick and accurate information in response to their inquiries, thereby improving the user experience.

[0044] The following describes the processing flow.

[0045] Step 1:

[0046] The user uses an information terminal to enter text into a customer service inquiry form and submit it. The terminal receives this inquiry text.

[0047] Step 2:

[0048] The terminal sends the received query text to the server. The server receives this data and passes it to the natural language processing engine.

[0049] Step 3:

[0050] The server uses a natural language processing engine to analyze the query. Specifically, it tokenizes the text, tags each word with its part of speech, and determines the user's intent. This helps identify whether the question is frequently asked.

[0051] Step 4:

[0052] The server uses the analyzed results to cross-reference them with the FAQ database and find relevant answers. The answer generation engine then uses this information to generate answers in natural language.

[0053] Step 5:

[0054] The server sends the generated text responses to the video generation module. The module uses a speech synthesis engine to convert the responses into speech and then creates images and animations to add visual information. This process generates a video that integrates audio and visuals.

[0055] Step 6:

[0056] The server uploads the generated video to the video hosting platform and generates an access link for it.

[0057] Step 7:

[0058] The server sends this access link to the terminal. The terminal then presents the link to the user, allowing the user to access the video.

[0059] Step 8:

[0060] Users access the link through their device and obtain answers to their questions by watching the provided video.

[0061] (Example 1)

[0062] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0063] The problem that this invention aims to solve is to improve the user experience by generating quick and effective responses to user inquiries and providing them in an integrated format of visual and auditory information.

[0064] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0065] In this invention, the server includes data receiving means for receiving requests from users, analysis means for analyzing the received requests and identifying the intent of the utterance, answer generating means for collecting information based on the analysis results and generating an optimal solution, and media generating means for generating multimedia including visual and auditory information based on the generated solution. This enables users to gain a detailed understanding of their inquiries through audiovisual information.

[0066] "Data receiving means" refers to a device or process that has the function of receiving requests from users and transmitting them to a server.

[0067] "Analysis means" refers to technology that analyzes received requests, identifies the intent of the utterance from its content, and determines whether it is a frequently asked question.

[0068] "Answer generation means" refers to a technology or device that has the function of generating the optimal solution by referring to an internally stored database based on analyzed information.

[0069] "Media generation means" refers to a device or process that has the function of generating multimedia content that integrates visual and auditory information based on the generated solution.

[0070] "Information provision means" refers to technology or devices that have the function of transmitting generated multimedia to users and providing it in a viewable format.

[0071] This invention is an information provision system for responding quickly and appropriately to user inquiries. This system is mainly composed of a server, and a specific embodiment thereof is shown below.

[0072] Each user can enter inquiries using a dedicated application or web browser on their device. The device converts the entered inquiries into data packets and sends them to the server. Standard internet connections and protocols are used for this data reception.

[0073] The server uses natural language processing techniques to analyze the received data packets. These analysis methods include, for example, tokenization, part-of-speech tagging, and dependency analysis. This allows the server to accurately grasp the user's intent and identify frequently occurring questions.

[0074] Based on the analyzed data, the server generates the optimal answer using an answer generation mechanism. At this stage, FAQs and knowledge bases in the internal database are referenced. A generation AI model is used to form the optimal prompt sentence for the user's question, resulting in a natural and accurate answer.

[0075] Subsequently, the server operates media generation tools based on the generated responses to create multimedia content that integrates audio and visual information. Speech synthesis technology is used to convert the response text into speech and generate related images and animations. Third-party media generation software and libraries may be used in this process.

[0076] The generated multimedia is provided to the user through an information delivery method. The video is uploaded to a hosting server, and an access link is sent to the user's device. The user uses this link to watch the video and utilize the information obtained visually and aurally.

[0077] A concrete example is an inquiry asking, "Please tell me how to return a product." In this case, the server identifies FAQs regarding the return procedure and generates an answer that includes specific steps. A video based on that answer is then created and provided to the user, making it easier to understand the return process in detail.

[0078] Example prompt: "Please provide a video guide explaining the product return process in detail."

[0079] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0080] Step 1:

[0081] The user opens a dedicated application or web browser on their device and enters their inquiry.

[0082] The input information is converted into data packets by the terminal. These data packets are sent to the server via the internet. The input is string data, and the output is the transmitted data packets. The terminal provides a user interface (UI) to facilitate user input processing.

[0083] Step 2:

[0084] The server inputs the data packets received from the terminal into its analysis engine.

[0085] The analysis engine uses natural language processing technology to analyze user inquiries through processes such as tokenization, part-of-speech tagging, and dependency analysis.

[0086] The input is a data packet, and the output is parsed query data. The server performs this parsing step to determine the user's intent.

[0087] Step 3:

[0088] The server then activates the answer generation engine based on the analyzed data.

[0089] At this stage, data queries are performed to extract appropriate information from the internal database based on the analysis results. A generative AI model is used to form the most suitable prompt sentence based on the analysis results, and natural language generation of the response is performed based on that prompt sentence.

[0090] The input is analysis data, and the output is a text-formatted response. The server applies various algorithms to generate the optimal response.

[0091] Step 4:

[0092] The server uses the generated text-formatted response to start the media generation engine.

[0093] This engine utilizes speech synthesis technology to convert text into speech and generate related visual information. This includes image generation and animation creation. The input is a text-based response, and the output is a multimedia file integrating visual and audio information. Through this, the server can provide information to the user in a more intuitive way.

[0094] Step 5:

[0095] The server uploads the generated multimedia files to the hosting server.

[0096] After uploading, a link to the file is generated, and the link information is sent to the user's device.

[0097] The input is a multimedia file, and the output is an access link. The server uses hosting capabilities to make it easy for users to access the content.

[0098] Step 6:

[0099] The device uses a link received from the server to download and play multimedia content.

[0100] Users view this content and understand information related to their inquiry through visual and auditory means.

[0101] The input is an access link, and the output is the displayed video content. The device uses a viewing application to support the user's consumption of the content.

[0102] (Application Example 1)

[0103] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0104] In modern information processing systems, there is a need for methods to respond quickly and accurately to diverse user inquiries. However, conventional systems often struggled to accurately grasp user intent and provide appropriate information. Furthermore, providing answers solely as text may not allow users to fully understand the information. Therefore, there is an urgent need to develop new information processing methods to improve the user experience.

[0105] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0106] In this invention, the server includes data receiving means for receiving inquiries from users, data analysis means for analyzing the received inquiries and identifying frequently occurring questions, information generation means for generating optimal answers based on the analysis results, multimedia generation means for integrating visual and audio information based on the generated answers, and data distribution means for providing the generated multimedia content to the user. This makes it possible to quickly provide information combining visual and audio in response to user inquiries.

[0107] "Data receiving means" refers to devices or software used to receive inquiries sent from users.

[0108] "Data analysis means" refers to processes and systems for analyzing received inquiries to identify frequently occurring questions and important information.

[0109] "Information generation means" refers to devices or software used to create appropriate answers based on analyzed data.

[0110] A "multimedia generation means" refers to a device or software that combines generated answers with visual and audio information to create integrated multimedia content.

[0111] "Data distribution means" refers to communication methods and platforms for providing generated multimedia content to users.

[0112] The invention will now be described in terms of embodiments for carrying out the invention. This invention relates to an information processing system for responding quickly and accurately to user inquiries. The operation of the system will be described in detail below.

[0113] The server first receives queries sent from user terminals using data reception means. These received queries are then analyzed by data analysis means within the server. One example of software used here is Google Cloud's Dialogflow. This allows the query content to be converted into structured data, enabling the identification of frequently occurring questions.

[0114] Based on the analysis results, the server uses information generation tools to generate appropriate responses. Generative AI models such as GPT-3 (registered trademark) can be used at this stage. The generated text-based responses are then converted into multimedia content integrating visual and audio information by multimedia generation tools. Software like Amazon Polly can be used for speech synthesis, and FFmpeg is used for video generation.

[0115] The generated multimedia content is uploaded to a hosting service such as AWS® S3 via a data distribution method on the server. Users can access this content through their devices and view the information via the provided links.

[0116] For example, if a user enters the prompt, "Please tell me how to check my order history," the system will generate a video demonstrating the appropriate steps, helping the user understand the process visually and audibly. In this way, users can easily obtain and understand the information in response to their inquiries.

[0117] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0118] Step 1:

[0119] The terminal receives a query from the user. The user enters the query details into the terminal's input interface, and that data is sent to the server. In this step, the input is the user's query data, and the output is the query information transferred to the server.

[0120] Step 2:

[0121] The server analyzes the received query using data analysis tools. Specifically, the server uses Google Cloud's Dialogflow to process the query using natural language processing to identify the query's intent and pinpoint frequently occurring questions. The input in this step is the user's query information, and the output is the analyzed intent and identified questions.

[0122] Step 3:

[0123] The server generates the optimal response using information generation means based on the analyzed data. In this process, the server uses GPT-3 as the generation AI model to generate appropriate text responses. The input for this step is the analyzed intent and question, and the output is the generated text-based response.

[0124] Step 4:

[0125] The server uses multimedia generation tools to create multimedia content that integrates visual and audio information based on the generated text-based responses. Here, Amazon Polly is used for speech synthesis, and FFmpeg is used to generate video for visual representation. The input for this step is the generated text responses, and the output is multimedia content.

[0126] Step 5:

[0127] The server uploads the generated multimedia content to AWS S3 and provides the user with a video link. The user can access this link through their device and view the multimedia content. The input for this step is the multimedia content, and the output is the sharing of the link with the user.

[0128] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0129] One embodiment of the present invention is a system that generates and provides customized video images that take emotions into account in response to user inquiries. This system receives inquiries from users, recognizes and analyzes the emotions contained therein to generate an optimal response, and then generates video images based on that response.

[0130] Inquiry reception and sentiment analysis

[0131] The user enters and submits an inquiry via an information terminal. The terminal sends the inquiry to the server. Upon receiving the inquiry, the server first uses an emotion engine to recognize the user's emotions as expressed in the inquiry text. This emotion analysis accurately determines whether the user is angry, troubled, etc.

[0132] Analysis of inquiry content and generation of responses

[0133] The server uses analytical tools to combine natural language processing techniques and sentiment analysis results to analyze the inquiry in detail. Based on this information, the server generates the most appropriate answer from the FAQ database. The answer generation takes sentiment information into account, and natural language generation is performed with a tone and content that matches the user's emotions.

[0134] Video generation

[0135] The generated response text is sent to the server's video generation system. Here, the speech synthesis engine is adjusted based on the emotion engine's analysis results to produce speech with a sound quality and speed appropriate to the user's emotions. Visual content is also generated taking emotions into consideration, and finally, a video is produced in which visual information and audio are integrated.

[0136] Video provision and utilization

[0137] The server uploads the generated video to an appropriate hosting platform and provides a link that the user can access immediately. The device presents this link to the user, who then accesses it to watch the video. For example, if it is determined that the user is dissatisfied with a product return, the server generates a video providing a polite and considerate explanation. This makes the user feel that their feelings have been accurately recognized and addressed appropriately, resulting in a better experience.

[0138] By implementing this invention, responses to user inquiries will be more personalized, and customized content that takes emotions into consideration will be provided. This is expected to improve customer satisfaction and reduce the number of inquiries.

[0139] The following describes the processing flow.

[0140] Step 1:

[0141] The user uses a terminal to enter and submit their inquiry. The terminal receives this inquiry text and sends it to the server.

[0142] Step 2:

[0143] The server passes the received query text to a natural language processing engine, which analyzes the text's content. Simultaneously, it activates an emotion engine to detect the user's emotions contained in the text and determine whether those emotions are positive or negative.

[0144] Step 3:

[0145] Based on the analyzed content and sentiment analysis results, the server searches the FAQ database for the most suitable answer. The answer generation engine generates answers naturally, taking into account the user's emotions in tone and content.

[0146] Step 4:

[0147] The server sends the generated response text to the video generation module. Here, the speech synthesis engine adjusts the tone and pace of the voice according to the emotion, and generates related images and animations to create a video that combines audio and visuals.

[0148] Step 5:

[0149] The server uploads the generated video to a video hosting platform and generates a link that users can access.

[0150] Step 6:

[0151] The server sends this link to the terminal, which then provides the link to the user. The user clicks the link and views the content to receive the answer.

[0152] Step 7:

[0153] By watching the video, users understand the answer to their inquiry through visual and auditory means, and achieve the purpose of their inquiry.

[0154] (Example 2)

[0155] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0156] Traditional systems provided automated responses to user inquiries, often failing to adequately consider user emotions. This resulted in an inadequate user experience and decreased satisfaction. In particular, they failed to meet the need for customized responses that reflected user emotions.

[0157] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0158] In this invention, the server includes emotion analysis means for analyzing emotions, natural language processing means for generating optimal answers, and video generation means for generating video by adjusting sound quality and speed. This enables customized responses that take into account the user's emotions.

[0159] "Sentiment analysis methods" refer to methods for analyzing inquiry text received from users, recognizing and quantifying the user's emotions contained within the text.

[0160] A "natural language processing method" is a means of generating the optimal response based on the user's inquiry and analyzed sentiment information. This method aims to generate natural language with a tone and content that matches the user's emotions.

[0161] The "moving image generation means" is a means of integrating audio and visual information based on the generated response text, with sound quality and speed that matches the user's emotions, to generate a moving image.

[0162] "Information provision means" refers to the means of uploading generated video images to a hosting platform and providing a link that users can access.

[0163] This system generates and provides customized, emotionally-inspired videos in response to user inquiries. Users can input inquiries via an information terminal and send them to the server. The terminal has the function of receiving the inquiry data and forwarding it to the server. The server has multiple means of processing that data.

[0164] First, the server uses sentiment analysis tools to analyze the user's emotions contained in the text data. This involves using sentiment analysis libraries (for example, "VADER" or other natural language processing tools). As a result of this analysis, the user's emotions can be quantified and that information can be obtained.

[0165] Next, the server uses natural language processing to generate the optimal response based on the inquiry content and sentiment information. In this process, a generative AI model (e.g., the "GPT" series) can be used to create a response with an appropriate tone and context that matches the user's emotions.

[0166] The generated responses are then sent to a video generation system. A speech synthesis engine (e.g., "Text-to-Speech" technology) is used to generate the audio. The server generates the audio, adjusting the sound quality and speed to reflect emotions. The responses also include appropriate visual content, which is then integrated and generated as a video.

[0167] Ultimately, the server uploads the generated video to a hosting platform and provides a link that the user can access immediately. The terminal is responsible for presenting this link to the user, who can then watch the video via the provided link. For example, if a user is "dissatisfied with a product return," the server can analyze their emotions and generate a video that provides an explanation in an appropriate tone. This system allows the user to feel that their emotions have been recognized and addressed appropriately. An example of a prompt might be, "Regarding your desire to have experienced better customer service." By inputting such prompts into the system, customized content that matches the user's intent is generated.

[0168] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0169] Step 1:

[0170] The user enters and submits their inquiry using an information terminal. The input is represented as text data, and the terminal's role is to send this data to the server. At this stage, the information and solutions the user is seeking are collected as input data.

[0171] Step 2:

[0172] The server transfers the text data received from the terminal to the sentiment analysis system. The sentiment analysis uses a natural language processing library to score keywords and emotional phrases within the text, extracting the emotions the user is experiencing. As a result of this analysis, the user's emotional state (e.g., dissatisfaction, joy) is quantified and output.

[0173] Step 3:

[0174] The server passes the query content along with the sentiment analysis results to a natural language processing system. Here, a generative AI model is used to generate a contextually appropriate answer. The input consists of the analyzed sentiment data and the query content, while the output is the result of natural language generation tailored to the user's emotional tone. This process ensures that the user receives a properly customized response.

[0175] Step 4:

[0176] The server forwards the generated text response to the video generation system. Using a speech synthesis engine, the system adjusts the sound quality and reading speed based on the emotion analysis results, and converts the response into speech. Simultaneously, visual information is also generated, and finally, a video integrating audio and visuals is output. This prepares content that aligns with the user's emotional expression.

[0177] Step 5:

[0178] The server uploads the prepared video to a cloud-based hosting platform and generates a link to access it. The terminal provides this link to the user, who can then view the video via the link. This allows the user to receive a customized visual response to their inquiry.

[0179] (Application Example 2)

[0180] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0181] Modern information delivery systems often fail to accurately grasp users' emotions and provide insufficient individualized responses. As a result, users become frustrated when their feelings are not properly recognized, leading to decreased satisfaction.

[0182] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0183] In this invention, the server includes information acquisition means for receiving communications from the user, emotion analysis means for analyzing the received communications to identify the user's emotions, and response generation means for generating an optimal answer based on the analyzed emotions. This makes it possible to provide appropriate video images that match the user's emotions.

[0184] "User communications" refers to data, including inquiries and requests, that users of an information system send through means of obtaining information.

[0185] "Information acquisition means" refers to a device or software that receives communications transmitted by a user and organizes them for further analysis.

[0186] "Emotional analysis means" refers to a function that processes received communication data to identify the user's emotions and feelings.

[0187] A "response generation means" is a process that automatically creates an optimized response based on analyzed emotional information.

[0188] "Motion image generation means" refers to a technology that creates motion images by integrating visual and auditory information that corresponds to the user's emotions, based on the generated response.

[0189] "Information provision means" refers to the means of quickly distributing generated video images to users and making them viewable.

[0190] The system implementing this invention receives communication from a user, analyzes the user's emotions, generates an optimal response, and provides it as a moving image including visual and auditory information. The server first receives communication from the user using information acquisition means. This communication is often a text message sent via a smartphone or other information terminal. The received communication is analyzed by emotion analysis means to identify the user's emotions contained therein.

[0191] This emotion analysis utilizes emotion analysis engines such as Microsoft® Text Analytics API. Based on the analysis results, the server uses natural language processing techniques to analyze the content in detail and generate the optimal response. This response is then spoken in a tone appropriate to the emotion. Speech synthesis utilizes speech generation technologies such as Google Cloud Text-to-Speech.

[0192] Furthermore, a video is generated by a video generation means. Here, based on the generated response, visual and audio information is integrated using visual content creation software such as Adobe Premiere Pro. Finally, this video is delivered to the user by an information delivery means.

[0193] For example, when a user makes a product inquiry on an online shopping platform, this system generates and promptly provides an explanatory video with content and tone tailored to the user's emotions. This allows the user to feel that their emotions have been appropriately recognized and that they have received a quick and accurate response. An example of a prompt for the generating AI model would be, "Analyze the user's emotions and generate an explanatory video addressing their dissatisfaction with the delivery delay."

[0194] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0195] Step 1:

[0196] The user enters the inquiry using a terminal and sends it to the server. The input is primarily text data. This data is received by the server's information retrieval system.

[0197] Step 2:

[0198] The server passes the received text data to an emotion analysis tool to identify the user's emotions. This step uses an emotion analysis engine such as the Microsoft Text Analytics API. The input is text data, and the output is data indicating the user's emotions. Through the analysis, emotions such as anger, joy, and sadness are obtained as data.

[0199] Step 3:

[0200] The server generates an appropriate response using natural language processing techniques based on analyzed sentiment and text data. An FAQ database is used for this process. Input consists of text and sentiment data, and output is a response sentence best suited to the user's sentiment. Google Cloud Natural Language API and similar tools are used for response generation.

[0201] Step 4:

[0202] The server passes the response text to a speech synthesis system to generate audio data. This step uses speech generation technology such as Google Cloud Text-to-Speech. The input is the response text, and the output is audio data. The tone and speed of the voice are adjusted according to the user's emotions.

[0203] Step 5:

[0204] The server generates a video by integrating audio data and visual content using a video generation system. This process utilizes visual content generation software such as Adobe Premiere Pro or FFmpeg. The input consists of audio data and visual elements, while the output is video data that reflects the user's emotions.

[0205] Step 6:

[0206] The server generates video and provides it to the terminal via an information distribution method, allowing the user to view the video. The input is video data, and the output is a link or file format reference provided to the user. The user can view the video by clicking the link.

[0207] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0208] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0209] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0210] [Second Embodiment]

[0211] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0212] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0213] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0214] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0215] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0216] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0217] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0218] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0219] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0220] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0221] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0222] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0223] In embodiments of the present invention, a system is provided that efficiently and effectively responds to user inquiries. This system is mainly server-based and improves the user experience through the following processes.

[0224] Inquiry reception and analysis

[0225] First, the user enters an inquiry via an information terminal. The terminal receives this inquiry and sends the data to the server. The server feeds the received data into its analysis engine, which analyzes the inquiry using natural language processing technology. This allows the server to accurately understand the user's intent and topic, and determine if it is a frequently asked question.

[0226] Answer generation

[0227] For questions identified by the analysis engine, the server activates an answer generation engine. This engine refers to FAQs and knowledge bases recorded in an internal database to create the most appropriate answer. The generated answer is output as natural-sounding text.

[0228] Video generation

[0229] The server uses the generated text-based responses to activate a video generation module. This module uses speech synthesis technology to convert the responses into audio and then generates images and animations to visually represent the content. This results in a video that integrates the responses as audiovisual information.

[0230] Video provision and viewing

[0231] The server uploads the generated video to the hosting server and creates a link that the user can access. The device receives the video through this link and provides it to the user. By watching the video, the user receives explanations through both sight and sound, allowing for a deeper understanding of the content.

[0232] Specific example

[0233] For example, if a user asks, "Please tell me how to return a product," the server will identify the FAQ related to "return procedures" and generate a detailed answer outlining those procedures. Then, based on this answer, it will generate a video visualizing the return process and provide it to the user. By watching the video, users can more easily understand the details of the return procedure.

[0234] Thus, the present invention aims to provide users with quick and accurate information in response to their inquiries, thereby improving the user experience.

[0235] The following describes the processing flow.

[0236] Step 1:

[0237] The user uses an information terminal to enter text into a customer service inquiry form and submit it. The terminal receives this inquiry text.

[0238] Step 2:

[0239] The terminal sends the received query text to the server. The server receives this data and passes it to the natural language processing engine.

[0240] Step 3:

[0241] The server uses a natural language processing engine to analyze the query. Specifically, it tokenizes the text, tags each word with its part of speech, and determines the user's intent. This helps identify whether the question is frequently asked.

[0242] Step 4:

[0243] The server uses the analyzed results to cross-reference them with the FAQ database and find relevant answers. The answer generation engine then uses this information to generate answers in natural language.

[0244] Step 5:

[0245] The server sends the generated text responses to the video generation module. The module uses a speech synthesis engine to convert the responses into speech and then creates images and animations to add visual information. This process generates a video that integrates audio and visuals.

[0246] Step 6:

[0247] The server uploads the generated video to the video hosting platform and generates an access link for it.

[0248] Step 7:

[0249] The server sends this access link to the terminal. The terminal then presents the link to the user, allowing the user to access the video.

[0250] Step 8:

[0251] Users access the link through their device and obtain answers to their questions by watching the provided video.

[0252] (Example 1)

[0253] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0254] The problem that this invention aims to solve is to improve the user experience by generating quick and effective responses to user inquiries and providing them in an integrated format of visual and auditory information.

[0255] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0256] In this invention, the server includes data receiving means for receiving requests from users, analysis means for analyzing the received requests and identifying the intent of the utterance, answer generating means for collecting information based on the analysis results and generating an optimal solution, and media generating means for generating multimedia including visual and auditory information based on the generated solution. This enables users to gain a detailed understanding of their inquiries through audiovisual information.

[0257] "Data receiving means" refers to a device or process that has the function of receiving requests from users and transmitting them to a server.

[0258] "Analysis means" refers to technology that analyzes received requests, identifies the intent of the utterance from its content, and determines whether it is a frequently asked question.

[0259] "Answer generation means" refers to a technology or device that has the function of generating the optimal solution by referring to an internally stored database based on analyzed information.

[0260] "Media generation means" refers to a device or process that has the function of generating multimedia content that integrates visual and auditory information based on the generated solution.

[0261] "Information provision means" refers to technology or devices that have the function of transmitting generated multimedia to users and providing it in a viewable format.

[0262] This invention is an information provision system for responding quickly and appropriately to user inquiries. This system is mainly composed of a server, and a specific embodiment thereof is shown below.

[0263] Each user can enter inquiries using a dedicated application or web browser on their device. The device converts the entered inquiries into data packets and sends them to the server. Standard internet connections and protocols are used for this data reception.

[0264] The server uses natural language processing techniques to analyze the received data packets. These analysis methods include, for example, tokenization, part-of-speech tagging, and dependency analysis. This allows the server to accurately grasp the user's intent and identify frequently occurring questions.

[0265] Based on the analyzed data, the server generates the optimal answer using an answer generation mechanism. At this stage, FAQs and knowledge bases in the internal database are referenced. A generation AI model is used to form the optimal prompt sentence for the user's question, resulting in a natural and accurate answer.

[0266] Subsequently, the server operates media generation tools based on the generated responses to create multimedia content that integrates audio and visual information. Speech synthesis technology is used to convert the response text into speech and generate related images and animations. Third-party media generation software and libraries may be used in this process.

[0267] The generated multimedia is provided to the user through an information delivery method. The video is uploaded to a hosting server, and an access link is sent to the user's device. The user uses this link to watch the video and utilize the information obtained visually and aurally.

[0268] A concrete example is an inquiry asking, "Please tell me how to return a product." In this case, the server identifies FAQs regarding the return procedure and generates an answer that includes specific steps. A video based on that answer is then created and provided to the user, making it easier to understand the return process in detail.

[0269] Example prompt: "Please provide a video guide explaining the product return process in detail."

[0270] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0271] Step 1:

[0272] The user opens a dedicated application or web browser on their device and enters their inquiry.

[0273] The input information is converted into data packets by the terminal. These data packets are sent to the server via the internet. The input is string data, and the output is the transmitted data packets. The terminal provides a user interface (UI) to facilitate user input processing.

[0274] Step 2:

[0275] The server inputs the data packets received from the terminal into its analysis engine.

[0276] The analysis engine uses natural language processing technology to analyze user inquiries through processes such as tokenization, part-of-speech tagging, and dependency analysis.

[0277] The input is a data packet, and the output is parsed query data. The server performs this parsing step to determine the user's intent.

[0278] Step 3:

[0279] The server then activates the answer generation engine based on the analyzed data.

[0280] At this stage, data queries are performed to extract appropriate information from the internal database based on the analysis results. A generative AI model is used to form the most suitable prompt sentence based on the analysis results, and natural language generation of the response is performed based on that prompt sentence.

[0281] The input is analysis data, and the output is a text-formatted response. The server applies various algorithms to generate the optimal response.

[0282] Step 4:

[0283] The server uses the generated text-formatted response to start the media generation engine.

[0284] This engine utilizes speech synthesis technology to convert text into speech and generate related visual information. This includes image generation and animation creation. The input is a text-formatted response, and the output is a multimedia file integrating visual and audio information. Through this, the server can provide information to users in a more intuitive manner.

[0285] Step 5:

[0286] The server uploads the generated multimedia file to the hosting server.

[0287] After uploading, it generates a link to the file and transmits the link information to the user's terminal.

[0288] The input is a multimedia file, and the output is an access link. The server uses the hosting function to enable users to easily access the content.

[0289] Step 6:

[0290] The terminal uses the link received from the server to download and play the multimedia content.

[0291] The user watches this content and understands the information related to the inquiry through vision and hearing.

[0292] The input is an access link, and the output is the video content to be displayed. The terminal uses a viewing application to support the user's digestion of the content.

[0293] (Application Example 1)

[0294] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".

[0295] In modern information processing systems, there is a need for methods to respond quickly and accurately to diverse user inquiries. However, conventional systems often struggled to accurately grasp user intent and provide appropriate information. Furthermore, providing answers solely as text may not allow users to fully understand the information. Therefore, there is an urgent need to develop new information processing methods to improve the user experience.

[0296] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0297] In this invention, the server includes data receiving means for receiving inquiries from users, data analysis means for analyzing the received inquiries and identifying frequently occurring questions, information generation means for generating optimal answers based on the analysis results, multimedia generation means for integrating visual and audio information based on the generated answers, and data distribution means for providing the generated multimedia content to the user. This makes it possible to quickly provide information combining visual and audio in response to user inquiries.

[0298] "Data receiving means" refers to devices or software used to receive inquiries sent from users.

[0299] "Data analysis means" refers to processes and systems for analyzing received inquiries to identify frequently occurring questions and important information.

[0300] "Information generation means" refers to devices or software used to create appropriate answers based on analyzed data.

[0301] A "multimedia generation means" refers to a device or software that combines generated answers with visual and audio information to create integrated multimedia content.

[0302] "Data distribution means" refers to communication methods and platforms for providing generated multimedia content to users.

[0303] The invention will now be described in terms of embodiments for carrying out the invention. This invention relates to an information processing system for responding quickly and accurately to user inquiries. The operation of the system will be described in detail below.

[0304] The server first receives queries sent from user terminals using data reception means. These received queries are then analyzed by data analysis means within the server. One example of software used here is Google Cloud's Dialogflow. This allows the query content to be converted into structured data, enabling the identification of frequently occurring questions.

[0305] Based on the analysis results, the server uses information generation tools to generate appropriate responses. Generative AI models such as GPT-3 can be used at this stage. The generated text-based responses are then converted into multimedia content integrating visual and audio information by multimedia generation tools. Software like Amazon Polly can be used for speech synthesis, and FFmpeg can be used for video generation.

[0306] The generated multimedia content is uploaded to a hosting service such as AWS S3 via a data distribution method on the server. Users can access this content through their devices and view the information via the provided links.

[0307] For example, if a user enters the prompt, "Please tell me how to check my order history," the system will generate a video demonstrating the appropriate steps, helping the user understand the process visually and audibly. In this way, users can easily obtain and understand the information in response to their inquiries.

[0308] The flow of the specific process in Application Example 1 will be described using FIG. 12.

[0309] Step 1:

[0310] The terminal receives an inquiry from the user. The user inputs the inquiry content into the input interface of the terminal, and the data is sent to the server. The input in this step is the user's inquiry data, and the output is the inquiry information transferred to the server.

[0311] Step 2:

[0312] The server analyzes the received inquiry using data analysis means. Specifically, the server uses Dialogflow of Google Cloud to perform natural language processing on the inquiry, identify the intention of the inquiry, and distinguish frequently asked questions. The input in this step is the user's inquiry information, and the output is the analyzed intention and identified questions.

[0313] Step 3:

[0314] The server generates an optimal answer using information generation means based on the analyzed data. At this time, the server uses GPT-3 as the generation AI model to generate an appropriate text response. The input in this step is the analyzed intention and questions, and the output is the generated text-based answer.

[0315] Step 4:

[0316] The server creates multimedia content integrating visual information and audio information using multimedia generation means based on the generated text-based answer. Here, Amazon Polly is used for speech synthesis, and FFmpeg is used to generate a video for visual representation. The input in this step is the generated text answer, and the output is the multimedia content.

[0317] Step 5:

[0318] The server uploads the generated multimedia content to AWS S3 and provides the user with a video link. The user can access this link through their device and view the multimedia content. The input for this step is the multimedia content, and the output is the sharing of the link with the user.

[0319] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0320] One embodiment of the present invention is a system that generates and provides customized video images that take emotions into account in response to user inquiries. This system receives inquiries from users, recognizes and analyzes the emotions contained therein to generate an optimal response, and then generates video images based on that response.

[0321] Inquiry reception and sentiment analysis

[0322] The user enters and submits an inquiry via an information terminal. The terminal sends the inquiry to the server. Upon receiving the inquiry, the server first uses an emotion engine to recognize the user's emotions as expressed in the inquiry text. This emotion analysis accurately determines whether the user is angry, troubled, etc.

[0323] Analysis of inquiry content and generation of responses

[0324] The server uses analytical tools to combine natural language processing techniques and sentiment analysis results to analyze the inquiry in detail. Based on this information, the server generates the most appropriate answer from the FAQ database. The answer generation takes sentiment information into account, and natural language generation is performed with a tone and content that matches the user's emotions.

[0325] Video generation

[0326] The generated response text is sent to the server's video generation system. Here, the speech synthesis engine is adjusted based on the emotion engine's analysis results to produce speech with a sound quality and speed appropriate to the user's emotions. Visual content is also generated taking emotions into consideration, and finally, a video is produced in which visual information and audio are integrated.

[0327] Video provision and utilization

[0328] The server uploads the generated video to an appropriate hosting platform and provides a link that the user can access immediately. The device presents this link to the user, who then accesses it to watch the video. For example, if it is determined that the user is dissatisfied with a product return, the server generates a video providing a polite and considerate explanation. This makes the user feel that their feelings have been accurately recognized and addressed appropriately, resulting in a better experience.

[0329] By implementing this invention, responses to user inquiries will be more personalized, and customized content that takes emotions into consideration will be provided. This is expected to improve customer satisfaction and reduce the number of inquiries.

[0330] The following describes the processing flow.

[0331] Step 1:

[0332] The user uses a terminal to enter and submit their inquiry. The terminal receives this inquiry text and sends it to the server.

[0333] Step 2:

[0334] The server passes the received query text to a natural language processing engine, which analyzes the text's content. Simultaneously, it activates an emotion engine to detect the user's emotions contained in the text and determine whether those emotions are positive or negative.

[0335] Step 3:

[0336] Based on the analyzed content and sentiment analysis results, the server searches the FAQ database for the most suitable answer. The answer generation engine generates answers naturally, taking into account the user's emotions in tone and content.

[0337] Step 4:

[0338] The server sends the generated response text to the video generation module. Here, the speech synthesis engine adjusts the tone and pace of the voice according to the emotion, and generates related images and animations to create a video that combines audio and visuals.

[0339] Step 5:

[0340] The server uploads the generated video to a video hosting platform and generates a link that users can access.

[0341] Step 6:

[0342] The server sends this link to the terminal, which then provides the link to the user. The user clicks the link and views the content to receive the answer.

[0343] Step 7:

[0344] By watching the video, users understand the answer to their inquiry through visual and auditory means, and achieve the purpose of their inquiry.

[0345] (Example 2)

[0346] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0347] Traditional systems provided automated responses to user inquiries, often failing to adequately consider user emotions. This resulted in an inadequate user experience and decreased satisfaction. In particular, they failed to meet the need for customized responses that reflected user emotions.

[0348] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0349] In this invention, the server includes emotion analysis means for analyzing emotions, natural language processing means for generating optimal answers, and video generation means for generating video by adjusting sound quality and speed. This enables customized responses that take into account the user's emotions.

[0350] "Sentiment analysis methods" refer to methods for analyzing inquiry text received from users, recognizing and quantifying the user's emotions contained within the text.

[0351] A "natural language processing method" is a means of generating the optimal response based on the user's inquiry and analyzed sentiment information. This method aims to generate natural language with a tone and content that matches the user's emotions.

[0352] The "moving image generation means" is a means of integrating audio and visual information based on the generated response text, with sound quality and speed that matches the user's emotions, to generate a moving image.

[0353] "Information provision means" refers to the means of uploading generated video images to a hosting platform and providing a link that users can access.

[0354] This system generates and provides customized, emotionally-inspired videos in response to user inquiries. Users can input inquiries via an information terminal and send them to the server. The terminal has the function of receiving the inquiry data and forwarding it to the server. The server has multiple means of processing that data.

[0355] First, the server uses sentiment analysis tools to analyze the user's emotions contained in the text data. This involves using sentiment analysis libraries (for example, "VADER" or other natural language processing tools). As a result of this analysis, the user's emotions can be quantified and that information can be obtained.

[0356] Next, the server uses natural language processing to generate the optimal response based on the inquiry content and sentiment information. In this process, a generative AI model (e.g., the "GPT" series) can be used to create a response with an appropriate tone and context that matches the user's emotions.

[0357] The generated responses are then sent to a video generation system. A speech synthesis engine (e.g., "Text-to-Speech" technology) is used to generate the audio. The server generates the audio, adjusting the sound quality and speed to reflect emotions. The responses also include appropriate visual content, which is then integrated and generated as a video.

[0358] Ultimately, the server uploads the generated video to a hosting platform and provides a link that the user can access immediately. The terminal is responsible for presenting this link to the user, who can then watch the video via the provided link. For example, if a user is "dissatisfied with a product return," the server can analyze their emotions and generate a video that provides an explanation in an appropriate tone. This system allows the user to feel that their emotions have been recognized and addressed appropriately. An example of a prompt might be, "Regarding your desire to have experienced better customer service." By inputting such prompts into the system, customized content that matches the user's intent is generated.

[0359] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0360] Step 1:

[0361] The user enters and submits their inquiry using an information terminal. The input is represented as text data, and the terminal's role is to send this data to the server. At this stage, the information and solutions the user is seeking are collected as input data.

[0362] Step 2:

[0363] The server transfers the text data received from the terminal to the sentiment analysis system. The sentiment analysis uses a natural language processing library to score keywords and emotional phrases within the text, extracting the emotions the user is experiencing. As a result of this analysis, the user's emotional state (e.g., dissatisfaction, joy) is quantified and output.

[0364] Step 3:

[0365] The server passes the query content along with the sentiment analysis results to a natural language processing system. Here, a generative AI model is used to generate a contextually appropriate answer. The input consists of the analyzed sentiment data and the query content, while the output is the result of natural language generation tailored to the user's emotional tone. This process ensures that the user receives a properly customized response.

[0366] Step 4:

[0367] The server forwards the generated text response to the video generation system. Using a speech synthesis engine, the system adjusts the sound quality and reading speed based on the emotion analysis results, and converts the response into speech. Simultaneously, visual information is also generated, and finally, a video integrating audio and visuals is output. This prepares content that aligns with the user's emotional expression.

[0368] Step 5:

[0369] The server uploads the prepared video to a cloud-based hosting platform and generates a link to access it. The terminal provides this link to the user, who can then view the video via the link. This allows the user to receive a customized visual response to their inquiry.

[0370] (Application Example 2)

[0371] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0372] Modern information delivery systems often fail to accurately grasp users' emotions and provide insufficient individualized support. As a result, users become frustrated when their feelings are not properly recognized, leading to decreased satisfaction.

[0373] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0374] In this invention, the server includes information acquisition means for receiving communications from the user, emotion analysis means for analyzing the received communications to identify the user's emotions, and response generation means for generating an optimal answer based on the analyzed emotions. This makes it possible to provide appropriate video images that match the user's emotions.

[0375] "User communications" refers to data, including inquiries and requests, that users of an information system send through means of obtaining information.

[0376] "Information acquisition means" refers to a device or software that receives communications transmitted by a user and organizes them for further analysis.

[0377] "Emotional analysis means" refers to a function that processes received communication data to identify the user's emotions and feelings.

[0378] A "response generation means" is a process that automatically creates an optimized response based on analyzed emotional information.

[0379] "Motion image generation means" refers to a technology that creates motion images by integrating visual and auditory information that corresponds to the user's emotions, based on the generated response.

[0380] "Information provision means" refers to the means of quickly distributing generated video images to users and making them viewable.

[0381] The system implementing this invention receives communication from a user, analyzes the user's emotions, generates an optimal response, and provides it as a moving image including visual and auditory information. The server first receives communication from the user using information acquisition means. This communication is often a text message sent via a smartphone or other information terminal. The received communication is analyzed by emotion analysis means to identify the user's emotions contained therein.

[0382] This sentiment analysis utilizes sentiment analysis engines such as the Microsoft Text Analytics API. Based on the analysis results, the server uses natural language processing techniques to analyze the content in detail and generate the optimal response. This response is then spoken in a tone appropriate to the emotion. Speech synthesis uses speech generation technologies such as Google Cloud Text-to-Speech.

[0383] Furthermore, a video is generated by a video generation means. Here, based on the generated response, visual and audio information is integrated using visual content creation software such as Adobe Premiere Pro. Finally, this video is delivered to the user by an information delivery means.

[0384] For example, when a user makes a product inquiry on an online shopping platform, this system generates and promptly provides an explanatory video with content and tone tailored to the user's emotions. This allows the user to feel that their emotions have been appropriately recognized and that they have received a quick and accurate response. An example of a prompt for the generating AI model would be, "Analyze the user's emotions and generate an explanatory video addressing their dissatisfaction with the delivery delay."

[0385] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0386] Step 1:

[0387] The user enters the inquiry using a terminal and sends it to the server. The input is primarily text data. This data is received by the server's information retrieval system.

[0388] Step 2:

[0389] The server passes the received text data to an emotion analysis tool to identify the user's emotions. This step uses an emotion analysis engine such as the Microsoft Text Analytics API. The input is text data, and the output is data indicating the user's emotions. Through the analysis, emotions such as anger, joy, and sadness are obtained as data.

[0390] Step 3:

[0391] The server generates an appropriate response using natural language processing techniques based on analyzed sentiment and text data. An FAQ database is used for this process. Input consists of text and sentiment data, and output is a response sentence best suited to the user's sentiment. Google Cloud Natural Language API and similar tools are used for response generation.

[0392] Step 4:

[0393] The server passes the response text to a speech synthesis system to generate audio data. This step uses speech generation technology such as Google Cloud Text-to-Speech. The input is the response text, and the output is audio data. The tone and speed of the voice are adjusted according to the user's emotions.

[0394] Step 5:

[0395] The server generates a video by integrating audio data and visual content using a video generation system. This process utilizes visual content generation software such as Adobe Premiere Pro or FFmpeg. The input consists of audio data and visual elements, while the output is video data that reflects the user's emotions.

[0396] Step 6:

[0397] The server generates video and provides it to the terminal via an information distribution method, allowing the user to view the video. The input is video data, and the output is a link or file format reference provided to the user. The user can view the video by clicking the link.

[0398] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0399] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0400] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0401] [Third Embodiment]

[0402] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0403] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0404] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0405] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0406] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0407] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0408] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0409] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0410] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0411] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0412] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0413] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0414] In embodiments of the present invention, a system is provided that efficiently and effectively responds to user inquiries. This system is mainly server-based and improves the user experience through the following processes.

[0415] Inquiry reception and analysis

[0416] First, the user enters an inquiry via an information terminal. The terminal receives this inquiry and sends the data to the server. The server feeds the received data into its analysis engine, which analyzes the inquiry using natural language processing technology. This allows the server to accurately understand the user's intent and topic, and determine if it is a frequently asked question.

[0417] Answer generation

[0418] For questions identified by the analysis engine, the server activates an answer generation engine. This engine refers to FAQs and knowledge bases recorded in an internal database to create the most appropriate answer. The generated answer is output as natural-sounding text.

[0419] Video generation

[0420] The server uses the generated text-based responses to activate a video generation module. This module uses speech synthesis technology to convert the responses into audio and then generates images and animations to visually represent the content. This results in a video that integrates the responses as audiovisual information.

[0421] Video provision and viewing

[0422] The server uploads the generated video to the hosting server and creates a link that the user can access. The device receives the video through this link and provides it to the user. By watching the video, the user receives explanations through both sight and sound, allowing for a deeper understanding of the content.

[0423] Specific example

[0424] For example, if a user asks, "Please tell me how to return a product," the server will identify the FAQ related to "return procedures" and generate a detailed answer outlining those procedures. Then, based on this answer, it will generate a video visualizing the return process and provide it to the user. By watching the video, users can more easily understand the details of the return procedure.

[0425] Thus, the present invention aims to provide users with quick and accurate information in response to their inquiries, thereby improving the user experience.

[0426] The following describes the processing flow.

[0427] Step 1:

[0428] The user uses an information terminal to enter text into a customer service inquiry form and submit it. The terminal receives this inquiry text.

[0429] Step 2:

[0430] The terminal sends the received query text to the server. The server receives this data and passes it to the natural language processing engine.

[0431] Step 3:

[0432] The server uses a natural language processing engine to analyze the query. Specifically, it tokenizes the text, tags each word with its part of speech, and determines the user's intent. This helps identify whether the question is frequently asked.

[0433] Step 4:

[0434] The server uses the analyzed results to cross-reference them with the FAQ database and find relevant answers. The answer generation engine then uses this information to generate answers in natural language.

[0435] Step 5:

[0436] The server sends the generated text responses to the video generation module. The module uses a speech synthesis engine to convert the responses into speech and then creates images and animations to add visual information. This process generates a video that integrates audio and visuals.

[0437] Step 6:

[0438] The server uploads the generated video to the video hosting platform and generates an access link for it.

[0439] Step 7:

[0440] The server sends this access link to the terminal. The terminal then presents the link to the user, allowing the user to access the video.

[0441] Step 8:

[0442] Users access the link through their device and obtain answers to their questions by watching the provided video.

[0443] (Example 1)

[0444] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0445] The problem that this invention aims to solve is to improve the user experience by generating quick and effective responses to user inquiries and providing them in an integrated format of visual and auditory information.

[0446] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0447] In this invention, the server includes data receiving means for receiving requests from users, analysis means for analyzing the received requests and identifying the intent of the utterance, answer generating means for collecting information based on the analysis results and generating an optimal solution, and media generating means for generating multimedia including visual and auditory information based on the generated solution. This enables users to gain a detailed understanding of their inquiries through audiovisual information.

[0448] "Data receiving means" refers to a device or process that has the function of receiving requests from users and transmitting them to a server.

[0449] "Analysis means" refers to technology that analyzes received requests, identifies the intent of the utterance from its content, and determines whether it is a frequently asked question.

[0450] "Answer generation means" refers to a technology or device that has the function of generating the optimal solution by referring to an internally stored database based on analyzed information.

[0451] "Media generation means" refers to a device or process that has the function of generating multimedia content that integrates visual and auditory information based on the generated solution.

[0452] "Information provision means" refers to technology or devices that have the function of transmitting generated multimedia to users and providing it in a viewable format.

[0453] This invention is an information provision system for responding quickly and appropriately to user inquiries. This system is mainly composed of a server, and a specific embodiment thereof is shown below.

[0454] Each user can enter inquiries using a dedicated application or web browser on their device. The device converts the entered inquiries into data packets and sends them to the server. Standard internet connections and protocols are used for this data reception.

[0455] The server uses natural language processing techniques to analyze the received data packets. These analysis methods include, for example, tokenization, part-of-speech tagging, and dependency analysis. This allows the server to accurately grasp the user's intent and identify frequently occurring questions.

[0456] Based on the analyzed data, the server generates the optimal answer using an answer generation mechanism. At this stage, FAQs and knowledge bases in the internal database are referenced. A generation AI model is used to form the optimal prompt sentence for the user's question, resulting in a natural and accurate answer.

[0457] Subsequently, the server operates media generation tools based on the generated responses to create multimedia content that integrates audio and visual information. Speech synthesis technology is used to convert the response text into speech and generate related images and animations. Third-party media generation software and libraries may be used in this process.

[0458] The generated multimedia is provided to the user through an information delivery method. The video is uploaded to a hosting server, and an access link is sent to the user's device. The user uses this link to watch the video and utilize the information obtained visually and aurally.

[0459] A concrete example is an inquiry asking, "Please tell me how to return a product." In this case, the server identifies FAQs regarding the return procedure and generates an answer that includes specific steps. A video based on that answer is then created and provided to the user, making it easier to understand the return process in detail.

[0460] Example prompt: "Please provide a video guide explaining the product return process in detail."

[0461] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0462] Step 1:

[0463] The user opens a dedicated application or web browser on their device and enters their inquiry.

[0464] The input information is converted into data packets by the terminal. These data packets are sent to the server via the internet. The input is string data, and the output is the transmitted data packets. The terminal provides a user interface (UI) to facilitate user input processing.

[0465] Step 2:

[0466] The server inputs the data packets received from the terminal into its analysis engine.

[0467] The analysis engine uses natural language processing technology to analyze user inquiries through processes such as tokenization, part-of-speech tagging, and dependency analysis.

[0468] The input is a data packet, and the output is parsed query data. The server performs this parsing step to determine the user's intent.

[0469] Step 3:

[0470] The server then activates the answer generation engine based on the analyzed data.

[0471] At this stage, data queries are performed to extract appropriate information from the internal database based on the analysis results. A generative AI model is used to form the most suitable prompt sentence based on the analysis results, and natural language generation of the response is performed based on that prompt sentence.

[0472] The input is analysis data, and the output is a text-formatted response. The server applies various algorithms to generate the optimal response.

[0473] Step 4:

[0474] The server uses the generated text-formatted response to start the media generation engine.

[0475] This engine utilizes speech synthesis technology to convert text into speech and generate related visual information. This includes image generation and animation creation. The input is a text-based response, and the output is a multimedia file integrating visual and audio information. Through this, the server can provide information to the user in a more intuitive way.

[0476] Step 5:

[0477] The server uploads the generated multimedia files to the hosting server.

[0478] After uploading, a link to the file is generated, and the link information is sent to the user's device.

[0479] The input is a multimedia file, and the output is an access link. The server uses hosting capabilities to make it easy for users to access the content.

[0480] Step 6:

[0481] The device uses a link received from the server to download and play multimedia content.

[0482] Users view this content and understand information related to their inquiry through visual and auditory means.

[0483] The input is an access link, and the output is the displayed video content. The device uses a viewing application to support the user's consumption of the content.

[0484] (Application Example 1)

[0485] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0486] In modern information processing systems, there is a need for methods to respond quickly and accurately to diverse user inquiries. However, conventional systems often struggled to accurately grasp user intent and provide appropriate information. Furthermore, providing answers solely as text may not allow users to fully understand the information. Therefore, there is an urgent need to develop new information processing methods to improve the user experience.

[0487] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0488] In this invention, the server includes data receiving means for receiving inquiries from users, data analysis means for analyzing the received inquiries and identifying frequently occurring questions, information generation means for generating optimal answers based on the analysis results, multimedia generation means for integrating visual and audio information based on the generated answers, and data distribution means for providing the generated multimedia content to the user. This makes it possible to quickly provide information combining visual and audio in response to user inquiries.

[0489] "Data receiving means" refers to devices or software used to receive inquiries sent from users.

[0490] "Data analysis means" refers to processes and systems for analyzing received inquiries to identify frequently occurring questions and important information.

[0491] "Information generation means" refers to devices or software used to create appropriate answers based on analyzed data.

[0492] A "multimedia generation means" refers to a device or software that combines generated answers with visual and audio information to create integrated multimedia content.

[0493] "Data distribution means" refers to communication methods and platforms for providing generated multimedia content to users.

[0494] The invention will now be described in terms of embodiments for carrying out the invention. This invention relates to an information processing system for responding quickly and accurately to user inquiries. The operation of the system will be described in detail below.

[0495] The server first receives queries sent from user terminals using data reception means. These received queries are then analyzed by data analysis means within the server. One example of software used here is Google Cloud's Dialogflow. This allows the query content to be converted into structured data, enabling the identification of frequently occurring questions.

[0496] Based on the analysis results, the server uses information generation tools to generate appropriate responses. Generative AI models such as GPT-3 can be used at this stage. The generated text-based responses are then converted into multimedia content integrating visual and audio information by multimedia generation tools. Software like Amazon Polly can be used for speech synthesis, and FFmpeg can be used for video generation.

[0497] The generated multimedia content is uploaded to a hosting service such as AWS S3 via a data distribution method on the server. Users can access this content through their devices and view the information via the provided links.

[0498] For example, if a user enters the prompt, "Please tell me how to check my order history," the system will generate a video demonstrating the appropriate steps, helping the user understand the process visually and audibly. In this way, users can easily obtain and understand the information in response to their inquiries.

[0499] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0500] Step 1:

[0501] The terminal receives a query from the user. The user enters the query details into the terminal's input interface, and that data is sent to the server. In this step, the input is the user's query data, and the output is the query information transferred to the server.

[0502] Step 2:

[0503] The server analyzes the received query using data analysis tools. Specifically, the server uses Google Cloud's Dialogflow to process the query using natural language processing to identify the query's intent and pinpoint frequently occurring questions. The input in this step is the user's query information, and the output is the analyzed intent and identified questions.

[0504] Step 3:

[0505] The server generates the optimal response using information generation means based on the analyzed data. In this process, the server uses GPT-3 as the generation AI model to generate appropriate text responses. The input for this step is the analyzed intent and question, and the output is the generated text-based response.

[0506] Step 4:

[0507] The server uses multimedia generation tools to create multimedia content that integrates visual and audio information based on the generated text-based responses. Here, Amazon Polly is used for speech synthesis, and FFmpeg is used to generate video for visual representation. The input for this step is the generated text responses, and the output is multimedia content.

[0508] Step 5:

[0509] The server uploads the generated multimedia content to AWS S3 and provides the user with a video link. The user can access this link through their device and view the multimedia content. The input for this step is the multimedia content, and the output is the sharing of the link with the user.

[0510] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0511] One embodiment of the present invention is a system that generates and provides customized video images that take emotions into account in response to user inquiries. This system receives inquiries from users, recognizes and analyzes the emotions contained therein to generate an optimal response, and then generates video images based on that response.

[0512] Inquiry reception and sentiment analysis

[0513] The user enters and submits an inquiry via an information terminal. The terminal sends the inquiry to the server. Upon receiving the inquiry, the server first uses an emotion engine to recognize the user's emotions as expressed in the inquiry text. This emotion analysis accurately determines whether the user is angry, troubled, etc.

[0514] Analysis of inquiry content and generation of responses

[0515] The server uses analytical tools to combine natural language processing techniques and sentiment analysis results to analyze the inquiry in detail. Based on this information, the server generates the most appropriate answer from the FAQ database. The answer generation takes sentiment information into account, and natural language generation is performed with a tone and content that matches the user's emotions.

[0516] Video generation

[0517] The generated response text is sent to the server's video generation system. Here, the speech synthesis engine is adjusted based on the emotion engine's analysis results to produce speech with a sound quality and speed appropriate to the user's emotions. Visual content is also generated taking emotions into consideration, and finally, a video is produced in which visual information and audio are integrated.

[0518] Video provision and utilization

[0519] The server uploads the generated video to an appropriate hosting platform and provides a link that the user can access immediately. The device presents this link to the user, who then accesses it to watch the video. For example, if it is determined that the user is dissatisfied with a product return, the server generates a video providing a polite and considerate explanation. This makes the user feel that their feelings have been accurately recognized and addressed appropriately, resulting in a better experience.

[0520] By implementing this invention, responses to user inquiries will be more personalized, and customized content that takes emotions into consideration will be provided. This is expected to improve customer satisfaction and reduce the number of inquiries.

[0521] The following describes the processing flow.

[0522] Step 1:

[0523] The user uses a terminal to enter and submit their inquiry. The terminal receives this inquiry text and sends it to the server.

[0524] Step 2:

[0525] The server passes the received query text to a natural language processing engine, which analyzes the text's content. Simultaneously, it activates an emotion engine to detect the user's emotions contained in the text and determine whether those emotions are positive or negative.

[0526] Step 3:

[0527] Based on the analyzed content and sentiment analysis results, the server searches the FAQ database for the most suitable answer. The answer generation engine generates answers naturally, taking into account the user's emotions in tone and content.

[0528] Step 4:

[0529] The server sends the generated response text to the video generation module. Here, the speech synthesis engine adjusts the tone and pace of the voice according to the emotion, and generates related images and animations to create a video that combines audio and visuals.

[0530] Step 5:

[0531] The server uploads the generated video to a video hosting platform and generates a link that users can access.

[0532] Step 6:

[0533] The server sends this link to the terminal, which then provides the link to the user. The user clicks the link and views the content to receive the answer.

[0534] Step 7:

[0535] By watching the video, users understand the answer to their inquiry through visual and auditory means, and achieve the purpose of their inquiry.

[0536] (Example 2)

[0537] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0538] Traditional systems provided automated responses to user inquiries, often failing to adequately consider user emotions. This resulted in an inadequate user experience and decreased satisfaction. In particular, they failed to meet the need for customized responses that reflected user emotions.

[0539] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0540] In this invention, the server includes emotion analysis means for analyzing emotions, natural language processing means for generating optimal answers, and video generation means for generating video by adjusting sound quality and speed. This enables customized responses that take into account the user's emotions.

[0541] "Sentiment analysis methods" refer to methods for analyzing inquiry text received from users, recognizing and quantifying the user's emotions contained within the text.

[0542] A "natural language processing method" is a means of generating the optimal response based on the user's inquiry and analyzed sentiment information. This method aims to generate natural language with a tone and content that matches the user's emotions.

[0543] The "moving image generation means" is a means of integrating audio and visual information based on the generated response text, with sound quality and speed that matches the user's emotions, to generate a moving image.

[0544] "Information provision means" refers to the means of uploading generated video images to a hosting platform and providing a link that users can access.

[0545] This system generates and provides customized, emotionally-inspired videos in response to user inquiries. Users can input inquiries via an information terminal and send them to the server. The terminal has the function of receiving the inquiry data and forwarding it to the server. The server has multiple means of processing that data.

[0546] First, the server uses sentiment analysis tools to analyze the user's emotions contained in the text data. This involves using sentiment analysis libraries (for example, "VADER" or other natural language processing tools). As a result of this analysis, the user's emotions can be quantified and that information can be obtained.

[0547] Next, the server uses natural language processing to generate the optimal response based on the inquiry content and sentiment information. In this process, a generative AI model (e.g., the "GPT" series) can be used to create a response with an appropriate tone and context that matches the user's emotions.

[0548] The generated responses are then sent to a video generation system. A speech synthesis engine (e.g., "Text-to-Speech" technology) is used to generate the audio. The server generates the audio, adjusting the sound quality and speed to reflect emotions. The responses also include appropriate visual content, which is then integrated and generated as a video.

[0549] Ultimately, the server uploads the generated video to a hosting platform and provides a link that the user can access immediately. The terminal is responsible for presenting this link to the user, who can then watch the video via the provided link. For example, if a user is "dissatisfied with a product return," the server can analyze their emotions and generate a video that provides an explanation in an appropriate tone. This system allows the user to feel that their emotions have been recognized and addressed appropriately. An example of a prompt might be, "Regarding your desire to have experienced better customer service." By inputting such prompts into the system, customized content that matches the user's intent is generated.

[0550] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0551] Step 1:

[0552] The user enters and submits their inquiry using an information terminal. The input is represented as text data, and the terminal's role is to send this data to the server. At this stage, the information and solutions the user is seeking are collected as input data.

[0553] Step 2:

[0554] The server transfers the text data received from the terminal to the sentiment analysis system. The sentiment analysis uses a natural language processing library to score keywords and emotional phrases within the text, extracting the emotions the user is experiencing. As a result of this analysis, the user's emotional state (e.g., dissatisfaction, joy) is quantified and output.

[0555] Step 3:

[0556] The server passes the query content along with the sentiment analysis results to a natural language processing system. Here, a generative AI model is used to generate a contextually appropriate answer. The input consists of the analyzed sentiment data and the query content, while the output is the result of natural language generation tailored to the user's emotional tone. This process ensures that the user receives a properly customized response.

[0557] Step 4:

[0558] The server forwards the generated text response to the video generation system. Using a speech synthesis engine, the system adjusts the sound quality and reading speed based on the emotion analysis results, and converts the response into speech. Simultaneously, visual information is also generated, and finally, a video integrating audio and visuals is output. This prepares content that aligns with the user's emotional expression.

[0559] Step 5:

[0560] The server uploads the prepared video to a cloud-based hosting platform and generates a link to access it. The terminal provides this link to the user, who can then view the video via the link. This allows the user to receive a customized visual response to their inquiry.

[0561] (Application Example 2)

[0562] Next, we will explain Application Example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0563] Modern information delivery systems often fail to accurately grasp users' emotions and provide insufficient individualized support. As a result, users become frustrated when their feelings are not properly recognized, leading to decreased satisfaction.

[0564] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0565] In this invention, the server includes information acquisition means for receiving communications from the user, emotion analysis means for analyzing the received communications to identify the user's emotions, and response generation means for generating an optimal answer based on the analyzed emotions. This makes it possible to provide appropriate video images that match the user's emotions.

[0566] "User communications" refers to data, including inquiries and requests, that users of an information system send through means of obtaining information.

[0567] "Information acquisition means" refers to a device or software that receives communications transmitted by a user and organizes them for further analysis.

[0568] "Emotional analysis means" refers to a function that processes received communication data to identify the user's emotions and feelings.

[0569] A "response generation means" is a process that automatically creates an optimized response based on analyzed emotional information.

[0570] "Motion image generation means" refers to a technology that creates motion images by integrating visual and auditory information that corresponds to the user's emotions, based on the generated response.

[0571] "Information provision means" refers to the means of quickly distributing generated video images to users and making them viewable.

[0572] The system implementing this invention receives communication from a user, analyzes the user's emotions, generates an optimal response, and provides it as a moving image including visual and auditory information. The server first receives communication from the user using information acquisition means. This communication is often a text message sent via a smartphone or other information terminal. The received communication is analyzed by emotion analysis means to identify the user's emotions contained therein.

[0573] This sentiment analysis utilizes sentiment analysis engines such as the Microsoft Text Analytics API. Based on the analysis results, the server uses natural language processing techniques to analyze the content in detail and generate the optimal response. This response is then spoken in a tone appropriate to the emotion. Speech synthesis uses speech generation technologies such as Google Cloud Text-to-Speech.

[0574] Furthermore, a video is generated by a video generation means. Here, based on the generated response, visual and audio information is integrated using visual content creation software such as Adobe Premiere Pro. Finally, this video is delivered to the user by an information delivery means.

[0575] For example, when a user makes a product inquiry on an online shopping platform, this system generates and promptly provides an explanatory video with content and tone tailored to the user's emotions. This allows the user to feel that their emotions have been appropriately recognized and that they have received a quick and accurate response. An example of a prompt for the generating AI model would be, "Analyze the user's emotions and generate an explanatory video addressing their dissatisfaction with the delivery delay."

[0576] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0577] Step 1:

[0578] The user enters the inquiry using a terminal and sends it to the server. The input is primarily text data. This data is received by the server's information retrieval system.

[0579] Step 2:

[0580] The server passes the received text data to an emotion analysis tool to identify the user's emotions. This step uses an emotion analysis engine such as the Microsoft Text Analytics API. The input is text data, and the output is data indicating the user's emotions. Through the analysis, emotions such as anger, joy, and sadness are obtained as data.

[0581] Step 3:

[0582] The server generates an appropriate response using natural language processing techniques based on analyzed sentiment and text data. An FAQ database is used for this process. Input consists of text and sentiment data, and output is a response sentence best suited to the user's sentiment. Google Cloud Natural Language API and similar tools are used for response generation.

[0583] Step 4:

[0584] The server passes the response text to a speech synthesis system to generate audio data. This step uses speech generation technology such as Google Cloud Text-to-Speech. The input is the response text, and the output is audio data. The tone and speed of the voice are adjusted according to the user's emotions.

[0585] Step 5:

[0586] The server generates a video by integrating audio data and visual content using a video generation system. This process utilizes visual content generation software such as Adobe Premiere Pro or FFmpeg. The input consists of audio data and visual elements, while the output is video data that reflects the user's emotions.

[0587] Step 6:

[0588] The server generates video and provides it to the terminal via an information distribution method, allowing the user to view the video. The input is video data, and the output is a link or file format reference provided to the user. The user can view the video by clicking the link.

[0589] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0590] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0591] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0592] [Fourth Embodiment]

[0593] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0594] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0595] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0596] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0597] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0598] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0599] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0600] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0601] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0602] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0603] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0604] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0605] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0606] In embodiments of the present invention, a system is provided that efficiently and effectively responds to user inquiries. This system is mainly server-based and improves the user experience through the following processes.

[0607] Inquiry reception and analysis

[0608] First, the user enters an inquiry via an information terminal. The terminal receives this inquiry and sends the data to the server. The server feeds the received data into its analysis engine, which analyzes the inquiry using natural language processing technology. This allows the server to accurately understand the user's intent and topic, and determine if it is a frequently asked question.

[0609] Answer generation

[0610] For questions identified by the analysis engine, the server activates an answer generation engine. This engine refers to FAQs and knowledge bases recorded in an internal database to create the most appropriate answer. The generated answer is output as natural-sounding text.

[0611] Video generation

[0612] The server uses the generated text-based responses to activate a video generation module. This module uses speech synthesis technology to convert the responses into audio and then generates images and animations to visually represent the content. This results in a video that integrates the responses as audiovisual information.

[0613] Video provision and viewing

[0614] The server uploads the generated video to the hosting server and creates a link that the user can access. The device receives the video through this link and provides it to the user. By watching the video, the user receives explanations through both sight and sound, allowing for a deeper understanding of the content.

[0615] Specific example

[0616] For example, if a user asks, "Please tell me how to return a product," the server will identify the FAQ related to "return procedures" and generate a detailed answer outlining those procedures. Then, based on this answer, it will generate a video visualizing the return process and provide it to the user. By watching the video, users can more easily understand the details of the return procedure.

[0617] Thus, the present invention aims to provide users with quick and accurate information in response to their inquiries, thereby improving the user experience.

[0618] The following describes the processing flow.

[0619] Step 1:

[0620] The user uses an information terminal to enter text into a customer service inquiry form and submit it. The terminal receives this inquiry text.

[0621] Step 2:

[0622] The terminal sends the received query text to the server. The server receives this data and passes it to the natural language processing engine.

[0623] Step 3:

[0624] The server uses a natural language processing engine to analyze the query. Specifically, it tokenizes the text, tags each word with its part of speech, and determines the user's intent. This helps identify whether the question is frequently asked.

[0625] Step 4:

[0626] The server uses the analyzed results to cross-reference them with the FAQ database and find relevant answers. The answer generation engine then uses this information to generate answers in natural language.

[0627] Step 5:

[0628] The server sends the generated text responses to the video generation module. The module uses a speech synthesis engine to convert the responses into speech and then creates images and animations to add visual information. This process generates a video that integrates audio and visuals.

[0629] Step 6:

[0630] The server uploads the generated video to the video hosting platform and generates an access link for it.

[0631] Step 7:

[0632] The server sends this access link to the terminal. The terminal then presents the link to the user, allowing the user to access the video.

[0633] Step 8:

[0634] Users access the link through their device and obtain answers to their questions by watching the provided video.

[0635] (Example 1)

[0636] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0637] The problem that this invention aims to solve is to improve the user experience by generating quick and effective responses to user inquiries and providing them in an integrated format of visual and auditory information.

[0638] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0639] In this invention, the server includes data receiving means for receiving requests from users, analysis means for analyzing the received requests and identifying the intent of the utterance, answer generating means for collecting information based on the analysis results and generating an optimal solution, and media generating means for generating multimedia including visual and auditory information based on the generated solution. This enables users to gain a detailed understanding of their inquiries through audiovisual information.

[0640] "Data receiving means" refers to a device or process that has the function of receiving requests from users and transmitting them to a server.

[0641] "Analysis means" refers to technology that analyzes received requests, identifies the intent of the utterance from its content, and determines whether it is a frequently asked question.

[0642] "Answer generation means" refers to a technology or device that has the function of generating the optimal solution by referring to an internally stored database based on analyzed information.

[0643] "Media generation means" refers to a device or process that has the function of generating multimedia content that integrates visual and auditory information based on the generated solution.

[0644] "Information provision means" refers to technology or devices that have the function of transmitting generated multimedia to users and providing it in a viewable format.

[0645] This invention is an information provision system for responding quickly and appropriately to user inquiries. This system is mainly composed of a server, and a specific embodiment thereof is shown below.

[0646] Each user can enter inquiries using a dedicated application or web browser on their device. The device converts the entered inquiries into data packets and sends them to the server. Standard internet connections and protocols are used for this data reception.

[0647] The server uses natural language processing techniques to analyze the received data packets. These analysis methods include, for example, tokenization, part-of-speech tagging, and dependency analysis. This allows the server to accurately grasp the user's intent and identify frequently occurring questions.

[0648] Based on the analyzed data, the server generates the optimal answer using an answer generation mechanism. At this stage, FAQs and knowledge bases in the internal database are referenced. A generation AI model is used to form the optimal prompt sentence for the user's question, resulting in a natural and accurate answer.

[0649] Subsequently, the server operates media generation tools based on the generated responses to create multimedia content that integrates audio and visual information. Speech synthesis technology is used to convert the response text into speech and generate related images and animations. Third-party media generation software and libraries may be used in this process.

[0650] The generated multimedia is provided to the user through an information delivery method. The video is uploaded to a hosting server, and an access link is sent to the user's device. The user uses this link to watch the video and utilize the information obtained visually and aurally.

[0651] A concrete example is an inquiry asking, "Please tell me how to return a product." In this case, the server identifies FAQs regarding the return procedure and generates an answer that includes specific steps. A video based on that answer is then created and provided to the user, making it easier to understand the return process in detail.

[0652] Example prompt: "Please provide a video guide explaining the product return process in detail."

[0653] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0654] Step 1:

[0655] The user opens a dedicated application or web browser on their device and enters their inquiry.

[0656] The input information is converted into data packets by the terminal. These data packets are sent to the server via the internet. The input is string data, and the output is the transmitted data packets. The terminal provides a user interface (UI) to facilitate user input processing.

[0657] Step 2:

[0658] The server inputs the data packets received from the terminal into its analysis engine.

[0659] The analysis engine uses natural language processing technology to analyze user inquiries through processes such as tokenization, part-of-speech tagging, and dependency analysis.

[0660] The input is a data packet, and the output is parsed query data. The server performs this parsing step to determine the user's intent.

[0661] Step 3:

[0662] The server then activates the answer generation engine based on the analyzed data.

[0663] At this stage, data queries are performed to extract appropriate information from the internal database based on the analysis results. A generative AI model is used to form the most suitable prompt sentence based on the analysis results, and natural language generation of the response is performed based on that prompt sentence.

[0664] The input is analysis data, and the output is a text-formatted response. The server applies various algorithms to generate the optimal response.

[0665] Step 4:

[0666] The server uses the generated text-formatted response to start the media generation engine.

[0667] This engine utilizes speech synthesis technology to convert text into speech and generate related visual information. This includes image generation and animation creation. The input is a text-based response, and the output is a multimedia file integrating visual and audio information. Through this, the server can provide information to the user in a more intuitive way.

[0668] Step 5:

[0669] The server uploads the generated multimedia files to the hosting server.

[0670] After uploading, a link to the file is generated, and the link information is sent to the user's device.

[0671] The input is a multimedia file, and the output is an access link. The server uses hosting capabilities to make it easy for users to access the content.

[0672] Step 6:

[0673] The device uses a link received from the server to download and play multimedia content.

[0674] Users view this content and understand information related to their inquiry through visual and auditory means.

[0675] The input is an access link, and the output is the displayed video content. The device uses a viewing application to support the user's consumption of the content.

[0676] (Application Example 1)

[0677] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0678] In modern information processing systems, there is a need for methods to respond quickly and accurately to diverse user inquiries. However, conventional systems often struggled to accurately grasp user intent and provide appropriate information. Furthermore, providing answers solely as text may not allow users to fully understand the information. Therefore, there is an urgent need to develop new information processing methods to improve the user experience.

[0679] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0680] In this invention, the server includes data receiving means for receiving inquiries from users, data analysis means for analyzing the received inquiries and identifying frequently occurring questions, information generation means for generating optimal answers based on the analysis results, multimedia generation means for integrating visual and audio information based on the generated answers, and data distribution means for providing the generated multimedia content to the user. This makes it possible to quickly provide information combining visual and audio in response to user inquiries.

[0681] "Data receiving means" refers to devices or software used to receive inquiries sent from users.

[0682] "Data analysis means" refers to processes and systems for analyzing received inquiries to identify frequently occurring questions and important information.

[0683] "Information generation means" refers to devices or software used to create appropriate answers based on analyzed data.

[0684] A "multimedia generation means" refers to a device or software that combines generated answers with visual and audio information to create integrated multimedia content.

[0685] "Data distribution means" refers to communication methods and platforms for providing generated multimedia content to users.

[0686] The invention will now be described in terms of embodiments for carrying out the invention. This invention relates to an information processing system for responding quickly and accurately to user inquiries. The operation of the system will be described in detail below.

[0687] The server first receives queries sent from user terminals using data reception means. These received queries are then analyzed by data analysis means within the server. One example of software used here is Google Cloud's Dialogflow. This allows the query content to be converted into structured data, enabling the identification of frequently occurring questions.

[0688] Based on the analysis results, the server uses information generation tools to generate appropriate responses. Generative AI models such as GPT-3 can be used at this stage. The generated text-based responses are then converted into multimedia content integrating visual and audio information by multimedia generation tools. Software like Amazon Polly can be used for speech synthesis, and FFmpeg can be used for video generation.

[0689] The generated multimedia content is uploaded to a hosting service such as AWS S3 via a data distribution method on the server. Users can access this content through their devices and view the information via the provided links.

[0690] For example, if a user enters the prompt, "Please tell me how to check my order history," the system will generate a video demonstrating the appropriate steps, helping the user understand the process visually and audibly. In this way, users can easily obtain and understand the information in response to their inquiries.

[0691] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0692] Step 1:

[0693] The terminal receives a query from the user. The user enters the query details into the terminal's input interface, and that data is sent to the server. In this step, the input is the user's query data, and the output is the query information transferred to the server.

[0694] Step 2:

[0695] The server analyzes the received query using data analysis tools. Specifically, the server uses Google Cloud's Dialogflow to process the query using natural language processing to identify the query's intent and pinpoint frequently occurring questions. The input in this step is the user's query information, and the output is the analyzed intent and identified questions.

[0696] Step 3:

[0697] The server generates the optimal response using information generation means based on the analyzed data. In this process, the server uses GPT-3 as the generation AI model to generate appropriate text responses. The input for this step is the analyzed intent and question, and the output is the generated text-based response.

[0698] Step 4:

[0699] The server uses multimedia generation tools to create multimedia content that integrates visual and audio information based on the generated text-based responses. Here, Amazon Polly is used for speech synthesis, and FFmpeg is used to generate video for visual representation. The input for this step is the generated text responses, and the output is multimedia content.

[0700] Step 5:

[0701] The server uploads the generated multimedia content to AWS S3 and provides the user with a video link. The user can access this link through their device and view the multimedia content. The input for this step is the multimedia content, and the output is the sharing of the link with the user.

[0702] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0703] One embodiment of the present invention is a system that generates and provides customized video images that take emotions into account in response to user inquiries. This system receives inquiries from users, recognizes and analyzes the emotions contained therein to generate an optimal response, and then generates video images based on that response.

[0704] Inquiry reception and sentiment analysis

[0705] The user enters and submits an inquiry via an information terminal. The terminal sends the inquiry to the server. Upon receiving the inquiry, the server first uses an emotion engine to recognize the user's emotions as expressed in the inquiry text. This emotion analysis accurately determines whether the user is angry, troubled, etc.

[0706] Analysis of inquiry content and generation of responses

[0707] The server uses analytical tools to combine natural language processing techniques and sentiment analysis results to analyze the inquiry in detail. Based on this information, the server generates the most appropriate answer from the FAQ database. The answer generation takes sentiment information into account, and natural language generation is performed with a tone and content that matches the user's emotions.

[0708] Video generation

[0709] The generated response text is sent to the server's video generation system. Here, the speech synthesis engine is adjusted based on the emotion engine's analysis results to produce speech with a sound quality and speed appropriate to the user's emotions. Visual content is also generated taking emotions into consideration, and finally, a video is produced in which visual information and audio are integrated.

[0710] Video provision and utilization

[0711] The server uploads the generated video to an appropriate hosting platform and provides a link that the user can access immediately. The device presents this link to the user, who then accesses it to watch the video. For example, if it is determined that the user is dissatisfied with a product return, the server generates a video providing a polite and considerate explanation. This makes the user feel that their feelings have been accurately recognized and addressed appropriately, resulting in a better experience.

[0712] By implementing this invention, responses to user inquiries will be more personalized, and customized content that takes emotions into consideration will be provided. This is expected to improve customer satisfaction and reduce the number of inquiries.

[0713] The following describes the processing flow.

[0714] Step 1:

[0715] The user uses a terminal to enter and submit their inquiry. The terminal receives this inquiry text and sends it to the server.

[0716] Step 2:

[0717] The server passes the received query text to a natural language processing engine, which analyzes the text's content. Simultaneously, it activates an emotion engine to detect the user's emotions contained in the text and determine whether those emotions are positive or negative.

[0718] Step 3:

[0719] Based on the analyzed content and sentiment analysis results, the server searches the FAQ database for the most suitable answer. The answer generation engine generates answers naturally, taking into account the user's emotions in tone and content.

[0720] Step 4:

[0721] The server sends the generated response text to the video generation module. Here, the speech synthesis engine adjusts the tone and pace of the voice according to the emotion, and generates related images and animations to create a video that combines audio and visuals.

[0722] Step 5:

[0723] The server uploads the generated video to a video hosting platform and generates a link that users can access.

[0724] Step 6:

[0725] The server sends this link to the terminal, which then provides the link to the user. The user clicks the link and views the content to receive the answer.

[0726] Step 7:

[0727] By watching the video, users understand the answer to their inquiry through visual and auditory means, and achieve the purpose of their inquiry.

[0728] (Example 2)

[0729] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0730] Traditional systems provided automated responses to user inquiries, often failing to adequately consider user emotions. This resulted in an inadequate user experience and decreased satisfaction. In particular, they failed to meet the need for customized responses that reflected user emotions.

[0731] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0732] In this invention, the server includes emotion analysis means for analyzing emotions, natural language processing means for generating optimal answers, and video generation means for generating video by adjusting sound quality and speed. This enables customized responses that take into account the user's emotions.

[0733] "Sentiment analysis methods" refer to methods for analyzing inquiry text received from users, recognizing and quantifying the user's emotions contained within the text.

[0734] A "natural language processing method" is a means of generating the optimal response based on the user's inquiry and analyzed sentiment information. This method aims to generate natural language with a tone and content that matches the user's emotions.

[0735] The "moving image generation means" is a means of integrating audio and visual information based on the generated response text, with sound quality and speed that matches the user's emotions, to generate a moving image.

[0736] "Information provision means" refers to the means of uploading generated video images to a hosting platform and providing a link that users can access.

[0737] This system generates and provides customized, emotionally-inspired videos in response to user inquiries. Users can input inquiries via an information terminal and send them to the server. The terminal has the function of receiving the inquiry data and forwarding it to the server. The server has multiple means of processing that data.

[0738] First, the server uses sentiment analysis tools to analyze the user's emotions contained in the text data. This involves using sentiment analysis libraries (for example, "VADER" or other natural language processing tools). As a result of this analysis, the user's emotions can be quantified and that information can be obtained.

[0739] Next, the server uses natural language processing to generate the optimal response based on the inquiry content and sentiment information. In this process, a generative AI model (e.g., the "GPT" series) can be used to create a response with an appropriate tone and context that matches the user's emotions.

[0740] The generated responses are then sent to a video generation system. A speech synthesis engine (e.g., "Text-to-Speech" technology) is used to generate the audio. The server generates the audio, adjusting the sound quality and speed to reflect emotions. The responses also include appropriate visual content, which is then integrated and generated as a video.

[0741] Ultimately, the server uploads the generated video to a hosting platform and provides a link that the user can access immediately. The terminal is responsible for presenting this link to the user, who can then watch the video via the provided link. For example, if a user is "dissatisfied with a product return," the server can analyze their emotions and generate a video that provides an explanation in an appropriate tone. This system allows the user to feel that their emotions have been recognized and addressed appropriately. An example of a prompt might be, "Regarding your desire to have experienced better customer service." By inputting such prompts into the system, customized content that matches the user's intent is generated.

[0742] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0743] Step 1:

[0744] The user enters and submits their inquiry using an information terminal. The input is represented as text data, and the terminal's role is to send this data to the server. At this stage, the information and solutions the user is seeking are collected as input data.

[0745] Step 2:

[0746] The server transfers the text data received from the terminal to the sentiment analysis system. The sentiment analysis uses a natural language processing library to score keywords and emotional phrases within the text, extracting the emotions the user is experiencing. As a result of this analysis, the user's emotional state (e.g., dissatisfaction, joy) is quantified and output.

[0747] Step 3:

[0748] The server passes the query content along with the sentiment analysis results to a natural language processing system. Here, a generative AI model is used to generate a contextually appropriate answer. The input consists of the analyzed sentiment data and the query content, while the output is the result of natural language generation tailored to the user's emotional tone. This process ensures that the user receives a properly customized response.

[0749] Step 4:

[0750] The server forwards the generated text response to the video generation system. Using a speech synthesis engine, the system adjusts the sound quality and reading speed based on the emotion analysis results, and converts the response into speech. Simultaneously, visual information is also generated, and finally, a video integrating audio and visuals is output. This prepares content that aligns with the user's emotional expression.

[0751] Step 5:

[0752] The server uploads the prepared video to a cloud-based hosting platform and generates a link to access it. The terminal provides this link to the user, who can then view the video via the link. This allows the user to receive a customized visual response to their inquiry.

[0753] (Application Example 2)

[0754] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0755] Modern information delivery systems often fail to accurately grasp users' emotions and provide insufficient individualized support. As a result, users become frustrated when their feelings are not properly recognized, leading to decreased satisfaction.

[0756] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0757] In this invention, the server includes information acquisition means for receiving communications from the user, emotion analysis means for analyzing the received communications to identify the user's emotions, and response generation means for generating an optimal answer based on the analyzed emotions. This makes it possible to provide appropriate video images that match the user's emotions.

[0758] "User communications" refers to data, including inquiries and requests, that users of an information system send through means of obtaining information.

[0759] "Information acquisition means" refers to a device or software that receives communications transmitted by a user and organizes them for further analysis.

[0760] "Emotional analysis means" refers to a function that processes received communication data to identify the user's emotions and feelings.

[0761] A "response generation means" is a process that automatically creates an optimized response based on analyzed emotional information.

[0762] "Motion image generation means" refers to a technology that creates motion images by integrating visual and auditory information that corresponds to the user's emotions, based on the generated response.

[0763] "Information provision means" refers to the means of quickly distributing generated video images to users and making them viewable.

[0764] The system implementing this invention receives communication from a user, analyzes the user's emotions, generates an optimal response, and provides it as a moving image including visual and auditory information. The server first receives communication from the user using information acquisition means. This communication is often a text message sent via a smartphone or other information terminal. The received communication is analyzed by emotion analysis means to identify the user's emotions contained therein.

[0765] This sentiment analysis utilizes sentiment analysis engines such as the Microsoft Text Analytics API. Based on the analysis results, the server uses natural language processing techniques to analyze the content in detail and generate the optimal response. This response is then spoken in a tone appropriate to the emotion. Speech synthesis uses speech generation technologies such as Google Cloud Text-to-Speech.

[0766] Furthermore, a video is generated by a video generation means. Here, based on the generated response, visual and audio information is integrated using visual content creation software such as Adobe Premiere Pro. Finally, this video is delivered to the user by an information delivery means.

[0767] For example, when a user makes a product inquiry on an online shopping platform, this system generates and promptly provides an explanatory video with content and tone tailored to the user's emotions. This allows the user to feel that their emotions have been appropriately recognized and that they have received a quick and accurate response. An example of a prompt for the generating AI model would be, "Analyze the user's emotions and generate an explanatory video addressing their dissatisfaction with the delivery delay."

[0768] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0769] Step 1:

[0770] The user enters the inquiry using a terminal and sends it to the server. The input is primarily text data. This data is received by the server's information retrieval system.

[0771] Step 2:

[0772] The server passes the received text data to an emotion analysis tool to identify the user's emotions. This step uses an emotion analysis engine such as the Microsoft Text Analytics API. The input is text data, and the output is data indicating the user's emotions. Through the analysis, emotions such as anger, joy, and sadness are obtained as data.

[0773] Step 3:

[0774] The server generates an appropriate response using natural language processing techniques based on analyzed sentiment and text data. An FAQ database is used for this process. Input consists of text and sentiment data, and output is a response sentence best suited to the user's sentiment. Google Cloud Natural Language API and similar tools are used for response generation.

[0775] Step 4:

[0776] The server passes the response text to a speech synthesis system to generate audio data. This step uses speech generation technology such as Google Cloud Text-to-Speech. The input is the response text, and the output is audio data. The tone and speed of the voice are adjusted according to the user's emotions.

[0777] Step 5:

[0778] The server generates a video by integrating audio data and visual content using a video generation system. This process utilizes visual content generation software such as Adobe Premiere Pro or FFmpeg. The input consists of audio data and visual elements, while the output is video data that reflects the user's emotions.

[0779] Step 6:

[0780] The server generates video and provides it to the terminal via an information distribution method, allowing the user to view the video. The input is video data, and the output is a link or file format reference provided to the user. The user can view the video by clicking the link.

[0781] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0782] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0783] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0784] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0785] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0786] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0787] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0788] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0789] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0790] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0791] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0792] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0793] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0794] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0795] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0796] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0797] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0798] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0799] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0800] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0801] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0802] The following is further disclosed regarding the embodiments described above.

[0803] (Claim 1)

[0804] Information receiving means for receiving inquiries from users,

[0805] An analytical means for analyzing received inquiries to identify frequently occurring questions,

[0806] A response generation means that generates the optimal answer based on the analyzed results,

[0807] A motion image generation means that generates motion images including visual and audio information based on the generated response,

[0808] Information provision means for providing generated video images to the user,

[0809] A system that includes this.

[0810] (Claim 2)

[0811] The system according to claim 1, wherein the analysis means analyzes the query using natural language processing technology.

[0812] (Claim 3)

[0813] The system according to claim 1, wherein the motion image generation means uses speech synthesis technology to convert the response text into speech.

[0814] "Example 1"

[0815] (Claim 1)

[0816] A data receiving means for receiving requests from users,

[0817] An analysis means for analyzing a received request and identifying the intent of the utterance,

[0818] A solution generation means that collects information based on the results of the analysis and generates the optimal solution,

[0819] A media generation means that generates multimedia including visual and auditory information based on the generated solution,

[0820] Information provision means for transmitting generated multimedia to the user,

[0821] A system that includes this.

[0822] (Claim 2)

[0823] The system according to claim 1, wherein the analysis means analyzes the request using natural language processing technology.

[0824] (Claim 3)

[0825] The system according to claim 1, wherein the media generation means uses speech synthesis technology to convert decoded text into speech.

[0826] "Application Example 1"

[0827] (Claim 1)

[0828] A data receiving means for receiving user inquiries,

[0829] A data analysis method that analyzes received inquiries to identify frequently occurring questions,

[0830] Information generation means that generates the optimal answer based on the analyzed results,

[0831] A multimedia generation means that integrates visual and audio information based on the generated answer,

[0832] A data distribution means for providing generated multimedia content to users,

[0833] An information processing system that includes this.

[0834] (Claim 2)

[0835] The information processing system according to claim 1, wherein the data analysis means analyzes queries using natural language processing technology.

[0836] (Claim 3)

[0837] The information processing system according to claim 1, wherein the multimedia generation means converts text generated using speech synthesis technology into speech.

[0838] "Example 2 of combining an emotion engine"

[0839] (Claim 1)

[0840] A sentiment analysis tool that receives inquiries from users and analyzes their emotions,

[0841] A natural language processing method that generates the optimal answer based on the analyzed emotional information and content,

[0842] A motion image generation means that generates motion images including visual and auditory information by adjusting sound quality and speed based on emotion analysis results,

[0843] Information provision means for providing generated video images to the user,

[0844] A system that includes this.

[0845] (Claim 2)

[0846] The system according to claim 1, wherein the emotion analysis means analyzes emotional information and generates a tone appropriate to the user's emotions.

[0847] (Claim 3)

[0848] The system according to claim 1, wherein the motion image generation means generates viewing media that matches emotions using speech synthesis technology and image generation technology.

[0849] "Application example 2 when combining with an emotional engine"

[0850] (Claim 1)

[0851] Information acquisition means for receiving communications from users,

[0852] An emotional analysis method that analyzes received communications to identify the user's emotions,

[0853] A response generation means that generates the optimal answer based on the analyzed emotions,

[0854] A motion image generation means that generates motion images including visual and auditory information in an emotionally appropriate tone based on the generated response,

[0855] A means of providing information that delivers generated video images to users,

[0856] A system that includes this.

[0857] (Claim 2)

[0858] The system according to claim 1, wherein the analysis means analyzes communications using human language processing techniques.

[0859] (Claim 3)

[0860] The system according to claim 1, wherein the motion image generation means converts response text into speech using speech generation technology. [Explanation of Symbols]

[0861] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A data receiving means for receiving user inquiries, A data analysis method that analyzes received inquiries to identify frequently occurring questions, Information generation means that generates the optimal answer based on the analyzed results, A multimedia generation means that integrates visual and audio information based on the generated answer, A data distribution means for providing generated multimedia content to users, An information processing system that includes this.

2. The information processing system according to claim 1, wherein the data analysis means analyzes queries using natural language processing technology.

3. The information processing system according to claim 1, wherein the multimedia generation means converts text generated using speech synthesis technology into speech.