system
The system addresses the challenge of providing 24/7 multilingual customer support by converting voice data to text, analyzing inquiries, and generating responses, ensuring quick and emotionally appropriate interactions for improved user satisfaction.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-10
- Publication Date
- 2026-06-22
AI Technical Summary
Existing customer support systems face challenges in providing 24/7 multilingual support, quick responses, and accurately addressing customer inquiries due to time and language constraints, leading to potential dissatisfaction and contract cancellation risks.
A system that converts user inquiries into digital data, analyzes the content, generates responses, and provides them in multiple languages, using speech-to-text and text-to-speech technology, with continuous learning for improved accuracy.
Enables rapid, accurate, and emotionally responsive customer support across various languages, enhancing user satisfaction and improving system performance over time.
Smart Images

Figure 2026101416000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, the method including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a character of the chatbot, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] In recent years, in customer support, it has been required to respond quickly and accurately to inquiries from customers. However, in a response that depends on personnel, there are time and language constraints, and customers may not receive satisfactory services. In such a situation, customers may feel dissatisfied, leading to a risk of contract cancellation. Therefore, it is necessary to provide a system that can respond 24 hours a day, 365 days a year, support various languages, and respond quickly and accurately to customer inquiries.
Means for Solving the Problems
[0005] This invention converts user inquiries into digital data using means for acquiring audio data and means for converting audio to text. Furthermore, it generates a response using means for analyzing the converted data and understanding the content of the inquiry. The generated response is converted back into audio data and provided to the customer. In addition, multilingual support enables the system to serve a wider range of users. Furthermore, the system's accuracy is improved through means for collecting and learning from response evaluation results. By combining these means, the invention provides a system that responds to customer inquiries quickly and accurately, thereby improving customer satisfaction.
[0006] "Voice data" refers to data used to record or transmit voice input from a user in digital format.
[0007] "Text data" refers to data that represents audio data as text information, making it processable by computers.
[0008] "Analysis" is the process of thoroughly examining acquired data to understand its content and characteristics.
[0009] "Response generation" is a process aimed at creating appropriate responses to user inquiries.
[0010] "Multilingual support" refers to a system's ability to process input and responses in multiple languages.
[0011] "Learning" is the process by which a system improves its performance based on past data and feedback. [Brief explanation of the drawing]
[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3]It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.
Mode for Carrying Out the Invention
[0013] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described according to the accompanying drawings.
[0014] First, the language used in the following description will be explained.
[0015] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include CPU (Central Processing Unit), GPU (Graphics Processing Unit), GPGPU (General-Purpose computing on Graphics Processing Units), APU (Accelerated Processing Unit), and the like.
[0016] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0017] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disk (e.g., hard disk), or magnetic tape, and the like.
[0018] In the following embodiments, the numbered communication I / F (Interface) is an interface that includes a communication processor, an antenna, and the like. The communication I / F manages communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0020] [First Embodiment]
[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0033] In this invention, the functions of the server, terminal, and user are coordinated to realize speech-to-text conversion, text analysis, response generation, and speech synthesis of the response.
[0034] When a user calls customer support, the device detects the call and sends the audio data to the server. The server uses speech recognition technology to convert this audio data into text data. The converted text data is then analyzed using natural language processing technology, initiating an analysis process to understand the user's inquiry.
[0035] Based on the analysis results, the server generates an appropriate response. The generated response is verified for accuracy by comparing it with existing databases and FAQs, and automatic translation is performed as needed for multilingual support. The response is then converted into speech data by a speech synthesis engine and provided to the user via the terminal.
[0036] For example, if a user says, "I have a question about my invoice," the server recognizes this as an "invoice-related inquiry" and generates a relevant response. This response can be delivered to the user verbally, with specific details such as, "Your most recent invoice was issued on [date]."
[0037] Furthermore, the server uses the feedback obtained during these interactions to continuously learn as a whole system and improve response accuracy.
[0038] As a result, the present invention provides a system that can respond quickly and appropriately to a wide range of customer inquiries, thereby improving customer satisfaction.
[0039] The following describes the processing flow.
[0040] Step 1:
[0041] The user makes a call to customer service. The terminal detects this call and notifies the server to start the system.
[0042] Step 2:
[0043] The server activates the speech recognition engine and acquires the audio data sent from the terminal in real time.
[0044] Step 3:
[0045] The server converts the acquired audio data into text data. Using speech recognition technology, it accurately transcribes the user's speech into text.
[0046] Step 4:
[0047] The server analyzes the converted text data using natural language processing technology to understand the user's intent and inquiry. Based on the analysis, it identifies the category of the inquiry.
[0048] Step 5:
[0049] The server searches for relevant information from the FAQ database and other resources based on the identified category and generates an appropriate response.
[0050] Step 6:
[0051] The generated response will be translated into multiple languages as needed. In this process, the server uses an automatic translation engine.
[0052] Step 7:
[0053] The server converts the response into audio data using a speech synthesis engine and plays it back to the user through the terminal. High-quality speech synthesis technology is used to ensure a natural flow of speech.
[0054] Step 8:
[0055] If the user has any additional feedback or questions, the device will send them to the server and restart the process from step 3.
[0056] Step 9:
[0057] The server utilizes the collected feedback data and continues to learn through machine learning to improve the overall response accuracy of the system.
[0058] (Example 1)
[0059] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0060] Conventional speech recognition systems faced challenges in deeply understanding the content of text after converting speech data, and in responding quickly and accurately to diverse user inquiries. In particular, supporting multiple natural languages and maintaining the accuracy of generated responses proved difficult, potentially compromising customer satisfaction. Furthermore, real-time processing and continuous improvement of response accuracy were required.
[0061] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0062] In this invention, the server includes means for acquiring voice information, means for converting voice information into text information, and means for generating appropriate responses based on queries using a generative AI model. This makes it possible to process user queries in real time and provide accurate responses that support multiple natural languages.
[0063] "Audio information" refers to the digital representation of speech obtained from users, and is data that is subject to text conversion and analysis.
[0064] "Textual information" refers to text data obtained by converting audio information, and is used for analysis and response generation.
[0065] A "generative AI model" is an algorithm or system used to generate text-based responses based on machine learning.
[0066] A "knowledge base" is a database or source of information that contains data used to verify response accuracy.
[0067] A "prompt" is a text instruction input into a generative AI model, which tells the model what kind of response to generate.
[0068] "Natural language" refers to the language that humans use on a daily basis, and which can be processed by computers in the form of speech or text.
[0069] This invention embodies a system that acquires voice information and generates an appropriate response using natural language processing technology. The server, terminal, and user functions are operated in coordination. When a user calls customer support, the terminal acquires voice information. The terminal sends this voice information to the server, which then processes the data.
[0070] The server uses a cloud-based speech recognition service to convert speech information into text. Specifically, open-source or commercial speech recognition software can be used on the server. The converted text information is then analyzed by natural language processing tools. For example, widely used natural language processing libraries can be used.
[0071] Based on the analysis, the server uses a generative AI model to generate an appropriate response to the user's inquiry. During this response generation process, prompts such as "Please provide the latest information regarding the invoice" are input to the AI model. Furthermore, if multilingual support is required, the server can also translate the response using a translation service.
[0072] The text response generated by the server is converted into speech information using a speech synthesis service. By using a speech synthesis engine, it is possible to provide a natural-sounding response. Finally, the terminal receives this speech data and provides it to the user.
[0073] For example, if a user says, "I have a question about my invoice," the system can analyze this and generate an appropriate response such as, "Your latest invoice was issued on [date]," which can then be communicated to the user verbally. This enables the present invention to provide prompt and appropriate customer support, thereby improving customer satisfaction.
[0074] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0075] Step 1:
[0076] The device acquires the user's voice information. When a user contacts customer support by phone, the device records that voice information in real time. The input is the user's raw voice, which is captured as digital voice data. The output is the voice data sent to the server.
[0077] Step 2:
[0078] The server converts audio information into text information. The server uses a speech recognition service to convert the received audio data into text information. The input is digital audio data sent from the terminal, and the output is the converted text data. The server uses a cloud-based speech recognition tool to perform this conversion.
[0079] Step 3:
[0080] The server analyzes the text information to understand the content of the inquiry. The server uses natural language processing tools to analyze the text information and identify the user's intent. The input is the text data obtained from speech recognition, and the output is the analyzed intent and inquiry category. For example, the text information "I want to know the details of the invoice" is classified as an "invoice-related inquiry."
[0081] Step 4:
[0082] The server generates an appropriate response using a generative AI model. The server inputs a prompt sentence into the generative AI model and generates a response. The input is the analyzed intent and prompt sentence, and the output is the response message to be provided to the user. The server generates a response based on a prompt sentence such as "Please provide an update on the invoice."
[0083] Step 5:
[0084] The server translates responses as needed. If multilingual support is required, the server uses a translation service to translate the generated responses into the user's language. The input is the response message output from the generating AI model, and the output is the translated response message.
[0085] Step 6:
[0086] The server converts the text response into audio information, which the terminal then provides to the user. The server uses a speech synthesis engine to convert the text response into speech. The input is the final text response (including translated versions), and the output is audio data. The terminal receives this audio data and plays it back to the user through its speaker.
[0087] Step 7:
[0088] The server collects response evaluation results and performs continuous learning. The server records user feedback and response accuracy and uses this to improve the system. Input is user feedback and evaluation results, and output is system learning data, which is used to improve response accuracy.
[0089] (Application Example 1)
[0090] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0091] In public spaces and environments where security is paramount, there is a need to quickly and accurately detect potential threats and suspicious statements, and to take appropriate action. Conventional systems often lack real-time capabilities and struggle to respond flexibly to linguistic diversity and specific situations.
[0092] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0093] In this invention, the server includes means for acquiring audio information, means for converting it into text information, and means for analyzing it to understand the content of a potential threat. This makes it possible to react immediately to suspicious statements.
[0094] "Audio information" refers to audio data acquired in a given space, and is the subject of analysis by the system.
[0095] "Textual information" refers to data obtained by converting audio information into text format, and is a digital format that allows for further analysis.
[0096] "Analysis" is the process of understanding the content of converted text information and generating necessary responses or warnings.
[0097] A "potential threat" refers to any suspicious remarks or actions in a public forum that could potentially endanger safety.
[0098] "Real-time" refers to a system's ability to react instantly and execute processes without delay.
[0099] This invention is a system for analyzing audio information in real time and detecting potential threats in public places. The server acquires audio information and processes it to convert it into text information. Using speech recognition technology such as Google's Speech-to-Text API, it is possible to efficiently convert audio data into text data. This allows for appropriate processing of audio in various languages.
[0100] The terminal is responsible for continuously recording ambient sounds and sending that data to the server. The server analyzes the text information to verify whether a potential threat exists. Natural language processing libraries such as NLTK and spaCy are used to determine if any suspicious statements are included.
[0101] Furthermore, based on the analysis results, an alert will be sent to the administrator immediately if necessary. By using push notification services such as Firebase Cloud Messaging, the information can be delivered without delay.
[0102] For example, when the system of the present invention is operated in a shopping mall, the terminals constantly monitor the surrounding sounds, and if any suspicious remarks are detected, the security team is notified in real time. This type of operation allows for a rapid response to potential threats, further enhancing public safety.
[0103] An example of a prompt for a generated AI model is: "Design an application that uses speech recognition technology to analyze suspicious statements in real time and detect potential threats, as a public security monitoring system. Specifically, please tell me what architecture and technology stack are required." Based on this prompt, it is possible to extract the knowledge necessary to build a more specific system and then design it.
[0104] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0105] Step 1:
[0106] The device acquires ambient sound. It uses the smartphone's microphone to continuously record surrounding audio. The input is raw audio data, which is then prepared for transmission to the server. The output is audio data ready for transmission.
[0107] Step 2:
[0108] The server converts received audio data into text. It uses the Google Speech-to-Text API to convert audio data into text format. The input is audio data, and the output is text. This conversion supports multiple languages.
[0109] Step 3:
[0110] The server analyzes textual information. Using a natural language processing library (NLTK or spaCy), it analyzes the textual information and determines whether it contains suspicious statements. The input is textual information, and the results of the analysis are output.
[0111] Step 4:
[0112] The server generates warning information based on the analysis results. When a potential threat is detected, it generates information that requires real-time action. The input is the analysis results, and it outputs specific warning content.
[0113] Step 5:
[0114] The server notifies the terminal of the generated alert. Firebase Cloud Messaging is used to push the alert details to the administrator's device. The input is the alert information, and the output is the notification displayed on the administrator's device.
[0115] Step 6:
[0116] The user (administrator) will take necessary actions based on the received warning. Specific actions may include going directly to the site or conducting additional investigations. The input is the notified warning information, and the output is the execution of the countermeasures.
[0117] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0118] This invention provides a response system that combines an emotion engine that acquires and processes voice data and analyzes the user's emotional state. When a user calls customer support, the terminal detects the voice input and sends it to the server. This voice data is converted into text data by a speech recognition engine, and the content of the user's inquiry is analyzed by a natural language processing module.
[0119] The emotion engine identifies the user's emotional state from voice input and provides this emotion data to the server. This emotion data is combined with the results of query analysis and used in the response generation module.
[0120] The server generates a response that includes an appropriate tone and suggestions based on the user's emotions. This response can choose more proactive and friendly language, taking the user's feelings into consideration. In particular, if the user expresses dissatisfaction, the response is adjusted to suggest a quick resolution.
[0121] The generated response is converted into speech data by a speech synthesis engine and delivered to the user via the terminal. This entire process allows the system to respond in real time and appropriately address the user's emotional state.
[0122] For example, if a user expresses dissatisfaction that their inquiry has not been resolved, the emotion engine detects this emotion as "dissatisfaction." In response, the server generates a response promising a prompt resolution, such as "We are very sorry. We will provide additional support immediately," and further synthesizes it into polite speech. In this way, the present invention aims to improve customer relationships by realizing interactions that take user emotions into account.
[0123] The following describes the processing flow.
[0124] Step 1:
[0125] The user makes a call to customer service. The device detects the call signal and sends the voice data to the server.
[0126] Step 2:
[0127] The server activates the speech recognition engine and converts the audio data received from the terminal into text data. This conversion process is performed in real time.
[0128] Step 3:
[0129] The server passes the text data to a natural language processing module, which analyzes the user's inquiry. This analysis identifies the user's intent and purpose.
[0130] Step 4:
[0131] The server uses an emotion engine to analyze the user's voice data and identify the emotions contained in the voice. This emotion data influences the generation of subsequent responses.
[0132] Step 5:
[0133] The server combines the analyzed query content with sentiment data, and the response generation module creates an appropriate response. Here, the tone and content of the response are adjusted based on the sentiment data.
[0134] Step 6:
[0135] The generated response is converted back into audio data by a speech synthesis engine. The server then sends this audio data to the terminal.
[0136] Step 7:
[0137] The device plays the generated voice response to the user. It is delivered in a tone that is easy for the user to hear and that is emotionally appropriate.
[0138] Step 8:
[0139] If the user has any further questions or feedback, the terminal will send this back to the server and repeat the process from step 2 as needed.
[0140] Step 9:
[0141] The server collects past dialogue data and uses it to train the emotion engine and response generation modules, thereby improving the overall accuracy of the system.
[0142] (Example 2)
[0143] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0144] Modern customer service systems often fail to generate appropriate responses that align with user emotions, making it particularly difficult to adequately satisfy dissatisfied customers. Furthermore, there is a lack of technology to provide real-time responses while supporting various language environments. This can lead to a decline in the quality of the customer experience and potentially negatively impact brand image.
[0145] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0146] In this invention, the server includes means for analyzing emotional states from voice information and using that data to generate responses, means for generating responses that include tone and suggestions that take into account the user's emotional state, and means for providing responses in real time. This makes it possible to provide appropriate responses that correspond to the user's emotions, thereby improving the quality of customer service.
[0147] "Audio information" refers to information used to acquire, process, and transmit human voices as digital data.
[0148] "Textual information" refers to language data converted from audio or other formats, and is information expressed in the form of characters.
[0149] "Emotional state" refers to data that indicates an individual's emotional state and tendencies, analyzed from audio and textual information.
[0150] "Response generation" is the process of generating a response to the user based on the analyzed questions and emotional states.
[0151] "Real-time" means that processes and responses are executed instantly with very short delays.
[0152] "Tone" refers to the nuances and tendencies of language and expression in a response, and is an element that provides appropriate expression for the user's emotions.
[0153] A "knowledge base" refers to the accumulation of information and data used to verify the accuracy of responses.
[0154] A "machine learning model" is an algorithm or framework used to analyze large amounts of data and learn patterns.
[0155] A "prompt" is a sentence that indicates a question or instruction specified as input to an AI model, and serves as the starting point for generating a response.
[0156] This invention is a system that generates an appropriate response by acquiring voice information when a user calls customer support and analyzing its content. The system functions through the cooperation of a terminal and a server.
[0157] The terminal uses a microphone to capture the user's voice, converts it to a digital format, and sends it to the server. A digital communication protocol is used to ensure the data is securely transmitted to the server.
[0158] The server converts received audio information into text using a speech recognition engine. A commonly used software for this process is a speech recognition service. The server then analyzes the converted text using a natural language processing library to understand the query. At this stage, processing is performed to clarify the user's intent based on the analysis results.
[0159] Next, the emotion analysis engine identifies the user's emotional state from the voice information and generates data for it. The server uses this emotion data to generate a response that corresponds to the user's emotions based on a generative AI model. This generated response is then output by the response generation module, including appropriate tone and suggestions.
[0160] For example, if a user is dissatisfied because their inquiry hasn't been resolved, the sentiment analysis engine detects this as "dissatisfaction," and the server generates a response suggesting a quick resolution, such as, "We are very sorry. We will provide additional support immediately." This response is then converted back into voice information by a speech synthesis engine and delivered to the user through their device.
[0161] An example of a prompt is, "Generate an appropriate response when a customer is dissatisfied with the service." In this way, the system aims to improve customer satisfaction by providing responses based on the user's emotions.
[0162] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0163] Step 1:
[0164] The terminal acquires voice input from the user using a microphone. The acquired voice is converted into a digital signal and sent to the server. The input is the user's raw voice data, and the output is voice data in digital format. The specific operation of dividing the voice signal into small data packets and transmitting them over the internet is performed.
[0165] Step 2:
[0166] The server inputs the received digital audio data into the speech recognition engine and converts it into text information. The input is digital audio data, and the output is the corresponding text information. The server analyzes the features of the audio waveform, identifies phonemes, and outputs the recognized words.
[0167] Step 3:
[0168] The server inputs textual information into a natural language processing module and analyzes the user's query. The input is textual information, and the output is structured query data. The server tokenizes the text, performs syntax analysis, and identifies the intent of the query.
[0169] Step 4:
[0170] The server uses a sentiment analysis engine to analyze the user's emotional state from textual information. The input is text data, and the output is data indicating the emotional state. The server performs specific actions such as evaluating words and context within the text and generating an emotional score.
[0171] Step 5:
[0172] The server uses the analyzed query content and sentiment data to generate a response using a generative AI model. The input is query content data and sentiment data, and the output is the generated response text. The server instructs the AI model to generate a response based on the prompt text, including appropriate tone and suggestions.
[0173] Step 6:
[0174] The server inputs the generated response text into a speech synthesis engine and converts it into speech information. The input is the generated response text, and the output is the synthesized speech data. The server inputs the text into a speech synthesis processor and performs the operation of generating natural-sounding speech that resembles a human voice.
[0175] Step 7:
[0176] The terminal plays back audio data received from the server and provides a response to the user. The input is synthesized audio data, and the output is the audio response delivered to the user. The terminal uses its audio speaker to play back the response with clear and easily understandable sound quality.
[0177] (Application Example 2)
[0178] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0179] In today's information society, efficiently and effectively handling inquiries from a large number of users is crucial. However, conventional response systems merely answer questions without considering the user's emotional state, limiting their ability to improve the user experience. Furthermore, conventional technologies are insufficient for multilingual support and rapid response generation, and there is a particular need to understand and appropriately respond to user emotions.
[0180] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0181] In this invention, the server includes means for acquiring voice information, means for analyzing the user's emotions from the voice information, and means for optimizing the response using the analyzed emotion information. This makes it possible to generate highly compatible responses that take the user's emotions into consideration.
[0182] "Audio information" refers to data or signals that are treated as audio.
[0183] "Character information" refers to data or signals that are represented as characters.
[0184] "Natural language" refers to the language systems that humans use on a daily basis.
[0185] "Machine learning" is a technology that enables systems to autonomously improve their performance using data.
[0186] A "knowledge base" is a collection of data that systematically organizes information related to a specific domain.
[0187] "Emotional information" refers to data about a user's emotional state obtained from their voice and actions.
[0188] "Analysis" is the process of thoroughly examining data and information to understand its structure and meaning.
[0189] "Optimization" means adjusting something to the most effective or efficient state according to its purpose.
[0190] "Real-time" refers to a state in which information processing and operations are performed instantly.
[0191] As an embodiment for carrying out the present invention, the system realizing this application example comprises a server, a user terminal, and a network connecting them. The server converts the user's voice information into text information using a speech recognition API, and analyzes this converted data using a natural language processing library. SpaCy or similar open-source libraries are used for the analysis. For the analysis of emotional information, a custom model using PyTorch is applied to estimate the emotional state from the user's voice.
[0192] The user terminal is a smartphone or similar mobile device equipped with a microphone for voice input and a speaker for voice output. The voice input is sent to the server in real time, and a response is generated based on the analysis results. This response is then converted into speech by a speech synthesis engine and provided to the user.
[0193] The server also performs a process of verifying the accuracy of the generated response by comparing it against a knowledge base. This enables the system to provide appropriate and efficient responses that meet the user's expectations. For example, if a user states that "the delivery of my item is delayed," the system will promptly apologize, check the shipping status, and process the approved claim.
[0194] An example of a prompt message is, "We have detected dissatisfaction from the user's voice. Generate a response that includes a prompt response and an apology." This enables the generative AI model to provide emotionally responsive interactions.
[0195] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0196] Step 1:
[0197] The user initiates voice input on the device. The device uses its microphone to acquire this voice information, converts it into digital data in real time, and sends it to the server. The input is the user's voice, and the output is compressed voice data transferred to the server.
[0198] Step 2:
[0199] The server converts the received audio data into text information using a speech recognition API. In this process, audio waveform data is input and text data is output. Specifically, APIs such as Google Cloud Speech-to-Text are used to convert the audio into a portable text format.
[0200] Step 3:
[0201] The server analyzes the converted text information using a natural language processing library (e.g., spaCy). The input here is speech-recognized text data, and the analyzed contextual information and commands are output. Specifically, it extracts knowledge from the text data and performs syntactic and semantic analysis to understand the user's intent.
[0202] Step 4:
[0203] The server uses an emotion information analysis module to identify the user's emotional state from textual information. The input is the contextual information obtained in step 3, and the output is the user's emotion (e.g., dissatisfied, satisfied). Specifically, a machine learning model (e.g., a PyTorch model) is used to perform pattern recognition to determine the emotion.
[0204] Step 5:
[0205] The server generates an appropriate response based on the analyzed sentiment information and the query content. The input is the user's intent and sentiment information, and the output is a text-based response. A generative AI model is used to create context-aware, real-time responses using prompt sentences.
[0206] Step 6:
[0207] The generated text response is converted into speech data through a speech synthesis engine. The server sends this speech data to the terminal. The input is a text-based response, and the output is a speech-based response. The specific operation utilizes speech synthesis technology for natural-sounding speech (e.g., Text-to-Speech API).
[0208] Step 7:
[0209] The user terminal plays the audio data received from the server and provides an immediate response to the user. The input is the audio data from the server, and the output is the audio the user hears. The specific action at this stage is to present a response to the user using the terminal's audio output device (speaker).
[0210] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0211] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0212] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0213] [Second Embodiment]
[0214] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0215] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0216] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0217] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0218] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0219] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0220] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0221] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0222] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0223] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0224] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0225] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0226] In this invention, the functions of the server, terminal, and user are coordinated to realize speech-to-text conversion, text analysis, response generation, and speech synthesis of the response.
[0227] When a user calls customer support, the device detects the call and sends the audio data to the server. The server uses speech recognition technology to convert this audio data into text data. The converted text data is then analyzed using natural language processing technology, initiating an analysis process to understand the user's inquiry.
[0228] Based on the analysis results, the server generates an appropriate response. The generated response is verified for accuracy by comparing it with existing databases and FAQs, and automatic translation is performed as needed for multilingual support. The response is then converted into speech data by a speech synthesis engine and provided to the user via the terminal.
[0229] For example, if a user says, "I have a question about my invoice," the server recognizes this as an "invoice-related inquiry" and generates a relevant response. This response can be delivered to the user verbally, with specific details such as, "Your most recent invoice was issued on [date]."
[0230] Furthermore, the server uses the feedback obtained during these interactions to continuously learn as a whole system and improve response accuracy.
[0231] As a result, the present invention provides a system that can respond quickly and appropriately to a wide range of customer inquiries, thereby improving customer satisfaction.
[0232] The following describes the processing flow.
[0233] Step 1:
[0234] The user makes a call to customer service. The terminal detects this call and notifies the server to start the system.
[0235] Step 2:
[0236] The server activates the speech recognition engine and acquires the audio data sent from the terminal in real time.
[0237] Step 3:
[0238] The server converts the acquired audio data into text data. Using speech recognition technology, it accurately transcribes the user's speech into text.
[0239] Step 4:
[0240] The server analyzes the converted text data using natural language processing technology to understand the user's intent and inquiry. Based on the analysis, it identifies the category of the inquiry.
[0241] Step 5:
[0242] The server searches for relevant information from the FAQ database and other resources based on the identified category and generates an appropriate response.
[0243] Step 6:
[0244] The generated response will be translated into multiple languages as needed. In this process, the server uses an automatic translation engine.
[0245] Step 7:
[0246] The server converts the response into audio data using a speech synthesis engine and plays it back to the user through the terminal. High-quality speech synthesis technology is used to ensure a natural flow of speech.
[0247] Step 8:
[0248] If the user has any additional feedback or questions, the device will send them to the server and restart the process from step 3.
[0249] Step 9:
[0250] The server utilizes the collected feedback data and continues to learn through machine learning to improve the overall response accuracy of the system.
[0251] (Example 1)
[0252] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0253] Conventional speech recognition systems faced challenges in deeply understanding the content of text after converting speech data, and in responding quickly and accurately to diverse user inquiries. In particular, supporting multiple natural languages and maintaining the accuracy of generated responses proved difficult, potentially compromising customer satisfaction. Furthermore, real-time processing and continuous improvement of response accuracy were required.
[0254] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0255] In this invention, the server includes means for acquiring voice information, means for converting voice information into text information, and means for generating appropriate responses based on queries using a generative AI model. This makes it possible to process user queries in real time and provide accurate responses that support multiple natural languages.
[0256] "Audio information" refers to the digital representation of speech obtained from users, and is data that is subject to text conversion and analysis.
[0257] "Textual information" refers to text data obtained by converting audio information, and is used for analysis and response generation.
[0258] A "generative AI model" is an algorithm or system used to generate text-based responses based on machine learning.
[0259] A "knowledge base" is a database or source of information that contains data used to verify response accuracy.
[0260] A "prompt" is a text instruction input into a generative AI model, which tells the model what kind of response to generate.
[0261] "Natural language" refers to the language that humans use on a daily basis, and which can be processed by computers in the form of speech or text.
[0262] This invention embodies a system that acquires voice information and generates an appropriate response using natural language processing technology. The server, terminal, and user functions are operated in coordination. When a user calls customer support, the terminal acquires voice information. The terminal sends this voice information to the server, which then processes the data.
[0263] The server uses a cloud-based speech recognition service to convert speech information into text. Specifically, open-source or commercial speech recognition software can be used on the server. The converted text information is then analyzed by natural language processing tools. For example, widely used natural language processing libraries can be used.
[0264] Based on the analysis, the server uses a generative AI model to generate an appropriate response to the user's inquiry. During this response generation process, prompts such as "Please provide the latest information regarding the invoice" are input to the AI model. Furthermore, if multilingual support is required, the server can also translate the response using a translation service.
[0265] The text response generated by the server is converted into speech information using a speech synthesis service. By using a speech synthesis engine, it is possible to provide a natural-sounding response. Finally, the terminal receives this speech data and provides it to the user.
[0266] For example, if a user says, "I have a question about my invoice," the system can analyze this and generate an appropriate response such as, "Your latest invoice was issued on [date]," which can then be communicated to the user verbally. This enables the present invention to provide prompt and appropriate customer support, thereby improving customer satisfaction.
[0267] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0268] Step 1:
[0269] The device acquires the user's voice information. When a user contacts customer support by phone, the device records that voice information in real time. The input is the user's raw voice, which is captured as digital voice data. The output is the voice data sent to the server.
[0270] Step 2:
[0271] The server converts audio information into text information. The server uses a speech recognition service to convert the received audio data into text information. The input is digital audio data sent from the terminal, and the output is the converted text data. The server uses a cloud-based speech recognition tool to perform this conversion.
[0272] Step 3:
[0273] The server analyzes the text information to understand the content of the inquiry. The server uses natural language processing tools to analyze the text information and identify the user's intent. The input is the text data obtained from speech recognition, and the output is the analyzed intent and inquiry category. For example, the text information "I want to know the details of the invoice" is classified as an "invoice-related inquiry."
[0274] Step 4:
[0275] The server generates an appropriate response using a generative AI model. The server inputs a prompt sentence into the generative AI model and generates a response. The input is the analyzed intent and prompt sentence, and the output is the response message to be provided to the user. The server generates a response based on a prompt sentence such as "Please provide an update on the invoice."
[0276] Step 5:
[0277] The server translates the response as needed. If the server needs to support multiple languages, it translates the response generated using the translation service into the user's language. The input is the response message output from the generative AI model, and the output is the translated response message.
[0278] Step 6:
[0279] The server converts the text response into voice information, and the terminal provides it to the user. The server uses a voice synthesis engine to convert the text response into voice. The input is the final text response (including if it has been translated), and the output is voice data. The terminal receives this voice data and plays it back to the user through the speaker.
[0280] Step 7:
[0281] The server collects the evaluation results of the response and performs continuous learning. The server records the user's feedback and the accuracy of the response and utilizes it to improve the system. The input is the feedback and evaluation results from the user, and the output is the learning data of the system, which is used to improve the response accuracy.
[0282] (Application Example 1)
[0283] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".
[0284] In public places and environments where security is required, it is necessary to quickly and accurately detect potential threats and suspicious remarks and take appropriate actions. Conventional systems often lack real-time capabilities and have difficulty in flexibly responding according to language diversity and specific situations.
[0285] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0286] In this invention, the server includes means for acquiring voice information, means for converting it into character information, and means for analyzing and understanding the content of potential threats. As a result, it becomes possible to immediately respond to suspicious remarks.
[0287] "Voice information" refers to voice data acquired in space and is the object to be analyzed by the system.
[0288] "Character information" refers to data obtained by converting voice information into text format and is a digital format that allows further analysis.
[0289] "Analysis" is a process for understanding the content of the converted character information and generating necessary responses or warnings.
[0290] "Potential threat" refers to suspicious remarks or actions that may pose a threat to safety in public places.
[0291] "Real-time" means that the system has the ability to react immediately and execute processes without delay.
[0292] The present invention is a system for analyzing voice information in real time and detecting potential threats in public places and the like. The server acquires voice information and performs a process of converting the voice into character information. Voice recognition technology can use, for example, the Google Speech-to-Text API to efficiently convert voice data into text data. As a result, voices in various languages can be appropriately processed.
[0293] The terminal is responsible for constantly recording ambient sound and transmitting the data to the server. The server performs analysis based on the character information and verifies whether there are potential threats. As natural language processing libraries, NLTK and spaCy are used to determine whether suspicious remarks are included.
[0294] Furthermore, based on the analysis results, an alert will be sent to the administrator immediately if necessary. By using push notification services such as Firebase Cloud Messaging, the information can be delivered without delay.
[0295] For example, when the system of the present invention is operated in a shopping mall, the terminals constantly monitor the surrounding sounds, and if any suspicious remarks are detected, the security team is notified in real time. This type of operation allows for a rapid response to potential threats, further enhancing public safety.
[0296] An example of a prompt for a generated AI model is: "Design an application that uses speech recognition technology to analyze suspicious statements in real time and detect potential threats, as a public security monitoring system. Specifically, please tell me what architecture and technology stack are required." Based on this prompt, it is possible to extract the knowledge necessary to build a more specific system and then design it.
[0297] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0298] Step 1:
[0299] The device acquires ambient sound. It uses the smartphone's microphone to continuously record surrounding audio. The input is raw audio data, which is then prepared for transmission to the server. The output is audio data ready for transmission.
[0300] Step 2:
[0301] The server converts received audio data into text. It uses the Google Speech-to-Text API to convert audio data into text format. The input is audio data, and the output is text. This conversion supports multiple languages.
[0302] Step 3:
[0303] The server analyzes the character information. Using a natural language processing library (NLTK or spaCy), it analyzes the character information to determine whether it contains suspicious statements. The input is the character information, and the result of the determination is output as the analysis result.
[0304] Step 4:
[0305] Based on the analysis result, the server generates warning information. When a potential threat is detected, it generates information that should be dealt with in real time. The input is the analysis result, and the specific warning content is output.
[0306] Step 5:
[0307] The server notifies the terminal of the generated warning. Using Firebase Cloud Messaging, it pushes the warning content to the administrator's device. The input is the warning information, and the output is the notification displayed on the administrator's device.
[0308] Step 6:
[0309] Based on the received warning, the user (administrator) takes necessary actions. Specific actions may include going directly to the site or conducting additional investigations. The input is the notified warning information, and the output is the execution of countermeasures.
[0310] Furthermore, an emotion engine for estimating the user's emotion may be combined. That is, the specific processing unit 290 may estimate the user's emotion using the emotion identification model 59 and perform specific processing using the user's emotion.
[0311] This invention provides a response system that combines an emotion engine that acquires and processes voice data and analyzes the user's emotional state. When a user calls customer support, the terminal detects the voice input and sends it to the server. This voice data is converted into text data by a speech recognition engine, and the content of the user's inquiry is analyzed by a natural language processing module.
[0312] The emotion engine identifies the user's emotional state from voice input and provides this emotion data to the server. This emotion data is combined with the results of query analysis and used in the response generation module.
[0313] The server generates a response that includes an appropriate tone and suggestions based on the user's emotions. This response can choose more proactive and friendly language, taking the user's feelings into consideration. In particular, if the user expresses dissatisfaction, the response is adjusted to suggest a quick resolution.
[0314] The generated response is converted into speech data by a speech synthesis engine and delivered to the user via the terminal. This entire process allows the system to respond in real time and appropriately address the user's emotional state.
[0315] For example, if a user expresses dissatisfaction that their inquiry has not been resolved, the emotion engine detects this emotion as "dissatisfaction." In response, the server generates a response promising a prompt resolution, such as "We are very sorry. We will provide additional support immediately," and further synthesizes it into polite speech. In this way, the present invention aims to improve customer relationships by realizing interactions that take user emotions into account.
[0316] The following describes the processing flow.
[0317] Step 1:
[0318] The user makes a call to customer service. The device detects the call signal and sends the voice data to the server.
[0319] Step 2:
[0320] The server activates the speech recognition engine and converts the audio data received from the terminal into text data. This conversion process is performed in real time.
[0321] Step 3:
[0322] The server passes the text data to a natural language processing module, which analyzes the user's inquiry. This analysis identifies the user's intent and purpose.
[0323] Step 4:
[0324] The server uses an emotion engine to analyze the user's voice data and identify the emotions contained in the voice. This emotion data influences the generation of subsequent responses.
[0325] Step 5:
[0326] The server combines the analyzed query content with sentiment data, and the response generation module creates an appropriate response. Here, the tone and content of the response are adjusted based on the sentiment data.
[0327] Step 6:
[0328] The generated response is converted back into audio data by a speech synthesis engine. The server then sends this audio data to the terminal.
[0329] Step 7:
[0330] The device plays the generated voice response to the user. It is delivered in a tone that is easy for the user to hear and that is emotionally appropriate.
[0331] Step 8:
[0332] If the user has any further questions or feedback, the terminal will send this back to the server and repeat the process from step 2 as needed.
[0333] Step 9:
[0334] The server collects past dialogue data and uses it to train the emotion engine and response generation modules, thereby improving the overall accuracy of the system.
[0335] (Example 2)
[0336] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0337] Modern customer service systems often fail to generate appropriate responses that align with user emotions, making it particularly difficult to adequately satisfy dissatisfied customers. Furthermore, there is a lack of technology to provide real-time responses while supporting various language environments. This can lead to a decline in the quality of the customer experience and potentially negatively impact brand image.
[0338] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0339] In this invention, the server includes means for analyzing emotional states from voice information and using that data to generate responses, means for generating responses that include tone and suggestions that take into account the user's emotional state, and means for providing responses in real time. This makes it possible to provide appropriate responses that correspond to the user's emotions, thereby improving the quality of customer service.
[0340] "Audio information" refers to information used to acquire, process, and transmit human voices as digital data.
[0341] "Textual information" refers to language data converted from audio or other formats, and is information expressed in the form of characters.
[0342] "Emotional state" refers to data that indicates an individual's emotional state and tendencies, analyzed from audio and textual information.
[0343] "Response generation" is the process of generating a response to the user based on the analyzed questions and emotional states.
[0344] "Real-time" means that processes and responses are executed instantly with very short delays.
[0345] "Tone" refers to the nuances and tendencies of language and expression in a response, and is an element that provides appropriate expression for the user's emotions.
[0346] A "knowledge base" refers to the accumulation of information and data used to verify the accuracy of responses.
[0347] A "machine learning model" is an algorithm or framework used to analyze large amounts of data and learn patterns.
[0348] A "prompt" is a sentence that indicates a question or instruction specified as input to an AI model, and serves as the starting point for generating a response.
[0349] This invention is a system that generates an appropriate response by acquiring voice information when a user calls customer support and analyzing its content. The system functions through the cooperation of a terminal and a server.
[0350] The terminal uses a microphone to capture the user's voice, converts it to a digital format, and sends it to the server. A digital communication protocol is used to ensure the data is securely transmitted to the server.
[0351] The server converts received audio information into text using a speech recognition engine. A commonly used software for this process is a speech recognition service. The server then analyzes the converted text using a natural language processing library to understand the query. At this stage, processing is performed to clarify the user's intent based on the analysis results.
[0352] Next, the emotion analysis engine identifies the user's emotional state from the voice information and generates data for it. The server uses this emotion data to generate a response that corresponds to the user's emotions based on a generative AI model. This generated response is then output by the response generation module, including appropriate tone and suggestions.
[0353] For example, if a user is dissatisfied because their inquiry hasn't been resolved, the sentiment analysis engine detects this as "dissatisfaction," and the server generates a response suggesting a quick resolution, such as, "We are very sorry. We will provide additional support immediately." This response is then converted back into voice information by a speech synthesis engine and delivered to the user through their device.
[0354] An example of a prompt is, "Generate an appropriate response when a customer is dissatisfied with the service." In this way, the system aims to improve customer satisfaction by providing responses based on the user's emotions.
[0355] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0356] Step 1:
[0357] The terminal acquires voice input from the user using a microphone. The acquired voice is converted into a digital signal and sent to the server. The input is the user's raw voice data, and the output is voice data in digital format. The specific operation of dividing the voice signal into small data packets and transmitting them over the internet is performed.
[0358] Step 2:
[0359] The server inputs the received digital audio data into the speech recognition engine and converts it into text information. The input is digital audio data, and the output is the corresponding text information. The server analyzes the features of the audio waveform, identifies phonemes, and outputs the recognized words.
[0360] Step 3:
[0361] The server inputs textual information into a natural language processing module and analyzes the user's query. The input is textual information, and the output is structured query data. The server tokenizes the text, performs syntax analysis, and identifies the intent of the query.
[0362] Step 4:
[0363] The server uses a sentiment analysis engine to analyze the user's emotional state from textual information. The input is text data, and the output is data indicating the emotional state. The server performs specific actions such as evaluating words and context within the text and generating an emotional score.
[0364] Step 5:
[0365] The server uses the analyzed query content and sentiment data to generate a response using a generative AI model. The input is query content data and sentiment data, and the output is the generated response text. The server instructs the AI model to generate a response based on the prompt text, including appropriate tone and suggestions.
[0366] Step 6:
[0367] The server inputs the generated response text into a speech synthesis engine and converts it into speech information. The input is the generated response text, and the output is the synthesized speech data. The server inputs the text into a speech synthesis processor and performs the operation of generating natural-sounding speech that resembles a human voice.
[0368] Step 7:
[0369] The terminal plays back audio data received from the server and provides a response to the user. The input is synthesized audio data, and the output is the audio response delivered to the user. The terminal uses its audio speaker to play back the response with clear and easily understandable sound quality.
[0370] (Application Example 2)
[0371] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0372] In today's information society, efficiently and effectively handling inquiries from a large number of users is crucial. However, conventional response systems merely answer questions without considering the user's emotional state, limiting their ability to improve the user experience. Furthermore, conventional technologies are insufficient for multilingual support and rapid response generation, and there is a particular need to understand and appropriately respond to user emotions.
[0373] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0374] In this invention, the server includes means for acquiring voice information, means for analyzing the user's emotions from the voice information, and means for optimizing the response using the analyzed emotion information. This makes it possible to generate highly compatible responses that take the user's emotions into consideration.
[0375] "Audio information" refers to data or signals that are treated as audio.
[0376] "Character information" refers to data or signals that are represented as characters.
[0377] "Natural language" refers to the language systems that humans use on a daily basis.
[0378] "Machine learning" is a technology that enables systems to autonomously improve their performance using data.
[0379] A "knowledge base" is a collection of data that systematically organizes information related to a specific domain.
[0380] "Emotional information" refers to data about a user's emotional state obtained from their voice and actions.
[0381] "Analysis" is the process of thoroughly examining data and information to understand its structure and meaning.
[0382] "Optimization" means adjusting something to the most effective or efficient state according to its purpose.
[0383] "Real-time" refers to a state in which information processing and operations are performed instantly.
[0384] As an embodiment for carrying out the present invention, the system realizing this application example comprises a server, a user terminal, and a network connecting them. The server converts the user's voice information into text information using a speech recognition API, and analyzes this converted data using a natural language processing library. SpaCy or similar open-source libraries are used for the analysis. For the analysis of emotional information, a custom model using PyTorch is applied to estimate the emotional state from the user's voice.
[0385] The user terminal is a smartphone or similar mobile device equipped with a microphone for voice input and a speaker for voice output. The voice input is sent to the server in real time, and a response is generated based on the analysis results. This response is then converted into speech by a speech synthesis engine and provided to the user.
[0386] The server also performs a process of verifying the accuracy of the generated response by comparing it against a knowledge base. This enables the system to provide appropriate and efficient responses that meet the user's expectations. For example, if a user states that "the delivery of my item is delayed," the system will promptly apologize, check the shipping status, and process the approved claim.
[0387] An example of a prompt message is, "We have detected dissatisfaction from the user's voice. Generate a response that includes a prompt response and an apology." This enables the generative AI model to provide emotionally responsive interactions.
[0388] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0389] Step 1:
[0390] The user initiates voice input on the device. The device uses its microphone to acquire this voice information, converts it into digital data in real time, and sends it to the server. The input is the user's voice, and the output is compressed voice data transferred to the server.
[0391] Step 2:
[0392] The server converts the received audio data into text information using a speech recognition API. In this process, audio waveform data is input and text data is output. Specifically, APIs such as Google Cloud Speech-to-Text are used to convert the audio into a portable text format.
[0393] Step 3:
[0394] The server analyzes the converted text information using a natural language processing library (e.g., spaCy). The input here is speech-recognized text data, and the analyzed contextual information and commands are output. Specifically, it extracts knowledge from the text data and performs syntactic and semantic analysis to understand the user's intent.
[0395] Step 4:
[0396] The server uses an emotion information analysis module to identify the user's emotional state from textual information. The input is the contextual information obtained in step 3, and the output is the user's emotion (e.g., dissatisfied, satisfied). Specifically, a machine learning model (e.g., a PyTorch model) is used to perform pattern recognition to determine the emotion.
[0397] Step 5:
[0398] The server generates an appropriate response based on the analyzed sentiment information and the query content. The input is the user's intent and sentiment information, and the output is a text-based response. A generative AI model is used to create context-aware, real-time responses using prompt sentences.
[0399] Step 6:
[0400] The generated text response is converted into speech data through a speech synthesis engine. The server sends this speech data to the terminal. The input is a text-based response, and the output is a speech-based response. The specific operation utilizes speech synthesis technology for natural-sounding speech (e.g., Text-to-Speech API).
[0401] Step 7:
[0402] The user terminal plays the audio data received from the server and provides an immediate response to the user. The input is the audio data from the server, and the output is the audio the user hears. The specific action at this stage is to present a response to the user using the terminal's audio output device (speaker).
[0403] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0404] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0405] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0406] [Third Embodiment]
[0407] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0408] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0409] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0410] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0411] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0412] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0413] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0414] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0415] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0416] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0417] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0418] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0419] In this invention, the functions of the server, terminal, and user are coordinated to realize speech-to-text conversion, text analysis, response generation, and speech synthesis of the response.
[0420] When a user calls customer support, the device detects the call and sends the audio data to the server. The server uses speech recognition technology to convert this audio data into text data. The converted text data is then analyzed using natural language processing technology, initiating an analysis process to understand the user's inquiry.
[0421] Based on the analysis results, the server generates an appropriate response. The generated response is verified for accuracy by comparing it with existing databases and FAQs, and automatic translation is performed as needed for multilingual support. The response is then converted into speech data by a speech synthesis engine and provided to the user via the terminal.
[0422] For example, if a user says, "I have a question about my invoice," the server recognizes this as an "invoice-related inquiry" and generates a relevant response. This response can be delivered to the user verbally, with specific details such as, "Your most recent invoice was issued on [date]."
[0423] Furthermore, the server uses the feedback obtained during these interactions to continuously learn as a whole system and improve response accuracy.
[0424] As a result, the present invention provides a system that can respond quickly and appropriately to a wide range of customer inquiries, thereby improving customer satisfaction.
[0425] The following describes the processing flow.
[0426] Step 1:
[0427] The user makes a call to customer service. The terminal detects this call and notifies the server to start the system.
[0428] Step 2:
[0429] The server activates the speech recognition engine and acquires the audio data sent from the terminal in real time.
[0430] Step 3:
[0431] The server converts the acquired audio data into text data. Using speech recognition technology, it accurately transcribes the user's speech into text.
[0432] Step 4:
[0433] The server analyzes the converted text data using natural language processing technology to understand the user's intent and inquiry. Based on the analysis, it identifies the category of the inquiry.
[0434] Step 5:
[0435] The server searches for relevant information from the FAQ database and other resources based on the identified category and generates an appropriate response.
[0436] Step 6:
[0437] The generated response will be translated into multiple languages as needed. In this process, the server uses an automatic translation engine.
[0438] Step 7:
[0439] The server converts the response into audio data using a speech synthesis engine and plays it back to the user through the terminal. High-quality speech synthesis technology is used to ensure a natural flow of speech.
[0440] Step 8:
[0441] If the user has any additional feedback or questions, the device will send them to the server and restart the process from step 3.
[0442] Step 9:
[0443] The server utilizes the collected feedback data and continues to learn through machine learning to improve the overall response accuracy of the system.
[0444] (Example 1)
[0445] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0446] Conventional speech recognition systems faced challenges in deeply understanding the content of text after converting speech data, and in responding quickly and accurately to diverse user inquiries. In particular, supporting multiple natural languages and maintaining the accuracy of generated responses proved difficult, potentially compromising customer satisfaction. Furthermore, real-time processing and continuous improvement of response accuracy were required.
[0447] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0448] In this invention, the server includes means for acquiring voice information, means for converting voice information into text information, and means for generating appropriate responses based on queries using a generative AI model. This makes it possible to process user queries in real time and provide accurate responses that support multiple natural languages.
[0449] "Audio information" refers to the digital representation of speech obtained from users, and is data that is subject to text conversion and analysis.
[0450] "Textual information" refers to text data obtained by converting audio information, and is used for analysis and response generation.
[0451] A "generative AI model" is an algorithm or system used to generate text-based responses based on machine learning.
[0452] A "knowledge base" is a database or source of information that contains data used to verify response accuracy.
[0453] A "prompt" is a text instruction input into a generative AI model, which tells the model what kind of response to generate.
[0454] "Natural language" refers to the language that humans use on a daily basis, and which can be processed by computers in the form of speech or text.
[0455] This invention embodies a system that acquires voice information and generates an appropriate response using natural language processing technology. The server, terminal, and user functions are operated in coordination. When a user calls customer support, the terminal acquires voice information. The terminal sends this voice information to the server, which then processes the data.
[0456] The server uses a cloud-based speech recognition service to convert speech information into text. Specifically, open-source or commercial speech recognition software can be used on the server. The converted text information is then analyzed by natural language processing tools. For example, widely used natural language processing libraries can be used.
[0457] Based on the analysis, the server uses a generative AI model to generate an appropriate response to the user's inquiry. During this response generation process, prompts such as "Please provide the latest information regarding the invoice" are input to the AI model. Furthermore, if multilingual support is required, the server can also translate the response using a translation service.
[0458] The text response generated by the server is converted into speech information using a speech synthesis service. By using a speech synthesis engine, it is possible to provide a natural-sounding response. Finally, the terminal receives this speech data and provides it to the user.
[0459] For example, if a user says, "I have a question about my invoice," the system can analyze this and generate an appropriate response such as, "Your latest invoice was issued on [date]," which can then be communicated to the user verbally. This enables the present invention to provide prompt and appropriate customer support, thereby improving customer satisfaction.
[0460] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0461] Step 1:
[0462] The device acquires the user's voice information. When a user contacts customer support by phone, the device records that voice information in real time. The input is the user's raw voice, which is captured as digital voice data. The output is the voice data sent to the server.
[0463] Step 2:
[0464] The server converts audio information into text information. The server uses a speech recognition service to convert the received audio data into text information. The input is digital audio data sent from the terminal, and the output is the converted text data. The server uses a cloud-based speech recognition tool to perform this conversion.
[0465] Step 3:
[0466] The server analyzes the text information to understand the content of the inquiry. The server uses natural language processing tools to analyze the text information and identify the user's intent. The input is the text data obtained from speech recognition, and the output is the analyzed intent and inquiry category. For example, the text information "I want to know the details of the invoice" is classified as an "invoice-related inquiry."
[0467] Step 4:
[0468] The server generates an appropriate response using a generative AI model. The server inputs a prompt sentence into the generative AI model and generates a response. The input is the analyzed intent and prompt sentence, and the output is the response message to be provided to the user. The server generates a response based on a prompt sentence such as "Please provide an update on the invoice."
[0469] Step 5:
[0470] The server translates responses as needed. If multilingual support is required, the server uses a translation service to translate the generated responses into the user's language. The input is the response message output from the generating AI model, and the output is the translated response message.
[0471] Step 6:
[0472] The server converts the text response into audio information, which the terminal then provides to the user. The server uses a speech synthesis engine to convert the text response into speech. The input is the final text response (including translated versions), and the output is audio data. The terminal receives this audio data and plays it back to the user through its speaker.
[0473] Step 7:
[0474] The server collects response evaluation results and performs continuous learning. The server records user feedback and response accuracy and uses this to improve the system. Input is user feedback and evaluation results, and output is system learning data, which is used to improve response accuracy.
[0475] (Application Example 1)
[0476] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0477] In public spaces and environments where security is paramount, there is a need to quickly and accurately detect potential threats and suspicious statements, and to take appropriate action. Conventional systems often lack real-time capabilities and struggle to respond flexibly to linguistic diversity and specific situations.
[0478] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0479] In this invention, the server includes means for acquiring audio information, means for converting it into text information, and means for analyzing it to understand the content of a potential threat. This makes it possible to react immediately to suspicious statements.
[0480] "Audio information" refers to audio data acquired in a given space, and is the subject of analysis by the system.
[0481] "Textual information" refers to data obtained by converting audio information into text format, and is a digital format that allows for further analysis.
[0482] "Analysis" is the process of understanding the content of converted text information and generating necessary responses or warnings.
[0483] A "potential threat" refers to any suspicious remarks or actions in a public forum that could potentially endanger safety.
[0484] "Real-time" refers to a system's ability to react instantly and execute processes without delay.
[0485] This invention is a system for analyzing audio information in real time and detecting potential threats in public places. The server acquires audio information and converts it into text. Using speech recognition technology such as the Google Speech-to-Text API, it is possible to efficiently convert audio data into text data. This allows for the appropriate processing of audio in various languages.
[0486] The terminal is responsible for continuously recording ambient sounds and sending that data to the server. The server analyzes the text information to verify whether a potential threat exists. Natural language processing libraries such as NLTK and spaCy are used to determine if any suspicious statements are included.
[0487] Furthermore, based on the analysis results, an alert will be sent to the administrator immediately if necessary. By using push notification services such as Firebase Cloud Messaging, the information can be delivered without delay.
[0488] For example, when the system of the present invention is operated in a shopping mall, the terminals constantly monitor the surrounding sounds, and if any suspicious remarks are detected, the security team is notified in real time. This type of operation allows for a rapid response to potential threats, further enhancing public safety.
[0489] An example of a prompt for a generated AI model is: "Design an application that uses speech recognition technology to analyze suspicious statements in real time and detect potential threats, as a public security monitoring system. Specifically, please tell me what architecture and technology stack are required." Based on this prompt, it is possible to extract the knowledge necessary to build a more specific system and then design it.
[0490] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0491] Step 1:
[0492] The device acquires ambient sound. It uses the smartphone's microphone to continuously record surrounding audio. The input is raw audio data, which is then prepared for transmission to the server. The output is audio data ready for transmission.
[0493] Step 2:
[0494] The server converts received audio data into text. It uses the Google Speech-to-Text API to convert audio data into text format. The input is audio data, and the output is text. This conversion supports multiple languages.
[0495] Step 3:
[0496] The server analyzes textual information. Using a natural language processing library (NLTK or spaCy), it analyzes the textual information and determines whether it contains suspicious statements. The input is textual information, and the results of the analysis are output.
[0497] Step 4:
[0498] The server generates warning information based on the analysis results. When a potential threat is detected, it generates information that requires real-time action. The input is the analysis results, and it outputs specific warning content.
[0499] Step 5:
[0500] The server notifies the terminal of the generated alert. Firebase Cloud Messaging is used to push the alert details to the administrator's device. The input is the alert information, and the output is the notification displayed on the administrator's device.
[0501] Step 6:
[0502] The user (administrator) will take necessary actions based on the received warning. Specific actions may include going directly to the site or conducting additional investigations. The input is the notified warning information, and the output is the execution of the countermeasures.
[0503] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0504] This invention provides a response system that combines an emotion engine that acquires and processes voice data and analyzes the user's emotional state. When a user calls customer support, the terminal detects the voice input and sends it to the server. This voice data is converted into text data by a speech recognition engine, and the content of the user's inquiry is analyzed by a natural language processing module.
[0505] The emotion engine identifies the user's emotional state from voice input and provides this emotion data to the server. This emotion data is combined with the results of query analysis and used in the response generation module.
[0506] The server generates a response that includes an appropriate tone and suggestions based on the user's emotions. This response can choose more proactive and friendly language, taking the user's feelings into consideration. In particular, if the user expresses dissatisfaction, the response is adjusted to suggest a quick resolution.
[0507] The generated response is converted into speech data by a speech synthesis engine and delivered to the user via the terminal. This entire process allows the system to respond in real time and appropriately address the user's emotional state.
[0508] For example, if a user expresses dissatisfaction that their inquiry has not been resolved, the emotion engine detects this emotion as "dissatisfaction." In response, the server generates a response promising a prompt resolution, such as "We are very sorry. We will provide additional support immediately," and further synthesizes it into polite speech. In this way, the present invention aims to improve customer relationships by realizing interactions that take user emotions into account.
[0509] The following describes the processing flow.
[0510] Step 1:
[0511] The user makes a call to customer service. The device detects the call signal and sends the voice data to the server.
[0512] Step 2:
[0513] The server activates the speech recognition engine and converts the audio data received from the terminal into text data. This conversion process is performed in real time.
[0514] Step 3:
[0515] The server passes the text data to a natural language processing module, which analyzes the user's inquiry. This analysis identifies the user's intent and purpose.
[0516] Step 4:
[0517] The server uses an emotion engine to analyze the user's voice data and identify the emotions contained in the voice. This emotion data influences the generation of subsequent responses.
[0518] Step 5:
[0519] The server combines the analyzed query content with sentiment data, and the response generation module creates an appropriate response. Here, the tone and content of the response are adjusted based on the sentiment data.
[0520] Step 6:
[0521] The generated response is converted back into audio data by a speech synthesis engine. The server then sends this audio data to the terminal.
[0522] Step 7:
[0523] The device plays the generated voice response to the user. It is delivered in a tone that is easy for the user to hear and that is emotionally appropriate.
[0524] Step 8:
[0525] If the user has any further questions or feedback, the terminal will send this back to the server and repeat the process from step 2 as needed.
[0526] Step 9:
[0527] The server collects past dialogue data and uses it to train the emotion engine and response generation modules, thereby improving the overall accuracy of the system.
[0528] (Example 2)
[0529] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0530] Modern customer service systems often fail to generate appropriate responses that align with user emotions, making it particularly difficult to adequately satisfy dissatisfied customers. Furthermore, there is a lack of technology to provide real-time responses while supporting various language environments. This can lead to a decline in the quality of the customer experience and potentially negatively impact brand image.
[0531] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0532] In this invention, the server includes means for analyzing emotional states from voice information and using that data to generate responses, means for generating responses that include tone and suggestions that take into account the user's emotional state, and means for providing responses in real time. This makes it possible to provide appropriate responses that correspond to the user's emotions, thereby improving the quality of customer service.
[0533] "Audio information" refers to information used to acquire, process, and transmit human voices as digital data.
[0534] "Textual information" refers to language data converted from audio or other formats, and is information expressed in the form of characters.
[0535] "Emotional state" refers to data that indicates an individual's emotional state and tendencies, analyzed from audio and textual information.
[0536] "Response generation" is the process of generating a response to the user based on the analyzed questions and emotional states.
[0537] "Real-time" means that processes and responses are executed instantly with very short delays.
[0538] "Tone" refers to the nuances and tendencies of language and expression in a response, and is an element that provides appropriate expression for the user's emotions.
[0539] A "knowledge base" refers to the accumulation of information and data used to verify the accuracy of responses.
[0540] A "machine learning model" is an algorithm or framework used to analyze large amounts of data and learn patterns.
[0541] A "prompt" is a sentence that indicates a question or instruction specified as input to an AI model, and serves as the starting point for generating a response.
[0542] This invention is a system that generates an appropriate response by acquiring voice information when a user calls customer support and analyzing its content. The system functions through the cooperation of a terminal and a server.
[0543] The terminal uses a microphone to capture the user's voice, converts it to a digital format, and sends it to the server. A digital communication protocol is used to ensure the data is securely transmitted to the server.
[0544] The server converts received audio information into text using a speech recognition engine. A commonly used software for this process is a speech recognition service. The server then analyzes the converted text using a natural language processing library to understand the query. At this stage, processing is performed to clarify the user's intent based on the analysis results.
[0545] Next, the emotion analysis engine identifies the user's emotional state from the voice information and generates data for it. The server uses this emotion data to generate a response that corresponds to the user's emotions based on a generative AI model. This generated response is then output by the response generation module, including appropriate tone and suggestions.
[0546] For example, if a user is dissatisfied because their inquiry hasn't been resolved, the sentiment analysis engine detects this as "dissatisfaction," and the server generates a response suggesting a quick resolution, such as, "We are very sorry. We will provide additional support immediately." This response is then converted back into voice information by a speech synthesis engine and delivered to the user through their device.
[0547] An example of a prompt is, "Generate an appropriate response when a customer is dissatisfied with the service." In this way, the system aims to improve customer satisfaction by providing responses based on the user's emotions.
[0548] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0549] Step 1:
[0550] The terminal acquires voice input from the user using a microphone. The acquired voice is converted into a digital signal and sent to the server. The input is the user's raw voice data, and the output is voice data in digital format. The specific operation of dividing the voice signal into small data packets and transmitting them over the internet is performed.
[0551] Step 2:
[0552] The server inputs the received digital audio data into the speech recognition engine and converts it into text information. The input is digital audio data, and the output is the corresponding text information. The server analyzes the features of the audio waveform, identifies phonemes, and outputs the recognized words.
[0553] Step 3:
[0554] The server inputs textual information into a natural language processing module and analyzes the user's query. The input is textual information, and the output is structured query data. The server tokenizes the text, performs syntax analysis, and identifies the intent of the query.
[0555] Step 4:
[0556] The server uses a sentiment analysis engine to analyze the user's emotional state from textual information. The input is text data, and the output is data indicating the emotional state. The server performs specific actions such as evaluating words and context within the text and generating an emotional score.
[0557] Step 5:
[0558] The server uses the analyzed query content and sentiment data to generate a response using a generative AI model. The input is query content data and sentiment data, and the output is the generated response text. The server instructs the AI model to generate a response based on the prompt text, including appropriate tone and suggestions.
[0559] Step 6:
[0560] The server inputs the generated response text into a speech synthesis engine and converts it into speech information. The input is the generated response text, and the output is the synthesized speech data. The server inputs the text into a speech synthesis processor and performs the operation of generating natural-sounding speech that resembles a human voice.
[0561] Step 7:
[0562] The terminal plays back audio data received from the server and provides a response to the user. The input is synthesized audio data, and the output is the audio response delivered to the user. The terminal uses its audio speaker to play back the response with clear and easily understandable sound quality.
[0563] (Application Example 2)
[0564] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0565] In today's information society, efficiently and effectively handling inquiries from a large number of users is crucial. However, conventional response systems merely answer questions without considering the user's emotional state, limiting their ability to improve the user experience. Furthermore, conventional technologies are insufficient for multilingual support and rapid response generation, and there is a particular need to understand and appropriately respond to user emotions.
[0566] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0567] In this invention, the server includes means for acquiring voice information, means for analyzing the user's emotions from the voice information, and means for optimizing the response using the analyzed emotion information. This makes it possible to generate highly compatible responses that take the user's emotions into consideration.
[0568] "Audio information" refers to data or signals that are treated as audio.
[0569] "Character information" refers to data or signals that are represented as characters.
[0570] "Natural language" refers to the language systems that humans use on a daily basis.
[0571] "Machine learning" is a technology that enables systems to autonomously improve their performance using data.
[0572] A "knowledge base" is a collection of data that systematically organizes information related to a specific domain.
[0573] "Emotional information" refers to data about a user's emotional state obtained from their voice and actions.
[0574] "Analysis" is the process of thoroughly examining data and information to understand its structure and meaning.
[0575] "Optimization" means adjusting something to the most effective or efficient state according to its purpose.
[0576] "Real-time" refers to a state in which information processing and operations are performed instantly.
[0577] As an embodiment for carrying out the present invention, the system realizing this application example comprises a server, a user terminal, and a network connecting them. The server converts the user's voice information into text information using a speech recognition API, and analyzes this converted data using a natural language processing library. SpaCy or similar open-source libraries are used for the analysis. For the analysis of emotional information, a custom model using PyTorch is applied to estimate the emotional state from the user's voice.
[0578] The user terminal is a smartphone or similar mobile device equipped with a microphone for voice input and a speaker for voice output. The voice input is sent to the server in real time, and a response is generated based on the analysis results. This response is then converted into speech by a speech synthesis engine and provided to the user.
[0579] The server also performs a process of verifying the accuracy of the generated response by comparing it against a knowledge base. This enables the system to provide appropriate and efficient responses that meet the user's expectations. For example, if a user states that "the delivery of my item is delayed," the system will promptly apologize, check the shipping status, and process the approved claim.
[0580] An example of a prompt message is, "We have detected dissatisfaction from the user's voice. Generate a response that includes a prompt response and an apology." This enables the generative AI model to provide emotionally responsive interactions.
[0581] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0582] Step 1:
[0583] The user initiates voice input on the device. The device uses its microphone to acquire this voice information, converts it into digital data in real time, and sends it to the server. The input is the user's voice, and the output is compressed voice data transferred to the server.
[0584] Step 2:
[0585] The server converts the received audio data into text information using a speech recognition API. In this process, audio waveform data is input and text data is output. Specifically, APIs such as Google Cloud Speech-to-Text are used to convert the audio into a portable text format.
[0586] Step 3:
[0587] The server analyzes the converted text information using a natural language processing library (e.g., spaCy). The input here is speech-recognized text data, and the analyzed contextual information and commands are output. Specifically, it extracts knowledge from the text data and performs syntactic and semantic analysis to understand the user's intent.
[0588] Step 4:
[0589] The server uses an emotion information analysis module to identify the user's emotional state from textual information. The input is the contextual information obtained in step 3, and the output is the user's emotion (e.g., dissatisfied, satisfied). Specifically, a machine learning model (e.g., a PyTorch model) is used to perform pattern recognition to determine the emotion.
[0590] Step 5:
[0591] The server generates an appropriate response based on the analyzed sentiment information and the query content. The input is the user's intent and sentiment information, and the output is a text-based response. A generative AI model is used to create context-aware, real-time responses using prompt sentences.
[0592] Step 6:
[0593] The generated text response is converted into speech data through a speech synthesis engine. The server sends this speech data to the terminal. The input is a text-based response, and the output is a speech-based response. The specific operation utilizes speech synthesis technology for natural-sounding speech (e.g., Text-to-Speech API).
[0594] Step 7:
[0595] The user terminal plays the audio data received from the server and provides an immediate response to the user. The input is the audio data from the server, and the output is the audio the user hears. The specific action at this stage is to present a response to the user using the terminal's audio output device (speaker).
[0596] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0597] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0598] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0599] [Fourth Embodiment]
[0600] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0601] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0602] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0603] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0604] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0605] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0606] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0607] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0608] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0609] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0610] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0611] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0612] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0613] In this invention, the functions of the server, terminal, and user are coordinated to realize speech-to-text conversion, text analysis, response generation, and speech synthesis of the response.
[0614] When a user calls customer support, the device detects the call and sends the audio data to the server. The server uses speech recognition technology to convert this audio data into text data. The converted text data is then analyzed using natural language processing technology, initiating an analysis process to understand the user's inquiry.
[0615] Based on the analysis results, the server generates an appropriate response. The generated response is verified for accuracy by comparing it with existing databases and FAQs, and automatic translation is performed as needed for multilingual support. The response is then converted into speech data by a speech synthesis engine and provided to the user via the terminal.
[0616] For example, if a user says, "I have a question about my invoice," the server recognizes this as an "invoice-related inquiry" and generates a relevant response. This response can be delivered to the user verbally, with specific details such as, "Your most recent invoice was issued on [date]."
[0617] Furthermore, the server uses the feedback obtained during these interactions to continuously learn as a whole system and improve response accuracy.
[0618] As a result, the present invention provides a system that can respond quickly and appropriately to a wide range of customer inquiries, thereby improving customer satisfaction.
[0619] The following describes the processing flow.
[0620] Step 1:
[0621] The user makes a call to customer service. The terminal detects this call and notifies the server to start the system.
[0622] Step 2:
[0623] The server activates the speech recognition engine and acquires the audio data sent from the terminal in real time.
[0624] Step 3:
[0625] The server converts the acquired audio data into text data. Using speech recognition technology, it accurately transcribes the user's speech into text.
[0626] Step 4:
[0627] The server analyzes the converted text data using natural language processing technology to understand the user's intent and inquiry. Based on the analysis, it identifies the category of the inquiry.
[0628] Step 5:
[0629] The server searches for relevant information from the FAQ database and other resources based on the identified category and generates an appropriate response.
[0630] Step 6:
[0631] The generated response will be translated into multiple languages as needed. In this process, the server uses an automatic translation engine.
[0632] Step 7:
[0633] The server converts the response into audio data using a speech synthesis engine and plays it back to the user through the terminal. High-quality speech synthesis technology is used to ensure a natural flow of speech.
[0634] Step 8:
[0635] If the user has any additional feedback or questions, the device will send them to the server and restart the process from step 3.
[0636] Step 9:
[0637] The server utilizes the collected feedback data and continues to learn through machine learning to improve the overall response accuracy of the system.
[0638] (Example 1)
[0639] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0640] Conventional speech recognition systems faced challenges in deeply understanding the content of text after converting speech data, and in responding quickly and accurately to diverse user inquiries. In particular, supporting multiple natural languages and maintaining the accuracy of generated responses proved difficult, potentially compromising customer satisfaction. Furthermore, real-time processing and continuous improvement of response accuracy were required.
[0641] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0642] In this invention, the server includes means for acquiring voice information, means for converting voice information into text information, and means for generating appropriate responses based on queries using a generative AI model. This makes it possible to process user queries in real time and provide accurate responses that support multiple natural languages.
[0643] "Audio information" refers to the digital representation of speech obtained from users, and is data that is subject to text conversion and analysis.
[0644] "Textual information" refers to text data obtained by converting audio information, and is used for analysis and response generation.
[0645] A "generative AI model" is an algorithm or system used to generate text-based responses based on machine learning.
[0646] A "knowledge base" is a database or source of information that contains data used to verify response accuracy.
[0647] A "prompt" is a text instruction input into a generative AI model, which tells the model what kind of response to generate.
[0648] "Natural language" refers to the language that humans use on a daily basis, and which can be processed by computers in the form of speech or text.
[0649] This invention embodies a system that acquires voice information and generates an appropriate response using natural language processing technology. The server, terminal, and user functions are operated in coordination. When a user calls customer support, the terminal acquires voice information. The terminal sends this voice information to the server, which then processes the data.
[0650] The server uses a cloud-based speech recognition service to convert speech information into text. Specifically, open-source or commercial speech recognition software can be used on the server. The converted text information is then analyzed by natural language processing tools. For example, widely used natural language processing libraries can be used.
[0651] Based on the analysis, the server uses a generative AI model to generate an appropriate response to the user's inquiry. During this response generation process, prompts such as "Please provide the latest information regarding the invoice" are input to the AI model. Furthermore, if multilingual support is required, the server can also translate the response using a translation service.
[0652] The text response generated by the server is converted into speech information using a speech synthesis service. By using a speech synthesis engine, it is possible to provide a natural-sounding response. Finally, the terminal receives this speech data and provides it to the user.
[0653] For example, if a user says, "I have a question about my invoice," the system can analyze this and generate an appropriate response such as, "Your latest invoice was issued on [date]," which can then be communicated to the user verbally. This enables the present invention to provide prompt and appropriate customer support, thereby improving customer satisfaction.
[0654] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0655] Step 1:
[0656] The device acquires the user's voice information. When a user contacts customer support by phone, the device records that voice information in real time. The input is the user's raw voice, which is captured as digital voice data. The output is the voice data sent to the server.
[0657] Step 2:
[0658] The server converts audio information into text information. The server uses a speech recognition service to convert the received audio data into text information. The input is digital audio data sent from the terminal, and the output is the converted text data. The server uses a cloud-based speech recognition tool to perform this conversion.
[0659] Step 3:
[0660] The server analyzes the text information to understand the content of the inquiry. The server uses natural language processing tools to analyze the text information and identify the user's intent. The input is the text data obtained from speech recognition, and the output is the analyzed intent and inquiry category. For example, the text information "I want to know the details of the invoice" is classified as an "invoice-related inquiry."
[0661] Step 4:
[0662] The server generates an appropriate response using a generative AI model. The server inputs a prompt sentence into the generative AI model and generates a response. The input is the analyzed intent and prompt sentence, and the output is the response message to be provided to the user. The server generates a response based on a prompt sentence such as "Please provide an update on the invoice."
[0663] Step 5:
[0664] The server translates responses as needed. If multilingual support is required, the server uses a translation service to translate the generated responses into the user's language. The input is the response message output from the generating AI model, and the output is the translated response message.
[0665] Step 6:
[0666] The server converts the text response into audio information, which the terminal then provides to the user. The server uses a speech synthesis engine to convert the text response into speech. The input is the final text response (including translated versions), and the output is audio data. The terminal receives this audio data and plays it back to the user through its speaker.
[0667] Step 7:
[0668] The server collects response evaluation results and performs continuous learning. The server records user feedback and response accuracy and uses this to improve the system. Input is user feedback and evaluation results, and output is system learning data, which is used to improve response accuracy.
[0669] (Application Example 1)
[0670] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0671] In public spaces and environments where security is paramount, there is a need to quickly and accurately detect potential threats and suspicious statements, and to take appropriate action. Conventional systems often lack real-time capabilities and struggle to respond flexibly to linguistic diversity and specific situations.
[0672] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0673] In this invention, the server includes means for acquiring audio information, means for converting it into text information, and means for analyzing it to understand the content of a potential threat. This makes it possible to react immediately to suspicious statements.
[0674] "Audio information" refers to audio data acquired in a given space, and is the subject of analysis by the system.
[0675] "Textual information" refers to data obtained by converting audio information into text format, and is a digital format that allows for further analysis.
[0676] "Analysis" is the process of understanding the content of converted text information and generating necessary responses or warnings.
[0677] A "potential threat" refers to any suspicious remarks or actions in a public forum that could potentially endanger safety.
[0678] "Real-time" refers to a system's ability to react instantly and execute processes without delay.
[0679] This invention is a system for analyzing audio information in real time and detecting potential threats in public places. The server acquires audio information and converts it into text. Using speech recognition technology such as the Google Speech-to-Text API, it is possible to efficiently convert audio data into text data. This allows for the appropriate processing of audio in various languages.
[0680] The terminal is responsible for continuously recording ambient sounds and sending that data to the server. The server analyzes the text information to verify whether a potential threat exists. Natural language processing libraries such as NLTK and spaCy are used to determine if any suspicious statements are included.
[0681] Furthermore, based on the analysis results, an alert will be sent to the administrator immediately if necessary. By using push notification services such as Firebase Cloud Messaging, the information can be delivered without delay.
[0682] For example, when the system of the present invention is operated in a shopping mall, the terminals constantly monitor the surrounding sounds, and if any suspicious remarks are detected, the security team is notified in real time. This type of operation allows for a rapid response to potential threats, further enhancing public safety.
[0683] An example of a prompt for a generated AI model is: "Design an application that uses speech recognition technology to analyze suspicious statements in real time and detect potential threats, as a public security monitoring system. Specifically, please tell me what architecture and technology stack are required." Based on this prompt, it is possible to extract the knowledge necessary to build a more specific system and then design it.
[0684] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0685] Step 1:
[0686] The device acquires ambient sound. It uses the smartphone's microphone to continuously record surrounding audio. The input is raw audio data, which is then prepared for transmission to the server. The output is audio data ready for transmission.
[0687] Step 2:
[0688] The server converts received audio data into text. It uses the Google Speech-to-Text API to convert audio data into text format. The input is audio data, and the output is text. This conversion supports multiple languages.
[0689] Step 3:
[0690] The server analyzes textual information. Using a natural language processing library (NLTK or spaCy), it analyzes the textual information and determines whether it contains suspicious statements. The input is textual information, and the results of the analysis are output.
[0691] Step 4:
[0692] The server generates warning information based on the analysis results. When a potential threat is detected, it generates information that requires real-time action. The input is the analysis results, and it outputs specific warning content.
[0693] Step 5:
[0694] The server notifies the terminal of the generated alert. Firebase Cloud Messaging is used to push the alert details to the administrator's device. The input is the alert information, and the output is the notification displayed on the administrator's device.
[0695] Step 6:
[0696] The user (administrator) will take necessary actions based on the received warning. Specific actions may include going directly to the site or conducting additional investigations. The input is the notified warning information, and the output is the execution of the countermeasures.
[0697] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0698] This invention provides a response system that combines an emotion engine that acquires and processes voice data and analyzes the user's emotional state. When a user calls customer support, the terminal detects the voice input and sends it to the server. This voice data is converted into text data by a speech recognition engine, and the content of the user's inquiry is analyzed by a natural language processing module.
[0699] The emotion engine identifies the user's emotional state from voice input and provides this emotion data to the server. This emotion data is combined with the results of query analysis and used in the response generation module.
[0700] The server generates a response that includes an appropriate tone and suggestions based on the user's emotions. This response can choose more proactive and friendly language, taking the user's feelings into consideration. In particular, if the user expresses dissatisfaction, the response is adjusted to suggest a quick resolution.
[0701] The generated response is converted into speech data by a speech synthesis engine and delivered to the user via the terminal. This entire process allows the system to respond in real time and appropriately address the user's emotional state.
[0702] For example, if a user expresses dissatisfaction that their inquiry has not been resolved, the emotion engine detects this emotion as "dissatisfaction." In response, the server generates a response promising a prompt resolution, such as "We are very sorry. We will provide additional support immediately," and further synthesizes it into polite speech. In this way, the present invention aims to improve customer relationships by realizing interactions that take user emotions into account.
[0703] The following describes the processing flow.
[0704] Step 1:
[0705] The user makes a call to customer service. The device detects the call signal and sends the voice data to the server.
[0706] Step 2:
[0707] The server activates the speech recognition engine and converts the audio data received from the terminal into text data. This conversion process is performed in real time.
[0708] Step 3:
[0709] The server passes the text data to a natural language processing module, which analyzes the user's inquiry. This analysis identifies the user's intent and purpose.
[0710] Step 4:
[0711] The server uses an emotion engine to analyze the user's voice data and identify the emotions contained in the voice. This emotion data influences the generation of subsequent responses.
[0712] Step 5:
[0713] The server combines the analyzed query content with sentiment data, and the response generation module creates an appropriate response. Here, the tone and content of the response are adjusted based on the sentiment data.
[0714] Step 6:
[0715] The generated response is converted back into audio data by a speech synthesis engine. The server then sends this audio data to the terminal.
[0716] Step 7:
[0717] The device plays the generated voice response to the user. It is delivered in a tone that is easy for the user to hear and that is emotionally appropriate.
[0718] Step 8:
[0719] If the user has any further questions or feedback, the terminal will send this back to the server and repeat the process from step 2 as needed.
[0720] Step 9:
[0721] The server collects past dialogue data and uses it to train the emotion engine and response generation modules, thereby improving the overall accuracy of the system.
[0722] (Example 2)
[0723] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0724] Modern customer service systems often fail to generate appropriate responses that align with user emotions, making it particularly difficult to adequately satisfy dissatisfied customers. Furthermore, there is a lack of technology to provide real-time responses while supporting various language environments. This can lead to a decline in the quality of the customer experience and potentially negatively impact brand image.
[0725] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0726] In this invention, the server includes means for analyzing emotional states from voice information and using that data to generate responses, means for generating responses that include tone and suggestions that take into account the user's emotional state, and means for providing responses in real time. This makes it possible to provide appropriate responses that correspond to the user's emotions, thereby improving the quality of customer service.
[0727] "Audio information" refers to information used to acquire, process, and transmit human voices as digital data.
[0728] "Textual information" refers to language data converted from audio or other formats, and is information expressed in the form of characters.
[0729] "Emotional state" refers to data that indicates an individual's emotional state and tendencies, analyzed from audio and textual information.
[0730] "Response generation" is the process of generating a response to the user based on the analyzed questions and emotional states.
[0731] "Real-time" means that processes and responses are executed instantly with very short delays.
[0732] "Tone" refers to the nuances and tendencies of language and expression in a response, and is an element that provides appropriate expression for the user's emotions.
[0733] A "knowledge base" refers to the accumulation of information and data used to verify the accuracy of responses.
[0734] A "machine learning model" is an algorithm or framework used to analyze large amounts of data and learn patterns.
[0735] A "prompt" is a sentence that indicates a question or instruction specified as input to an AI model, and serves as the starting point for generating a response.
[0736] This invention is a system that generates an appropriate response by acquiring voice information when a user calls customer support and analyzing its content. The system functions through the cooperation of a terminal and a server.
[0737] The terminal uses a microphone to capture the user's voice, converts it to a digital format, and sends it to the server. A digital communication protocol is used to ensure the data is securely transmitted to the server.
[0738] The server converts received audio information into text using a speech recognition engine. A commonly used software for this process is a speech recognition service. The server then analyzes the converted text using a natural language processing library to understand the query. At this stage, processing is performed to clarify the user's intent based on the analysis results.
[0739] Next, the emotion analysis engine identifies the user's emotional state from the voice information and generates data for it. The server uses this emotion data to generate a response that corresponds to the user's emotions based on a generative AI model. This generated response is then output by the response generation module, including appropriate tone and suggestions.
[0740] For example, if a user is dissatisfied because their inquiry hasn't been resolved, the sentiment analysis engine detects this as "dissatisfaction," and the server generates a response suggesting a quick resolution, such as, "We are very sorry. We will provide additional support immediately." This response is then converted back into voice information by a speech synthesis engine and delivered to the user through their device.
[0741] An example of a prompt is, "Generate an appropriate response when a customer is dissatisfied with the service." In this way, the system aims to improve customer satisfaction by providing responses based on the user's emotions.
[0742] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0743] Step 1:
[0744] The terminal acquires voice input from the user using a microphone. The acquired voice is converted into a digital signal and sent to the server. The input is the user's raw voice data, and the output is voice data in digital format. The specific operation of dividing the voice signal into small data packets and transmitting them over the internet is performed.
[0745] Step 2:
[0746] The server inputs the received digital audio data into the speech recognition engine and converts it into text information. The input is digital audio data, and the output is the corresponding text information. The server analyzes the features of the audio waveform, identifies phonemes, and outputs the recognized words.
[0747] Step 3:
[0748] The server inputs textual information into a natural language processing module and analyzes the user's query. The input is textual information, and the output is structured query data. The server tokenizes the text, performs syntax analysis, and identifies the intent of the query.
[0749] Step 4:
[0750] The server uses a sentiment analysis engine to analyze the user's emotional state from textual information. The input is text data, and the output is data indicating the emotional state. The server performs specific actions such as evaluating words and context within the text and generating an emotional score.
[0751] Step 5:
[0752] The server uses the analyzed query content and sentiment data to generate a response using a generative AI model. The input is query content data and sentiment data, and the output is the generated response text. The server instructs the AI model to generate a response based on the prompt text, including appropriate tone and suggestions.
[0753] Step 6:
[0754] The server inputs the generated response text into a speech synthesis engine and converts it into speech information. The input is the generated response text, and the output is the synthesized speech data. The server inputs the text into a speech synthesis processor and performs the operation of generating natural-sounding speech that resembles a human voice.
[0755] Step 7:
[0756] The terminal plays back audio data received from the server and provides a response to the user. The input is synthesized audio data, and the output is the audio response delivered to the user. The terminal uses its audio speaker to play back the response with clear and easily understandable sound quality.
[0757] (Application Example 2)
[0758] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0759] In today's information society, efficiently and effectively handling inquiries from a large number of users is crucial. However, conventional response systems merely answer questions without considering the user's emotional state, limiting their ability to improve the user experience. Furthermore, conventional technologies are insufficient for multilingual support and rapid response generation, and there is a particular need to understand and appropriately respond to user emotions.
[0760] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0761] In this invention, the server includes means for acquiring voice information, means for analyzing the user's emotions from the voice information, and means for optimizing the response using the analyzed emotion information. This makes it possible to generate highly compatible responses that take the user's emotions into consideration.
[0762] "Audio information" refers to data or signals that are treated as audio.
[0763] "Character information" refers to data or signals that are represented as characters.
[0764] "Natural language" refers to the language systems that humans use on a daily basis.
[0765] "Machine learning" is a technology that enables systems to autonomously improve their performance using data.
[0766] A "knowledge base" is a collection of data that systematically organizes information related to a specific domain.
[0767] "Emotional information" refers to data about a user's emotional state obtained from their voice and actions.
[0768] "Analysis" is the process of thoroughly examining data and information to understand its structure and meaning.
[0769] "Optimization" means adjusting something to the most effective or efficient state according to its purpose.
[0770] "Real-time" refers to a state in which information processing and operations are performed instantly.
[0771] As an embodiment for carrying out the present invention, the system realizing this application example comprises a server, a user terminal, and a network connecting them. The server converts the user's voice information into text information using a speech recognition API, and analyzes this converted data using a natural language processing library. SpaCy or similar open-source libraries are used for the analysis. For the analysis of emotional information, a custom model using PyTorch is applied to estimate the emotional state from the user's voice.
[0772] The user terminal is a smartphone or similar mobile device equipped with a microphone for voice input and a speaker for voice output. The voice input is sent to the server in real time, and a response is generated based on the analysis results. This response is then converted into speech by a speech synthesis engine and provided to the user.
[0773] The server also performs a process of verifying the accuracy of the generated response by comparing it against a knowledge base. This enables the system to provide appropriate and efficient responses that meet the user's expectations. For example, if a user states that "the delivery of my item is delayed," the system will promptly apologize, check the shipping status, and process the approved claim.
[0774] An example of a prompt message is, "We have detected dissatisfaction from the user's voice. Generate a response that includes a prompt response and an apology." This enables the generative AI model to provide emotionally responsive interactions.
[0775] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0776] Step 1:
[0777] The user initiates voice input on the device. The device uses its microphone to acquire this voice information, converts it into digital data in real time, and sends it to the server. The input is the user's voice, and the output is compressed voice data transferred to the server.
[0778] Step 2:
[0779] The server converts the received audio data into text information using a speech recognition API. In this process, audio waveform data is input and text data is output. Specifically, APIs such as Google Cloud Speech-to-Text are used to convert the audio into a portable text format.
[0780] Step 3:
[0781] The server analyzes the converted text information using a natural language processing library (e.g., spaCy). The input here is speech-recognized text data, and the analyzed contextual information and commands are output. Specifically, it extracts knowledge from the text data and performs syntactic and semantic analysis to understand the user's intent.
[0782] Step 4:
[0783] The server uses an emotion information analysis module to identify the user's emotional state from textual information. The input is the contextual information obtained in step 3, and the output is the user's emotion (e.g., dissatisfied, satisfied). Specifically, a machine learning model (e.g., a PyTorch model) is used to perform pattern recognition to determine the emotion.
[0784] Step 5:
[0785] The server generates an appropriate response based on the analyzed sentiment information and the query content. The input is the user's intent and sentiment information, and the output is a text-based response. A generative AI model is used to create context-aware, real-time responses using prompt sentences.
[0786] Step 6:
[0787] The generated text response is converted into speech data through a speech synthesis engine. The server sends this speech data to the terminal. The input is a text-based response, and the output is a speech-based response. The specific operation utilizes speech synthesis technology for natural-sounding speech (e.g., Text-to-Speech API).
[0788] Step 7:
[0789] The user terminal plays the audio data received from the server and provides an immediate response to the user. The input is the audio data from the server, and the output is the audio the user hears. The specific action at this stage is to present a response to the user using the terminal's audio output device (speaker).
[0790] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0791] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0792] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0793] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0794] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0795] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0796] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0797] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0798] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0799] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0800] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0801] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0802] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0803] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0804] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0805] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0806] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0807] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0808] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0809] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0810] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0811] The following is further disclosed regarding the embodiments described above.
[0812] (Claim 1)
[0813] Means for acquiring audio data,
[0814] A means of converting this audio data into text data,
[0815] A means of analyzing the converted text data to understand the content of the query,
[0816] Means for generating an appropriate response based on an inquiry,
[0817] A means for converting the generated text response into audio data,
[0818] Means for making these operations compatible with multiple languages,
[0819] A system including means for collecting and learning the evaluation results of the aforementioned responses.
[0820] (Claim 2)
[0821] The system according to claim 1, wherein the response generation means verifies the accuracy of the response against a database.
[0822] (Claim 3)
[0823] The system according to claim 1, wherein the means for converting audio data to text and text data to audio operates in real time.
[0824] "Example 1"
[0825] (Claim 1)
[0826] Means for acquiring audio information,
[0827] A means of converting this audio information into text information,
[0828] A means of analyzing the converted character information to understand the content of the query,
[0829] A means for generating an appropriate response based on a query using a generative AI model,
[0830] A means for converting the generated text response into audio information,
[0831] Means for making these operations compatible with multiple natural languages,
[0832] A means for collecting and continuously learning the evaluation results of the aforementioned response,
[0833] A system including means for converting audio information to text information and text information to audio information in real time.
[0834] (Claim 2)
[0835] The system according to claim 1, wherein the response generation means verifies the accuracy of the response in light of a knowledge base.
[0836] (Claim 3)
[0837] The system according to claim 1, which generates a response by inputting a prompt sentence into an AI model that generates prompt sentences.
[0838] "Application Example 1"
[0839] (Claim 1)
[0840] Means for acquiring audio information,
[0841] A means of converting this audio information into text information,
[0842] A means of analyzing the converted text information to understand the content of the inquiry or potential threat,
[0843] A means of generating appropriate responses or warnings based on inquiries,
[0844] A means for converting the generated text response into audio information,
[0845] Means for making these operations compatible with multiple languages,
[0846] A system that includes means for collecting and continuously learning from evaluation results of responses and warnings.
[0847] (Claim 2)
[0848] The system according to claim 1, wherein the response generation means verifies the accuracy of the response or warning against a database.
[0849] (Claim 3)
[0850] The system according to claim 1, which performs real-time conversion from audio information to text information and from text information to audio information, and responds immediately to suspicious speech.
[0851] "Example 2 of combining an emotion engine"
[0852] (Claim 1)
[0853] Means for acquiring audio information,
[0854] A means of converting this audio information into text information,
[0855] A means of analyzing the converted character information to understand the content of the query,
[0856] Means for generating an appropriate response based on an inquiry,
[0857] A means for converting the generated text response into audio information,
[0858] Means for making these operations compatible with multiple natural languages,
[0859] A means for collecting the evaluation results of the aforementioned response and learning based on a machine learning model,
[0860] A means of analyzing emotional states from audio information and using that data to generate responses,
[0861] A means for generating responses that include tone and suggestions that take into account the user's emotional state,
[0862] Means of providing responses in real time,
[0863] A system that includes this.
[0864] (Claim 2)
[0865] The system according to claim 1, wherein the response generation means verifies the accuracy of the response in light of a knowledge base.
[0866] (Claim 3)
[0867] The system according to claim 1, wherein the means for converting audio information to text information and converting text information to audio information is operated immediately.
[0868] "Application example 2 when combining with an emotional engine"
[0869] (Claim 1)
[0870] Means for acquiring audio information,
[0871] A means of converting this audio information into text information,
[0872] A means of analyzing the converted character information to understand the content of the query,
[0873] Means for generating an appropriate response based on an inquiry,
[0874] A means for converting the generated text response into audio information,
[0875] Means for making these operations compatible with multiple natural languages,
[0876] A means for collecting and machine learning the evaluation results of the aforementioned response,
[0877] A method for analyzing user emotions from voice information,
[0878] A system that includes means for optimizing responses using analyzed emotional information.
[0879] (Claim 2)
[0880] The system according to claim 1, wherein the response generation means verifies the accuracy of the response in light of a knowledge base and takes emotional information into consideration.
[0881] (Claim 3)
[0882] The system according to claim 1, wherein the means for converting audio information to text information and converting text information to audio information operates in real time, and emotion information analysis is performed in real time. [Explanation of Symbols]
[0883] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. Means for acquiring audio information, A means of converting this audio information into text information, A means of analyzing the converted text information to understand the content of the inquiry or potential threat, A means of generating appropriate responses or warnings based on inquiries, A means for converting the generated text response into audio information, Means for making these operations compatible with multiple languages, A system that includes means for collecting and continuously learning from evaluation results of responses and warnings.
2. The system according to claim 1, wherein the response generation means verifies the accuracy of the response or warning against a database.
3. The system according to claim 1, which performs real-time conversion from audio information to text information and from text information to audio information, and responds immediately to suspicious speech.