system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The system efficiently converts and analyzes audio feedback into actionable suggestions, addressing the challenge of rapid service improvement in the service industry and shelters.

JP2026096694APending Publication Date: 2026-06-15SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-03
Publication Date: 2026-06-15

AI Technical Summary

Technical Problem

Existing systems in the service industry and shelters face challenges in efficiently collecting, classifying, and analyzing customer feedback and disaster victim complaints to improve services quickly.

Method used

A system that acquires audio data, converts it into text, analyzes and categorizes it, and generates improvement suggestions based on statistically organized data for efficient feedback collection and presentation.

Benefits of technology

Enables rapid and effective analysis of user feedback, allowing for immediate generation and presentation of improvement suggestions.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026096694000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A means of acquiring audio data, A conversion means for converting the aforementioned audio data into text data, An analytical means for analyzing and categorizing the aforementioned text data, A means for statistically organizing the aforementioned analysis results, A generation means for generating improvement proposals based on the aforementioned analysis results, An output means for outputting the aforementioned improvement proposal, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, the method including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a character of the chatbot, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In the service industry and shelters, it is difficult to efficiently collect, classify feedback, requests, and complaints from customers and disaster victims. Furthermore, it is required to quickly analyze such data and utilize it for service improvement. By solving these problems, it is necessary to improve the efficiency of business operations and the quality of services.

Means for Solving the Problems

[0005] This invention provides a system that includes means for acquiring audio data and converting it into text data. The converted data is analyzed and categorized. Furthermore, it includes means for generating improvement suggestions based on statistically organized data, and by outputting these suggestions, it provides a system that enables efficient feedback and the presentation of improvement measures. This allows for the appropriate collection of feedback from the field and the rapid provision of effective improvement suggestions.

[0006] "Audio data" refers to the user's speech or voice signals collected through a microphone or other voice acquisition devices.

[0007] "Acquisition means" refers to the process or device that acquires voice from the user using a microphone or similar device and converts it into a format usable within the system.

[0008] "Conversion means" refers to a process or device that analyzes acquired audio data and converts it into text data in string format.

[0009] "Analysis means" refers to a process or device that processes converted text data to identify important information or classify it into categories.

[0010] "Organizational means" refers to the process or apparatus of using analyzed data to reconstruct the data using statistical methods and storing or displaying it in a meaningful format.

[0011] "Generating means" refers to a process or device that automatically generates improvement suggestions or other recommendations based on organized data.

[0012] "Output means" refers to a method or device for presenting generated suggestions or feedback to the user, providing information in the form of audio or text. [Brief explanation of the drawing]

[0013] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

Embodiments for Carrying Out the Invention

[0014] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0015] First, the terms used in the following description will be explained.

[0016] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0017] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0018] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0019] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), etc.

[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0021] [First Embodiment]

[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0034] In implementing the present invention, the system mainly consists of three elements: a server, a terminal, and a user. The following describes how these elements interact and how the invention is implemented.

[0035] First, the user speaks into a terminal installed within the facility. The terminal uses a microphone to capture the user's voice and converts the captured voice data into text data using speech recognition technology. This conversion process sends the voice information to the server as an analyzable string.

[0036] Next, the server analyzes the received text data using natural language processing (NLP) techniques. This analysis classifies and organizes the text data into categories. For example, it might be categorized as complaints about the service or suggestions for improvement.

[0037] The analyzed data is stored in a database on the server and statistically organized. Through this organization process, the information in the database is continuously updated, making it possible to understand trends such as which feedback patterns are increasing over time.

[0038] The server generates new improvement suggestions based on the organized data. An algorithm is applied as the generation method, which forms specific suggestions aimed at improving the service. For example, suggestions may include strengthening employee training or adding new menu items.

[0039] Finally, the server sends the generated suggestions to the terminal, which then outputs them to the user as feedback. Output methods include direct voice notification to the user using speech synthesis technology, as well as displaying the suggestions as text on the screen.

[0040] For example, if a user provides feedback such as "The wait time is too long," the device captures the audio, converts it to text, and sends it to the server. The server identifies the complaint about "wait time" as a category, analyzes the relevant data, and generates suggestions for service improvement. These suggestions are then fed back to the user from the device, and may include suggestions such as "Increase the number of staff to reduce wait times."

[0041] Through this process, the present invention enables the efficient collection and analysis of user feedback, and the generation of immediate feedback and improvement suggestions.

[0042] The following describes the processing flow.

[0043] Step 1:

[0044] Users speak into the device to input feedback and requests via voice.

[0045] Step 2:

[0046] The device acquires the user's voice and records it as audio data.

[0047] Step 3:

[0048] The device converts the acquired audio data into text data using speech recognition technology.

[0049] Step 4:

[0050] The terminal sends the converted text data to the server.

[0051] Step 5:

[0052] The server analyzes the text data it receives using natural language processing techniques.

[0053] Step 6:

[0054] The server categorizes the text data.

[0055] Step 7:

[0056] The server updates statistical information using the classified data and saves it to the database.

[0057] Step 8:

[0058] The server generates improvement suggestions based on statistical information.

[0059] Step 9:

[0060] The server sends improvement suggestions it has generated to the terminal.

[0061] Step 10:

[0062] The device outputs improvement suggestions it receives to the user through speech synthesis technology and display.

[0063] (Example 1)

[0064] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0065] Traditional customer feedback systems suffer from inefficient processes, from acquiring and analyzing voice data to generating improvement suggestions, making it difficult to translate feedback into rapid service improvements. In particular, delays in accurately converting voice data to text and statistically organizing analysis results can slow down the implementation of feedback, making it difficult to improve customer satisfaction.

[0066] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0067] In this invention, the server includes equipment means for acquiring voice information, technical means for converting voice information into text information, and processing means for analyzing and classifying the text information. This enables rapid and accurate processing of voice feedback from customers, and prompt generation and provision of service improvement suggestions.

[0068] "Audio information" refers to the language and sounds spoken by the user, and is typically data acquired through input devices such as microphones.

[0069] "Equipment means" refers to hardware devices or equipment for acquiring audio information, specifically input devices such as microphones.

[0070] "Textual information" refers to text data obtained after audio information has been converted, and is data in a format that can be processed and analyzed by a computer.

[0071] "Technical means" refers to software or algorithms used to convert speech information into text information, and speech recognition technology is an example of this.

[0072] "Processing means" refers to software or algorithms for analyzing acquired textual information and classifying its content, and natural language processing technology falls under this category.

[0073] "Data means" refers to a system or software for collecting, storing, and organizing analyzed character information, and database management systems are an example of this.

[0074] "Algorithmic means" refers to computational methods or processes for generating improvement suggestions from organized data, and machine learning algorithms fall under this category.

[0075] "Display means" refers to a device or method for presenting generated improvement suggestions to the user, and this includes displays and speech synthesis devices.

[0076] This invention is a voice processing system aimed at improving services through feedback collection and analysis using voice information. The following describes specific implementations of this system.

[0077] Users provide verbal feedback to a terminal installed within the facility. This terminal is equipped with a microphone, which acts as a voice acquisition device to capture voice information. The terminal uses speech recognition technology to convert the voice information into text information. Specifically, it utilizes a commonly available speech recognition API. This API functions as a technical means for converting the acquired voice into text data.

[0078] The character information converted by the terminal is sent to the server. The server analyzes this character information using natural language processing technology. This analysis process includes algorithmic means for classifying text data according to its content. For example, a Python®-based natural language processing library is used to analyze the text and classify it into categories such as "waiting time" or "service."

[0079] The analyzed data is stored in a database management system on the server. Database technologies such as MySQL (registered trademark) are used for data organization. This stored data serves as foundational material for revealing trends through statistical analysis.

[0080] Based on the organized data, the server uses a generation algorithm to formulate improvement suggestions. Machine learning frameworks such as Scikit-learn are used for this generation process. The generated suggestions are then communicated to the user in a concrete and actionable format.

[0081] The terminal provides the user with improvement suggestions received from the server using a display and speech generation technology. For speech generation, a common speech synthesis technology, such as the Text-to-Speech API, is used.

[0082] For example, if a user provides voice feedback stating, "The wait time is too long," the device converts this to text and sends it to the server. The server identifies the complaint about the wait time, analyzes it based on relevant data, and generates a suggestion. This suggestion is ultimately provided to the user, either voice or text, and may include a message such as, "Increase staffing levels to reduce wait times."

[0083] An example of a prompt to input into the generation AI model is text such as, "How does the system generate improvement suggestions after receiving voice feedback?" This prompt is useful for testing the system's behavior and evaluating the improvement suggestion generation process.

[0084] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0085] Step 1:

[0086] Users provide verbal feedback to terminals installed within the facility. The input is voice information captured via the terminal's microphone. For example, when a user says, "The wait time is long," the voice data is recorded.

[0087] Step 2:

[0088] The device converts the acquired audio information into text information using speech recognition technology. Here, audio data is generated as input and text data as output. A speech recognition API is used in this process, and specifically, the audio is converted into text saying "There is a long waiting time."

[0089] Step 3:

[0090] The terminal sends the converted text data to the server. The input text data is securely transferred using the HTTPS protocol. Specifically, the text "There is a long wait time" arrives at the server.

[0091] Step 4:

[0092] The server analyzes the received text data using natural language processing techniques. The text data received as input is categorized after analysis. For example, keywords such as "waiting time" are extracted and categorized as complaints. A natural language processing library is used for this process.

[0093] Step 5:

[0094] The server stores the analyzed data in a database and organizes it statistically. The analyzed data, as input, is stored in the database management system, and time-series data is generated as output. Specifically, data related to waiting times is organized to understand feedback trends.

[0095] Step 6:

[0096] The server applies an algorithm to generate improvement suggestions based on the organized data. The organized data, as input, is analyzed through the algorithm, and specific suggestions are generated as output. For example, a suggestion might be made to "increase the number of staff" to reduce waiting times.

[0097] Step 7:

[0098] The server sends the generated improvement suggestions to the terminal, which then provides feedback on the suggestions either visually or audibly. The input is the generated suggestions, which are presented to the user as output. Specifically, the suggestions are displayed either audibly or in text, communicating to the user, "We will increase staff numbers to reduce waiting times."

[0099] (Application Example 1)

[0100] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0101] In physical stores, there is a need to efficiently and immediately collect and analyze customer feedback to quickly propose concrete improvement measures that enhance the customer experience. However, traditional methods often involve manual feedback collection, lacking immediacy. Furthermore, the process of effectively analyzing the collected feedback and generating specific improvement proposals is not sufficiently automated. As a result, it is difficult to quickly implement improvement measures that lead to increased customer satisfaction.

[0102] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0103] In this invention, the server includes an acquisition means for acquiring audio data, a conversion means for converting audio data into text data, an analysis means for analyzing text data and classifying it into categories related to customer experience, an organization means for accumulating the analysis results and statistically organizing them as time-series data, a generation means for generating new improvement suggestions, and an output means for notifying improvement suggestions using audio output or a display device. This enables efficient collection and analysis of customer feedback in physical stores and allows for the immediate proposal of concrete improvement measures.

[0104] "Acquisition means" refers to devices and technologies for directly collecting voice, which allows user feedback to be incorporated into the system.

[0105] "Conversion means" refers to technology for converting collected audio data into text data, and plays the role of replacing audio information with an analyzable format.

[0106] "Analysis methods" refer to technologies and algorithms that analyze converted text data in detail and classify it into specific categories, and are used with the aim of clearly understanding the user's intentions and emotions.

[0107] "Organizational methods" refer to the process of systematically storing analyzed data using time series and other statistical methods, thereby enabling continuous data management and trend analysis.

[0108] "Generative means" refers to algorithms and technologies for devising new service improvement measures and improvement proposals based on organized data, and possessing the ability to formulate concrete action plans.

[0109] "Output means" refers to technology for notifying users of generated improvement suggestions via audio or visual means, and plays a role in effectively providing feedback to users.

[0110] Modes for carrying out the invention

[0111] The system implementing this invention mainly consists of three elements: a server, a terminal, and a user. The user can provide voice feedback via a terminal installed in the store. The terminal has a built-in voice input device, which makes it possible to effectively acquire the user's voice data.

[0112] A voice input device (e.g., a standard microphone) is used to acquire the voice data. The terminal sends the acquired voice data to the server. The server converts this voice data into text data using speech recognition software (such as Google's Cloud Speech-to-Text API).

[0113] Text data is analyzed on the server using natural language processing technologies (such as the Google NLP API) and categorized into categories related to the customer experience. This makes it easier to organize the specific feedback provided by users. The analyzed data is stored in a database (such as Firebase or AWS® DynamoDB) and statistically organized over time.

[0114] Based on the organized data, the server uses Python scripts and the scikit-learn library to generate improvement suggestions. These suggestions are then communicated to the user via text-to-speech software or a display device. This allows stores to immediately respond to and improve customer feedback.

[0115] For example, if a user speaks into the device saying, "The product placement is confusing," the audio is converted into text and categorized. The organized data is stored as a problem related to "product placement," and the server may generate suggestions such as "placement of product information staff" or "improvement of store layout."

[0116] An example of a prompt to input into the generating AI model is: "Based on the following feedback data, generate service improvement suggestions for physical stores: 'The product layout is unclear, and store guidance is insufficient.'"

[0117] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0118] Step 1:

[0119] The user provides feedback via a voice input device built into the terminal. Voice data is acquired as input. The terminal collects this as digital voice data in preparation for sending it to the server.

[0120] Step 2:

[0121] The server uses speech recognition software (e.g., Google Cloud Speech-to-Text API) to analyze the received audio data. This process involves data processing, converting the input audio data into text data. The output is a string representing the content of the audio.

[0122] Step 3:

[0123] The server analyzes the converted text data using natural language processing techniques (e.g., Google NLP API). This analysis categorizes the text into specific categories related to the customer experience. The input is text data, and the output is categorized information.

[0124] Step 4:

[0125] The server stores the analyzed data in a database. This data is organized based on time series. The specific operation of this step is to accumulate feedback by category and generate statistical data to understand trends.

[0126] Step 5:

[0127] The server generates new improvement suggestions based on the organized data. By utilizing Python scripts and the scikit-learn library to perform data calculations, it outputs specific improvement proposals for store operations.

[0128] Step 6:

[0129] The server sends the generated improvement suggestions to the terminal. The terminal notifies the user of the suggestions either verbally using speech synthesis technology or visually by displaying them on the screen. The output is the suggested content in either audio or text format. This process allows the user to receive improvement suggestions immediately.

[0130] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0131] This invention is a system that acquires and analyzes voice data and provides consulting and improvement suggestions that include the user's emotions. The system mainly consists of three elements: a server, a terminal, and the user.

[0132] First, the user makes a voice input into a terminal within the facility. This voice input includes feedback, requests, and comments. The terminal uses a microphone to acquire the voice data and immediately converts it into text data using speech recognition technology. This process converts the voice information into a format that the system can analyze.

[0133] Next, the terminal sends the converted text data to the server. The server analyzes the text data using natural language processing technology. This analysis incorporates a newly developed emotion engine that recognizes and extracts emotions from the words spoken by the user. The emotion engine can independently identify emotion categories such as positive, negative, and neutral from the text and evaluate their intensity.

[0134] The server uses text data and sentiment information to generate service improvement suggestions. Sentiment information directly influences the priority of the suggestions and the response strategy, and is a crucial element for gaining a rich understanding of the emotions users experience.

[0135] Finally, the generated improvement suggestions are sent from the server to the terminal. The terminal then presents feedback to the user, taking into account the analysis results of the emotion engine, through speech synthesis technology and display. This enables a more flexible and human-like response.

[0136] For example, if a user provides feedback to their device such as, "Recently, the staff's customer service attitude has been frustrating," the device converts this audio into text and sends it to the server. Based on this text, the server recognizes the user's frustration as a negative emotion and suggests "strengthening staff customer service training" as a corresponding improvement measure. This suggestion is then provided to the user as audio feedback.

[0137] In this way, the present invention realizes a system that improves the quality of user feedback and supports more effective service improvement by using emotion-based data.

[0138] The following describes the processing flow.

[0139] Step 1:

[0140] Users input feedback and requests via voice into their devices. These voice inputs may contain the user's emotions.

[0141] Step 2:

[0142] The device uses the microphone to capture the user's voice. The captured voice data is temporarily stored in the internal memory.

[0143] Step 3:

[0144] The device uses speech recognition technology to convert the acquired speech data into text data. A highly accurate speech recognition engine is used during this process to prevent misrecognition of language.

[0145] Step 4:

[0146] The device sends the converted text data to the server. This transmission uses data encryption to protect user privacy.

[0147] Step 5:

[0148] The server receives text data and analyzes its content using natural language processing (NLP) techniques. This analysis includes keyword extraction and contextual understanding.

[0149] Step 6:

[0150] The server uses an emotion engine to analyze the user's emotions contained in the text data. It evaluates the type of emotion (positive, negative, neutral) and its intensity.

[0151] Step 7:

[0152] The server combines text data content with sentiment information to generate improvement suggestions in response to user requests and complaints. These suggestions are prioritized based on the type and intensity of the sentiment.

[0153] Step 8:

[0154] The server generates improvement suggestions and sends them to the terminal. Data security is also considered here, and encryption is performed during transmission.

[0155] Step 9:

[0156] The terminal uses speech synthesis technology to provide voice feedback to the user based on improvement suggestions received from the server. Additionally, the suggestions are displayed as text on the screen as needed.

[0157] Step 10:

[0158] Users can receive feedback and provide further requests and feedback. This process can be repeated.

[0159] (Example 2)

[0160] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0161] Conventional voice data processing systems lack the ability to effectively analyze user feedback and propose concrete improvement measures, making it difficult to rapidly improve services and enhance user satisfaction. Furthermore, they struggle to provide feedback that takes emotions into account, and are insufficient to meet users' essential needs.

[0162] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0163] In this invention, the server includes receiving means for acquiring audio data, conversion means for converting audio data into text data, and processing means for analyzing text data and extracting emotions. This makes it possible to generate highly accurate improvement suggestions based on user feedback.

[0164] "Receiving means" refers to the devices or functions used to acquire audio data.

[0165] "Conversion means" refers to the process or technology used to convert acquired audio data into text data.

[0166] "Processing means" refers to functions and algorithms for analyzing text data and extracting emotions.

[0167] "Means of creation" refers to the methods and functions for creating proposals based on the analysis results.

[0168] "Display means" refers to output functions or devices for presenting the generated proposals to the user.

[0169] This invention is a system that analyzes voice data and generates specific service improvement suggestions based on user feedback. This system primarily consists of a server, a terminal, and the user.

[0170] First, the user provides voice input to the device. For example, they can provide voice feedback such as, "The wait time at this store is long." The device uses a microphone that functions as an acoustic sensor to capture voice data.

[0171] Next, the device uses speech recognition technology to convert this speech data into text data. This process utilizes speech recognition software, and specifically, commonly used speech recognition APIs are suitable.

[0172] Next, the device sends this converted text data to the server. The server uses a natural language processing engine to analyze the text data in detail and extracts emotional data from the user feedback through an emotion engine. This analysis makes it possible to classify the feedback into positive, negative, or neutral emotions.

[0173] Based on the analysis results, the server creates specific improvement suggestions that reflect the sentiment data. For example, in response to negative feedback, suggestions such as increasing staff or managing shifts more efficiently may be generated.

[0174] Ultimately, the server sends the generated improvement suggestions to the terminal, which then uses speech synthesis technology to provide feedback to the user. For example, it's possible to utilize the API of a speech technology provider. In this process, the suggestions can also be presented visually using a display.

[0175] An example of a prompt message might be, "Please provide feedback. Based on your input, we will make suggestions for improving our service." This system allows for the effective use of user feedback based on emotions, enabling service improvements that enhance the user experience.

[0176] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0177] Step 1:

[0178] The user provides voice feedback to the device. The feedback is acquired as a digital audio signal via an acoustic sensor. The device collects this digital audio signal and prepares it for the next processing step.

[0179] Step 2:

[0180] The terminal passes the acquired audio data to speech recognition software, which converts it from audio to text data. Depending on the speech recognition technology used, the audio signal is converted into a string of characters. The input is audio data, and the output is text data. This makes the audio information easier to analyze.

[0181] Step 3:

[0182] The terminal sends the converted text data to the server. During this process, the text data is formatted appropriately and transferred with minimal delay. The input is text data, and the output is a transfer to the server.

[0183] Step 4:

[0184] The server analyzes the received text data using a natural language processing engine. An emotion engine extracts the user's emotions from the text. The input is text data, and the output is analyzed emotion information. Emotions are classified as positive, negative, or neutral.

[0185] Step 5:

[0186] The server generates improvement suggestions based on the analysis results. Specific improvement measures are devised based on the sentiment of user feedback. The input is sentiment information, and the output is suggestion information. This allows feedback to contribute to service improvement.

[0187] Step 6:

[0188] The server sends the created improvement suggestions to the terminal. The suggestions are adjusted based on the priority and content of the feedback. The input is the suggestion information, and the output is the transfer to the terminal.

[0189] Step 7:

[0190] The terminal outputs suggestions to the user using speech synthesis technology. In addition to audio presentation, it can also display the suggestions as text using the display. Input is suggestion information from the server, and output is feedback for the user. This allows the user to understand how their feedback contributes to improvements.

[0191] (Application Example 2)

[0192] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0193] Data centers are operated by a large staff, and real-time and rapid feedback analysis is required to achieve efficient operations. However, conventional systems have difficulty appropriately analyzing the emotions contained in feedback and immediately proposing improvements, which has posed challenges to improving staff comfort and operational efficiency.

[0194] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0195] In this invention, the server includes an acquisition means for acquiring voice data, a conversion means for converting voice data into text data, and an analysis means for analyzing the text data and categorizing it, including emotions. This makes it possible to instantly analyze the emotions of staff voice feedback in a data center and provide improvement suggestions in real time.

[0196] "Voice data" refers to digital information, including the user's spoken words and sounds, acquired by acoustic sensors.

[0197] "Acquisition means" refers to devices or mechanisms for collecting audio data, and typically includes microphones and acoustic sensors.

[0198] "Conversion means" refers to algorithms or software processes for converting audio data into text data, utilizing speech recognition technology.

[0199] "Analysis means" refers to techniques and methods for analyzing text data and determining emotional categories based on individual words and phrases.

[0200] "Organizational methods" refer to functions that add emotional information obtained through analytical methods, statistically classify, visualize, or organize data.

[0201] A "generation method" is a system that creates effective improvement suggestions based on organized data, taking emotional information into consideration.

[0202] "Output means" refers to methods or devices for presenting improvement suggestions to the user in audio or visual form, utilizing speech synthesis or displays.

[0203] This invention realizes a system that analyzes voice data to understand emotions and generates improvement suggestions. The specific implementation of the system is described below.

[0204] The server receives audio data transmitted from each terminal. This audio data is captured by the terminal's acoustic sensors and converted into text data on the spot using speech recognition technology. The main software used here is a speech recognition API (such as Google Cloud Speech-to-Text).

[0205] When text data is sent to the server, the server analyzes the text using a natural language processing library (such as spaCy). At this stage, the emotion engine classifies the emotion in the text as positive, negative, or neutral. As a result, the type and intensity of the emotion are evaluated.

[0206] Next, the server uses a sorting mechanism to statistically classify the analysis results and generates data with added emotional information based on this classification. Based on this data, a generation mechanism creates improvement suggestions. The server then sends the generated improvement suggestions back to the terminal, which presents the suggestions to the user using speech synthesis technology (such as Amazon Polly) or a display.

[0207] For example, if a data center staff member gives feedback saying, "The air conditioning in the facility has been too strong lately, making it too cold," the emotion engine will recognize this voice as a "negative emotion." Based on this information, the server will suggest "revising the air conditioning settings" and notify the staff member of this suggestion in either voice or visual form.

[0208] An example of a prompt for a generating AI model would be: "The user feels that the air conditioning in the facility is too strong and it's too cold. What improvements would you suggest?" This allows the system to analyze emotions in real time and provide appropriate improvement suggestions.

[0209] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0210] Step 1:

[0211] The device acquires user voice data using an acoustic sensor. The input here is the user's voice, and the output is digitized voice data. The acoustic sensor picks up ambient sounds and records them as digital signals within the device.

[0212] Step 2:

[0213] The device uses a speech recognition API to convert audio data into text data. The input is audio data, and the output is the converted text data. The speech recognition API analyzes the audio waveform and converts each phoneme into its corresponding character to create text.

[0214] Step 3:

[0215] The terminal sends the converted text data to the server. The input is text data, and the output is the transmission of data to the server. Network protocols are used to securely and efficiently transfer the text data to the server.

[0216] Step 4:

[0217] The server applies a natural language processing library to analyze the received text data. The input is text data, and the output is the analysis result. The library divides the text into tokens, identifies the emotion of each token, and categorizes them.

[0218] Step 5:

[0219] The server uses an emotion engine to extract and evaluate emotional information from the analysis results. The input is the analysis results, and the output is the emotion category and its intensity. The emotion engine scores the emotional intensity of the identified tokens and determines whether they are positive, negative, or neutral.

[0220] Step 6:

[0221] The server uses sorting tools to statistically classify this emotional information and generate a new data structure. The input is emotional information, and the output is statistically sorted data. The sorting tools aggregate the data by category and generate statistical information.

[0222] Step 7:

[0223] The server generates improvement suggestions using a generation method. The input is statistically organized data, and the output is specific improvement suggestions. It performs the operation of generating the suggestions by applying rule-based models or machine learning models.

[0224] Step 8:

[0225] The server sends the generated improvement suggestions to the terminal. The input is the improvement suggestions, and the output is the transmission of data to the terminal. It performs the operation of transmitting data over the network, delivering information to the user at the appropriate time.

[0226] Step 9:

[0227] The device uses speech synthesis technology to present improvement suggestions to the user as audio. The input is improvement suggestions sent from the server, and the output is audio feedback. The speech synthesis engine generates text as speech and transmits the information through the device's speaker.

[0228] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0229] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0230] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0231] [Second Embodiment]

[0232] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0233] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0234] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0235] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0236] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0237] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0238] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0239] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0240] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0241] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0242] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0243] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0244] In implementing the present invention, the system mainly consists of three elements: a server, a terminal, and a user. The following describes how these elements interact and how the invention is implemented.

[0245] First, the user speaks into a terminal installed within the facility. The terminal uses a microphone to capture the user's voice and converts the captured voice data into text data using speech recognition technology. This conversion process sends the voice information to the server as an analyzable string.

[0246] Next, the server analyzes the received text data using natural language processing (NLP) techniques. This analysis classifies and organizes the text data into categories. For example, it might be categorized as complaints about the service or suggestions for improvement.

[0247] The analyzed data is stored in a database on the server and statistically organized. Through this organization process, the information in the database is continuously updated, making it possible to understand trends such as which feedback patterns are increasing over time.

[0248] The server generates new improvement suggestions based on the organized data. An algorithm is applied as the generation method, which forms specific suggestions aimed at improving the service. For example, suggestions may include strengthening employee training or adding new menu items.

[0249] Finally, the server sends the generated suggestions to the terminal, which then outputs them to the user as feedback. Output methods include direct voice notification to the user using speech synthesis technology, as well as displaying the suggestions as text on the screen.

[0250] For example, if a user provides feedback such as "The wait time is too long," the device captures the audio, converts it to text, and sends it to the server. The server identifies the complaint about "wait time" as a category, analyzes the relevant data, and generates suggestions for service improvement. These suggestions are then fed back to the user from the device, and may include suggestions such as "Increase the number of staff to reduce wait times."

[0251] Through this process, the present invention enables the efficient collection and analysis of user feedback, and the generation of immediate feedback and improvement suggestions.

[0252] The following describes the processing flow.

[0253] Step 1:

[0254] Users speak into the device to input feedback and requests via voice.

[0255] Step 2:

[0256] The device acquires the user's voice and records it as audio data.

[0257] Step 3:

[0258] The device converts the acquired audio data into text data using speech recognition technology.

[0259] Step 4:

[0260] The terminal sends the converted text data to the server.

[0261] Step 5:

[0262] The server analyzes the text data it receives using natural language processing techniques.

[0263] Step 6:

[0264] The server categorizes the text data.

[0265] Step 7:

[0266] The server updates statistical information using the classified data and saves it to the database.

[0267] Step 8:

[0268] The server generates improvement suggestions based on statistical information.

[0269] Step 9:

[0270] The server sends improvement suggestions it has generated to the terminal.

[0271] Step 10:

[0272] The device outputs improvement suggestions it receives to the user through speech synthesis technology and display.

[0273] (Example 1)

[0274] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0275] Traditional customer feedback systems suffer from inefficient processes, from acquiring and analyzing voice data to generating improvement suggestions, making it difficult to translate feedback into rapid service improvements. In particular, delays in accurately converting voice data to text and statistically organizing analysis results can slow down the implementation of feedback, making it difficult to improve customer satisfaction.

[0276] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0277] In this invention, the server includes equipment means for acquiring voice information, technical means for converting voice information into text information, and processing means for analyzing and classifying the text information. This enables rapid and accurate processing of voice feedback from customers, and prompt generation and provision of service improvement suggestions.

[0278] "Audio information" refers to the language and sounds spoken by the user, and is typically data acquired through input devices such as microphones.

[0279] "Equipment means" refers to hardware devices or equipment for acquiring audio information, specifically input devices such as microphones.

[0280] "Textual information" refers to text data obtained after audio information has been converted, and is data in a format that can be processed and analyzed by a computer.

[0281] The "technical means" refers to software or algorithms used to convert voice information into character information, and voice recognition technology corresponds to this.

[0282] The "processing means" refers to software or algorithms for analyzing the acquired character information to classify the content, and natural language processing technology corresponds to this.

[0283] The "data means" refers to a system or software for collecting, storing, and organizing the analyzed character information, and a database management system, etc., corresponds to this.

[0284] The "algorithm means" refers to a calculation method or process for generating improvement proposals from the organized data, and machine learning algorithms correspond to this.

[0285] The "display means" refers to a device or method for presenting the generated improvement proposals to the user, and this includes displays and voice synthesis devices.

[0286] The present invention is a voice processing system aimed at improving services through feedback collection and analysis using voice information. Below, the forms for specifically implementing this system are shown.

[0287] The user verbally provides feedback towards a terminal installed within the facility. This terminal is equipped with a microphone as a voice collection device for acquiring voice information. The terminal uses voice recognition technology to convert the voice information into character information. Specifically, it utilizes a generally available voice recognition API. This API functions as a technical means for converting the acquired voice into text data.

[0288] The character information converted by the terminal is sent to the server. The server analyzes this character information using natural language processing technology. This analysis process includes algorithmic means for classifying text data according to its content. For example, a Python-based natural language processing library is used to analyze the text and classify it into categories such as "waiting time" or "service."

[0289] The analyzed data is stored in a database management system on the server. Database technologies such as MySQL are used for data organization. This stored data serves as foundational material for revealing trends through statistical analysis.

[0290] Based on the organized data, the server uses a generation algorithm to formulate improvement suggestions. Machine learning frameworks such as Scikit-learn are used for this generation process. The generated suggestions are then communicated to the user in a concrete and actionable format.

[0291] The terminal provides the user with improvement suggestions received from the server using a display and speech generation technology. For speech generation, a common speech synthesis technology, such as the Text-to-Speech API, is used.

[0292] For example, if a user provides voice feedback stating, "The wait time is too long," the device converts this to text and sends it to the server. The server identifies the complaint about the wait time, analyzes it based on relevant data, and generates a suggestion. This suggestion is ultimately provided to the user, either voice or text, and may include a message such as, "Increase staffing levels to reduce wait times."

[0293] An example of a prompt to input into the generation AI model is text such as, "How does the system generate improvement suggestions after receiving voice feedback?" This prompt is useful for testing the system's behavior and evaluating the improvement suggestion generation process.

[0294] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0295] Step 1:

[0296] Users provide verbal feedback to terminals installed within the facility. The input is voice information captured via the terminal's microphone. For example, when a user says, "The wait time is long," the voice data is recorded.

[0297] Step 2:

[0298] The device converts the acquired audio information into text information using speech recognition technology. Here, audio data is generated as input and text data as output. A speech recognition API is used in this process, and specifically, the audio is converted into text saying "There is a long waiting time."

[0299] Step 3:

[0300] The terminal sends the converted text data to the server. The input text data is securely transferred using the HTTPS protocol. Specifically, the text "There is a long wait time" arrives at the server.

[0301] Step 4:

[0302] The server analyzes the received text data using natural language processing techniques. The text data received as input is categorized after analysis. For example, keywords such as "waiting time" are extracted and categorized as complaints. A natural language processing library is used for this process.

[0303] Step 5:

[0304] The server accumulates the analyzed data in a database and statistically organizes it. The analyzed data as input is stored in a database management system, and time-series data is generated as output. Specifically, in order to grasp the trend of feedback, the data related to waiting time is organized.

[0305] Step 6:

[0306] Based on the organized data, the server applies an algorithm for generating improvement suggestions. The organized data as input is analyzed through the algorithm, and specific suggestions are generated as output. For example, suggestions such as "increase the number of staff" are made to shorten the waiting time.

[0307] Step 7:

[0308] The server sends the generated improvement suggestions to the terminal, and the terminal provides feedback on the suggestions by display or voice. The input is the generated suggestion, and it is presented to the user as output. Specifically, the suggestion is displayed in voice or text and conveyed to the user as "increase the number of staff to shorten the waiting time".

[0309] (Application Example 1)

[0310] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".

[0311] In a physical store, it is required to quickly propose specific improvement measures to improve the customer experience by efficiently and immediately collecting and analyzing feedback from customers. However, in the conventional method, feedback collection is often done manually, lacking immediacy. Also, the process of effectively analyzing the obtained feedback and generating specific improvement plans is not sufficiently automated. Therefore, it is difficult to quickly introduce improvement measures that lead to an improvement in customer satisfaction.

[0312] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0313] In this invention, the server includes an acquisition means for acquiring audio data, a conversion means for converting audio data into text data, an analysis means for analyzing text data and classifying it into categories related to customer experience, an organization means for accumulating the analysis results and statistically organizing them as time-series data, a generation means for generating new improvement suggestions, and an output means for notifying improvement suggestions using audio output or a display device. This enables efficient collection and analysis of customer feedback in physical stores and allows for the immediate proposal of concrete improvement measures.

[0314] "Acquisition means" refers to devices and technologies for directly collecting voice, which allows user feedback to be incorporated into the system.

[0315] "Conversion means" refers to technology for converting collected audio data into text data, and plays the role of replacing audio information with an analyzable format.

[0316] "Analysis methods" refer to technologies and algorithms that analyze converted text data in detail and classify it into specific categories, and are used with the aim of clearly understanding the user's intentions and emotions.

[0317] "Organizational methods" refer to the process of systematically storing analyzed data using time series and other statistical methods, thereby enabling continuous data management and trend analysis.

[0318] "Generative means" refers to algorithms and technologies for devising new service improvement measures and improvement proposals based on organized data, and possessing the ability to formulate concrete action plans.

[0319] "Output means" refers to technology for notifying users of generated improvement suggestions via audio or visual means, and plays a role in effectively providing feedback to users.

[0320] Modes for carrying out the invention

[0321] The system implementing this invention mainly consists of three elements: a server, a terminal, and a user. The user can provide voice feedback via a terminal installed in the store. The terminal has a built-in voice input device, which makes it possible to effectively acquire the user's voice data.

[0322] A voice input device (e.g., a standard microphone) is used to acquire the voice data. The terminal sends the acquired voice data to the server. The server converts this voice data into text data using speech recognition software (such as the Google Cloud Speech-to-Text API).

[0323] Text data is analyzed on the server using natural language processing technologies (such as the Google NLP API) and categorized into categories related to the customer experience. This makes it easier to organize the specific feedback provided by users. The analyzed data is stored in a database (such as Firebase or AWS DynamoDB) and statistically organized over time.

[0324] Based on the organized data, the server uses Python scripts and the scikit-learn library to generate improvement suggestions. These suggestions are then communicated to the user via text-to-speech software or a display device. This allows stores to immediately respond to and improve customer feedback.

[0325] For example, if a user speaks into the device saying, "The product placement is confusing," the audio is converted into text and categorized. The organized data is stored as a problem related to "product placement," and the server may generate suggestions such as "placement of product information staff" or "improvement of store layout."

[0326] An example of a prompt to input into the generating AI model is: "Based on the following feedback data, generate service improvement suggestions for physical stores: 'The product layout is unclear, and store guidance is insufficient.'"

[0327] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0328] Step 1:

[0329] The user provides feedback via a voice input device built into the terminal. Voice data is acquired as input. The terminal collects this as digital voice data in preparation for sending it to the server.

[0330] Step 2:

[0331] The server uses speech recognition software (e.g., Google Cloud Speech-to-Text API) to analyze the received audio data. This process involves data processing, converting the input audio data into text data. The output is a string representing the content of the audio.

[0332] Step 3:

[0333] The server analyzes the converted text data using natural language processing techniques (e.g., Google NLP API). This analysis categorizes the text into specific categories related to the customer experience. The input is text data, and the output is categorized information.

[0334] Step 4:

[0335] The server stores the analyzed data in a database. This data is organized based on time series. The specific operation of this step is to accumulate feedback by category and generate statistical data to understand trends.

[0336] Step 5:

[0337] The server generates new improvement suggestions based on the organized data. By utilizing Python scripts and the scikit-learn library to perform data calculations, it outputs specific improvement proposals for store operations.

[0338] Step 6:

[0339] The server sends the generated improvement suggestions to the terminal. The terminal notifies the user of the suggestions either verbally using speech synthesis technology or visually by displaying them on the screen. The output is the suggested content in either audio or text format. This process allows the user to receive improvement suggestions immediately.

[0340] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0341] This invention is a system that acquires and analyzes voice data and provides consulting and improvement suggestions that include the user's emotions. The system mainly consists of three elements: a server, a terminal, and the user.

[0342] First, the user makes a voice input into a terminal within the facility. This voice input includes feedback, requests, and comments. The terminal uses a microphone to acquire the voice data and immediately converts it into text data using speech recognition technology. This process converts the voice information into a format that the system can analyze.

[0343] Next, the terminal sends the converted text data to the server. The server analyzes the text data using natural language processing technology. This analysis incorporates a newly developed emotion engine that recognizes and extracts emotions from the words spoken by the user. The emotion engine can independently identify emotion categories such as positive, negative, and neutral from the text and evaluate their intensity.

[0344] The server uses text data and sentiment information to generate service improvement suggestions. Sentiment information directly influences the priority of the suggestions and the response strategy, and is a crucial element for gaining a rich understanding of the emotions users experience.

[0345] Finally, the generated improvement suggestions are sent from the server to the terminal. The terminal then presents feedback to the user, taking into account the analysis results of the emotion engine, through speech synthesis technology and display. This enables a more flexible and human-like response.

[0346] For example, if a user provides feedback to their device such as, "Recently, the staff's customer service attitude has been frustrating," the device converts this audio into text and sends it to the server. Based on this text, the server recognizes the user's frustration as a negative emotion and suggests "strengthening staff customer service training" as a corresponding improvement measure. This suggestion is then provided to the user as audio feedback.

[0347] In this way, the present invention realizes a system that improves the quality of user feedback and supports more effective service improvement by using emotion-based data.

[0348] The following describes the processing flow.

[0349] Step 1:

[0350] Users input feedback and requests via voice into their devices. These voice inputs may contain the user's emotions.

[0351] Step 2:

[0352] The device uses the microphone to capture the user's voice. The captured voice data is temporarily stored in the internal memory.

[0353] Step 3:

[0354] The device uses speech recognition technology to convert the acquired speech data into text data. A highly accurate speech recognition engine is used during this process to prevent misrecognition of language.

[0355] Step 4:

[0356] The device sends the converted text data to the server. This transmission uses data encryption to protect user privacy.

[0357] Step 5:

[0358] The server receives text data and analyzes its content using natural language processing (NLP) techniques. This analysis includes keyword extraction and contextual understanding.

[0359] Step 6:

[0360] The server uses an emotion engine to analyze the user's emotions contained in the text data. It evaluates the type of emotion (positive, negative, neutral) and its intensity.

[0361] Step 7:

[0362] The server combines text data content with sentiment information to generate improvement suggestions in response to user requests and complaints. These suggestions are prioritized based on the type and intensity of the sentiment.

[0363] Step 8:

[0364] The server generates improvement suggestions and sends them to the terminal. Data security is also considered here, and encryption is performed during transmission.

[0365] Step 9:

[0366] The terminal uses speech synthesis technology to provide voice feedback to the user based on improvement suggestions received from the server. Additionally, the suggestions are displayed as text on the screen as needed.

[0367] Step 10:

[0368] Users can receive feedback and provide further requests and feedback. This process can be repeated.

[0369] (Example 2)

[0370] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0371] Conventional voice data processing systems lack the ability to effectively analyze user feedback and propose concrete improvement measures, making it difficult to rapidly improve services and enhance user satisfaction. Furthermore, they struggle to provide feedback that takes emotions into account, and are insufficient to meet users' essential needs.

[0372] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0373] In this invention, the server includes receiving means for acquiring audio data, conversion means for converting audio data into text data, and processing means for analyzing text data and extracting emotions. This makes it possible to generate highly accurate improvement suggestions based on user feedback.

[0374] "Receiving means" refers to the devices or functions used to acquire audio data.

[0375] "Conversion means" refers to the process or technology used to convert acquired audio data into text data.

[0376] "Processing means" refers to functions and algorithms for analyzing text data and extracting emotions.

[0377] "Means of creation" refers to the methods and functions for creating proposals based on the analysis results.

[0378] "Display means" refers to output functions or devices for presenting the generated proposals to the user.

[0379] This invention is a system that analyzes voice data and generates specific service improvement suggestions based on user feedback. This system primarily consists of a server, a terminal, and the user.

[0380] First, the user provides voice input to the device. For example, they can provide voice feedback such as, "The wait time at this store is long." The device uses a microphone that functions as an acoustic sensor to capture voice data.

[0381] Next, the device uses speech recognition technology to convert this speech data into text data. This process utilizes speech recognition software, and specifically, commonly used speech recognition APIs are suitable.

[0382] Next, the device sends this converted text data to the server. The server uses a natural language processing engine to analyze the text data in detail and extracts emotional data from the user feedback through an emotion engine. This analysis makes it possible to classify the feedback into positive, negative, or neutral emotions.

[0383] Based on the analysis results, the server creates specific improvement suggestions that reflect the sentiment data. For example, in response to negative feedback, suggestions such as increasing staff or managing shifts more efficiently may be generated.

[0384] Ultimately, the server sends the generated improvement suggestions to the terminal, which then uses speech synthesis technology to provide feedback to the user. For example, it's possible to utilize the API of a speech technology provider. In this process, the suggestions can also be presented visually using a display.

[0385] An example of a prompt message might be, "Please provide feedback. Based on your input, we will make suggestions for improving our service." This system allows for the effective use of user feedback based on emotions, enabling service improvements that enhance the user experience.

[0386] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0387] Step 1:

[0388] The user provides voice feedback to the device. The feedback is acquired as a digital audio signal via an acoustic sensor. The device collects this digital audio signal and prepares it for the next processing step.

[0389] Step 2:

[0390] The terminal passes the acquired audio data to speech recognition software, which converts it from audio to text data. Depending on the speech recognition technology used, the audio signal is converted into a string of characters. The input is audio data, and the output is text data. This makes the audio information easier to analyze.

[0391] Step 3:

[0392] The terminal sends the converted text data to the server. During this process, the text data is formatted appropriately and transferred with minimal delay. The input is text data, and the output is a transfer to the server.

[0393] Step 4:

[0394] The server analyzes the received text data using a natural language processing engine. An emotion engine extracts the user's emotions from the text. The input is text data, and the output is analyzed emotion information. Emotions are classified as positive, negative, or neutral.

[0395] Step 5:

[0396] The server generates improvement suggestions based on the analysis results. Specific improvement measures are devised based on the sentiment of user feedback. The input is sentiment information, and the output is suggestion information. This allows feedback to contribute to service improvement.

[0397] Step 6:

[0398] The server sends the created improvement suggestions to the terminal. The suggestions are adjusted based on the priority and content of the feedback. The input is the suggestion information, and the output is the transfer to the terminal.

[0399] Step 7:

[0400] The terminal outputs suggestions to the user using speech synthesis technology. In addition to audio presentation, it can also display the suggestions as text using the display. Input is suggestion information from the server, and output is feedback for the user. This allows the user to understand how their feedback contributes to improvements.

[0401] (Application Example 2)

[0402] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0403] Data centers are operated by a large staff, and real-time and rapid feedback analysis is required to achieve efficient operations. However, conventional systems have difficulty appropriately analyzing the emotions contained in feedback and immediately proposing improvements, which has posed challenges to improving staff comfort and operational efficiency.

[0404] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0405] In this invention, the server includes an acquisition means for acquiring voice data, a conversion means for converting voice data into text data, and an analysis means for analyzing the text data and categorizing it, including emotions. This makes it possible to instantly analyze the emotions of staff voice feedback in a data center and provide improvement suggestions in real time.

[0406] "Voice data" refers to digital information, including the user's spoken words and sounds, acquired by acoustic sensors.

[0407] "Acquisition means" refers to devices or mechanisms for collecting audio data, and typically includes microphones and acoustic sensors.

[0408] "Conversion means" refers to algorithms or software processes for converting audio data into text data, utilizing speech recognition technology.

[0409] "Analysis means" refers to techniques and methods for analyzing text data and determining emotional categories based on individual words and phrases.

[0410] "Organizational methods" refer to functions that add emotional information obtained through analytical methods, statistically classify, visualize, or organize data.

[0411] A "generation method" is a system that creates effective improvement suggestions based on organized data, taking emotional information into consideration.

[0412] "Output means" refers to methods or devices for presenting improvement suggestions to the user in audio or visual form, utilizing speech synthesis or displays.

[0413] This invention realizes a system that analyzes voice data to understand emotions and generates improvement suggestions. The specific implementation of the system is described below.

[0414] The server receives audio data transmitted from each terminal. This audio data is captured by the terminal's acoustic sensors and converted into text data on the spot using speech recognition technology. The main software used here is a speech recognition API (such as Google Cloud Speech-to-Text).

[0415] When text data is sent to the server, the server analyzes the text using a natural language processing library (such as spaCy). At this stage, the emotion engine classifies the emotion in the text as positive, negative, or neutral. As a result, the type and intensity of the emotion are evaluated.

[0416] Next, the server uses a sorting mechanism to statistically classify the analysis results and generates data with added emotional information based on this classification. Based on this data, a generation mechanism creates improvement suggestions. The server then sends the generated improvement suggestions back to the terminal, which presents the suggestions to the user using speech synthesis technology (such as Amazon Polly) or a display.

[0417] For example, if a data center staff member gives feedback saying, "The air conditioning in the facility has been too strong lately, making it too cold," the emotion engine will recognize this voice as a "negative emotion." Based on this information, the server will suggest "revising the air conditioning settings" and notify the staff member of this suggestion in either voice or visual form.

[0418] An example of a prompt for a generating AI model would be: "The user feels that the air conditioning in the facility is too strong and it's too cold. What improvements would you suggest?" This allows the system to analyze emotions in real time and provide appropriate improvement suggestions.

[0419] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0420] Step 1:

[0421] The device acquires user voice data using an acoustic sensor. The input here is the user's voice, and the output is digitized voice data. The acoustic sensor picks up ambient sounds and records them as digital signals within the device.

[0422] Step 2:

[0423] The device uses a speech recognition API to convert audio data into text data. The input is audio data, and the output is the converted text data. The speech recognition API analyzes the audio waveform and converts each phoneme into its corresponding character to create text.

[0424] Step 3:

[0425] The terminal sends the converted text data to the server. The input is text data, and the output is the transmission of data to the server. Network protocols are used to securely and efficiently transfer the text data to the server.

[0426] Step 4:

[0427] The server applies a natural language processing library to analyze the received text data. The input is text data, and the output is the analysis result. The library divides the text into tokens, identifies the emotion of each token, and categorizes them.

[0428] Step 5:

[0429] The server uses an emotion engine to extract and evaluate emotional information from the analysis results. The input is the analysis results, and the output is the emotion category and its intensity. The emotion engine scores the emotional intensity of the identified tokens and determines whether they are positive, negative, or neutral.

[0430] Step 6:

[0431] The server uses sorting tools to statistically classify this emotional information and generate a new data structure. The input is emotional information, and the output is statistically sorted data. The sorting tools aggregate the data by category and generate statistical information.

[0432] Step 7:

[0433] The server generates improvement suggestions using a generation method. The input is statistically organized data, and the output is specific improvement suggestions. It performs the operation of generating the suggestions by applying rule-based models or machine learning models.

[0434] Step 8:

[0435] The server sends the generated improvement suggestions to the terminal. The input is the improvement suggestions, and the output is the transmission of data to the terminal. It performs the operation of transmitting data over the network, delivering information to the user at the appropriate time.

[0436] Step 9:

[0437] The device uses speech synthesis technology to present improvement suggestions to the user as audio. The input is improvement suggestions sent from the server, and the output is audio feedback. The speech synthesis engine generates text as speech and transmits the information through the device's speaker.

[0438] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0439] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0440] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0441] [Third Embodiment]

[0442] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0443] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0444] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0445] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0446] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0447] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0448] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0449] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0450] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0451] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0452] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0453] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0454] In implementing the present invention, the system mainly consists of three elements: a server, a terminal, and a user. The following describes how these elements interact and how the invention is implemented.

[0455] First, the user speaks into a terminal installed within the facility. The terminal uses a microphone to capture the user's voice and converts the captured voice data into text data using speech recognition technology. This conversion process sends the voice information to the server as an analyzable string.

[0456] Next, the server analyzes the received text data using natural language processing (NLP) techniques. This analysis classifies and organizes the text data into categories. For example, it might be categorized as complaints about the service or suggestions for improvement.

[0457] The analyzed data is stored in a database on the server and statistically organized. Through this organization process, the information in the database is continuously updated, making it possible to understand trends such as which feedback patterns are increasing over time.

[0458] The server generates new improvement suggestions based on the organized data. An algorithm is applied as the generation method, which forms specific suggestions aimed at improving the service. For example, suggestions may include strengthening employee training or adding new menu items.

[0459] Finally, the server sends the generated suggestions to the terminal, which then outputs them to the user as feedback. Output methods include direct voice notification to the user using speech synthesis technology, as well as displaying the suggestions as text on the screen.

[0460] For example, if a user provides feedback such as "The wait time is too long," the device captures the audio, converts it to text, and sends it to the server. The server identifies the complaint about "wait time" as a category, analyzes the relevant data, and generates suggestions for service improvement. These suggestions are then fed back to the user from the device, and may include suggestions such as "Increase the number of staff to reduce wait times."

[0461] Through this process, the present invention enables the efficient collection and analysis of user feedback, and the generation of immediate feedback and improvement suggestions.

[0462] The following describes the processing flow.

[0463] Step 1:

[0464] Users speak into the device to input feedback and requests via voice.

[0465] Step 2:

[0466] The device acquires the user's voice and records it as audio data.

[0467] Step 3:

[0468] The device converts the acquired audio data into text data using speech recognition technology.

[0469] Step 4:

[0470] The terminal sends the converted text data to the server.

[0471] Step 5:

[0472] The server analyzes the text data it receives using natural language processing techniques.

[0473] Step 6:

[0474] The server categorizes the text data.

[0475] Step 7:

[0476] The server updates statistical information using the classified data and saves it to the database.

[0477] Step 8:

[0478] The server generates improvement suggestions based on statistical information.

[0479] Step 9:

[0480] The server sends improvement suggestions it has generated to the terminal.

[0481] Step 10:

[0482] The device outputs improvement suggestions it receives to the user through speech synthesis technology and display.

[0483] (Example 1)

[0484] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0485] Traditional customer feedback systems suffer from inefficient processes, from acquiring and analyzing voice data to generating improvement suggestions, making it difficult to translate feedback into rapid service improvements. In particular, delays in accurately converting voice data to text and statistically organizing analysis results can slow down the implementation of feedback, making it difficult to improve customer satisfaction.

[0486] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0487] In this invention, the server includes equipment means for acquiring voice information, technical means for converting voice information into text information, and processing means for analyzing and classifying the text information. This enables rapid and accurate processing of voice feedback from customers, and prompt generation and provision of service improvement suggestions.

[0488] "Audio information" refers to the language and sounds spoken by the user, and is typically data acquired through input devices such as microphones.

[0489] "Equipment means" refers to hardware devices or equipment for acquiring audio information, specifically input devices such as microphones.

[0490] "Textual information" refers to text data obtained after audio information has been converted, and is data in a format that can be processed and analyzed by a computer.

[0491] "Technical means" refers to software or algorithms used to convert speech information into text information, and speech recognition technology is an example of this.

[0492] "Processing means" refers to software or algorithms for analyzing acquired textual information and classifying its content, and natural language processing technology falls under this category.

[0493] "Data means" refers to a system or software for collecting, storing, and organizing analyzed character information, and database management systems are an example of this.

[0494] "Algorithmic means" refers to computational methods or processes for generating improvement suggestions from organized data, and machine learning algorithms fall under this category.

[0495] "Display means" refers to a device or method for presenting generated improvement suggestions to the user, and this includes displays and speech synthesis devices.

[0496] This invention is a voice processing system aimed at improving services through feedback collection and analysis using voice information. The following describes specific implementations of this system.

[0497] Users provide verbal feedback to a terminal installed within the facility. This terminal is equipped with a microphone, which acts as a voice acquisition device to capture voice information. The terminal uses speech recognition technology to convert the voice information into text information. Specifically, it utilizes a commonly available speech recognition API. This API functions as a technical means for converting the acquired voice into text data.

[0498] The character information converted by the terminal is sent to the server. The server analyzes this character information using natural language processing technology. This analysis process includes algorithmic means for classifying text data according to its content. For example, a Python-based natural language processing library is used to analyze the text and classify it into categories such as "waiting time" or "service."

[0499] The analyzed data is stored in a database management system on the server. Database technologies such as MySQL are used for data organization. This stored data serves as foundational material for revealing trends through statistical analysis.

[0500] Based on the organized data, the server uses a generation algorithm to formulate improvement suggestions. Machine learning frameworks such as Scikit-learn are used for this generation process. The generated suggestions are then communicated to the user in a concrete and actionable format.

[0501] The terminal provides the user with improvement suggestions received from the server using a display and speech generation technology. For speech generation, a common speech synthesis technology, such as the Text-to-Speech API, is used.

[0502] For example, if a user provides voice feedback stating, "The wait time is too long," the device converts this to text and sends it to the server. The server identifies the complaint about the wait time, analyzes it based on relevant data, and generates a suggestion. This suggestion is ultimately provided to the user, either voice or text, and may include a message such as, "Increase staffing levels to reduce wait times."

[0503] An example of a prompt to input into the generation AI model is text such as, "How does the system generate improvement suggestions after receiving voice feedback?" This prompt is useful for testing the system's behavior and evaluating the improvement suggestion generation process.

[0504] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0505] Step 1:

[0506] Users provide verbal feedback to terminals installed within the facility. The input is voice information captured via the terminal's microphone. For example, when a user says, "The wait time is long," the voice data is recorded.

[0507] Step 2:

[0508] The device converts the acquired audio information into text information using speech recognition technology. Here, audio data is generated as input and text data as output. A speech recognition API is used in this process, and specifically, the audio is converted into text saying "There is a long waiting time."

[0509] Step 3:

[0510] The terminal sends the converted text data to the server. The input text data is securely transferred using the HTTPS protocol. Specifically, the text "There is a long wait time" arrives at the server.

[0511] Step 4:

[0512] The server analyzes the received text data using natural language processing techniques. The text data received as input is categorized after analysis. For example, keywords such as "waiting time" are extracted and categorized as complaints. A natural language processing library is used for this process.

[0513] Step 5:

[0514] The server stores the analyzed data in a database and organizes it statistically. The analyzed data, as input, is stored in the database management system, and time-series data is generated as output. Specifically, data related to waiting times is organized to understand feedback trends.

[0515] Step 6:

[0516] The server applies an algorithm to generate improvement suggestions based on the organized data. The organized data, as input, is analyzed through the algorithm, and specific suggestions are generated as output. For example, a suggestion might be made to "increase the number of staff" to reduce waiting times.

[0517] Step 7:

[0518] The server sends the generated improvement suggestions to the terminal, which then provides feedback on the suggestions either visually or audibly. The input is the generated suggestions, which are presented to the user as output. Specifically, the suggestions are displayed either audibly or in text, communicating to the user, "We will increase staff numbers to reduce waiting times."

[0519] (Application Example 1)

[0520] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0521] In physical stores, there is a need to efficiently and immediately collect and analyze customer feedback to quickly propose concrete improvement measures that enhance the customer experience. However, traditional methods often involve manual feedback collection, lacking immediacy. Furthermore, the process of effectively analyzing the collected feedback and generating specific improvement proposals is not sufficiently automated. As a result, it is difficult to quickly implement improvement measures that lead to increased customer satisfaction.

[0522] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0523] In this invention, the server includes an acquisition means for acquiring audio data, a conversion means for converting audio data into text data, an analysis means for analyzing text data and classifying it into categories related to customer experience, an organization means for accumulating the analysis results and statistically organizing them as time-series data, a generation means for generating new improvement suggestions, and an output means for notifying improvement suggestions using audio output or a display device. This enables efficient collection and analysis of customer feedback in physical stores and allows for the immediate proposal of concrete improvement measures.

[0524] "Acquisition means" refers to devices and technologies for directly collecting voice, which allows user feedback to be incorporated into the system.

[0525] "Conversion means" refers to technology for converting collected audio data into text data, and plays the role of replacing audio information with an analyzable format.

[0526] "Analysis methods" refer to technologies and algorithms that analyze converted text data in detail and classify it into specific categories, and are used with the aim of clearly understanding the user's intentions and emotions.

[0527] "Organizational methods" refer to the process of systematically storing analyzed data using time series and other statistical methods, thereby enabling continuous data management and trend analysis.

[0528] "Generative means" refers to algorithms and technologies for devising new service improvement measures and improvement proposals based on organized data, and possessing the ability to formulate concrete action plans.

[0529] "Output means" refers to technology for notifying users of generated improvement suggestions via audio or visual means, and plays a role in effectively providing feedback to users.

[0530] Modes for carrying out the invention

[0531] The system implementing this invention mainly consists of three elements: a server, a terminal, and a user. The user can provide voice feedback via a terminal installed in the store. The terminal has a built-in voice input device, which makes it possible to effectively acquire the user's voice data.

[0532] A voice input device (e.g., a standard microphone) is used to acquire the voice data. The terminal sends the acquired voice data to the server. The server converts this voice data into text data using speech recognition software (such as the Google Cloud Speech-to-Text API).

[0533] Text data is analyzed on the server using natural language processing technologies (such as the Google NLP API) and categorized into categories related to the customer experience. This makes it easier to organize the specific feedback provided by users. The analyzed data is stored in a database (such as Firebase or AWS DynamoDB) and statistically organized over time.

[0534] Based on the organized data, the server uses Python scripts and the scikit-learn library to generate improvement suggestions. These suggestions are then communicated to the user via text-to-speech software or a display device. This allows stores to immediately respond to and improve customer feedback.

[0535] For example, if a user speaks into the device saying, "The product placement is confusing," the audio is converted into text and categorized. The organized data is stored as a problem related to "product placement," and the server may generate suggestions such as "placement of product information staff" or "improvement of store layout."

[0536] An example of a prompt to input into the generating AI model is: "Based on the following feedback data, generate service improvement suggestions for physical stores: 'The product layout is unclear, and store guidance is insufficient.'"

[0537] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0538] Step 1:

[0539] The user provides feedback via a voice input device built into the terminal. Voice data is acquired as input. The terminal collects this as digital voice data in preparation for sending it to the server.

[0540] Step 2:

[0541] The server uses speech recognition software (e.g., Google Cloud Speech-to-Text API) to analyze the received audio data. This process involves data processing, converting the input audio data into text data. The output is a string representing the content of the audio.

[0542] Step 3:

[0543] The server analyzes the converted text data using natural language processing techniques (e.g., Google NLP API). This analysis categorizes the text into specific categories related to the customer experience. The input is text data, and the output is categorized information.

[0544] Step 4:

[0545] The server stores the analyzed data in a database. This data is organized based on time series. The specific operation of this step is to accumulate feedback by category and generate statistical data to understand trends.

[0546] Step 5:

[0547] The server generates new improvement suggestions based on the organized data. By utilizing Python scripts and the scikit-learn library to perform data calculations, it outputs specific improvement proposals for store operations.

[0548] Step 6:

[0549] The server sends the generated improvement suggestions to the terminal. The terminal notifies the user of the suggestions either verbally using speech synthesis technology or visually by displaying them on the screen. The output is the suggested content in either audio or text format. This process allows the user to receive improvement suggestions immediately.

[0550] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0551] This invention is a system that acquires and analyzes voice data and provides consulting and improvement suggestions that include the user's emotions. The system mainly consists of three elements: a server, a terminal, and the user.

[0552] First, the user makes a voice input into a terminal within the facility. This voice input includes feedback, requests, and comments. The terminal uses a microphone to acquire the voice data and immediately converts it into text data using speech recognition technology. This process converts the voice information into a format that the system can analyze.

[0553] Next, the terminal sends the converted text data to the server. The server analyzes the text data using natural language processing technology. This analysis incorporates a newly developed emotion engine that recognizes and extracts emotions from the words spoken by the user. The emotion engine can independently identify emotion categories such as positive, negative, and neutral from the text and evaluate their intensity.

[0554] The server uses text data and sentiment information to generate service improvement suggestions. Sentiment information directly influences the priority of the suggestions and the response strategy, and is a crucial element for gaining a rich understanding of the emotions users experience.

[0555] Finally, the generated improvement suggestions are sent from the server to the terminal. The terminal then presents feedback to the user, taking into account the analysis results of the emotion engine, through speech synthesis technology and display. This enables a more flexible and human-like response.

[0556] For example, if a user provides feedback to their device such as, "Recently, the staff's customer service attitude has been frustrating," the device converts this audio into text and sends it to the server. Based on this text, the server recognizes the user's frustration as a negative emotion and suggests "strengthening staff customer service training" as a corresponding improvement measure. This suggestion is then provided to the user as audio feedback.

[0557] In this way, the present invention realizes a system that improves the quality of user feedback and supports more effective service improvement by using emotion-based data.

[0558] The following describes the processing flow.

[0559] Step 1:

[0560] Users input feedback and requests via voice into their devices. These voice inputs may contain the user's emotions.

[0561] Step 2:

[0562] The device uses the microphone to capture the user's voice. The captured voice data is temporarily stored in the internal memory.

[0563] Step 3:

[0564] The device uses speech recognition technology to convert the acquired speech data into text data. A highly accurate speech recognition engine is used during this process to prevent misrecognition of language.

[0565] Step 4:

[0566] The device sends the converted text data to the server. This transmission uses data encryption to protect user privacy.

[0567] Step 5:

[0568] The server receives text data and analyzes its content using natural language processing (NLP) techniques. This analysis includes keyword extraction and contextual understanding.

[0569] Step 6:

[0570] The server uses an emotion engine to analyze the user's emotions contained in the text data. It evaluates the type of emotion (positive, negative, neutral) and its intensity.

[0571] Step 7:

[0572] The server combines text data content with sentiment information to generate improvement suggestions in response to user requests and complaints. These suggestions are prioritized based on the type and intensity of the sentiment.

[0573] Step 8:

[0574] The server generates improvement suggestions and sends them to the terminal. Data security is also considered here, and encryption is performed during transmission.

[0575] Step 9:

[0576] The terminal uses speech synthesis technology to provide voice feedback to the user based on improvement suggestions received from the server. Additionally, the suggestions are displayed as text on the screen as needed.

[0577] Step 10:

[0578] Users can receive feedback and provide further requests and feedback. This process can be repeated.

[0579] (Example 2)

[0580] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0581] Conventional voice data processing systems lack the ability to effectively analyze user feedback and propose concrete improvement measures, making it difficult to rapidly improve services and enhance user satisfaction. Furthermore, they struggle to provide feedback that takes emotions into account, and are insufficient to meet users' essential needs.

[0582] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0583] In this invention, the server includes receiving means for acquiring audio data, conversion means for converting audio data into text data, and processing means for analyzing text data and extracting emotions. This makes it possible to generate highly accurate improvement suggestions based on user feedback.

[0584] "Receiving means" refers to the devices or functions used to acquire audio data.

[0585] "Conversion means" refers to the process or technology used to convert acquired audio data into text data.

[0586] "Processing means" refers to functions and algorithms for analyzing text data and extracting emotions.

[0587] "Means of creation" refers to the methods and functions for creating proposals based on the analysis results.

[0588] "Display means" refers to output functions or devices for presenting the generated proposals to the user.

[0589] This invention is a system that analyzes voice data and generates specific service improvement suggestions based on user feedback. This system primarily consists of a server, a terminal, and the user.

[0590] First, the user provides voice input to the device. For example, they can provide voice feedback such as, "The wait time at this store is long." The device uses a microphone that functions as an acoustic sensor to capture voice data.

[0591] Next, the device uses speech recognition technology to convert this speech data into text data. This process utilizes speech recognition software, and specifically, commonly used speech recognition APIs are suitable.

[0592] Next, the device sends this converted text data to the server. The server uses a natural language processing engine to analyze the text data in detail and extracts emotional data from the user feedback through an emotion engine. This analysis makes it possible to classify the feedback into positive, negative, or neutral emotions.

[0593] Based on the analysis results, the server creates specific improvement suggestions that reflect the sentiment data. For example, in response to negative feedback, suggestions such as increasing staff or managing shifts more efficiently may be generated.

[0594] Ultimately, the server sends the generated improvement suggestions to the terminal, which then uses speech synthesis technology to provide feedback to the user. For example, it's possible to utilize the API of a speech technology provider. In this process, the suggestions can also be presented visually using a display.

[0595] An example of a prompt message might be, "Please provide feedback. Based on your input, we will make suggestions for improving our service." This system allows for the effective use of user feedback based on emotions, enabling service improvements that enhance the user experience.

[0596] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0597] Step 1:

[0598] The user provides voice feedback to the device. The feedback is acquired as a digital audio signal via an acoustic sensor. The device collects this digital audio signal and prepares it for the next processing step.

[0599] Step 2:

[0600] The terminal passes the acquired audio data to speech recognition software, which converts it from audio to text data. Depending on the speech recognition technology used, the audio signal is converted into a string of characters. The input is audio data, and the output is text data. This makes the audio information easier to analyze.

[0601] Step 3:

[0602] The terminal sends the converted text data to the server. During this process, the text data is formatted appropriately and transferred with minimal delay. The input is text data, and the output is a transfer to the server.

[0603] Step 4:

[0604] The server analyzes the received text data using a natural language processing engine. An emotion engine extracts the user's emotions from the text. The input is text data, and the output is analyzed emotion information. Emotions are classified as positive, negative, or neutral.

[0605] Step 5:

[0606] The server generates improvement suggestions based on the analysis results. Specific improvement measures are devised based on the sentiment of user feedback. The input is sentiment information, and the output is suggestion information. This allows feedback to contribute to service improvement.

[0607] Step 6:

[0608] The server sends the created improvement suggestions to the terminal. The suggestions are adjusted based on the priority and content of the feedback. The input is the suggestion information, and the output is the transfer to the terminal.

[0609] Step 7:

[0610] The terminal outputs suggestions to the user using speech synthesis technology. In addition to audio presentation, it can also display the suggestions as text using the display. Input is suggestion information from the server, and output is feedback for the user. This allows the user to understand how their feedback contributes to improvements.

[0611] (Application Example 2)

[0612] Next, we will explain Application Example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0613] Data centers are operated by a large staff, and real-time and rapid feedback analysis is required to achieve efficient operations. However, conventional systems have difficulty appropriately analyzing the emotions contained in feedback and immediately proposing improvements, which has posed challenges to improving staff comfort and operational efficiency.

[0614] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0615] In this invention, the server includes an acquisition means for acquiring voice data, a conversion means for converting voice data into text data, and an analysis means for analyzing the text data and categorizing it, including emotions. This makes it possible to instantly analyze the emotions of staff voice feedback in a data center and provide improvement suggestions in real time.

[0616] "Voice data" refers to digital information, including the user's spoken words and sounds, acquired by acoustic sensors.

[0617] "Acquisition means" refers to devices or mechanisms for collecting audio data, and typically includes microphones and acoustic sensors.

[0618] "Conversion means" refers to algorithms or software processes for converting audio data into text data, utilizing speech recognition technology.

[0619] "Analysis means" refers to techniques and methods for analyzing text data and determining emotional categories based on individual words and phrases.

[0620] "Organizational methods" refer to functions that add emotional information obtained through analytical methods, statistically classify, visualize, or organize data.

[0621] A "generation method" is a system that creates effective improvement suggestions based on organized data, taking emotional information into consideration.

[0622] "Output means" refers to methods or devices for presenting improvement suggestions to the user in audio or visual form, utilizing speech synthesis or displays.

[0623] This invention realizes a system that analyzes voice data to understand emotions and generates improvement suggestions. The specific implementation of the system is described below.

[0624] The server receives audio data transmitted from each terminal. This audio data is captured by the terminal's acoustic sensors and converted into text data on the spot using speech recognition technology. The main software used here is a speech recognition API (such as Google Cloud Speech-to-Text).

[0625] When text data is sent to the server, the server analyzes the text using a natural language processing library (such as spaCy). At this stage, the emotion engine classifies the emotion in the text as positive, negative, or neutral. As a result, the type and intensity of the emotion are evaluated.

[0626] Next, the server uses a sorting mechanism to statistically classify the analysis results and generates data with added emotional information based on this classification. Based on this data, a generation mechanism creates improvement suggestions. The server then sends the generated improvement suggestions back to the terminal, which presents the suggestions to the user using speech synthesis technology (such as Amazon Polly) or a display.

[0627] For example, if a data center staff member gives feedback saying, "The air conditioning in the facility has been too strong lately, making it too cold," the emotion engine will recognize this voice as a "negative emotion." Based on this information, the server will suggest "revising the air conditioning settings" and notify the staff member of this suggestion in either voice or visual form.

[0628] An example of a prompt for a generating AI model would be: "The user feels that the air conditioning in the facility is too strong and it's too cold. What improvements would you suggest?" This allows the system to analyze emotions in real time and provide appropriate improvement suggestions.

[0629] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0630] Step 1:

[0631] The device acquires user voice data using an acoustic sensor. The input here is the user's voice, and the output is digitized voice data. The acoustic sensor picks up ambient sounds and records them as digital signals within the device.

[0632] Step 2:

[0633] The device uses a speech recognition API to convert audio data into text data. The input is audio data, and the output is the converted text data. The speech recognition API analyzes the audio waveform and converts each phoneme into its corresponding character to create text.

[0634] Step 3:

[0635] The terminal sends the converted text data to the server. The input is text data, and the output is the transmission of data to the server. Network protocols are used to securely and efficiently transfer the text data to the server.

[0636] Step 4:

[0637] The server applies a natural language processing library to analyze the received text data. The input is text data, and the output is the analysis result. The library divides the text into tokens, identifies the emotion of each token, and categorizes them.

[0638] Step 5:

[0639] The server uses an emotion engine to extract and evaluate emotional information from the analysis results. The input is the analysis results, and the output is the emotion category and its intensity. The emotion engine scores the emotional intensity of the identified tokens and determines whether they are positive, negative, or neutral.

[0640] Step 6:

[0641] The server uses sorting tools to statistically classify this emotional information and generate a new data structure. The input is emotional information, and the output is statistically sorted data. The sorting tools aggregate the data by category and generate statistical information.

[0642] Step 7:

[0643] The server generates improvement suggestions using a generation method. The input is statistically organized data, and the output is specific improvement suggestions. It performs the operation of generating the suggestions by applying rule-based models or machine learning models.

[0644] Step 8:

[0645] The server sends the generated improvement suggestions to the terminal. The input is the improvement suggestions, and the output is the transmission of data to the terminal. It performs the operation of transmitting data over the network, delivering information to the user at the appropriate time.

[0646] Step 9:

[0647] The device uses speech synthesis technology to present improvement suggestions to the user as audio. The input is improvement suggestions sent from the server, and the output is audio feedback. The speech synthesis engine generates text as speech and transmits the information through the device's speaker.

[0648] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0649] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0650] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0651] [Fourth Embodiment]

[0652] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0653] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0654] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0655] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0656] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0657] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0658] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0659] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0660] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0661] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0662] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0663] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0664] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0665] In implementing the present invention, the system mainly consists of three elements: a server, a terminal, and a user. The following describes how these elements interact and how the invention is implemented.

[0666] First, the user speaks into a terminal installed within the facility. The terminal uses a microphone to capture the user's voice and converts the captured voice data into text data using speech recognition technology. This conversion process sends the voice information to the server as an analyzable string.

[0667] Next, the server analyzes the received text data using natural language processing (NLP) techniques. This analysis classifies and organizes the text data into categories. For example, it might be categorized as complaints about the service or suggestions for improvement.

[0668] The analyzed data is stored in a database on the server and statistically organized. Through this organization process, the information in the database is continuously updated, making it possible to understand trends such as which feedback patterns are increasing over time.

[0669] The server generates new improvement suggestions based on the organized data. An algorithm is applied as the generation method, which forms specific suggestions aimed at improving the service. For example, suggestions may include strengthening employee training or adding new menu items.

[0670] Finally, the server sends the generated suggestions to the terminal, which then outputs them to the user as feedback. Output methods include direct voice notification to the user using speech synthesis technology, as well as displaying the suggestions as text on the screen.

[0671] For example, if a user provides feedback such as "The wait time is too long," the device captures the audio, converts it to text, and sends it to the server. The server identifies the complaint about "wait time" as a category, analyzes the relevant data, and generates suggestions for service improvement. These suggestions are then fed back to the user from the device, and may include suggestions such as "Increase the number of staff to reduce wait times."

[0672] Through this process, the present invention enables the efficient collection and analysis of user feedback, and the generation of immediate feedback and improvement suggestions.

[0673] The following describes the processing flow.

[0674] Step 1:

[0675] Users speak into the device to input feedback and requests via voice.

[0676] Step 2:

[0677] The device acquires the user's voice and records it as audio data.

[0678] Step 3:

[0679] The device converts the acquired audio data into text data using speech recognition technology.

[0680] Step 4:

[0681] The terminal sends the converted text data to the server.

[0682] Step 5:

[0683] The server analyzes the text data it receives using natural language processing techniques.

[0684] Step 6:

[0685] The server categorizes the text data.

[0686] Step 7:

[0687] The server updates statistical information using the classified data and saves it to the database.

[0688] Step 8:

[0689] The server generates improvement suggestions based on statistical information.

[0690] Step 9:

[0691] The server sends improvement suggestions it has generated to the terminal.

[0692] Step 10:

[0693] The device outputs improvement suggestions it receives to the user through speech synthesis technology and display.

[0694] (Example 1)

[0695] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0696] Traditional customer feedback systems suffer from inefficient processes, from acquiring and analyzing voice data to generating improvement suggestions, making it difficult to translate feedback into rapid service improvements. In particular, delays in accurately converting voice data to text and statistically organizing analysis results can slow down the implementation of feedback, making it difficult to improve customer satisfaction.

[0697] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0698] In this invention, the server includes equipment means for acquiring voice information, technical means for converting voice information into text information, and processing means for analyzing and classifying the text information. This enables rapid and accurate processing of voice feedback from customers, and prompt generation and provision of service improvement suggestions.

[0699] "Audio information" refers to the language and sounds spoken by the user, and is typically data acquired through input devices such as microphones.

[0700] "Equipment means" refers to hardware devices or equipment for acquiring audio information, specifically input devices such as microphones.

[0701] "Textual information" refers to text data obtained after audio information has been converted, and is data in a format that can be processed and analyzed by a computer.

[0702] "Technical means" refers to software or algorithms used to convert speech information into text information, and speech recognition technology is an example of this.

[0703] "Processing means" refers to software or algorithms for analyzing acquired textual information and classifying its content, and natural language processing technology falls under this category.

[0704] "Data means" refers to a system or software for collecting, storing, and organizing analyzed character information, and database management systems are an example of this.

[0705] "Algorithmic means" refers to computational methods or processes for generating improvement suggestions from organized data, and machine learning algorithms fall under this category.

[0706] "Display means" refers to a device or method for presenting generated improvement suggestions to the user, and this includes displays and speech synthesis devices.

[0707] This invention is a voice processing system aimed at improving services through feedback collection and analysis using voice information. The following describes specific implementations of this system.

[0708] Users provide verbal feedback to a terminal installed within the facility. This terminal is equipped with a microphone, which acts as a voice acquisition device to capture voice information. The terminal uses speech recognition technology to convert the voice information into text information. Specifically, it utilizes a commonly available speech recognition API. This API functions as a technical means for converting the acquired voice into text data.

[0709] The character information converted by the terminal is sent to the server. The server analyzes this character information using natural language processing technology. This analysis process includes algorithmic means for classifying text data according to its content. For example, a Python-based natural language processing library is used to analyze the text and classify it into categories such as "waiting time" or "service."

[0710] The analyzed data is stored in a database management system on the server. Database technologies such as MySQL are used for data organization. This stored data serves as foundational material for revealing trends through statistical analysis.

[0711] Based on the organized data, the server uses a generation algorithm to formulate improvement suggestions. Machine learning frameworks such as Scikit-learn are used for this generation process. The generated suggestions are then communicated to the user in a concrete and actionable format.

[0712] The terminal provides the user with improvement suggestions received from the server using a display and speech generation technology. For speech generation, a common speech synthesis technology, such as the Text-to-Speech API, is used.

[0713] For example, if a user provides voice feedback stating, "The wait time is too long," the device converts this to text and sends it to the server. The server identifies the complaint about the wait time, analyzes it based on relevant data, and generates a suggestion. This suggestion is ultimately provided to the user, either voice or text, and may include a message such as, "Increase staffing levels to reduce wait times."

[0714] An example of a prompt to input into the generation AI model is text such as, "How does the system generate improvement suggestions after receiving voice feedback?" This prompt is useful for testing the system's behavior and evaluating the improvement suggestion generation process.

[0715] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0716] Step 1:

[0717] Users provide verbal feedback to terminals installed within the facility. The input is voice information captured via the terminal's microphone. For example, when a user says, "The wait time is long," the voice data is recorded.

[0718] Step 2:

[0719] The device converts the acquired audio information into text information using speech recognition technology. Here, audio data is generated as input and text data as output. A speech recognition API is used in this process, and specifically, the audio is converted into text saying "There is a long waiting time."

[0720] Step 3:

[0721] The terminal sends the converted text data to the server. The input text data is securely transferred using the HTTPS protocol. Specifically, the text "There is a long wait time" arrives at the server.

[0722] Step 4:

[0723] The server analyzes the received text data using natural language processing techniques. The text data received as input is categorized after analysis. For example, keywords such as "waiting time" are extracted and categorized as complaints. A natural language processing library is used for this process.

[0724] Step 5:

[0725] The server stores the analyzed data in a database and organizes it statistically. The analyzed data, as input, is stored in the database management system, and time-series data is generated as output. Specifically, data related to waiting times is organized to understand feedback trends.

[0726] Step 6:

[0727] The server applies an algorithm to generate improvement suggestions based on the organized data. The organized data, as input, is analyzed through the algorithm, and specific suggestions are generated as output. For example, a suggestion might be made to "increase the number of staff" to reduce waiting times.

[0728] Step 7:

[0729] The server sends the generated improvement suggestions to the terminal, which then provides feedback on the suggestions either visually or audibly. The input is the generated suggestions, which are presented to the user as output. Specifically, the suggestions are displayed either audibly or in text, communicating to the user, "We will increase staff numbers to reduce waiting times."

[0730] (Application Example 1)

[0731] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0732] In physical stores, there is a need to efficiently and immediately collect and analyze customer feedback to quickly propose concrete improvement measures that enhance the customer experience. However, traditional methods often involve manual feedback collection, lacking immediacy. Furthermore, the process of effectively analyzing the collected feedback and generating specific improvement proposals is not sufficiently automated. As a result, it is difficult to quickly implement improvement measures that lead to increased customer satisfaction.

[0733] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0734] In this invention, the server includes an acquisition means for acquiring audio data, a conversion means for converting audio data into text data, an analysis means for analyzing text data and classifying it into categories related to customer experience, an organization means for accumulating the analysis results and statistically organizing them as time-series data, a generation means for generating new improvement suggestions, and an output means for notifying improvement suggestions using audio output or a display device. This enables efficient collection and analysis of customer feedback in physical stores and allows for the immediate proposal of concrete improvement measures.

[0735] "Acquisition means" refers to devices and technologies for directly collecting voice, which allows user feedback to be incorporated into the system.

[0736] "Conversion means" refers to technology for converting collected audio data into text data, and plays the role of replacing audio information with an analyzable format.

[0737] "Analysis methods" refer to technologies and algorithms that analyze converted text data in detail and classify it into specific categories, and are used with the aim of clearly understanding the user's intentions and emotions.

[0738] "Organizational methods" refer to the process of systematically storing analyzed data using time series and other statistical methods, thereby enabling continuous data management and trend analysis.

[0739] "Generative means" refers to algorithms and technologies for devising new service improvement measures and improvement proposals based on organized data, and possessing the ability to formulate concrete action plans.

[0740] "Output means" refers to technology for notifying users of generated improvement suggestions via audio or visual means, and plays a role in effectively providing feedback to users.

[0741] Modes for carrying out the invention

[0742] The system implementing this invention mainly consists of three elements: a server, a terminal, and a user. The user can provide voice feedback via a terminal installed in the store. The terminal has a built-in voice input device, which makes it possible to effectively acquire the user's voice data.

[0743] A voice input device (e.g., a standard microphone) is used to acquire the voice data. The terminal sends the acquired voice data to the server. The server converts this voice data into text data using speech recognition software (such as the Google Cloud Speech-to-Text API).

[0744] Text data is analyzed on the server using natural language processing technologies (such as the Google NLP API) and categorized into categories related to the customer experience. This makes it easier to organize the specific feedback provided by users. The analyzed data is stored in a database (such as Firebase or AWS DynamoDB) and statistically organized over time.

[0745] Based on the organized data, the server uses Python scripts and the scikit-learn library to generate improvement suggestions. These suggestions are then communicated to the user via text-to-speech software or a display device. This allows stores to immediately respond to and improve customer feedback.

[0746] For example, if a user speaks into the device saying, "The product placement is confusing," the audio is converted into text and categorized. The organized data is stored as a problem related to "product placement," and the server may generate suggestions such as "placement of product information staff" or "improvement of store layout."

[0747] An example of a prompt to input into the generating AI model is: "Based on the following feedback data, generate service improvement suggestions for physical stores: 'The product layout is unclear, and store guidance is insufficient.'"

[0748] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0749] Step 1:

[0750] The user provides feedback via a voice input device built into the terminal. Voice data is acquired as input. The terminal collects this as digital voice data in preparation for sending it to the server.

[0751] Step 2:

[0752] The server uses speech recognition software (e.g., Google Cloud Speech-to-Text API) to analyze the received audio data. This process involves data processing, converting the input audio data into text data. The output is a string representing the content of the audio.

[0753] Step 3:

[0754] The server analyzes the converted text data using natural language processing techniques (e.g., Google NLP API). This analysis categorizes the text into specific categories related to the customer experience. The input is text data, and the output is categorized information.

[0755] Step 4:

[0756] The server stores the analyzed data in a database. This data is organized based on time series. The specific operation of this step is to accumulate feedback by category and generate statistical data to understand trends.

[0757] Step 5:

[0758] The server generates new improvement suggestions based on the organized data. By utilizing Python scripts and the scikit-learn library to perform data calculations, it outputs specific improvement proposals for store operations.

[0759] Step 6:

[0760] The server sends the generated improvement suggestions to the terminal. The terminal notifies the user of the suggestions either verbally using speech synthesis technology or visually by displaying them on the screen. The output is the suggested content in either audio or text format. This process allows the user to receive improvement suggestions immediately.

[0761] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0762] This invention is a system that acquires and analyzes voice data and provides consulting and improvement suggestions that include the user's emotions. The system mainly consists of three elements: a server, a terminal, and the user.

[0763] First, the user makes a voice input into a terminal within the facility. This voice input includes feedback, requests, and comments. The terminal uses a microphone to acquire the voice data and immediately converts it into text data using speech recognition technology. This process converts the voice information into a format that the system can analyze.

[0764] Next, the terminal sends the converted text data to the server. The server analyzes the text data using natural language processing technology. This analysis incorporates a newly developed emotion engine that recognizes and extracts emotions from the words spoken by the user. The emotion engine can independently identify emotion categories such as positive, negative, and neutral from the text and evaluate their intensity.

[0765] The server uses text data and sentiment information to generate service improvement suggestions. Sentiment information directly influences the priority of the suggestions and the response strategy, and is a crucial element for gaining a rich understanding of the emotions users experience.

[0766] Finally, the generated improvement suggestions are sent from the server to the terminal. The terminal then presents feedback to the user, taking into account the analysis results of the emotion engine, through speech synthesis technology and display. This enables a more flexible and human-like response.

[0767] For example, if a user provides feedback to their device such as, "Recently, the staff's customer service attitude has been frustrating," the device converts this audio into text and sends it to the server. Based on this text, the server recognizes the user's frustration as a negative emotion and suggests "strengthening staff customer service training" as a corresponding improvement measure. This suggestion is then provided to the user as audio feedback.

[0768] In this way, the present invention realizes a system that improves the quality of user feedback and supports more effective service improvement by using emotion-based data.

[0769] The following describes the processing flow.

[0770] Step 1:

[0771] Users input feedback and requests via voice into their devices. These voice inputs may contain the user's emotions.

[0772] Step 2:

[0773] The device uses the microphone to capture the user's voice. The captured voice data is temporarily stored in the internal memory.

[0774] Step 3:

[0775] The device uses speech recognition technology to convert the acquired speech data into text data. A highly accurate speech recognition engine is used during this process to prevent misrecognition of language.

[0776] Step 4:

[0777] The device sends the converted text data to the server. This transmission uses data encryption to protect user privacy.

[0778] Step 5:

[0779] The server receives text data and analyzes its content using natural language processing (NLP) techniques. This analysis includes keyword extraction and contextual understanding.

[0780] Step 6:

[0781] The server uses an emotion engine to analyze the user's emotions contained in the text data. It evaluates the type of emotion (positive, negative, neutral) and its intensity.

[0782] Step 7:

[0783] The server combines text data content with sentiment information to generate improvement suggestions in response to user requests and complaints. These suggestions are prioritized based on the type and intensity of the sentiment.

[0784] Step 8:

[0785] The server generates improvement suggestions and sends them to the terminal. Data security is also considered here, and encryption is performed during transmission.

[0786] Step 9:

[0787] The terminal uses speech synthesis technology to provide voice feedback to the user based on improvement suggestions received from the server. Additionally, the suggestions are displayed as text on the screen as needed.

[0788] Step 10:

[0789] Users can receive feedback and provide further requests and feedback. This process can be repeated.

[0790] (Example 2)

[0791] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0792] Conventional voice data processing systems lack the ability to effectively analyze user feedback and propose concrete improvement measures, making it difficult to rapidly improve services and enhance user satisfaction. Furthermore, they struggle to provide feedback that takes emotions into account, and are insufficient to meet users' essential needs.

[0793] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0794] In this invention, the server includes receiving means for acquiring audio data, conversion means for converting audio data into text data, and processing means for analyzing text data and extracting emotions. This makes it possible to generate highly accurate improvement suggestions based on user feedback.

[0795] "Receiving means" refers to the devices or functions used to acquire audio data.

[0796] "Conversion means" refers to the process or technology used to convert acquired audio data into text data.

[0797] "Processing means" refers to functions and algorithms for analyzing text data and extracting emotions.

[0798] "Means of creation" refers to the methods and functions for creating proposals based on the analysis results.

[0799] "Display means" refers to output functions or devices for presenting the generated proposals to the user.

[0800] This invention is a system that analyzes voice data and generates specific service improvement suggestions based on user feedback. This system primarily consists of a server, a terminal, and the user.

[0801] First, the user provides voice input to the device. For example, they can provide voice feedback such as, "The wait time at this store is long." The device uses a microphone that functions as an acoustic sensor to capture voice data.

[0802] Next, the device uses speech recognition technology to convert this speech data into text data. This process utilizes speech recognition software, and specifically, commonly used speech recognition APIs are suitable.

[0803] Next, the device sends this converted text data to the server. The server uses a natural language processing engine to analyze the text data in detail and extracts emotional data from the user feedback through an emotion engine. This analysis makes it possible to classify the feedback into positive, negative, or neutral emotions.

[0804] Based on the analysis results, the server creates specific improvement suggestions that reflect the sentiment data. For example, in response to negative feedback, suggestions such as increasing staff or managing shifts more efficiently may be generated.

[0805] Ultimately, the server sends the generated improvement suggestions to the terminal, which then uses speech synthesis technology to provide feedback to the user. For example, it's possible to utilize the API of a speech technology provider. In this process, the suggestions can also be presented visually using a display.

[0806] An example of a prompt message might be, "Please provide feedback. Based on your input, we will make suggestions for improving our service." This system allows for the effective use of user feedback based on emotions, enabling service improvements that enhance the user experience.

[0807] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0808] Step 1:

[0809] The user provides voice feedback to the device. The feedback is acquired as a digital audio signal via an acoustic sensor. The device collects this digital audio signal and prepares it for the next processing step.

[0810] Step 2:

[0811] The terminal passes the acquired audio data to speech recognition software, which converts it from audio to text data. Depending on the speech recognition technology used, the audio signal is converted into a string of characters. The input is audio data, and the output is text data. This makes the audio information easier to analyze.

[0812] Step 3:

[0813] The terminal sends the converted text data to the server. During this process, the text data is formatted appropriately and transferred with minimal delay. The input is text data, and the output is a transfer to the server.

[0814] Step 4:

[0815] The server analyzes the received text data using a natural language processing engine. An emotion engine extracts the user's emotions from the text. The input is text data, and the output is analyzed emotion information. Emotions are classified as positive, negative, or neutral.

[0816] Step 5:

[0817] The server generates improvement suggestions based on the analysis results. Specific improvement measures are devised based on the sentiment of user feedback. The input is sentiment information, and the output is suggestion information. This allows feedback to contribute to service improvement.

[0818] Step 6:

[0819] The server sends the created improvement suggestions to the terminal. The suggestions are adjusted based on the priority and content of the feedback. The input is the suggestion information, and the output is the transfer to the terminal.

[0820] Step 7:

[0821] The terminal outputs suggestions to the user using speech synthesis technology. In addition to audio presentation, it can also display the suggestions as text using the display. Input is suggestion information from the server, and output is feedback for the user. This allows the user to understand how their feedback contributes to improvements.

[0822] (Application Example 2)

[0823] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0824] Data centers are operated by a large staff, and real-time and rapid feedback analysis is required to achieve efficient operations. However, conventional systems have difficulty appropriately analyzing the emotions contained in feedback and immediately proposing improvements, which has posed challenges to improving staff comfort and operational efficiency.

[0825] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0826] In this invention, the server includes an acquisition means for acquiring voice data, a conversion means for converting voice data into text data, and an analysis means for analyzing the text data and categorizing it, including emotions. This makes it possible to instantly analyze the emotions of staff voice feedback in a data center and provide improvement suggestions in real time.

[0827] "Voice data" refers to digital information, including the user's spoken words and sounds, acquired by acoustic sensors.

[0828] "Acquisition means" refers to devices or mechanisms for collecting audio data, and typically includes microphones and acoustic sensors.

[0829] "Conversion means" refers to algorithms or software processes for converting audio data into text data, utilizing speech recognition technology.

[0830] "Analysis means" refers to techniques and methods for analyzing text data and determining emotional categories based on individual words and phrases.

[0831] "Organizational methods" refer to functions that add emotional information obtained through analytical methods, statistically classify, visualize, or organize data.

[0832] A "generation method" is a system that creates effective improvement suggestions based on organized data, taking emotional information into consideration.

[0833] "Output means" refers to methods or devices for presenting improvement suggestions to the user in audio or visual form, utilizing speech synthesis or displays.

[0834] This invention realizes a system that analyzes voice data to understand emotions and generates improvement suggestions. The specific implementation of the system is described below.

[0835] The server receives audio data transmitted from each terminal. This audio data is captured by the terminal's acoustic sensors and converted into text data on the spot using speech recognition technology. The main software used here is a speech recognition API (such as Google Cloud Speech-to-Text).

[0836] When text data is sent to the server, the server analyzes the text using a natural language processing library (such as spaCy). At this stage, the emotion engine classifies the emotion in the text as positive, negative, or neutral. As a result, the type and intensity of the emotion are evaluated.

[0837] Next, the server uses a sorting mechanism to statistically classify the analysis results and generates data with added emotional information based on this classification. Based on this data, a generation mechanism creates improvement suggestions. The server then sends the generated improvement suggestions back to the terminal, which presents the suggestions to the user using speech synthesis technology (such as Amazon Polly) or a display.

[0838] For example, if a data center staff member gives feedback saying, "The air conditioning in the facility has been too strong lately, making it too cold," the emotion engine will recognize this voice as a "negative emotion." Based on this information, the server will suggest "revising the air conditioning settings" and notify the staff member of this suggestion in either voice or visual form.

[0839] An example of a prompt for a generating AI model would be: "The user feels that the air conditioning in the facility is too strong and it's too cold. What improvements would you suggest?" This allows the system to analyze emotions in real time and provide appropriate improvement suggestions.

[0840] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0841] Step 1:

[0842] The device acquires user voice data using an acoustic sensor. The input here is the user's voice, and the output is digitized voice data. The acoustic sensor picks up ambient sounds and records them as digital signals within the device.

[0843] Step 2:

[0844] The device uses a speech recognition API to convert audio data into text data. The input is audio data, and the output is the converted text data. The speech recognition API analyzes the audio waveform and converts each phoneme into its corresponding character to create text.

[0845] Step 3:

[0846] The terminal sends the converted text data to the server. The input is text data, and the output is the transmission of data to the server. Network protocols are used to securely and efficiently transfer the text data to the server.

[0847] Step 4:

[0848] The server applies a natural language processing library to analyze the received text data. The input is text data, and the output is the analysis result. The library divides the text into tokens, identifies the emotion of each token, and categorizes them.

[0849] Step 5:

[0850] The server uses an emotion engine to extract and evaluate emotional information from the analysis results. The input is the analysis results, and the output is the emotion category and its intensity. The emotion engine scores the emotional intensity of the identified tokens and determines whether they are positive, negative, or neutral.

[0851] Step 6:

[0852] The server uses sorting tools to statistically classify this emotional information and generate a new data structure. The input is emotional information, and the output is statistically sorted data. The sorting tools aggregate the data by category and generate statistical information.

[0853] Step 7:

[0854] The server generates improvement suggestions using a generation method. The input is statistically organized data, and the output is specific improvement suggestions. It performs the operation of generating the suggestions by applying rule-based models or machine learning models.

[0855] Step 8:

[0856] The server sends the generated improvement suggestions to the terminal. The input is the improvement suggestions, and the output is the transmission of data to the terminal. It performs the operation of transmitting data over the network, delivering information to the user at the appropriate time.

[0857] Step 9:

[0858] The device uses speech synthesis technology to present improvement suggestions to the user as audio. The input is improvement suggestions sent from the server, and the output is audio feedback. The speech synthesis engine generates text as speech and transmits the information through the device's speaker.

[0859] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0860] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0861] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0862] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0863] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0864] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0865] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0866] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0867] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0868] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0869] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0870] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0871] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0872] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0873] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0874] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0875] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0876] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0877] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0878] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0879] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted as being incorporated by reference.

[0880] The following is further disclosed regarding the embodiments described above.

[0881] (Claim 1)

[0882] A means of acquiring audio data,

[0883] A conversion means for converting the aforementioned audio data into text data,

[0884] An analytical means for analyzing and categorizing the aforementioned text data,

[0885] A means for statistically organizing the aforementioned analysis results,

[0886] A generation means for generating improvement proposals based on the aforementioned analysis results,

[0887] An output means for outputting the aforementioned improvement proposal,

[0888] A system that includes this.

[0889] (Claim 2)

[0890] The system according to claim 1, characterized in that the output means provides feedback using speech synthesis technology.

[0891] (Claim 3)

[0892] The system according to claim 1, characterized in that the acquisition means acquires audio data through a microphone.

[0893] "Example 1"

[0894] (Claim 1)

[0895] A device for acquiring audio information,

[0896] Technical means for converting the aforementioned audio information into text information,

[0897] Processing means for analyzing and classifying the aforementioned character information,

[0898] A data means for statistically organizing the aforementioned analysis results,

[0899] An algorithm means for generating improvement suggestions based on the aforementioned analysis results,

[0900] A display means for providing the aforementioned improvement suggestion,

[0901] Information processing device including

[0902] (Claim 2)

[0903] The information processing apparatus according to claim 1, characterized in that the display means provides a response using voice generation technology.

[0904] (Claim 3)

[0905] The information processing apparatus according to claim 1, characterized in that the aforementioned device means acquires voice information through a voice collection device.

[0906] "Application Example 1"

[0907] (Claim 1)

[0908] A means of acquiring audio data,

[0909] A conversion means for converting the aforementioned audio data into text data,

[0910] An analytical means for analyzing the aforementioned text data and classifying it into categories related to customer experience,

[0911] A means for accumulating the aforementioned analysis results and organizing them statistically as time-series data,

[0912] A generation means for generating new improvement proposals using the aforementioned analysis results,

[0913] Output means for notifying the aforementioned improvement suggestion using voice output or a display device,

[0914] A system that includes this.

[0915] (Claim 2)

[0916] The system according to claim 1, characterized in that the analysis means immediately classifies customer feedback using speech recognition technology.

[0917] (Claim 3)

[0918] The system according to claim 1, characterized in that the acquisition means acquires voice data using a voice input device.

[0919] "Example 2 of combining an emotion engine"

[0920] (Claim 1)

[0921] A receiving means for acquiring audio data,

[0922] A conversion means for converting the aforementioned audio data into text data,

[0923] Processing means for analyzing the aforementioned text data and extracting emotions,

[0924] A means for creating a proposal based on the aforementioned analysis results,

[0925] A display means for presenting the aforementioned proposal,

[0926] A system that includes this.

[0927] (Claim 2)

[0928] The system according to claim 1, characterized in that the display means provides feedback using voice conversion technology.

[0929] (Claim 3)

[0930] The system according to claim 1, characterized in that the receiving means acquires audio data through an acoustic sensor.

[0931] "Application example 2 when combining with an emotional engine"

[0932] (Claim 1)

[0933] A means of acquiring audio data,

[0934] A conversion means for converting the aforementioned audio data into text data,

[0935] An analytical means for analyzing the aforementioned text data and categorizing it, including emotions,

[0936] A method for organizing the results statistically by adding emotional information based on the aforementioned analysis results,

[0937] A generation means that generates improvement suggestions that take into account the sentiment analysis results based on the aforementioned organization results,

[0938] An output means for outputting the aforementioned improvement suggestion by voice or display,

[0939] A system that includes this.

[0940] (Claim 2)

[0941] The system according to claim 1, wherein the output means provides feedback using speech synthesis technology.

[0942] (Claim 3)

[0943] The system according to claim 1, wherein the acquisition means acquires audio data using an acoustic sensor. [Explanation of symbols]

[0944] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of acquiring audio data, A conversion means for converting the aforementioned audio data into text data, An analytical means for analyzing and categorizing the aforementioned text data, A means for statistically organizing the aforementioned analysis results, A generation means for generating improvement proposals based on the aforementioned analysis results, An output means for outputting the aforementioned improvement proposal, A system that includes this.

2. The system according to claim 1, characterized in that the output means provides feedback using speech synthesis technology.

3. The system according to claim 1, characterized in that the acquisition means acquires audio data through a microphone.