system
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-26
Smart Images

Figure 2026105362000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] The problems to be solved by the present invention are to improve the efficiency of the order process using voice recognition in television shopping and the security of identity verification and payment processing. In particular, there is a need to provide a system that enables easy and accurate ordering and payment even for elderly people and consumers who are not familiar with technology. Also, ensuring security to protect users from mistakes, communication congestion, and fraudulent transactions in conventional telephone operations is an issue.
Means for Solving the Problems
[0005] This invention provides a system comprising a terminal device for receiving voice input, a server device for analyzing the received voice and extracting order information, a voice synthesis means for generating confirmation voice based on the order information, and means for confirming the order based on the user's response. In addition, it achieves efficiency and security throughout the entire ordering process by verifying the user's legitimacy using a voice pattern analysis verification means and encrypting and securely processing payment information. This enables an innovative television shopping experience that can be used with peace of mind by all users, including the elderly and consumers unfamiliar with technology.
[0006] "Voice input" is a technology that captures the voice spoken by the user in digital format.
[0007] A "terminal device" is a device that receives voice input from a user and transmits the data to a server.
[0008] A "speech recognition engine" is a system, either software or hardware, that analyzes speech input and converts it into text data.
[0009] "Order information" refers to data about the products and quantities ordered by the user.
[0010] A "server device" is a device that analyzes voice data to extract order information and communicates with other systems.
[0011] "Speech synthesis means" refers to a technology that converts text data into a speech format and outputs it to the user.
[0012] An "order confirmation mechanism" is a function that executes a process to finally confirm an order based on the user's response.
[0013] "Voice pattern analysis" is a technology that analyzes voice data to recognize the characteristics of a particular speaker.
[0014] "Identity verification means" is a process of verifying the identity of a user by using voice pattern analysis.
[0015] "Payment information" is information related to a user's payment, which is data including credit card numbers and bank account information.
[0016] "Encryption" is a technology that converts information based on specific rules to protect data.
[0017] "Process securely" refers to managing by taking appropriate protection measures so that data is not leaked or tampered with by a third party.
Brief Description of Drawings
[0018] [Figure 1] It is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which multiple emotions are mapped. [Figure 10] It shows an emotion map to which multiple emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Embodiment 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Embodiment 2 when combined with an emotion engine. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when combined with an emotion engine.
Mode for Carrying Out the Invention
[0019] Hereinafter, an example of an embodiment of the system according to the technology of the present disclosure will be described with reference to the accompanying drawings.
[0020] First, the terms used in the following description will be explained.
[0021] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0022] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0023] In the following embodiments, the signed storage is one or more non-volatile storage devices that store various programs and various parameters. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes.
[0024] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0025] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0026] [First Embodiment]
[0027] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0028] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0029] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0030] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0031] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0032] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0033] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0034] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0035] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0036] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0037] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0038] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0039] The system according to the present invention is designed to streamline and secure the voice-based ordering process through television shopping. This system integrates voice recognition, voice confirmation generation, identity verification, and secure payment processing.
[0040] The system consists of a terminal, a server, and a speech synthesis device. First, the terminal acquires the user's voice. When the user speaks and says they want to order a specific product, that voice data is sent to the server.
[0041] The server analyzes the received audio data based on its speech recognition engine and extracts order information such as product name and quantity. Based on this information, the server generates a confirmation voice message and presents it to the user via a speech synthesizer.
[0042] Once the user confirms that there are no problems with their order, the server uses voice pattern analysis technology to verify the user's identity. This verification is performed by comparing the user's voice with previously registered voice data.
[0043] Once identity verification is complete, the server proceeds with payment processing. Payment information is securely processed based on the payment method specified by the user. During this process, all data is encrypted and protected from unauthorized access.
[0044] As a concrete example, consider a case where a user orders a specific vacuum cleaner from a TV shopping channel. The user says, "I want to order a vacuum cleaner," using their voice. The terminal receives this voice and sends it to the server. The server analyzes the voice and extracts the keyword "vacuum cleaner." Next, the server creates a confirmation voice message, "Do you want to order the XX vacuum cleaner?", and presents it to the user through a speech synthesis device.
[0045] If the user responds with "yes," the server will reconfirm the order details and prompt the user for identity verification and payment method selection. For example, the user may be asked to enter their credit card information by voice. The server then completes the payment using encrypted information.
[0046] This system allows users to complete orders safely and efficiently without complicated procedures. The voice-activated interface provides a user-friendly shopping experience, especially for the elderly and those unfamiliar with technology.
[0047] The following describes the processing flow.
[0048] Step 1:
[0049] The device receives voice input from the user. The user voice-orders products while watching a TV shopping program, and the device records this via its microphone.
[0050] Step 2:
[0051] The device digitizes the recorded audio data and sends it to the server via the internet. Since the data is transferred in real time, processing can begin immediately.
[0052] Step 3:
[0053] The server inputs the received audio data into a speech recognition engine, which converts it from speech to text. In this process, keywords such as product names and quantities are extracted and organized as order information.
[0054] Step 4:
[0055] The server generates text to produce a confirmation voice message based on the order information. For example, it might create a confirmation message such as, "Would you like to order two of item XX?"
[0056] Step 5:
[0057] The server sends the generated text to the speech synthesis engine, which then generates confirmation audio data.
[0058] Step 6:
[0059] The terminal receives confirmation audio data from the speech synthesis engine and plays it for the user. At this point, the user is asked to confirm the order details.
[0060] Step 7:
[0061] The user listens to a confirmation voice message and responds with "yes" if there are no problems with the order, or "no" if cancellation or modification is needed.
[0062] Step 8:
[0063] The terminal sends the user's voice response back to the server, which analyzes the audio to understand the user's intent. If the user responds with "yes," the server proceeds to prepare to confirm the order.
[0064] Step 9:
[0065] The server uses voice pattern analysis to verify the user's identity by comparing their voice to pre-registered voice data.
[0066] Step 10:
[0067] The user specifies their payment method by voice and enters credit card information and other details as needed. The terminal encrypts this information and sends it to the server.
[0068] Step 11:
[0069] The server securely processes payment information and verifies that the purchase process has been completed successfully. After completion, it notifies the user of the order completion via the terminal.
[0070] (Example 1)
[0071] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0072] Traditional online ordering systems often require users to perform complex procedures and input tasks, making them difficult to use, especially for the elderly and those unfamiliar with technology. Furthermore, identity verification and payment processing may lack sufficient security and efficiency. There is a need to address these challenges and provide a system that allows anyone to complete orders easily and securely.
[0073] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0074] In this invention, the server includes a receiving means for acquiring voice input, a processing means for analyzing the voice input and extracting information, and an output means for generating and presenting confirmation voice based on the information. This makes it possible to place orders simply and securely using voice.
[0075] A "receiving means for acquiring voice input" is a device that collects voice data from a user and converts it into a format that can be processed as an electronic signal.
[0076] The "processing means for analyzing the voice input and extracting information" refers to a device that uses voice recognition technology to analyze the voice signal and identify order information and other necessary information.
[0077] "Output means for generating and presenting confirmation audio based on the aforementioned information" refers to a device for generating confirmation audio based on extracted information and presenting it audibly to the user.
[0078] A "confirmation means for confirming an order based on user responses" is a device that finalizes and records the order details based on the responses provided by the user.
[0079] "An authentication method that uses voice analysis to identify and authenticate users" refers to a device that analyzes the characteristics of a voice to individually identify a user and verify their identity.
[0080] A "transaction method that encrypts and processes user transaction information to protect it" is a device that securely processes user payment information using encryption technology and protects it from unauthorized access.
[0081] This invention provides an online ordering system using voice commands, enabling users to easily and securely order products using only their voice. An embodiment thereof is described below.
[0082] In this system, the terminal is equipped with a microphone to capture user speech. When the user verbally states that they wish to order a product, the audio data is converted into a digital format by the terminal and transmitted to the server via the network. The specific software used includes a "speech recognition engine" for analyzing the audio data.
[0083] The server analyzes the received audio data using a "speech recognition engine." This analysis converts the audio signal into text data and extracts order information such as product name and quantity. This information is necessary to confirm the order details with the user.
[0084] The server then uses "speech synthesis software" to generate a confirmation voice message for the user. This confirmation voice message is used to allow the user to confirm the order details based on the generated data. For example, it might say, "Would you like to order two of the XX vacuum cleaners?"
[0085] The user responds to the confirmation voice message. The response is sent back to the server via the terminal. The server verifies the user's identity using voice pattern analysis technology. After authentication is complete, the payment information is securely processed using encryption technology according to the user's instructions.
[0086] As a concrete example, when a user voice-inputs "I want to order a vacuum cleaner," the terminal sends the voice data to the server, which analyzes the data and generates a confirmation voice message asking the user, "Are you sure you want to order a vacuum cleaner?" The order is then confirmed when the user answers "Yes."
[0087] An example of a prompt used as input for the AI model in this system is, "Please tell me about the voice-based product ordering system." This example demonstrates how to achieve an intuitive and easy-to-use user experience via a voice interface.
[0088] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0089] Step 1:
[0090] The device acquires voice from the user. The user directly speaks about their purchase request, for example, saying, "I want to order a vacuum cleaner." The device receives this voice via its microphone and stores it as audio data. The input is a raw audio signal, and the output is digital audio data.
[0091] Step 2:
[0092] The terminal transmits the acquired digital audio data to the server. Using a stable internet connection, the audio files are uploaded to the server. The input is the audio data within the terminal, and the output is the audio data transferred to the server.
[0093] Step 3:
[0094] The server analyzes the received audio data based on its speech recognition engine. Specifically, it converts the audio signal into text and extracts order information (product name and quantity). The input is audio data stored on the server, and the output is order information in text format.
[0095] Step 4:
[0096] The server generates a confirmation voice based on the order information. Using speech synthesis software, it creates a voice message for the user to reconfirm the order details. For example, a confirmation message such as "Do you want to order the XX vacuum cleaner?" is generated. The input is order information in text format, and the output is a confirmation message in digital voice.
[0097] Step 5:
[0098] The terminal presents the user with a confirmation audio received from the server. The user then plays the message through headphones or speakers to confirm its contents. The input is the confirmation audio data from the server, and the output is the user's auditory confirmation.
[0099] Step 6:
[0100] The user listens to a confirmation voice message and then responds. For example, they might respond with "yes," and the response is sent to the server via their device. The input is the user's response to the confirmation voice message, and the output is the response voice data.
[0101] Step 7:
[0102] The server receives the user's voice response and performs identity verification using voice pattern analysis. It compares this to previously registered voice data to check for a match. The input is the user's voice response data, and the output is the result of the identity verification.
[0103] Step 8:
[0104] After identity verification is complete, the server accepts the payment method specified by the user and processes the payment. Transaction information is encrypted and processed securely. The input is authenticated payment information, and the output is a confirmation of transaction completion.
[0105] (Application Example 1)
[0106] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0107] Conventional voice-based ordering systems have struggled to accurately extract order information from user voices and to confirm orders safely and quickly. Ensuring privacy while processing payments was also a critical challenge. Furthermore, there is a need for a voice interface that is simple and efficient to use, even for elderly users and those unfamiliar with technology.
[0108] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0109] In this invention, the server includes a computing device that analyzes voice input using a recognition mechanism and extracts information related to the order; a voice generation means that creates a confirmation voice based on the order information and presents it to the user; an order confirmation means that confirms the order based on the user's response; and a processing means that receives the specification of the payment method by voice and securely processes the payment information. As a result, users can effectively order products using voice commands and make payments securely while maintaining their privacy.
[0110] "Voice input" is the operation of receiving the voice spoken by the user and processing it as digital information.
[0111] An "information processing device" is a device that acquires voice input and transmits that information to other devices.
[0112] A "recognition mechanism" is an algorithm or software used to analyze audio data and extract meaningful information.
[0113] A "calculating device" is a device that analyzes received data and derives specific information such as order details and quantities.
[0114] "Speech generation means" refers to a technology or device for converting text information into speech and presenting it to the user.
[0115] An "order confirmation method" is a process for managing and confirming the completion of an order based on the user's instructions.
[0116] "Payment information" refers to all information related to payments that a buyer uses to purchase goods or services.
[0117] "Processing means" refers to a technology or system for safely and efficiently manipulating, calculating, or storing specific information.
[0118] The system for carrying out this invention consists of the following main components: an information processing device for receiving voice input, a computing device including a recognition mechanism, a voice generation means, an order confirmation means, and a means for processing payment information.
[0119] First, the user places an order for a product with the information processing device via voice input. This voice is transmitted to the computer through software such as Google® Cloud Speech-to-Text API or Amazon Polly. The computer analyzes the voice data and extracts information about the order. This analysis includes a process to derive specific order details, such as the color and quantity of sneakers.
[0120] Next, the voice generation device presents the extracted order information to the user as confirmation. This allows the user to verbally confirm that the order details are correct. After the user confirms the order details, the order confirmation device activates and finalizes the order. At this point, the user specifies the payment method via voice.
[0121] Subsequently, payment information is securely processed using encryption technologies such as OpenSSL, according to the user's specifications. This ensures privacy while enabling secure transactions.
[0122] As a concrete example, suppose a user says, "I want to order light blue sneakers." The computer processes this audio and generates a confirmation voice message asking, "Do you want to order light blue sneakers?" If the user responds with "Yes," the order confirmation system confirms this information and proceeds to the next step.
[0123] An example of a prompt message would be, "Please tell me how to use speech recognition technology to analyze a user's voice saying, 'I want to order light blue sneakers,' and process it as an order."
[0124] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0125] Step 1:
[0126] The device receives the user's voice input. It acquires the voice data of the user's utterance, "I want to order light blue sneakers," and sends this data to the server for speech recognition using the Google Cloud Speech-to-Text API. The input is the user's voice, and the output is digitized voice data.
[0127] Step 2:
[0128] The server analyzes the received audio data. The computing device processes the audio data and uses a generative AI model to extract order information (such as product name and quantity). The input is digitized audio data, and the output is the extracted order information (e.g., sneaker color and quantity).
[0129] Step 3:
[0130] The server generates a confirmation voice based on the order information. Using a voice generation method such as Amazon Polly, it creates a confirmation message, "Do you want to order light blue sneakers?", and sends it to the terminal. The input is the order information, and the output is a synthesized confirmation voice.
[0131] Step 4:
[0132] The terminal presents the user with a synthesized voice confirmation message. The user listens to it and confirms the order details. The input is the synthesized voice message, and the output is the user's confirmation response (e.g., "Yes").
[0133] Step 5:
[0134] The server receives the user's acknowledgment and confirms the order. The order confirmation system then uses speech recognition again to analyze the response and complete the order confirmation process. The input is the user's acknowledgment, and the output is the confirmed order.
[0135] Step 6:
[0136] The user specifies the payment method by voice. The terminal sends this voice to the server, and the computing device performs voice recognition to extract the payment information. The input is the user's voice instruction for the payment method, and the output is the extracted payment information.
[0137] Step 7:
[0138] The server securely processes payment information. The processing method uses OpenSSL to encrypt payment information, ensuring secure transactions. The input is payment information, and the output is encrypted payment information.
[0139] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0140] The system according to the present invention streamlines the voice-based ordering process in television shopping and provides responses that take into account the user's emotional state, thereby realizing a more comfortable and safe purchasing experience. This system consists of a terminal, a server, a voice synthesis device, and an emotion engine.
[0141] First, the device acquires the user's speech as audio data. When a user orders a specific product on TV shopping, they communicate their intention verbally. The acquired audio data is then sent to the server.
[0142] The server uses a speech recognition engine to convert this audio data into text and extract order information such as product name and quantity. Based on this information, the server generates a confirmation voice message and presents it to the user via a speech synthesizer.
[0143] In this process, the server is equipped with an emotion engine that analyzes the emotions contained in the user's voice. The emotion engine can estimate the user's emotional state from factors such as tone, speed, and rhythm of the voice. For example, if excitement or irritation is detected, it will generate a corresponding confirmation voice in a softer tone.
[0144] A confirmation voice message is presented to the user, and users who agree to its contents respond with "Yes." The server analyzes this response and performs voice pattern analysis to verify the user's identity. Once the user's identity has been verified through voice pattern analysis, the server proceeds to the payment process.
[0145] The user further specifies the payment method by voice and also verbally provides necessary information (e.g., credit card number). The terminal encrypts this information and sends it to the server. The server completes the transaction through a secure payment gateway. Even after payment is complete, the emotion engine may suggest feedback to the user, such as surveys or incentives for future use.
[0146] For example, when a user orders a new smartphone, the device receives a voice message saying, "I would like to order the latest smartphone, model XYZ." This voice message is sent to the server, where speech recognition converts the keyword "model XYZ" into text. The server generates a confirmation message asking, "Do you want to order model XYZ?" and, if the emotion engine detects frustration, adds a softer message saying, "We will process your order quickly, so please rest assured." After the user confirms their response, the transaction is completed securely.
[0147] This system will not only facilitate smooth information exchange, but will also enable the provision of more satisfying services through flexible responses that cater to user emotions.
[0148] The following describes the processing flow.
[0149] Step 1:
[0150] A user watches a TV shopping program and speaks aloud, "I want to buy this product." The device receives this voice command, performs noise filtering, and converts it into digital audio data.
[0151] Step 2:
[0152] The terminal sends the digitally converted audio data to the server. This transmission takes place in real time via the internet.
[0153] Step 3:
[0154] The server processes the received audio data through a speech recognition engine and converts it into text data. Here, order information such as product name and desired quantity is extracted.
[0155] Step 4:
[0156] The server generates a confirmation message based on the order information. For example, it might prepare a message as text such as, "Would you like to order two of item XX?"
[0157] Step 5:
[0158] The server uses an emotion engine to perform sentiment analysis on the voice data. The emotion engine determines the user's emotions from the tone of voice and the speed of speech, detecting patterns such as joy and anger.
[0159] Step 6:
[0160] The server adjusts the confirmation message based on the results of the sentiment analysis. For example, if anger is detected, the message may be modified to include the phrase, "We will respond as soon as possible."
[0161] Step 7:
[0162] The revised confirmation message is sent to the speech synthesis engine to generate natural-sounding speech.
[0163] Step 8:
[0164] The device plays a generated confirmation audio to the user. The user listens to the confirmation audio and responds with "yes" if it is correct, or "no" if correction is needed.
[0165] Step 9:
[0166] The terminal sends the user's response back to the server as audio data, which the server then analyzes.
[0167] Step 10:
[0168] The server uses voice pattern analysis to initiate a process to verify the user's identity. This confirms that the voice belongs to an authenticated user.
[0169] Step 11:
[0170] Once identity verification is complete, the user selects a payment method by voice and provides the required payment information. The device encrypts this information securely and sends it to the server.
[0171] Step 12:
[0172] The server processes encrypted payment information and completes the transaction through a secure payment gateway. Once completion is confirmed, the terminal notifies the user of the order completion message.
[0173] (Example 2)
[0174] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0175] Conventional voice-input ordering systems failed to provide flexible responses that took into account the user's emotional state, thus failing to improve the user experience. Furthermore, there were challenges in verifying user identity and ensuring the security of payment information. These problems need to be solved to provide a safer and more comfortable ordering experience.
[0176] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0177] In this invention, the server includes an information processing device for receiving voice input, a computing device for analyzing the voice input based on a conversion device and extracting information related to the order, a voice generation means for generating a confirmation voice based on the information related to the order and presenting it to the user, an emotion analysis means for analyzing the emotion of the voice and adjusting the response, and a means for confirming the order based on the user's response. This enables flexible responses that take into account the user's emotions, as well as secure identity verification and payment processing.
[0178] "Voice input" is the process by which electronic devices acquire a user's speech as digital data.
[0179] An "information processing device" is a device that has the function of receiving voice input and is used for data collection and communication.
[0180] A "conversion device" is a technology that has a mechanism for converting audio data into text data.
[0181] A "calculating device" is a device that analyzes text data extracted from speech and has the function of organizing and extracting information related to orders.
[0182] A "speech generation method" is a means of generating natural-sounding speech from text data and presenting it to the user.
[0183] "An emotion analysis method that analyzes the emotions in voice and adjusts responses" is a means of determining the emotional state of a user based on their voice tone, speed, etc., and adjusting the content of the response and the voice accordingly.
[0184] "Methods for confirming an order" refer to the means of confirming an order based on the user's response and completing the final order process.
[0185] "Identity verification methods" refer to measures used to confirm the user's identity through voice pattern analysis and to guarantee the legitimacy of the order.
[0186] "A processing method for encrypting and securely processing payment information" refers to a method equipped with a mechanism for encrypting data in order to securely manage and process users' payment information.
[0187] To carry out this invention, the following configuration is necessary. The system mainly includes a terminal device, a server device, a speech synthesis device, and emotion analysis means. First, the terminal device functions as an information processing device for acquiring voice input from the user. An input device such as a microphone is integrated into the terminal device, which converts the voice into a digital signal and transmits it to the server device via the network.
[0188] The server device acts as a conversion device, converting audio data into text data using a speech recognition engine. DeepSpeech or similar high-precision speech recognition technologies are used for this conversion. From the extracted text data, the server device acts as a computing device, analyzing and organizing order information such as product names and quantities.
[0189] Subsequently, the server uses a speech synthesis engine (TTS engine) to generate a confirmation voice based on the obtained order information. This process utilizes an AI model for natural language processing and generation. In particular, it analyzes the tone and speed of the voice through emotion analysis to capture the user's emotional state. Based on this analysis, the confirmation voice is adjusted.
[0190] When a user confirms an order by voice using the order confirmation method, their response is analyzed by the server, and the order is confirmed. At this point, voice pattern analysis is performed by the identity verification method to verify the user's identity and guarantee the legitimacy of the order. Biometric authentication systems may also be integrated into this process.
[0191] Furthermore, payment information provided by users is securely processed using a method that encrypts and transmits payment information. The terminal device encrypts the information and sends it to the server, which completes the transaction through a secure gateway. Strong encryption technology is used to maintain data security.
[0192] Finally, the server also provides post-purchase feedback and incentives through sentiment analysis. This process ensures that users have a comfortable, emotionally responsive, and flexible purchasing experience.
[0193] For example, when a user tries to purchase a smartphone, the device receives a voice message saying, "I would like to order the latest smartphone, model XYZ." This voice message is sent to the server, and through speech recognition, the text data "model XYZ" is generated. A confirmation message, "Do you want to order model XYZ?", is generated, and if the user's emotions are analyzed, adjustments are made as needed.
[0194] An example of a prompt for a generative AI model is: "A user is passionately trying to order the latest XYZ smartphone from a TV shopping channel. Convert the voice data into text and generate an acknowledgment response that reflects the user's emotions."
[0195] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0196] Step 1:
[0197] The device acquires the user's voice input. The user speaks to clarify the product they want to purchase from the TV shopping program, and this voice is captured by the device's built-in microphone and converted into digital data. This conversion is output as audio data processed by digital signal processing.
[0198] Step 2:
[0199] The terminal transmits the acquired audio data to the server. The audio data is securely transferred to the server over the network, and the communication is protected by an encryption protocol. The input is digital audio data, and the output is received by the server as audio data that can be recognized.
[0200] Step 3:
[0201] The server converts the received audio data into text using a speech recognition engine. The speech recognition engine uses a deep learning model to analyze the audio and output the order details as text data. In this process, important order information such as product name and quantity is extracted.
[0202] Step 4:
[0203] The server generates a confirmation voice based on the order information obtained from the text. It uses a speech synthesis engine to convert the text back into speech and create the confirmation message. The input is text data about the order, and the output is the voice data required for confirmation.
[0204] Step 5:
[0205] The server uses an emotion analysis engine to analyze the tone and speed of the user's voice and evaluate the user's emotional state. For example, if the user's voice contains excitement or frustration, the confirmation voice is adjusted according to that emotion. The input is the initial voice data, and the output is the confirmation voice adjusted according to the emotion.
[0206] Step 6:
[0207] The user receives a confirmation voice message from the server and responds. The user confirms the order with a voice message such as "yes," which the terminal then re-records. The recorded voice data is then sent back to the server.
[0208] Step 7:
[0209] The server analyzes the user's voice response and verifies their identity through voice pattern analysis. This process involves analyzing the frequency and rhythm of the voice and comparing it to past data. The input is the user's voice response, and the output is data confirming the legitimacy of the order.
[0210] Step 8:
[0211] Once the server verifies the user's identity, it begins processing the payment. The payment information provided by the user is encrypted and transmitted to the server via a secure route, where the transaction is completed using the payment gateway. The input is encrypted payment information, and the output is a confirmation of transaction completion.
[0212] (Application Example 2)
[0213] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0214] Conventional voice-based ordering systems often provide uniform responses without considering the user's emotional state, potentially leading to decreased user satisfaction. Furthermore, insufficient consideration is given to the security of voice-based identity verification and payment. Therefore, a system is needed that can provide flexible responses tailored to the user's emotional state while ensuring secure identity verification and payment.
[0215] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0216] In this invention, the server includes a calculation means for analyzing and extracting order information based on speech recognition means, a response generation means for generating a response corresponding to the emotional state, and an emotion analysis means for estimating the emotional state. This makes it possible to provide appropriate responses according to the user's emotional state, thereby improving security.
[0217] "Voice input" is a method by which users communicate information through their voice.
[0218] An "information processing device" is a device for receiving and processing audio data.
[0219] "Speech recognition means" refers to technology that analyzes speech data and converts it into text data.
[0220] A "computational device" is a device used to extract necessary information from analyzed data.
[0221] "Speech synthesis means" refers to technology for converting text data into speech data and outputting it.
[0222] "Emotional analysis methods" refer to technologies that estimate a user's emotional state from voice data.
[0223] A "response generation means" is a technology for generating an appropriate response based on an emotional state.
[0224] An "order confirmation method" is a technology for confirming an order based on the user's response.
[0225] "Verification methods" refer to technologies that analyze a user's voice patterns to determine their legitimacy.
[0226] A "payment transaction method" is a technology for securely processing a user's payment information.
[0227] The system for implementing this invention uses a smartphone or smart glasses as an information processing device to receive voice input. When a user wants to order a product, they input the necessary information by voice into these devices. The information processing device receives the voice data and transmits it to a computing device. The server uses voice recognition means to convert the voice data into text and extract the order information.
[0228] At this time, the server mobilizes emotion analysis means to analyze the user's voice tone, speed, and rhythm, and estimate their emotional state. For example, if the user is perceived as being in a hurry, the server can generate a flexible response that corresponds to their emotional state. This response generation means generates a confirmation voice, which is then presented to the user via a speech synthesis means.
[0229] When the user responds with "yes" by voice, the server confirms the order through the order confirmation mechanism. Then, the user's voice pattern is analyzed by a verification mechanism to ensure security by verifying their identity. Finally, the payment transaction mechanism encrypts the user's payment information and processes it securely.
[0230] For example, if a user speaks into their smartphone saying, "I'd like to order new sneakers, size 9," the system recognizes this as order information and, based on sentiment analysis, generates a reassuring response such as, "If you're in a hurry, we'll process it quickly."
[0231] An example of a prompt for a generative AI model is: "When the user provides voice input, convert it to text using speech recognition, analyze the sentiment using the sentiment analysis engine, and then create an appropriate response based on the results."
[0232] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0233] Step 1:
[0234] The terminal receives voice input from the user. When the user speaks about the product they want to purchase, that voice data is recorded in the terminal. This voice data is then sent for subsequent processing.
[0235] Step 2:
[0236] The server converts the received audio data into text data using speech recognition technology. The speech recognition engine analyzes the waveform of the voice and extracts order information such as product names and quantities as text. As a result, the order information is output.
[0237] Step 3:
[0238] The server uses sentiment analysis tools to estimate the emotional state based on the converted text. It analyzes the tone, tempo, and intonation of the audio data to identify emotions such as whether the user is feeling urgency or satisfaction. This analysis result is output as the emotional state.
[0239] Step 4:
[0240] The server generates a confirmation voice message using a response generation mechanism based on the order information and emotional state. The confirmation voice message incorporates flexible responses depending on the emotional state (e.g., "We will respond promptly" if the customer is in a hurry). This confirmation voice message is output by a speech synthesis mechanism.
[0241] Step 5:
[0242] The user responds to the confirmation audio presented by the server with "yes" or "no." The terminal receives this response again as audio data. This audio data is sent to the server for final confirmation.
[0243] Step 6:
[0244] The server analyzes the received acknowledgment based on the order confirmation mechanism and confirms the order. The order is confirmed when the user responds with "yes," and the process proceeds to the next step. As a result of this process, confirmed order information is output.
[0245] Step 7:
[0246] The server uses verification methods to analyze the user's voice pattern and verify their identity. To ensure security, the voice characteristics are compared to a database to determine legitimacy. This analysis confirms that the user is legitimate.
[0247] Step 8:
[0248] The server uses a payment transaction method to securely process the payment information provided by the user by encrypting it. The transaction is carried out through the payment gateway, and the completed payment information is output.
[0249] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0250] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0251] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0252] [Second Embodiment]
[0253] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0254] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0255] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0256] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0257] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0258] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0259] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0260] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0261] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0262] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0263] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0264] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0265] The system according to the present invention is designed to streamline and secure the voice-based ordering process through television shopping. This system integrates voice recognition, voice confirmation generation, identity verification, and secure payment processing.
[0266] The system consists of a terminal, a server, and a speech synthesis device. First, the terminal acquires the user's voice. When the user speaks and says they want to order a specific product, that voice data is sent to the server.
[0267] The server analyzes the received audio data based on its speech recognition engine and extracts order information such as product name and quantity. Based on this information, the server generates a confirmation voice message and presents it to the user via a speech synthesizer.
[0268] Once the user confirms that there are no problems with their order, the server uses voice pattern analysis technology to verify the user's identity. This verification is performed by comparing the user's voice with previously registered voice data.
[0269] Once identity verification is complete, the server proceeds with payment processing. Payment information is securely processed based on the payment method specified by the user. During this process, all data is encrypted and protected from unauthorized access.
[0270] As a concrete example, consider a case where a user orders a specific vacuum cleaner from a TV shopping channel. The user says, "I want to order a vacuum cleaner," using their voice. The terminal receives this voice and sends it to the server. The server analyzes the voice and extracts the keyword "vacuum cleaner." Next, the server creates a confirmation voice message, "Do you want to order the XX vacuum cleaner?", and presents it to the user through a speech synthesis device.
[0271] If the user responds with "yes," the server will reconfirm the order details and prompt the user for identity verification and payment method selection. For example, the user may be asked to enter their credit card information by voice. The server then completes the payment using encrypted information.
[0272] This system allows users to complete orders safely and efficiently without complicated procedures. The voice-activated interface provides a user-friendly shopping experience, especially for the elderly and those unfamiliar with technology.
[0273] The following describes the processing flow.
[0274] Step 1:
[0275] The terminal receives voice input from the user. The user verbally places a product order while watching a TV shopping program, and the terminal records this through a microphone.
[0276] Step 2:
[0277] The terminal digitizes the recorded voice data and transmits it to the server via an Internet connection. Since the data is transferred in real-time, it can immediately enter the processing.
[0278] Step 3:
[0279] The server inputs the received voice data into a voice recognition engine and converts it from voice to text. In this process, keywords such as product names and quantities are extracted and organized as order information.
[0280] Step 4:
[0281] Based on the order information, the server creates text for generating a confirmation voice. For example, it creates a confirmation message such as "Are 2 units of [product name] ordered?".
[0282] Step 5:
[0283] The server sends the generated text to a voice synthesis engine to generate confirmation voice data.
[0284] Step 6:
[0285] The terminal receives the confirmation voice data from the voice synthesis engine and plays it back to the user. Here, the user is requested to confirm the order details.
[0286] Step 7:
[0287] The user listens to the confirmation voice and responds verbally with "Yes" if there are no problems with the order, or "No" if cancellation or modification is required.
[0288] Step 8:
[0289] The terminal sends the user's voice response back to the server, which analyzes the audio to understand the user's intent. If the user responds with "yes," the server proceeds to prepare to confirm the order.
[0290] Step 9:
[0291] The server uses voice pattern analysis to verify the user's identity by comparing their voice to pre-registered voice data.
[0292] Step 10:
[0293] The user specifies their payment method by voice and enters credit card information and other details as needed. The terminal encrypts this information and sends it to the server.
[0294] Step 11:
[0295] The server securely processes payment information and verifies that the purchase process has been completed successfully. After completion, it notifies the user of the order completion via the terminal.
[0296] (Example 1)
[0297] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0298] Traditional online ordering systems often require users to perform complex procedures and input tasks, making them difficult to use, especially for the elderly and those unfamiliar with technology. Furthermore, identity verification and payment processing may lack sufficient security and efficiency. There is a need to address these challenges and provide a system that allows anyone to complete orders easily and securely.
[0299] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0300] In this invention, the server includes a receiving means for acquiring voice input, a processing means for analyzing the voice input and extracting information, and an output means for generating and presenting a confirmation voice based on the information. Thereby, it becomes possible to perform an order procedure simply and safely using voice.
[0301] The "receiving means for acquiring voice input" is a device that collects voice data from a user and converts it into a form that can be processed as an electronic signal.
[0302] The "processing means for analyzing the voice input and extracting information" is a device that analyzes a voice signal using voice recognition technology to identify order information and other necessary information.
[0303] The "output means for generating and presenting a confirmation voice based on the information" is a device that generates a confirmation voice based on the extracted information and presents it auditorily to the user.
[0304] The "confirmation means for determining an order based on the user's response" is a device that makes a final decision on the order content based on the response provided by the user and records it.
[0305] The "authentication means for identifying the user using voice analysis and performing authentication" is a device that analyzes the characteristics of voice to individually identify the user and perform identity verification.
[0306] The "transaction means for encrypting and processing to protect the user's transaction information" is a device that securely processes the user's payment information using encryption technology and protects it from unauthorized access.
[0307]
[0308] This invention provides an online ordering system using voice, enabling a user to easily and safely order goods using only voice. The embodiments thereof will be described below.
[0308] In this system, the terminal is equipped with a microphone to capture user speech. When the user verbally states that they wish to order a product, the audio data is converted into a digital format by the terminal and transmitted to the server via the network. The specific software used includes a "speech recognition engine" for analyzing the audio data.
[0309] The server analyzes the received audio data using a "speech recognition engine." This analysis converts the audio signal into text data and extracts order information such as product name and quantity. This information is necessary to confirm the order details with the user.
[0310] The server then uses "speech synthesis software" to generate a confirmation voice message for the user. This confirmation voice message is used to allow the user to confirm the order details based on the generated data. For example, it might say, "Would you like to order two of the XX vacuum cleaners?"
[0311] The user responds to the confirmation voice message. The response is sent back to the server via the terminal. The server verifies the user's identity using voice pattern analysis technology. After authentication is complete, the payment information is securely processed using encryption technology according to the user's instructions.
[0312] As a concrete example, when a user voice-inputs "I want to order a vacuum cleaner," the terminal sends the voice data to the server, which analyzes the data and generates a confirmation voice message asking the user, "Are you sure you want to order a vacuum cleaner?" The order is then confirmed when the user answers "Yes."
[0313] An example of a prompt used as input for the AI model in this system is, "Please tell me about the voice-based product ordering system." This example demonstrates how to achieve an intuitive and easy-to-use user experience via a voice interface.
[0314] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0315] Step 1:
[0316] The device acquires voice from the user. The user directly speaks about their purchase request, for example, saying, "I want to order a vacuum cleaner." The device receives this voice via its microphone and stores it as audio data. The input is a raw audio signal, and the output is digital audio data.
[0317] Step 2:
[0318] The terminal transmits the acquired digital audio data to the server. Using a stable internet connection, the audio files are uploaded to the server. The input is the audio data within the terminal, and the output is the audio data transferred to the server.
[0319] Step 3:
[0320] The server analyzes the received audio data based on its speech recognition engine. Specifically, it converts the audio signal into text and extracts order information (product name and quantity). The input is audio data stored on the server, and the output is order information in text format.
[0321] Step 4:
[0322] The server generates a confirmation voice based on the order information. Using speech synthesis software, it creates a voice message for the user to reconfirm the order details. For example, a confirmation message such as "Do you want to order the XX vacuum cleaner?" is generated. The input is order information in text format, and the output is a confirmation message in digital voice.
[0323] Step 5:
[0324] The terminal presents the user with a confirmation audio received from the server. The user then plays the message through headphones or speakers to confirm its contents. The input is the confirmation audio data from the server, and the output is the user's auditory confirmation.
[0325] Step 6:
[0326] The user listens to a confirmation voice message and then responds. For example, they might respond with "yes," and the response is sent to the server via their device. The input is the user's response to the confirmation voice message, and the output is the response voice data.
[0327] Step 7:
[0328] The server receives the user's voice response and performs identity verification using voice pattern analysis. It compares this to previously registered voice data to check for a match. The input is the user's voice response data, and the output is the result of the identity verification.
[0329] Step 8:
[0330] After identity verification is complete, the server accepts the payment method specified by the user and processes the payment. Transaction information is encrypted and processed securely. The input is authenticated payment information, and the output is a confirmation of transaction completion.
[0331] (Application Example 1)
[0332] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0333] Conventional voice-based ordering systems have struggled to accurately extract order information from user voices and to confirm orders safely and quickly. Ensuring privacy while processing payments was also a critical challenge. Furthermore, there is a need for a voice interface that is simple and efficient to use, even for elderly users and those unfamiliar with technology.
[0334] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0335] In this invention, the server includes a computing device that analyzes voice input using a recognition mechanism and extracts information related to the order; a voice generation means that creates a confirmation voice based on the order information and presents it to the user; an order confirmation means that confirms the order based on the user's response; and a processing means that receives the specification of the payment method by voice and securely processes the payment information. As a result, users can effectively order products using voice commands and make payments securely while maintaining their privacy.
[0336] "Voice input" is the operation of receiving the voice spoken by the user and processing it as digital information.
[0337] An "information processing device" is a device that acquires voice input and transmits that information to other devices.
[0338] A "recognition mechanism" is an algorithm or software used to analyze audio data and extract meaningful information.
[0339] A "calculating device" is a device that analyzes received data and derives specific information such as order details and quantities.
[0340] "Speech generation means" refers to a technology or device for converting text information into speech and presenting it to the user.
[0341] An "order confirmation method" is a process for managing and confirming the completion of an order based on the user's instructions.
[0342] "Payment information" refers to all information related to payments that a buyer uses to purchase goods or services.
[0343] "Processing means" refers to a technology or system for safely and efficiently manipulating, calculating, or storing specific information.
[0344] The system for carrying out this invention consists of the following main components: an information processing device for receiving voice input, a computing device including a recognition mechanism, a voice generation means, an order confirmation means, and a means for processing payment information.
[0345] First, the user places an order for products with the information processing device via voice input. This voice is sent to the computing device through software such as the Google Cloud Speech-to-Text API or Amazon Polly. The computing device analyzes the voice data and extracts information about the order. This analysis includes a process to derive specific order details, such as the color and quantity of sneakers.
[0346] Next, the voice generation device presents the extracted order information to the user as confirmation. This allows the user to verbally confirm that the order details are correct. After the user confirms the order details, the order confirmation device activates and finalizes the order. At this point, the user specifies the payment method via voice.
[0347] Subsequently, payment information is securely processed using encryption technologies such as OpenSSL, according to the user's specifications. This ensures privacy while enabling secure transactions.
[0348] As a concrete example, suppose a user says, "I want to order light blue sneakers." The computer processes this audio and generates a confirmation voice message asking, "Do you want to order light blue sneakers?" If the user responds with "Yes," the order confirmation system confirms this information and proceeds to the next step.
[0349] An example of a prompt message would be, "Please tell me how to use speech recognition technology to analyze a user's voice saying, 'I want to order light blue sneakers,' and process it as an order."
[0350] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0351] Step 1:
[0352] The device receives the user's voice input. It acquires the voice data of the user's utterance, "I want to order light blue sneakers," and sends this data to the server for speech recognition using the Google Cloud Speech-to-Text API. The input is the user's voice, and the output is digitized voice data.
[0353] Step 2:
[0354] The server analyzes the received audio data. The computing device processes the audio data and uses a generative AI model to extract order information (such as product name and quantity). The input is digitized audio data, and the output is the extracted order information (e.g., sneaker color and quantity).
[0355] Step 3:
[0356] The server generates a confirmation voice based on the order information. Using a voice generation method such as Amazon Polly, it creates a confirmation message, "Do you want to order light blue sneakers?", and sends it to the terminal. The input is the order information, and the output is a synthesized confirmation voice.
[0357] Step 4:
[0358] The terminal presents the user with a synthesized voice confirmation message. The user listens to it and confirms the order details. The input is the synthesized voice message, and the output is the user's confirmation response (e.g., "Yes").
[0359] Step 5:
[0360] The server receives the user's acknowledgment and confirms the order. The order confirmation system then uses speech recognition again to analyze the response and complete the order confirmation process. The input is the user's acknowledgment, and the output is the confirmed order.
[0361] Step 6:
[0362] The user specifies the payment method by voice. The terminal sends this voice to the server, and the computing device performs voice recognition to extract the payment information. The input is the user's voice instruction for the payment method, and the output is the extracted payment information.
[0363] Step 7:
[0364] The server securely processes payment information. The processing method uses OpenSSL to encrypt payment information, ensuring secure transactions. The input is payment information, and the output is encrypted payment information.
[0365] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0366] The system according to the present invention streamlines the voice-based ordering process in television shopping and provides responses that take into account the user's emotional state, thereby realizing a more comfortable and safe purchasing experience. This system consists of a terminal, a server, a voice synthesis device, and an emotion engine.
[0367] First, the device acquires the user's speech as audio data. When a user orders a specific product on TV shopping, they communicate their intention verbally. The acquired audio data is then sent to the server.
[0368] The server uses a speech recognition engine to convert this audio data into text and extract order information such as product name and quantity. Based on this information, the server generates a confirmation voice message and presents it to the user via a speech synthesizer.
[0369] In this process, the server is equipped with an emotion engine that analyzes the emotions contained in the user's voice. The emotion engine can estimate the user's emotional state from factors such as tone, speed, and rhythm of the voice. For example, if excitement or irritation is detected, it will generate a corresponding confirmation voice in a softer tone.
[0370] A confirmation voice message is presented to the user, and users who agree to its contents respond with "Yes." The server analyzes this response and performs voice pattern analysis to verify the user's identity. Once the user's identity has been verified through voice pattern analysis, the server proceeds to the payment process.
[0371] The user further specifies the payment method by voice and also verbally provides necessary information (e.g., credit card number). The terminal encrypts this information and sends it to the server. The server completes the transaction through a secure payment gateway. Even after payment is complete, the emotion engine may suggest feedback to the user, such as surveys or incentives for future use.
[0372] For example, when a user orders a new smartphone, the device receives a voice message saying, "I would like to order the latest smartphone, model XYZ." This voice message is sent to the server, where speech recognition converts the keyword "model XYZ" into text. The server generates a confirmation message asking, "Do you want to order model XYZ?" and, if the emotion engine detects frustration, adds a softer message saying, "We will process your order quickly, so please rest assured." After the user confirms their response, the transaction is completed securely.
[0373] This system will not only facilitate smooth information exchange, but will also enable the provision of more satisfying services through flexible responses that cater to user emotions.
[0374] The following describes the processing flow.
[0375] Step 1:
[0376] A user watches a TV shopping program and speaks aloud, "I want to buy this product." The device receives this voice command, performs noise filtering, and converts it into digital audio data.
[0377] Step 2:
[0378] The terminal sends the digitally converted audio data to the server. This transmission takes place in real time via the internet.
[0379] Step 3:
[0380] The server processes the received audio data through a speech recognition engine and converts it into text data. Here, order information such as product name and desired quantity is extracted.
[0381] Step 4:
[0382] The server generates a confirmation message based on the order information. For example, it might prepare a message as text such as, "Would you like to order two of item XX?"
[0383] Step 5:
[0384] The server uses an emotion engine to perform sentiment analysis on the voice data. The emotion engine determines the user's emotions from the tone of voice and the speed of speech, detecting patterns such as joy and anger.
[0385] Step 6:
[0386] The server adjusts the confirmation message based on the results of the sentiment analysis. For example, if anger is detected, the message may be modified to include the phrase, "We will respond as soon as possible."
[0387] Step 7:
[0388] The revised confirmation message is sent to the speech synthesis engine to generate natural-sounding speech.
[0389] Step 8:
[0390] The device plays a generated confirmation audio to the user. The user listens to the confirmation audio and responds with "yes" if it is correct, or "no" if correction is needed.
[0391] Step 9:
[0392] The terminal sends the user's response back to the server as audio data, which the server then analyzes.
[0393] Step 10:
[0394] The server uses voice pattern analysis to initiate a process to verify the user's identity. This confirms that the voice belongs to an authenticated user.
[0395] Step 11:
[0396] Once identity verification is complete, the user selects a payment method by voice and provides the required payment information. The device encrypts this information securely and sends it to the server.
[0397] Step 12:
[0398] The server processes encrypted payment information and completes the transaction through a secure payment gateway. Once completion is confirmed, the terminal notifies the user of the order completion message.
[0399] (Example 2)
[0400] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0401] Conventional voice-input ordering systems failed to provide flexible responses that took into account the user's emotional state, thus failing to improve the user experience. Furthermore, there were challenges in verifying user identity and ensuring the security of payment information. These problems need to be solved to provide a safer and more comfortable ordering experience.
[0402] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0403] In this invention, the server includes an information processing device for receiving voice input, a computing device for analyzing the voice input based on a conversion device and extracting information related to the order, a voice generation means for generating a confirmation voice based on the information related to the order and presenting it to the user, an emotion analysis means for analyzing the emotion of the voice and adjusting the response, and a means for confirming the order based on the user's response. This enables flexible responses that take into account the user's emotions, as well as secure identity verification and payment processing.
[0404] "Voice input" is the process by which electronic devices acquire a user's speech as digital data.
[0405] An "information processing device" is a device that has the function of receiving voice input and is used for data collection and communication.
[0406] A "conversion device" is a technology that has a mechanism for converting audio data into text data.
[0407] A "calculating device" is a device that analyzes text data extracted from speech and has the function of organizing and extracting information related to orders.
[0408] A "speech generation method" is a means of generating natural-sounding speech from text data and presenting it to the user.
[0409] "An emotion analysis method that analyzes the emotions in voice and adjusts responses" is a means of determining the emotional state of a user based on their voice tone, speed, etc., and adjusting the content of the response and the voice accordingly.
[0410] "Methods for confirming an order" refer to the means of confirming an order based on the user's response and completing the final order process.
[0411] "Identity verification methods" refer to measures used to confirm the user's identity through voice pattern analysis and to guarantee the legitimacy of the order.
[0412] "A processing method for encrypting and securely processing payment information" refers to a method equipped with a mechanism for encrypting data in order to securely manage and process users' payment information.
[0413] To carry out this invention, the following configuration is necessary. The system mainly includes a terminal device, a server device, a speech synthesis device, and emotion analysis means. First, the terminal device functions as an information processing device for acquiring voice input from the user. An input device such as a microphone is integrated into the terminal device, which converts the voice into a digital signal and transmits it to the server device via the network.
[0414] The server device acts as a conversion device, converting audio data into text data using a speech recognition engine. DeepSpeech or similar high-precision speech recognition technologies are used for this conversion. From the extracted text data, the server device acts as a computing device, analyzing and organizing order information such as product names and quantities.
[0415] Subsequently, the server uses a speech synthesis engine (TTS engine) to generate a confirmation voice based on the obtained order information. This process utilizes an AI model for natural language processing and generation. In particular, it analyzes the tone and speed of the voice through emotion analysis to capture the user's emotional state. Based on this analysis, the confirmation voice is adjusted.
[0416] When a user confirms an order by voice using the order confirmation method, their response is analyzed by the server, and the order is confirmed. At this point, voice pattern analysis is performed by the identity verification method to verify the user's identity and guarantee the legitimacy of the order. Biometric authentication systems may also be integrated into this process.
[0417] Furthermore, payment information provided by users is securely processed using a method that encrypts and transmits payment information. The terminal device encrypts the information and sends it to the server, which completes the transaction through a secure gateway. Strong encryption technology is used to maintain data security.
[0418] Finally, the server also provides post-purchase feedback and incentives through sentiment analysis. This process ensures that users have a comfortable, emotionally responsive, and flexible purchasing experience.
[0419] For example, when a user tries to purchase a smartphone, the device receives a voice message saying, "I would like to order the latest smartphone, model XYZ." This voice message is sent to the server, and through speech recognition, the text data "model XYZ" is generated. A confirmation message, "Do you want to order model XYZ?", is generated, and if the user's emotions are analyzed, adjustments are made as needed.
[0420] An example of a prompt for a generative AI model is: "A user is passionately trying to order the latest XYZ smartphone from a TV shopping channel. Convert the voice data into text and generate an acknowledgment response that reflects the user's emotions."
[0421] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0422] Step 1:
[0423] The device acquires the user's voice input. The user speaks to clarify the product they want to purchase from the TV shopping program, and this voice is captured by the device's built-in microphone and converted into digital data. This conversion is output as audio data processed by digital signal processing.
[0424] Step 2:
[0425] The terminal transmits the acquired audio data to the server. The audio data is securely transferred to the server over the network, and the communication is protected by an encryption protocol. The input is digital audio data, and the output is received by the server as audio data that can be recognized.
[0426] Step 3:
[0427] The server converts the received audio data into text using a speech recognition engine. The speech recognition engine uses a deep learning model to analyze the audio and output the order details as text data. In this process, important order information such as product name and quantity is extracted.
[0428] Step 4:
[0429] The server generates a confirmation voice based on the order information obtained from the text. It uses a speech synthesis engine to convert the text back into speech and create the confirmation message. The input is text data about the order, and the output is the voice data required for confirmation.
[0430] Step 5:
[0431] The server uses an emotion analysis engine to analyze the tone and speed of the user's voice and evaluate the user's emotional state. For example, if the user's voice contains excitement or frustration, the confirmation voice is adjusted according to that emotion. The input is the initial voice data, and the output is the confirmation voice adjusted according to the emotion.
[0432] Step 6:
[0433] The user receives a confirmation voice message from the server and responds. The user confirms the order with a voice message such as "yes," which the terminal then re-records. The recorded voice data is then sent back to the server.
[0434] Step 7:
[0435] The server analyzes the user's voice response and verifies their identity through voice pattern analysis. This process involves analyzing the frequency and rhythm of the voice and comparing it to past data. The input is the user's voice response, and the output is data confirming the legitimacy of the order.
[0436] Step 8:
[0437] Once the server verifies the user's identity, it begins processing the payment. The payment information provided by the user is encrypted and transmitted to the server via a secure route, where the transaction is completed using the payment gateway. The input is encrypted payment information, and the output is a confirmation of transaction completion.
[0438] (Application Example 2)
[0439] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0440] Conventional voice-based ordering systems often provide uniform responses without considering the user's emotional state, potentially leading to decreased user satisfaction. Furthermore, insufficient consideration is given to the security of voice-based identity verification and payment. Therefore, a system is needed that can provide flexible responses tailored to the user's emotional state while ensuring secure identity verification and payment.
[0441] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0442] In this invention, the server includes a calculation means for analyzing and extracting order information based on speech recognition means, a response generation means for generating a response corresponding to the emotional state, and an emotion analysis means for estimating the emotional state. This makes it possible to provide appropriate responses according to the user's emotional state, thereby improving security.
[0443] "Voice input" is a method by which users communicate information through their voice.
[0444] An "information processing device" is a device for receiving and processing audio data.
[0445] "Speech recognition means" refers to technology that analyzes speech data and converts it into text data.
[0446] A "computational device" is a device used to extract necessary information from analyzed data.
[0447] "Speech synthesis means" refers to technology for converting text data into speech data and outputting it.
[0448] "Emotional analysis methods" refer to technologies that estimate a user's emotional state from voice data.
[0449] A "response generation means" is a technology for generating an appropriate response based on an emotional state.
[0450] An "order confirmation method" is a technology for confirming an order based on the user's response.
[0451] "Verification methods" refer to technologies that analyze a user's voice patterns to determine their legitimacy.
[0452] A "payment transaction method" is a technology for securely processing a user's payment information.
[0453] The system for implementing this invention uses a smartphone or smart glasses as an information processing device to receive voice input. When a user wants to order a product, they input the necessary information by voice into these devices. The information processing device receives the voice data and transmits it to a computing device. The server uses voice recognition means to convert the voice data into text and extract the order information.
[0454] At this time, the server mobilizes emotion analysis means to analyze the user's voice tone, speed, and rhythm, and estimate their emotional state. For example, if the user is perceived as being in a hurry, the server can generate a flexible response that corresponds to their emotional state. This response generation means generates a confirmation voice, which is then presented to the user via a speech synthesis means.
[0455] When the user responds with "yes" by voice, the server confirms the order through the order confirmation mechanism. Then, the user's voice pattern is analyzed by a verification mechanism to ensure security by verifying their identity. Finally, the payment transaction mechanism encrypts the user's payment information and processes it securely.
[0456] For example, if a user speaks into their smartphone saying, "I'd like to order new sneakers, size 9," the system recognizes this as order information and, based on sentiment analysis, generates a reassuring response such as, "If you're in a hurry, we'll process it quickly."
[0457] An example of a prompt for a generative AI model is: "When the user provides voice input, convert it to text using speech recognition, analyze the sentiment using the sentiment analysis engine, and then create an appropriate response based on the results."
[0458] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0459] Step 1:
[0460] The terminal receives voice input from the user. When the user speaks about the product they want to purchase, that voice data is recorded in the terminal. This voice data is then sent for subsequent processing.
[0461] Step 2:
[0462] The server converts the received audio data into text data using speech recognition technology. The speech recognition engine analyzes the waveform of the voice and extracts order information such as product names and quantities as text. As a result, the order information is output.
[0463] Step 3:
[0464] The server uses sentiment analysis tools to estimate the emotional state based on the converted text. It analyzes the tone, tempo, and intonation of the audio data to identify emotions such as whether the user is feeling urgency or satisfaction. This analysis result is output as the emotional state.
[0465] Step 4:
[0466] The server generates a confirmation voice message using a response generation mechanism based on the order information and emotional state. The confirmation voice message incorporates flexible responses depending on the emotional state (e.g., "We will respond promptly" if the customer is in a hurry). This confirmation voice message is output by a speech synthesis mechanism.
[0467] Step 5:
[0468] The user responds to the confirmation audio presented by the server with "yes" or "no." The terminal receives this response again as audio data. This audio data is sent to the server for final confirmation.
[0469] Step 6:
[0470] The server analyzes the received acknowledgment based on the order confirmation mechanism and confirms the order. The order is confirmed when the user responds with "yes," and the process proceeds to the next step. As a result of this process, confirmed order information is output.
[0471] Step 7:
[0472] The server uses verification methods to analyze the user's voice pattern and verify their identity. To ensure security, the voice characteristics are compared to a database to determine legitimacy. This analysis confirms that the user is legitimate.
[0473] Step 8:
[0474] The server uses a payment transaction method to securely process the payment information provided by the user by encrypting it. The transaction is carried out through the payment gateway, and the completed payment information is output.
[0475] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0476] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0477] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0478] [Third Embodiment]
[0479] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0480] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0481] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0482] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0483] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0484] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0485] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0486] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0487] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0488] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0489] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0490] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0491] The system according to the present invention is designed to streamline and secure the voice-based ordering process through television shopping. This system integrates voice recognition, voice confirmation generation, identity verification, and secure payment processing.
[0492] The system consists of a terminal, a server, and a speech synthesis device. First, the terminal acquires the user's voice. When the user speaks and says they want to order a specific product, that voice data is sent to the server.
[0493] The server analyzes the received audio data based on its speech recognition engine and extracts order information such as product name and quantity. Based on this information, the server generates a confirmation voice message and presents it to the user via a speech synthesizer.
[0494] Once the user confirms that there are no problems with their order, the server uses voice pattern analysis technology to verify the user's identity. This verification is performed by comparing the user's voice with previously registered voice data.
[0495] Once identity verification is complete, the server proceeds with payment processing. Payment information is securely processed based on the payment method specified by the user. During this process, all data is encrypted and protected from unauthorized access.
[0496] As a concrete example, consider a case where a user orders a specific vacuum cleaner from a TV shopping channel. The user says, "I want to order a vacuum cleaner," using their voice. The terminal receives this voice and sends it to the server. The server analyzes the voice and extracts the keyword "vacuum cleaner." Next, the server creates a confirmation voice message, "Do you want to order the XX vacuum cleaner?", and presents it to the user through a speech synthesis device.
[0497] If the user responds with "yes," the server will reconfirm the order details and prompt the user for identity verification and payment method selection. For example, the user may be asked to enter their credit card information by voice. The server then completes the payment using encrypted information.
[0498] This system allows users to complete orders safely and efficiently without complicated procedures. The voice-activated interface provides a user-friendly shopping experience, especially for the elderly and those unfamiliar with technology.
[0499] The following describes the processing flow.
[0500] Step 1:
[0501] The device receives voice input from the user. The user voice-orders products while watching a TV shopping program, and the device records this via its microphone.
[0502] Step 2:
[0503] The device digitizes the recorded audio data and sends it to the server via the internet. Since the data is transferred in real time, processing can begin immediately.
[0504] Step 3:
[0505] The server inputs the received audio data into a speech recognition engine, which converts it from speech to text. In this process, keywords such as product names and quantities are extracted and organized as order information.
[0506] Step 4:
[0507] The server generates text to produce a confirmation voice message based on the order information. For example, it might create a confirmation message such as, "Would you like to order two of item XX?"
[0508] Step 5:
[0509] The server sends the generated text to the speech synthesis engine, which then generates confirmation audio data.
[0510] Step 6:
[0511] The terminal receives confirmation audio data from the speech synthesis engine and plays it for the user. At this point, the user is asked to confirm the order details.
[0512] Step 7:
[0513] The user listens to a confirmation voice message and responds with "yes" if there are no problems with the order, or "no" if cancellation or modification is needed.
[0514] Step 8:
[0515] The terminal sends the user's voice response back to the server, which analyzes the audio to understand the user's intent. If the user responds with "yes," the server proceeds to prepare to confirm the order.
[0516] Step 9:
[0517] The server uses voice pattern analysis to verify the user's identity by comparing their voice to pre-registered voice data.
[0518] Step 10:
[0519] The user specifies their payment method by voice and enters credit card information and other details as needed. The terminal encrypts this information and sends it to the server.
[0520] Step 11:
[0521] The server securely processes payment information and verifies that the purchase process has been completed successfully. After completion, it notifies the user of the order completion via the terminal.
[0522] (Example 1)
[0523] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0524] Traditional online ordering systems often require users to perform complex procedures and input tasks, making them difficult to use, especially for the elderly and those unfamiliar with technology. Furthermore, identity verification and payment processing may lack sufficient security and efficiency. There is a need to address these challenges and provide a system that allows anyone to complete orders easily and securely.
[0525] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0526] In this invention, the server includes a receiving means for acquiring voice input, a processing means for analyzing the voice input and extracting information, and an output means for generating and presenting confirmation voice based on the information. This makes it possible to place orders simply and securely using voice.
[0527] A "receiving means for acquiring voice input" is a device that collects voice data from a user and converts it into a format that can be processed as an electronic signal.
[0528] The "processing means for analyzing the voice input and extracting information" refers to a device that uses voice recognition technology to analyze the voice signal and identify order information and other necessary information.
[0529] "Output means for generating and presenting confirmation audio based on the aforementioned information" refers to a device for generating confirmation audio based on extracted information and presenting it audibly to the user.
[0530] A "confirmation means for confirming an order based on user responses" is a device that finalizes and records the order details based on the responses provided by the user.
[0531] "An authentication method that uses voice analysis to identify and authenticate users" refers to a device that analyzes the characteristics of a voice to individually identify a user and verify their identity.
[0532] A "transaction method that encrypts and processes user transaction information to protect it" is a device that securely processes user payment information using encryption technology and protects it from unauthorized access.
[0533] This invention provides an online ordering system using voice commands, enabling users to easily and securely order products using only their voice. An embodiment thereof is described below.
[0534] In this system, the terminal is equipped with a microphone to capture user speech. When the user verbally states that they wish to order a product, the audio data is converted into a digital format by the terminal and transmitted to the server via the network. The specific software used includes a "speech recognition engine" for analyzing the audio data.
[0535] The server analyzes the received audio data using a "speech recognition engine." This analysis converts the audio signal into text data and extracts order information such as product name and quantity. This information is necessary to confirm the order details with the user.
[0536] The server then uses "speech synthesis software" to generate a confirmation voice message for the user. This confirmation voice message is used to allow the user to confirm the order details based on the generated data. For example, it might say, "Would you like to order two of the XX vacuum cleaners?"
[0537] The user responds to the confirmation voice message. The response is sent back to the server via the terminal. The server verifies the user's identity using voice pattern analysis technology. After authentication is complete, the payment information is securely processed using encryption technology according to the user's instructions.
[0538] As a concrete example, when a user voice-inputs "I want to order a vacuum cleaner," the terminal sends the voice data to the server, which analyzes the data and generates a confirmation voice message asking the user, "Are you sure you want to order a vacuum cleaner?" The order is then confirmed when the user answers "Yes."
[0539] An example of a prompt used as input for the AI model in this system is, "Please tell me about the voice-based product ordering system." This example demonstrates how to achieve an intuitive and easy-to-use user experience via a voice interface.
[0540] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0541] Step 1:
[0542] The device acquires voice from the user. The user directly speaks about their purchase request, for example, saying, "I want to order a vacuum cleaner." The device receives this voice via its microphone and stores it as audio data. The input is a raw audio signal, and the output is digital audio data.
[0543] Step 2:
[0544] The terminal transmits the acquired digital audio data to the server. Using a stable internet connection, the audio files are uploaded to the server. The input is the audio data within the terminal, and the output is the audio data transferred to the server.
[0545] Step 3:
[0546] The server analyzes the received audio data based on its speech recognition engine. Specifically, it converts the audio signal into text and extracts order information (product name and quantity). The input is audio data stored on the server, and the output is order information in text format.
[0547] Step 4:
[0548] The server generates a confirmation voice based on the order information. Using speech synthesis software, it creates a voice message for the user to reconfirm the order details. For example, a confirmation message such as "Do you want to order the XX vacuum cleaner?" is generated. The input is order information in text format, and the output is a confirmation message in digital voice.
[0549] Step 5:
[0550] The terminal presents the user with a confirmation audio received from the server. The user then plays the message through headphones or speakers to confirm its contents. The input is the confirmation audio data from the server, and the output is the user's auditory confirmation.
[0551] Step 6:
[0552] The user listens to a confirmation voice message and then responds. For example, they might respond with "yes," and the response is sent to the server via their device. The input is the user's response to the confirmation voice message, and the output is the response voice data.
[0553] Step 7:
[0554] The server receives the user's voice response and performs identity verification using voice pattern analysis. It compares this to previously registered voice data to check for a match. The input is the user's voice response data, and the output is the result of the identity verification.
[0555] Step 8:
[0556] After identity verification is complete, the server accepts the payment method specified by the user and processes the payment. Transaction information is encrypted and processed securely. The input is authenticated payment information, and the output is a confirmation of transaction completion.
[0557] (Application Example 1)
[0558] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0559] Conventional voice-based ordering systems have struggled to accurately extract order information from user voices and to confirm orders safely and quickly. Ensuring privacy while processing payments was also a critical challenge. Furthermore, there is a need for a voice interface that is simple and efficient to use, even for elderly users and those unfamiliar with technology.
[0560] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0561] In this invention, the server includes a computing device that analyzes voice input using a recognition mechanism and extracts information related to the order; a voice generation means that creates a confirmation voice based on the order information and presents it to the user; an order confirmation means that confirms the order based on the user's response; and a processing means that receives the specification of the payment method by voice and securely processes the payment information. As a result, users can effectively order products using voice commands and make payments securely while maintaining their privacy.
[0562] "Voice input" is the operation of receiving the voice spoken by the user and processing it as digital information.
[0563] An "information processing device" is a device that acquires voice input and transmits that information to other devices.
[0564] A "recognition mechanism" is an algorithm or software used to analyze audio data and extract meaningful information.
[0565] A "calculating device" is a device that analyzes received data and derives specific information such as order details and quantities.
[0566] "Speech generation means" refers to a technology or device for converting text information into speech and presenting it to the user.
[0567] An "order confirmation method" is a process for managing and confirming the completion of an order based on the user's instructions.
[0568] "Payment information" refers to all information related to payments that a buyer uses to purchase goods or services.
[0569] "Processing means" refers to a technology or system for safely and efficiently manipulating, calculating, or storing specific information.
[0570] The system for carrying out this invention consists of the following main components: an information processing device for receiving voice input, a computing device including a recognition mechanism, a voice generation means, an order confirmation means, and a means for processing payment information.
[0571] First, the user places an order for products with the information processing device via voice input. This voice is sent to the computing device through software such as the Google Cloud Speech-to-Text API or Amazon Polly. The computing device analyzes the voice data and extracts information about the order. This analysis includes a process to derive specific order details, such as the color and quantity of sneakers.
[0572] Next, the voice generation device presents the extracted order information to the user as confirmation. This allows the user to verbally confirm that the order details are correct. After the user confirms the order details, the order confirmation device activates and finalizes the order. At this point, the user specifies the payment method via voice.
[0573] Subsequently, payment information is securely processed using encryption technologies such as OpenSSL, according to the user's specifications. This ensures privacy while enabling secure transactions.
[0574] As a concrete example, suppose a user says, "I want to order light blue sneakers." The computer processes this audio and generates a confirmation voice message asking, "Do you want to order light blue sneakers?" If the user responds with "Yes," the order confirmation system confirms this information and proceeds to the next step.
[0575] An example of a prompt message would be, "Please tell me how to use speech recognition technology to analyze a user's voice saying, 'I want to order light blue sneakers,' and process it as an order."
[0576] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0577] Step 1:
[0578] The device receives the user's voice input. It acquires the voice data of the user's utterance, "I want to order light blue sneakers," and sends this data to the server for speech recognition using the Google Cloud Speech-to-Text API. The input is the user's voice, and the output is digitized voice data.
[0579] Step 2:
[0580] The server analyzes the received audio data. The computing device processes the audio data and uses a generative AI model to extract order information (such as product name and quantity). The input is digitized audio data, and the output is the extracted order information (e.g., sneaker color and quantity).
[0581] Step 3:
[0582] The server generates a confirmation voice based on the order information. Using a voice generation method such as Amazon Polly, it creates a confirmation message, "Do you want to order light blue sneakers?", and sends it to the terminal. The input is the order information, and the output is a synthesized confirmation voice.
[0583] Step 4:
[0584] The terminal presents the user with a synthesized voice confirmation message. The user listens to it and confirms the order details. The input is the synthesized voice message, and the output is the user's confirmation response (e.g., "Yes").
[0585] Step 5:
[0586] The server receives the user's acknowledgment and confirms the order. The order confirmation system then uses speech recognition again to analyze the response and complete the order confirmation process. The input is the user's acknowledgment, and the output is the confirmed order.
[0587] Step 6:
[0588] The user specifies the payment method by voice. The terminal sends this voice to the server, and the computing device performs voice recognition to extract the payment information. The input is the user's voice instruction for the payment method, and the output is the extracted payment information.
[0589] Step 7:
[0590] The server securely processes payment information. The processing method uses OpenSSL to encrypt payment information, ensuring secure transactions. The input is payment information, and the output is encrypted payment information.
[0591] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0592] The system according to the present invention streamlines the voice-based ordering process in television shopping and provides responses that take into account the user's emotional state, thereby realizing a more comfortable and safe purchasing experience. This system consists of a terminal, a server, a voice synthesis device, and an emotion engine.
[0593] First, the device acquires the user's speech as audio data. When a user orders a specific product on TV shopping, they communicate their intention verbally. The acquired audio data is then sent to the server.
[0594] The server uses a speech recognition engine to convert this audio data into text and extract order information such as product name and quantity. Based on this information, the server generates a confirmation voice message and presents it to the user via a speech synthesizer.
[0595] In this process, the server is equipped with an emotion engine that analyzes the emotions contained in the user's voice. The emotion engine can estimate the user's emotional state from factors such as tone, speed, and rhythm of the voice. For example, if excitement or irritation is detected, it will generate a corresponding confirmation voice in a softer tone.
[0596] A confirmation voice message is presented to the user, and users who agree to its contents respond with "Yes." The server analyzes this response and performs voice pattern analysis to verify the user's identity. Once the user's identity has been verified through voice pattern analysis, the server proceeds to the payment process.
[0597] The user further specifies the payment method by voice and also verbally provides necessary information (e.g., credit card number). The terminal encrypts this information and sends it to the server. The server completes the transaction through a secure payment gateway. Even after payment is complete, the emotion engine may suggest feedback to the user, such as surveys or incentives for future use.
[0598] For example, when a user orders a new smartphone, the device receives a voice message saying, "I would like to order the latest smartphone, model XYZ." This voice message is sent to the server, where speech recognition converts the keyword "model XYZ" into text. The server generates a confirmation message asking, "Do you want to order model XYZ?" and, if the emotion engine detects frustration, adds a softer message saying, "We will process your order quickly, so please rest assured." After the user confirms their response, the transaction is completed securely.
[0599] This system will not only facilitate smooth information exchange, but will also enable the provision of more satisfying services through flexible responses that cater to user emotions.
[0600] The following describes the processing flow.
[0601] Step 1:
[0602] A user watches a TV shopping program and speaks aloud, "I want to buy this product." The device receives this voice command, performs noise filtering, and converts it into digital audio data.
[0603] Step 2:
[0604] The terminal sends the digitally converted audio data to the server. This transmission takes place in real time via the internet.
[0605] Step 3:
[0606] The server processes the received audio data through a speech recognition engine and converts it into text data. Here, order information such as product name and desired quantity is extracted.
[0607] Step 4:
[0608] The server generates a confirmation message based on the order information. For example, it might prepare a message as text such as, "Would you like to order two of item XX?"
[0609] Step 5:
[0610] The server uses an emotion engine to perform sentiment analysis on the voice data. The emotion engine determines the user's emotions from the tone of voice and the speed of speech, detecting patterns such as joy and anger.
[0611] Step 6:
[0612] The server adjusts the confirmation message based on the results of the sentiment analysis. For example, if anger is detected, the message may be modified to include the phrase, "We will respond as soon as possible."
[0613] Step 7:
[0614] The revised confirmation message is sent to the speech synthesis engine to generate natural-sounding speech.
[0615] Step 8:
[0616] The device plays a generated confirmation audio to the user. The user listens to the confirmation audio and responds with "yes" if it is correct, or "no" if correction is needed.
[0617] Step 9:
[0618] The terminal sends the user's response back to the server as audio data, which the server then analyzes.
[0619] Step 10:
[0620] The server uses voice pattern analysis to initiate a process to verify the user's identity. This confirms that the voice belongs to an authenticated user.
[0621] Step 11:
[0622] Once identity verification is complete, the user selects a payment method by voice and provides the required payment information. The device encrypts this information securely and sends it to the server.
[0623] Step 12:
[0624] The server processes encrypted payment information and completes the transaction through a secure payment gateway. Once completion is confirmed, the terminal notifies the user of the order completion message.
[0625] (Example 2)
[0626] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0627] Conventional voice-input ordering systems failed to provide flexible responses that took into account the user's emotional state, thus failing to improve the user experience. Furthermore, there were challenges in verifying user identity and ensuring the security of payment information. These problems need to be solved to provide a safer and more comfortable ordering experience.
[0628] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0629] In this invention, the server includes an information processing device for receiving voice input, a computing device for analyzing the voice input based on a conversion device and extracting information related to the order, a voice generation means for generating a confirmation voice based on the information related to the order and presenting it to the user, an emotion analysis means for analyzing the emotion of the voice and adjusting the response, and a means for confirming the order based on the user's response. This enables flexible responses that take into account the user's emotions, as well as secure identity verification and payment processing.
[0630] "Voice input" is the process by which electronic devices acquire a user's speech as digital data.
[0631] An "information processing device" is a device that has the function of receiving voice input and is used for data collection and communication.
[0632] A "conversion device" is a technology that has a mechanism for converting audio data into text data.
[0633] A "calculating device" is a device that analyzes text data extracted from speech and has the function of organizing and extracting information related to orders.
[0634] A "speech generation method" is a means of generating natural-sounding speech from text data and presenting it to the user.
[0635] "An emotion analysis method that analyzes the emotions in voice and adjusts responses" is a means of determining the emotional state of a user based on their voice tone, speed, etc., and adjusting the content of the response and the voice accordingly.
[0636] "Methods for confirming an order" refer to the means of confirming an order based on the user's response and completing the final order process.
[0637] "Identity verification methods" refer to measures used to confirm the user's identity through voice pattern analysis and to guarantee the legitimacy of the order.
[0638] "A processing method for encrypting and securely processing payment information" refers to a method equipped with a mechanism for encrypting data in order to securely manage and process users' payment information.
[0639] To carry out this invention, the following configuration is necessary. The system mainly includes a terminal device, a server device, a speech synthesis device, and emotion analysis means. First, the terminal device functions as an information processing device for acquiring voice input from the user. An input device such as a microphone is integrated into the terminal device, which converts the voice into a digital signal and transmits it to the server device via the network.
[0640] The server device acts as a conversion device, converting audio data into text data using a speech recognition engine. DeepSpeech or similar high-precision speech recognition technologies are used for this conversion. From the extracted text data, the server device acts as a computing device, analyzing and organizing order information such as product names and quantities.
[0641] Subsequently, the server uses a speech synthesis engine (TTS engine) to generate a confirmation voice based on the obtained order information. This process utilizes an AI model for natural language processing and generation. In particular, it analyzes the tone and speed of the voice through emotion analysis to capture the user's emotional state. Based on this analysis, the confirmation voice is adjusted.
[0642] When a user confirms an order by voice using the order confirmation method, their response is analyzed by the server, and the order is confirmed. At this point, voice pattern analysis is performed by the identity verification method to verify the user's identity and guarantee the legitimacy of the order. Biometric authentication systems may also be integrated into this process.
[0643] Furthermore, payment information provided by users is securely processed using a method that encrypts and transmits payment information. The terminal device encrypts the information and sends it to the server, which completes the transaction through a secure gateway. Strong encryption technology is used to maintain data security.
[0644] Finally, the server also provides post-purchase feedback and incentives through sentiment analysis. This process ensures that users have a comfortable, emotionally responsive, and flexible purchasing experience.
[0645] For example, when a user tries to purchase a smartphone, the device receives a voice message saying, "I would like to order the latest smartphone, model XYZ." This voice message is sent to the server, and through speech recognition, the text data "model XYZ" is generated. A confirmation message, "Do you want to order model XYZ?", is generated, and if the user's emotions are analyzed, adjustments are made as needed.
[0646] An example of a prompt for a generative AI model is: "A user is passionately trying to order the latest XYZ smartphone from a TV shopping channel. Convert the voice data into text and generate an acknowledgment response that reflects the user's emotions."
[0647] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0648] Step 1:
[0649] The device acquires the user's voice input. The user speaks to clarify the product they want to purchase from the TV shopping program, and this voice is captured by the device's built-in microphone and converted into digital data. This conversion is output as audio data processed by digital signal processing.
[0650] Step 2:
[0651] The terminal transmits the acquired audio data to the server. The audio data is securely transferred to the server over the network, and the communication is protected by an encryption protocol. The input is digital audio data, and the output is received by the server as audio data that can be recognized.
[0652] Step 3:
[0653] The server converts the received audio data into text using a speech recognition engine. The speech recognition engine uses a deep learning model to analyze the audio and output the order details as text data. In this process, important order information such as product name and quantity is extracted.
[0654] Step 4:
[0655] The server generates a confirmation voice based on the order information obtained from the text. It uses a speech synthesis engine to convert the text back into speech and create the confirmation message. The input is text data about the order, and the output is the voice data required for confirmation.
[0656] Step 5:
[0657] The server uses an emotion analysis engine to analyze the tone and speed of the user's voice and evaluate the user's emotional state. For example, if the user's voice contains excitement or frustration, the confirmation voice is adjusted according to that emotion. The input is the initial voice data, and the output is the confirmation voice adjusted according to the emotion.
[0658] Step 6:
[0659] The user receives a confirmation voice message from the server and responds. The user confirms the order with a voice message such as "yes," which the terminal then re-records. The recorded voice data is then sent back to the server.
[0660] Step 7:
[0661] The server analyzes the user's voice response and verifies their identity through voice pattern analysis. This process involves analyzing the frequency and rhythm of the voice and comparing it to past data. The input is the user's voice response, and the output is data confirming the legitimacy of the order.
[0662] Step 8:
[0663] Once the server verifies the user's identity, it begins processing the payment. The payment information provided by the user is encrypted and transmitted to the server via a secure route, where the transaction is completed using the payment gateway. The input is encrypted payment information, and the output is a confirmation of transaction completion.
[0664] (Application Example 2)
[0665] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0666] Conventional voice-based ordering systems often provide uniform responses without considering the user's emotional state, potentially leading to decreased user satisfaction. Furthermore, insufficient consideration is given to the security of voice-based identity verification and payment. Therefore, a system is needed that can provide flexible responses tailored to the user's emotional state while ensuring secure identity verification and payment.
[0667] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0668] In this invention, the server includes a calculation means for analyzing and extracting order information based on speech recognition means, a response generation means for generating a response corresponding to the emotional state, and an emotion analysis means for estimating the emotional state. This makes it possible to provide appropriate responses according to the user's emotional state, thereby improving security.
[0669] "Voice input" is a method by which users communicate information through their voice.
[0670] An "information processing device" is a device for receiving and processing audio data.
[0671] "Speech recognition means" refers to technology that analyzes speech data and converts it into text data.
[0672] A "computational device" is a device used to extract necessary information from analyzed data.
[0673] "Speech synthesis means" refers to technology for converting text data into speech data and outputting it.
[0674] "Emotional analysis methods" refer to technologies that estimate a user's emotional state from voice data.
[0675] A "response generation means" is a technology for generating an appropriate response based on an emotional state.
[0676] An "order confirmation method" is a technology for confirming an order based on the user's response.
[0677] "Verification methods" refer to technologies that analyze a user's voice patterns to determine their legitimacy.
[0678] A "payment transaction method" is a technology for securely processing a user's payment information.
[0679] The system for implementing this invention uses a smartphone or smart glasses as an information processing device to receive voice input. When a user wants to order a product, they input the necessary information by voice into these devices. The information processing device receives the voice data and transmits it to a computing device. The server uses voice recognition means to convert the voice data into text and extract the order information.
[0680] At this time, the server mobilizes emotion analysis means to analyze the user's voice tone, speed, and rhythm, and estimate their emotional state. For example, if the user is perceived as being in a hurry, the server can generate a flexible response that corresponds to their emotional state. This response generation means generates a confirmation voice, which is then presented to the user via a speech synthesis means.
[0681] When the user responds with "yes" by voice, the server confirms the order through the order confirmation mechanism. Then, the user's voice pattern is analyzed by a verification mechanism to ensure security by verifying their identity. Finally, the payment transaction mechanism encrypts the user's payment information and processes it securely.
[0682] For example, if a user speaks into their smartphone saying, "I'd like to order new sneakers, size 9," the system recognizes this as order information and, based on sentiment analysis, generates a reassuring response such as, "If you're in a hurry, we'll process it quickly."
[0683] An example of a prompt for a generative AI model is: "When the user provides voice input, convert it to text using speech recognition, analyze the sentiment using the sentiment analysis engine, and then create an appropriate response based on the results."
[0684] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0685] Step 1:
[0686] The terminal receives voice input from the user. When the user speaks about the product they want to purchase, that voice data is recorded in the terminal. This voice data is then sent for subsequent processing.
[0687] Step 2:
[0688] The server converts the received audio data into text data using speech recognition technology. The speech recognition engine analyzes the waveform of the voice and extracts order information such as product names and quantities as text. As a result, the order information is output.
[0689] Step 3:
[0690] The server uses sentiment analysis tools to estimate the emotional state based on the converted text. It analyzes the tone, tempo, and intonation of the audio data to identify emotions such as whether the user is feeling urgency or satisfaction. This analysis result is output as the emotional state.
[0691] Step 4:
[0692] The server generates a confirmation voice message using a response generation mechanism based on the order information and emotional state. The confirmation voice message incorporates flexible responses depending on the emotional state (e.g., "We will respond promptly" if the customer is in a hurry). This confirmation voice message is output by a speech synthesis mechanism.
[0693] Step 5:
[0694] The user responds to the confirmation audio presented by the server with "yes" or "no." The terminal receives this response again as audio data. This audio data is sent to the server for final confirmation.
[0695] Step 6:
[0696] The server analyzes the received acknowledgment based on the order confirmation mechanism and confirms the order. The order is confirmed when the user responds with "yes," and the process proceeds to the next step. As a result of this process, confirmed order information is output.
[0697] Step 7:
[0698] The server uses verification methods to analyze the user's voice pattern and verify their identity. To ensure security, the voice characteristics are compared to a database to determine legitimacy. This analysis confirms that the user is legitimate.
[0699] Step 8:
[0700] The server uses a payment transaction method to securely process the payment information provided by the user by encrypting it. The transaction is carried out through the payment gateway, and the completed payment information is output.
[0701] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0702] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0703] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0704] [Fourth Embodiment]
[0705] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0706] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0707] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0708] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0709] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0710] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0711] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0712] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0713] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0714] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0715] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0716] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0717] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0718] The system according to the present invention is designed to streamline and secure the voice-based ordering process through television shopping. This system integrates voice recognition, voice confirmation generation, identity verification, and secure payment processing.
[0719] The system consists of a terminal, a server, and a speech synthesis device. First, the terminal acquires the user's voice. When the user speaks and says they want to order a specific product, that voice data is sent to the server.
[0720] The server analyzes the received audio data based on its speech recognition engine and extracts order information such as product name and quantity. Based on this information, the server generates a confirmation voice message and presents it to the user via a speech synthesizer.
[0721] Once the user confirms that there are no problems with their order, the server uses voice pattern analysis technology to verify the user's identity. This verification is performed by comparing the user's voice with previously registered voice data.
[0722] Once identity verification is complete, the server proceeds with payment processing. Payment information is securely processed based on the payment method specified by the user. During this process, all data is encrypted and protected from unauthorized access.
[0723] As a concrete example, consider a case where a user orders a specific vacuum cleaner from a TV shopping channel. The user says, "I want to order a vacuum cleaner," using their voice. The terminal receives this voice and sends it to the server. The server analyzes the voice and extracts the keyword "vacuum cleaner." Next, the server creates a confirmation voice message, "Do you want to order the XX vacuum cleaner?", and presents it to the user through a speech synthesis device.
[0724] If the user responds with "yes," the server will reconfirm the order details and prompt the user for identity verification and payment method selection. For example, the user may be asked to enter their credit card information by voice. The server then completes the payment using encrypted information.
[0725] This system allows users to complete orders safely and efficiently without complicated procedures. The voice-activated interface provides a user-friendly shopping experience, especially for the elderly and those unfamiliar with technology.
[0726] The following describes the processing flow.
[0727] Step 1:
[0728] The device receives voice input from the user. The user voice-orders products while watching a TV shopping program, and the device records this via its microphone.
[0729] Step 2:
[0730] The device digitizes the recorded audio data and sends it to the server via the internet. Since the data is transferred in real time, processing can begin immediately.
[0731] Step 3:
[0732] The server inputs the received audio data into a speech recognition engine, which converts it from speech to text. In this process, keywords such as product names and quantities are extracted and organized as order information.
[0733] Step 4:
[0734] The server generates text to produce a confirmation voice message based on the order information. For example, it might create a confirmation message such as, "Would you like to order two of item XX?"
[0735] Step 5:
[0736] The server sends the generated text to the speech synthesis engine, which then generates confirmation audio data.
[0737] Step 6:
[0738] The terminal receives confirmation audio data from the speech synthesis engine and plays it for the user. At this point, the user is asked to confirm the order details.
[0739] Step 7:
[0740] The user listens to a confirmation voice message and responds with "yes" if there are no problems with the order, or "no" if cancellation or modification is needed.
[0741] Step 8:
[0742] The terminal sends the user's voice response back to the server, which analyzes the audio to understand the user's intent. If the user responds with "yes," the server proceeds to prepare to confirm the order.
[0743] Step 9:
[0744] The server uses voice pattern analysis to verify the user's identity by comparing their voice to pre-registered voice data.
[0745] Step 10:
[0746] The user specifies their payment method by voice and enters credit card information and other details as needed. The terminal encrypts this information and sends it to the server.
[0747] Step 11:
[0748] The server securely processes payment information and verifies that the purchase process has been completed successfully. After completion, it notifies the user of the order completion via the terminal.
[0749] (Example 1)
[0750] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0751] Traditional online ordering systems often require users to perform complex procedures and input tasks, making them difficult to use, especially for the elderly and those unfamiliar with technology. Furthermore, identity verification and payment processing may lack sufficient security and efficiency. There is a need to address these challenges and provide a system that allows anyone to complete orders easily and securely.
[0752] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0753] In this invention, the server includes a receiving means for acquiring voice input, a processing means for analyzing the voice input and extracting information, and an output means for generating and presenting confirmation voice based on the information. This makes it possible to place orders simply and securely using voice.
[0754] A "receiving means for acquiring voice input" is a device that collects voice data from a user and converts it into a format that can be processed as an electronic signal.
[0755] The "processing means for analyzing the voice input and extracting information" refers to a device that uses voice recognition technology to analyze the voice signal and identify order information and other necessary information.
[0756] "Output means for generating and presenting confirmation audio based on the aforementioned information" refers to a device for generating confirmation audio based on extracted information and presenting it audibly to the user.
[0757] A "confirmation means for confirming an order based on user responses" is a device that finalizes and records the order details based on the responses provided by the user.
[0758] "An authentication method that uses voice analysis to identify and authenticate users" refers to a device that analyzes the characteristics of a voice to individually identify a user and verify their identity.
[0759] A "transaction method that encrypts and processes user transaction information to protect it" is a device that securely processes user payment information using encryption technology and protects it from unauthorized access.
[0760] This invention provides an online ordering system using voice commands, enabling users to easily and securely order products using only their voice. An embodiment thereof is described below.
[0761] In this system, the terminal is equipped with a microphone to capture user speech. When the user verbally states that they wish to order a product, the audio data is converted into a digital format by the terminal and transmitted to the server via the network. The specific software used includes a "speech recognition engine" for analyzing the audio data.
[0762] The server analyzes the received audio data using a "speech recognition engine." This analysis converts the audio signal into text data and extracts order information such as product name and quantity. This information is necessary to confirm the order details with the user.
[0763] The server then uses "speech synthesis software" to generate a confirmation voice message for the user. This confirmation voice message is used to allow the user to confirm the order details based on the generated data. For example, it might say, "Would you like to order two of the XX vacuum cleaners?"
[0764] The user responds to the confirmation voice message. The response is sent back to the server via the terminal. The server verifies the user's identity using voice pattern analysis technology. After authentication is complete, the payment information is securely processed using encryption technology according to the user's instructions.
[0765] As a concrete example, when a user voice-inputs "I want to order a vacuum cleaner," the terminal sends the voice data to the server, which analyzes the data and generates a confirmation voice message asking the user, "Are you sure you want to order a vacuum cleaner?" The order is then confirmed when the user answers "Yes."
[0766] An example of a prompt used as input for the AI model in this system is, "Please tell me about the voice-based product ordering system." This example demonstrates how to achieve an intuitive and easy-to-use user experience via a voice interface.
[0767] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0768] Step 1:
[0769] The device acquires voice from the user. The user directly speaks about their purchase request, for example, saying, "I want to order a vacuum cleaner." The device receives this voice via its microphone and stores it as audio data. The input is a raw audio signal, and the output is digital audio data.
[0770] Step 2:
[0771] The terminal transmits the acquired digital audio data to the server. Using a stable internet connection, the audio files are uploaded to the server. The input is the audio data within the terminal, and the output is the audio data transferred to the server.
[0772] Step 3:
[0773] The server analyzes the received audio data based on its speech recognition engine. Specifically, it converts the audio signal into text and extracts order information (product name and quantity). The input is audio data stored on the server, and the output is order information in text format.
[0774] Step 4:
[0775] The server generates a confirmation voice based on the order information. Using speech synthesis software, it creates a voice message for the user to reconfirm the order details. For example, a confirmation message such as "Do you want to order the XX vacuum cleaner?" is generated. The input is order information in text format, and the output is a confirmation message in digital voice.
[0776] Step 5:
[0777] The terminal presents the user with a confirmation audio received from the server. The user then plays the message through headphones or speakers to confirm its contents. The input is the confirmation audio data from the server, and the output is the user's auditory confirmation.
[0778] Step 6:
[0779] The user listens to a confirmation voice message and then responds. For example, they might respond with "yes," and the response is sent to the server via their device. The input is the user's response to the confirmation voice message, and the output is the response voice data.
[0780] Step 7:
[0781] The server receives the user's voice response and performs identity verification using voice pattern analysis. It compares this to previously registered voice data to check for a match. The input is the user's voice response data, and the output is the result of the identity verification.
[0782] Step 8:
[0783] After identity verification is complete, the server accepts the payment method specified by the user and processes the payment. Transaction information is encrypted and processed securely. The input is authenticated payment information, and the output is a confirmation of transaction completion.
[0784] (Application Example 1)
[0785] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0786] Conventional voice-based ordering systems have struggled to accurately extract order information from user voices and to confirm orders safely and quickly. Ensuring privacy while processing payments was also a critical challenge. Furthermore, there is a need for a voice interface that is simple and efficient to use, even for elderly users and those unfamiliar with technology.
[0787] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0788] In this invention, the server includes a computing device that analyzes voice input using a recognition mechanism and extracts information related to the order; a voice generation means that creates a confirmation voice based on the order information and presents it to the user; an order confirmation means that confirms the order based on the user's response; and a processing means that receives the specification of the payment method by voice and securely processes the payment information. As a result, users can effectively order products using voice commands and make payments securely while maintaining their privacy.
[0789] "Voice input" is the operation of receiving the voice spoken by the user and processing it as digital information.
[0790] An "information processing device" is a device that acquires voice input and transmits that information to other devices.
[0791] A "recognition mechanism" is an algorithm or software used to analyze audio data and extract meaningful information.
[0792] A "calculating device" is a device that analyzes received data and derives specific information such as order details and quantities.
[0793] "Speech generation means" refers to a technology or device for converting text information into speech and presenting it to the user.
[0794] An "order confirmation method" is a process for managing and confirming the completion of an order based on the user's instructions.
[0795] "Payment information" refers to all information related to payments that a buyer uses to purchase goods or services.
[0796] "Processing means" refers to a technology or system for safely and efficiently manipulating, calculating, or storing specific information.
[0797] The system for carrying out this invention consists of the following main components: an information processing device for receiving voice input, a computing device including a recognition mechanism, a voice generation means, an order confirmation means, and a means for processing payment information.
[0798] First, the user places an order for products with the information processing device via voice input. This voice is sent to the computing device through software such as the Google Cloud Speech-to-Text API or Amazon Polly. The computing device analyzes the voice data and extracts information about the order. This analysis includes a process to derive specific order details, such as the color and quantity of sneakers.
[0799] Next, the voice generation device presents the extracted order information to the user as confirmation. This allows the user to verbally confirm that the order details are correct. After the user confirms the order details, the order confirmation device activates and finalizes the order. At this point, the user specifies the payment method via voice.
[0800] Subsequently, payment information is securely processed using encryption technologies such as OpenSSL, according to the user's specifications. This ensures privacy while enabling secure transactions.
[0801] As a concrete example, suppose a user says, "I want to order light blue sneakers." The computer processes this audio and generates a confirmation voice message asking, "Do you want to order light blue sneakers?" If the user responds with "Yes," the order confirmation system confirms this information and proceeds to the next step.
[0802] An example of a prompt message would be, "Please tell me how to use speech recognition technology to analyze a user's voice saying, 'I want to order light blue sneakers,' and process it as an order."
[0803] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0804] Step 1:
[0805] The device receives the user's voice input. It acquires the voice data of the user's utterance, "I want to order light blue sneakers," and sends this data to the server for speech recognition using the Google Cloud Speech-to-Text API. The input is the user's voice, and the output is digitized voice data.
[0806] Step 2:
[0807] The server analyzes the received audio data. The computing device processes the audio data and uses a generative AI model to extract order information (such as product name and quantity). The input is digitized audio data, and the output is the extracted order information (e.g., sneaker color and quantity).
[0808] Step 3:
[0809] The server generates a confirmation voice based on the order information. Using a voice generation method such as Amazon Polly, it creates a confirmation message, "Do you want to order light blue sneakers?", and sends it to the terminal. The input is the order information, and the output is a synthesized confirmation voice.
[0810] Step 4:
[0811] The terminal presents the user with a synthesized voice confirmation message. The user listens to it and confirms the order details. The input is the synthesized voice message, and the output is the user's confirmation response (e.g., "Yes").
[0812] Step 5:
[0813] The server receives the user's acknowledgment and confirms the order. The order confirmation system then uses speech recognition again to analyze the response and complete the order confirmation process. The input is the user's acknowledgment, and the output is the confirmed order.
[0814] Step 6:
[0815] The user specifies the payment method by voice. The terminal sends this voice to the server, and the computing device performs voice recognition to extract the payment information. The input is the user's voice instruction for the payment method, and the output is the extracted payment information.
[0816] Step 7:
[0817] The server securely processes payment information. The processing method uses OpenSSL to encrypt payment information, ensuring secure transactions. The input is payment information, and the output is encrypted payment information.
[0818] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0819] The system according to the present invention streamlines the voice-based ordering process in television shopping and provides responses that take into account the user's emotional state, thereby realizing a more comfortable and safe purchasing experience. This system consists of a terminal, a server, a voice synthesis device, and an emotion engine.
[0820] First, the device acquires the user's speech as audio data. When a user orders a specific product on TV shopping, they communicate their intention verbally. The acquired audio data is then sent to the server.
[0821] The server uses a speech recognition engine to convert this audio data into text and extract order information such as product name and quantity. Based on this information, the server generates a confirmation voice message and presents it to the user via a speech synthesizer.
[0822] In this process, the server is equipped with an emotion engine that analyzes the emotions contained in the user's voice. The emotion engine can estimate the user's emotional state from factors such as tone, speed, and rhythm of the voice. For example, if excitement or irritation is detected, it will generate a corresponding confirmation voice in a softer tone.
[0823] A confirmation voice message is presented to the user, and users who agree to its contents respond with "Yes." The server analyzes this response and performs voice pattern analysis to verify the user's identity. Once the user's identity has been verified through voice pattern analysis, the server proceeds to the payment process.
[0824] The user further specifies the payment method by voice and also verbally provides necessary information (e.g., credit card number). The terminal encrypts this information and sends it to the server. The server completes the transaction through a secure payment gateway. Even after payment is complete, the emotion engine may suggest feedback to the user, such as surveys or incentives for future use.
[0825] For example, when a user orders a new smartphone, the device receives a voice message saying, "I would like to order the latest smartphone, model XYZ." This voice message is sent to the server, where speech recognition converts the keyword "model XYZ" into text. The server generates a confirmation message asking, "Do you want to order model XYZ?" and, if the emotion engine detects frustration, adds a softer message saying, "We will process your order quickly, so please rest assured." After the user confirms their response, the transaction is completed securely.
[0826] This system will not only facilitate smooth information exchange, but will also enable the provision of more satisfying services through flexible responses that cater to user emotions.
[0827] The following describes the processing flow.
[0828] Step 1:
[0829] A user watches a TV shopping program and speaks aloud, "I want to buy this product." The device receives this voice command, performs noise filtering, and converts it into digital audio data.
[0830] Step 2:
[0831] The terminal sends the digitally converted audio data to the server. This transmission takes place in real time via the internet.
[0832] Step 3:
[0833] The server processes the received audio data through a speech recognition engine and converts it into text data. Here, order information such as product name and desired quantity is extracted.
[0834] Step 4:
[0835] The server generates a confirmation message based on the order information. For example, it might prepare a message as text such as, "Would you like to order two of item XX?"
[0836] Step 5:
[0837] The server uses an emotion engine to perform sentiment analysis on the voice data. The emotion engine determines the user's emotions from the tone of voice and the speed of speech, detecting patterns such as joy and anger.
[0838] Step 6:
[0839] The server adjusts the confirmation message based on the results of the sentiment analysis. For example, if anger is detected, the message may be modified to include the phrase, "We will respond as soon as possible."
[0840] Step 7:
[0841] The revised confirmation message is sent to the speech synthesis engine to generate natural-sounding speech.
[0842] Step 8:
[0843] The device plays a generated confirmation audio to the user. The user listens to the confirmation audio and responds with "yes" if it is correct, or "no" if correction is needed.
[0844] Step 9:
[0845] The terminal sends the user's response back to the server as audio data, which the server then analyzes.
[0846] Step 10:
[0847] The server uses voice pattern analysis to initiate a process to verify the user's identity. This confirms that the voice belongs to an authenticated user.
[0848] Step 11:
[0849] Once identity verification is complete, the user selects a payment method by voice and provides the required payment information. The device encrypts this information securely and sends it to the server.
[0850] Step 12:
[0851] The server processes encrypted payment information and completes the transaction through a secure payment gateway. Once completion is confirmed, the terminal notifies the user of the order completion message.
[0852] (Example 2)
[0853] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0854] Conventional voice-input ordering systems failed to provide flexible responses that took into account the user's emotional state, thus failing to improve the user experience. Furthermore, there were challenges in verifying user identity and ensuring the security of payment information. These problems need to be solved to provide a safer and more comfortable ordering experience.
[0855] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0856] In this invention, the server includes an information processing device for receiving voice input, a computing device for analyzing the voice input based on a conversion device and extracting information related to the order, a voice generation means for generating a confirmation voice based on the information related to the order and presenting it to the user, an emotion analysis means for analyzing the emotion of the voice and adjusting the response, and a means for confirming the order based on the user's response. This enables flexible responses that take into account the user's emotions, as well as secure identity verification and payment processing.
[0857] "Voice input" is the process by which electronic devices acquire a user's speech as digital data.
[0858] An "information processing device" is a device that has the function of receiving voice input and is used for data collection and communication.
[0859] A "conversion device" is a technology that has a mechanism for converting audio data into text data.
[0860] A "calculating device" is a device that analyzes text data extracted from speech and has the function of organizing and extracting information related to orders.
[0861] A "speech generation method" is a means of generating natural-sounding speech from text data and presenting it to the user.
[0862] "An emotion analysis method that analyzes the emotions in voice and adjusts responses" is a means of determining the emotional state of a user based on their voice tone, speed, etc., and adjusting the content of the response and the voice accordingly.
[0863] "Methods for confirming an order" refer to the means of confirming an order based on the user's response and completing the final order process.
[0864] "Identity verification methods" refer to measures used to confirm the user's identity through voice pattern analysis and to guarantee the legitimacy of the order.
[0865] "A processing method for encrypting and securely processing payment information" refers to a method equipped with a mechanism for encrypting data in order to securely manage and process users' payment information.
[0866] To carry out this invention, the following configuration is necessary. The system mainly includes a terminal device, a server device, a speech synthesis device, and emotion analysis means. First, the terminal device functions as an information processing device for acquiring voice input from the user. An input device such as a microphone is integrated into the terminal device, which converts the voice into a digital signal and transmits it to the server device via the network.
[0867] The server device acts as a conversion device, converting audio data into text data using a speech recognition engine. DeepSpeech or similar high-precision speech recognition technologies are used for this conversion. From the extracted text data, the server device acts as a computing device, analyzing and organizing order information such as product names and quantities.
[0868] Subsequently, the server uses a speech synthesis engine (TTS engine) to generate a confirmation voice based on the obtained order information. This process utilizes an AI model for natural language processing and generation. In particular, it analyzes the tone and speed of the voice through emotion analysis to capture the user's emotional state. Based on this analysis, the confirmation voice is adjusted.
[0869] When a user confirms an order by voice using the order confirmation method, their response is analyzed by the server, and the order is confirmed. At this point, voice pattern analysis is performed by the identity verification method to verify the user's identity and guarantee the legitimacy of the order. Biometric authentication systems may also be integrated into this process.
[0870] Furthermore, payment information provided by users is securely processed using a method that encrypts and transmits payment information. The terminal device encrypts the information and sends it to the server, which completes the transaction through a secure gateway. Strong encryption technology is used to maintain data security.
[0871] Finally, the server also provides post-purchase feedback and incentives through sentiment analysis. This process ensures that users have a comfortable, emotionally responsive, and flexible purchasing experience.
[0872] For example, when a user tries to purchase a smartphone, the device receives a voice message saying, "I would like to order the latest smartphone, model XYZ." This voice message is sent to the server, and through speech recognition, the text data "model XYZ" is generated. A confirmation message, "Do you want to order model XYZ?", is generated, and if the user's emotions are analyzed, adjustments are made as needed.
[0873] An example of a prompt for a generative AI model is: "A user is passionately trying to order the latest XYZ smartphone from a TV shopping channel. Convert the voice data into text and generate an acknowledgment response that reflects the user's emotions."
[0874] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0875] Step 1:
[0876] The device acquires the user's voice input. The user speaks to clarify the product they want to purchase from the TV shopping program, and this voice is captured by the device's built-in microphone and converted into digital data. This conversion is output as audio data processed by digital signal processing.
[0877] Step 2:
[0878] The terminal transmits the acquired audio data to the server. The audio data is securely transferred to the server over the network, and the communication is protected by an encryption protocol. The input is digital audio data, and the output is received by the server as audio data that can be recognized.
[0879] Step 3:
[0880] The server converts the received audio data into text using a speech recognition engine. The speech recognition engine uses a deep learning model to analyze the audio and output the order details as text data. In this process, important order information such as product name and quantity is extracted.
[0881] Step 4:
[0882] The server generates a confirmation voice based on the order information obtained from the text. It uses a speech synthesis engine to convert the text back into speech and create the confirmation message. The input is text data about the order, and the output is the voice data required for confirmation.
[0883] Step 5:
[0884] The server uses an emotion analysis engine to analyze the tone and speed of the user's voice and evaluate the user's emotional state. For example, if the user's voice contains excitement or frustration, the confirmation voice is adjusted according to that emotion. The input is the initial voice data, and the output is the confirmation voice adjusted according to the emotion.
[0885] Step 6:
[0886] The user receives a confirmation voice message from the server and responds. The user confirms the order with a voice message such as "yes," which the terminal then re-records. The recorded voice data is then sent back to the server.
[0887] Step 7:
[0888] The server analyzes the user's voice response and verifies their identity through voice pattern analysis. This process involves analyzing the frequency and rhythm of the voice and comparing it to past data. The input is the user's voice response, and the output is data confirming the legitimacy of the order.
[0889] Step 8:
[0890] Once the server verifies the user's identity, it begins processing the payment. The payment information provided by the user is encrypted and transmitted to the server via a secure route, where the transaction is completed using the payment gateway. The input is encrypted payment information, and the output is a confirmation of transaction completion.
[0891] (Application Example 2)
[0892] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0893] Conventional voice-based ordering systems often provide uniform responses without considering the user's emotional state, potentially leading to decreased user satisfaction. Furthermore, insufficient consideration is given to the security of voice-based identity verification and payment. Therefore, a system is needed that can provide flexible responses tailored to the user's emotional state while ensuring secure identity verification and payment.
[0894] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0895] In this invention, the server includes a calculation means for analyzing and extracting order information based on speech recognition means, a response generation means for generating a response corresponding to the emotional state, and an emotion analysis means for estimating the emotional state. This makes it possible to provide appropriate responses according to the user's emotional state, thereby improving security.
[0896] "Voice input" is a method by which users communicate information through their voice.
[0897] An "information processing device" is a device for receiving and processing audio data.
[0898] "Speech recognition means" refers to technology that analyzes speech data and converts it into text data.
[0899] A "computational device" is a device used to extract necessary information from analyzed data.
[0900] "Speech synthesis means" refers to technology for converting text data into speech data and outputting it.
[0901] "Emotional analysis methods" refer to technologies that estimate a user's emotional state from voice data.
[0902] A "response generation means" is a technology for generating an appropriate response based on an emotional state.
[0903] An "order confirmation method" is a technology for confirming an order based on the user's response.
[0904] "Verification methods" refer to technologies that analyze a user's voice patterns to determine their legitimacy.
[0905] A "payment transaction method" is a technology for securely processing a user's payment information.
[0906] The system for implementing this invention uses a smartphone or smart glasses as an information processing device to receive voice input. When a user wants to order a product, they input the necessary information by voice into these devices. The information processing device receives the voice data and transmits it to a computing device. The server uses voice recognition means to convert the voice data into text and extract the order information.
[0907] At this time, the server mobilizes emotion analysis means to analyze the user's voice tone, speed, and rhythm, and estimate their emotional state. For example, if the user is perceived as being in a hurry, the server can generate a flexible response that corresponds to their emotional state. This response generation means generates a confirmation voice, which is then presented to the user via a speech synthesis means.
[0908] When the user responds with "yes" by voice, the server confirms the order through the order confirmation mechanism. Then, the user's voice pattern is analyzed by a verification mechanism to ensure security by verifying their identity. Finally, the payment transaction mechanism encrypts the user's payment information and processes it securely.
[0909] For example, if a user speaks into their smartphone saying, "I'd like to order new sneakers, size 9," the system recognizes this as order information and, based on sentiment analysis, generates a reassuring response such as, "If you're in a hurry, we'll process it quickly."
[0910] An example of a prompt for a generative AI model is: "When the user provides voice input, convert it to text using speech recognition, analyze the sentiment using the sentiment analysis engine, and then create an appropriate response based on the results."
[0911] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0912] Step 1:
[0913] The terminal receives voice input from the user. When the user speaks about the product they want to purchase, that voice data is recorded in the terminal. This voice data is then sent for subsequent processing.
[0914] Step 2:
[0915] The server converts the received audio data into text data using speech recognition technology. The speech recognition engine analyzes the waveform of the voice and extracts order information such as product names and quantities as text. As a result, the order information is output.
[0916] Step 3:
[0917] The server uses sentiment analysis tools to estimate the emotional state based on the converted text. It analyzes the tone, tempo, and intonation of the audio data to identify emotions such as whether the user is feeling urgency or satisfaction. This analysis result is output as the emotional state.
[0918] Step 4:
[0919] The server generates a confirmation voice message using a response generation mechanism based on the order information and emotional state. The confirmation voice message incorporates flexible responses depending on the emotional state (e.g., "We will respond promptly" if the customer is in a hurry). This confirmation voice message is output by a speech synthesis mechanism.
[0920] Step 5:
[0921] The user responds to the confirmation audio presented by the server with "yes" or "no." The terminal receives this response again as audio data. This audio data is sent to the server for final confirmation.
[0922] Step 6:
[0923] The server analyzes the received acknowledgment based on the order confirmation mechanism and confirms the order. The order is confirmed when the user responds with "yes," and the process proceeds to the next step. As a result of this process, confirmed order information is output.
[0924] Step 7:
[0925] The server uses verification methods to analyze the user's voice pattern and verify their identity. To ensure security, the voice characteristics are compared to a database to determine legitimacy. This analysis confirms that the user is legitimate.
[0926] Step 8:
[0927] The server uses a payment transaction method to securely process the payment information provided by the user by encrypting it. The transaction is carried out through the payment gateway, and the completed payment information is output.
[0928] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0929] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0930] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0931] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0932] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0933] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0934] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0935] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0936] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0937] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0938] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0939] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0940] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0941] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0942] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0943] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0944] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0945] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0946] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0947] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0948] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0949] The following is further disclosed regarding the embodiments described above.
[0950] (Claim 1)
[0951] A terminal device that receives voice input,
[0952] A server device that analyzes the aforementioned voice input based on a voice recognition engine and extracts order information,
[0953] A speech synthesis means that generates a confirmation voice based on the aforementioned order information and presents it to the user,
[0954] An order confirmation method that confirms an order based on the user's response,
[0955] A system that includes this.
[0956] (Claim 2)
[0957] The system according to claim 1, comprising a means for verifying the user's identity and determining their legitimacy through voice pattern analysis.
[0958] (Claim 3)
[0959] The system according to claim 1, comprising a payment processing means for encrypting and securely processing user payment information.
[0960] "Example 1"
[0961] (Claim 1)
[0962] A receiving means for acquiring voice input,
[0963] Processing means for analyzing the aforementioned voice input and extracting information,
[0964] An output means that generates and presents confirmation audio based on the aforementioned information,
[0965] A confirmation method for confirming an order based on the user's response,
[0966] A system that includes this.
[0967] (Claim 2)
[0968] The system according to claim 1, comprising authentication means for identifying a user using voice analysis and performing authentication.
[0969] (Claim 3)
[0970] The system according to claim 1, comprising a transaction means for encrypting and processing user transaction information in order to protect it.
[0971] "Application Example 1"
[0972] (Claim 1)
[0973] An information processing device that receives voice input,
[0974] A computing device that analyzes the aforementioned voice input using a recognition mechanism and extracts information related to the order,
[0975] A voice generation means that creates a confirmation voice based on the information regarding the order and presents it to the user,
[0976] An order confirmation method that confirms the order based on the user's response,
[0977] A processing means that accepts the specification of payment method by voice and securely processes payment information,
[0978] A system that includes this.
[0979] (Claim 2)
[0980] The system according to claim 1, comprising a means for verifying the identity of a user and evaluating their legitimacy through voice pattern analysis.
[0981] (Claim 3)
[0982] The system according to claim 1, comprising processing means for encrypting and securely processing information related to user payments.
[0983] "Example 2 of combining an emotion engine"
[0984] (Claim 1)
[0985] An information processing device that receives voice input,
[0986] A computing device that analyzes the aforementioned voice input based on a conversion device and extracts information related to the order,
[0987] A voice generation means that generates a confirmation voice based on the information regarding the order and presents it to the user,
[0988] An emotion analysis means that analyzes the emotions in voice and adjusts the response,
[0989] A means of confirming an order based on the user's response,
[0990] A system that includes this.
[0991] (Claim 2)
[0992] The system according to claim 1, comprising a means for verifying the identity of a user and determining their legitimacy through voice pattern analysis.
[0993] (Claim 3)
[0994] The system according to claim 1, comprising a processing means for encrypting and securely processing the user's payment information.
[0995] "Application example 2 when combining with an emotional engine"
[0996] (Claim 1)
[0997] An information processing device that receives voice input,
[0998] A computing device that analyzes the aforementioned voice input based on voice recognition means and extracts order information,
[0999] A speech synthesis means that generates a confirmation voice based on the aforementioned order information and presents it to the user,
[1000] A sentiment analysis tool that analyzes the user's voice tone and estimates their emotional state,
[1001] A response generation means that generates a response corresponding to an emotional state,
[1002] An order confirmation method that confirms the order based on the user's response,
[1003] A system that includes this.
[1004] (Claim 2)
[1005] The system according to claim 1, comprising a verification means for verifying the user's identity and determining their legitimacy through voice pattern analysis.
[1006] (Claim 3)
[1007] The system according to claim 1, comprising a payment transaction means for encrypting and securely processing user payment information. [Explanation of Symbols]
[1008] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. An information processing device that receives voice input, A computing device that analyzes the aforementioned voice input using a recognition mechanism and extracts information related to the order, A voice generation means that creates a confirmation voice based on the information regarding the order and presents it to the user, An order confirmation method that confirms the order based on the user's response, A processing means that accepts the specification of payment method by voice and securely processes payment information, A system that includes this.
2. The system according to claim 1, comprising a means for verifying the identity of a user and evaluating their legitimacy through voice pattern analysis.
3. The system according to claim 1, comprising processing means for encrypting and securely processing information related to user payments.